Text Mining 101

Text Mining or Text Analytic is the discovery and communication of meaningful patterns in text data. As part of 101, I would like to cover the building blocks of TM:

  • TM process overview
  • Calculate term weight (TF-IDF)
  • Similarity distance measure (Cosine)
  • Overview of key text mining techniques

Text Mining Process Overview


Broadly there are 4 stages in the text mining process. There are great open source tools available (R, python, etc) to carry out the process mentioned here. The steps almost remain the same irrespective of the analysis platform.

– Step 1: Data Assembly
– Step 2: Data Processing
– Step 3: Data Exploration or Visualization
– Step 4: Model Building

R - Text Mining-001

Brief description about Data Processing steps

Explore Corpus – Understand the types of variables, their functions, permissible values, and so on. Some formats including html and xml contain tags and other data structures that provide more metadata.

Convert text to lowercase – This is to avoid distinguish between words simply on case.

Remove Number(if required) – Numbers may or may not be relevant to our analyses.

Remove Punctuation – Punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning.

Remove English stop words – Stop words are common words found in a language. Words like for, of, are etc are common stop words.

Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.

Strip white space – Eliminate extra white spaces.

Stemming – Transforms to root word. Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”. For example i.e., 1) “computer” & “computers” become “comput”

Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”

Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be removed from the document term matrix.

Document term matrix – A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix.


Calculate Term Weight – TF-IDF


How frequently term appears?

Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

How important a term is?

DF: Document Frequency = d (number of documents containing a given term) / D (the size of the collection of documents)

To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

IDF: Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it)

Example:

Consider a document containing 100 words wherein the word CAR appears 3 times

TF(CAR) = 3 / 100 = 0.03

Now, assume we have 10 million documents and the word CAR appears in one thousand of these

IDF(CAR) = log(10,000,000 / 1,000) = 4

TF-IDF weight is product of these quantities: 0.03 * 4 = 0.12


Similarity Distance Measure (Cosine)


Why Cosine?

Here is a detailed paper on comparing the efficiency of different distance measures for text documents. General observation is that the Cosine similarity works better than the Euclidean for text data.

Cosine-001

So lets understand how to calculate Cosine similarity.

Example:

Text 1: statistics skills and programming skills are equally important for analytics

Text 2: statistics skills and domain knowledge are important for analytics

Text 3: I like reading books and travelling

Document Term Matrix for the above 3 text would be:

DTM-001

The three vectors are:

T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)

T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)

T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)

Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%

Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%


Overview of key Text Mining techniques


There 3 key techniques are in practice:

  1. N-gram based analytic
  2. Shallow Natural Language Processing technique
  3. Deep Natural Language Processing technique

N-gram based analytic

Definition:

  • n-gramis a contiguous sequence of n items from a given sequenceof text
  • The items can be syllables, letters, words or base pairs according to the application

Application:

  • Probabilistic language model for predicting the next item in a sequence in the form of a (n − 1)
  • Widely used in probability, communication theory, computational linguistics, biological sequence analysis

Advantage:

  • Relatively simple
  • Simply increasing n, model can be used to store more context

Disadvantage: 

Semantic value of the item is not considered

Example:

Lets look at the n-gram output for the sample sentence “defense attorney for liberty and montecito”

  • 1-gram: defense, attorney, for, liberty, and, montecito
  • 2-gram: defense attorney, for liberty, and montecito, attorney for, liberty and, attorney for
  • 3-gram: defense attorney for, liberty and montecito, attorney for liberty, for liberty and, liberty and montecito
  • 4-gram: defense attorney for liberty, attorney for liberty and, for liberty and montecito,
  • 5-gram: defense attorney for liberty and montecito, attorney for liberty and montecito
  • 6-gram: defense attorney for liberty and montecito

Shallow NLP technique

Definition:

  • Assign a syntactic label (noun, verb etc.) to a chunk
  • Knowledge extraction from text through semantic/syntactic analysis approach

Shallow-001

Application:

  • Taxonomy extraction (predefined terms and entities). Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases, medicines
  • Concept extraction (main idea or a theme)

Advantage: Less noisy than n-grams

Disadvantage: Does not specify role of items in the main sentence

Example: Consider the sentence “The driver from Bangalore crashed the bike with the black bumper”. Lets examine the results of applying the n-gram and shallow NLP technique to extract the concept to the sample sentence.

Apply below 3 steps to example sentence

  • Convert to lowercase & PoS tag
  • Remove stop words
  • Retain only Noun’s & Verb’s as these hold higher weight in the sentence

1-gram output: driver, bangalore, crashed, bike, bumper

1-gram

Bi-gram output with noun/verb’s retained: crashed bike, driver bangalore, bangalore crashed

2-gram

3-gram output with noun/verb’s retained: driver bangalore crashed, bangalore crashed bike

3-gram

Conclusion:

  • 1-gram: Reduced noise, however no clear context
  • Bi-gram & 3-gram:  Increased context, however there is a information loss (like bumper tells us what is crashed which is not in output)

Deep NLP technique

Definition:

  • Extension to the shallow NLP
  • Get the syntactic relationship between each pair of words
  • Apply sentence segmentation to determine the sentence boundaries
  • The Stanford Parser (or any similar) is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence
  • Detected relationships are expressed as complex construction to retain the context
  • Example relationships: Located in, employed by, part of, married to

ApplicationsDevelop features and representations appropriate for complex interpretation tasks like fraud detection and prediction activities based on complex RNA-Sequence in life science

Capture

The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object) without loosing the context. Modifiers are negations, multi-word expression, adverbial modifier like not, maybe, however etc. You can learn more about stanford typed dependency here.

Triples Output:

  • driver : from : bangalore
  • driver : crashed : bike
  • driver : crashed with : bumper

Disadvantages:

  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis

Summary


Summary

Hope this helps and I welcome feedback/comments.


References: