Text Mining or Text Analytic is the discovery and communication of meaningful patterns in text data. As part of 101, I would like to cover the building blocks of TM:
- TM process overview
- Calculate term weight (TF-IDF)
- Similarity distance measure (Cosine)
- Overview of key text mining techniques
Text Mining Process Overview
Broadly there are 4 stages in the text mining process. There are great open source tools available (R, python, etc) to carry out the process mentioned here. The steps almost remain the same irrespective of the analysis platform.
– Step 1: Data Assembly
– Step 2: Data Processing
– Step 3: Data Exploration or Visualization
– Step 4: Model Building
Brief description about Data Processing steps
Explore Corpus – Understand the types of variables, their functions, permissible values, and so on. Some formats including html and xml contain tags and other data structures that provide more metadata.
Convert text to lowercase – This is to avoid distinguish between words simply on case.
Remove Number(if required) – Numbers may or may not be relevant to our analyses.
Remove Punctuation – Punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning.
Remove English stop words – Stop words are common words found in a language. Words like for, of, are etc are common stop words.
Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.
Strip white space – Eliminate extra white spaces.
Stemming – Transforms to root word. Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”. For example i.e., 1) “computer” & “computers” become “comput”
Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”
Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be removed from the document term matrix.
Document term matrix – A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix.
Calculate Term Weight – TF-IDF
How frequently term appears?
Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
How important a term is?
DF: Document Frequency = d (number of documents containing a given term) / D (the size of the collection of documents)
To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared
IDF: Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it)
Consider a document containing 100 words wherein the word CAR appears 3 times
TF(CAR) = 3 / 100 = 0.03
Now, assume we have 10 million documents and the word CAR appears in one thousand of these
IDF(CAR) = log(10,000,000 / 1,000) = 4
TF-IDF weight is product of these quantities: 0.03 * 4 = 0.12
Similarity Distance Measure (Cosine)
Here is a detailed paper on comparing the efficiency of different distance measures for text documents. General observation is that the Cosine similarity works better than the Euclidean for text data.
So lets understand how to calculate Cosine similarity.
Text 1: statistics skills and programming skills are equally important for analytics
Text 2: statistics skills and domain knowledge are important for analytics
Text 3: I like reading books and travelling
Document Term Matrix for the above 3 text would be:
The three vectors are:
T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)
T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)
T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)
Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%
Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%
Overview of key Text Mining techniques
There 3 key techniques are in practice:
- N-gram based analytic
- Shallow Natural Language Processing technique
- Deep Natural Language Processing technique
N-gram based analytic
- n-gramis a contiguous sequence of n items from a given sequenceof text
- The items can be syllables, letters, words or base pairs according to the application
- Probabilistic language model for predicting the next item in a sequence in the form of a (n − 1)
- Widely used in probability, communication theory, computational linguistics, biological sequence analysis
- Relatively simple
- Simply increasing n, model can be used to store more context
Semantic value of the item is not considered
Lets look at the n-gram output for the sample sentence “defense attorney for liberty and montecito”
- 1-gram: defense, attorney, for, liberty, and, montecito
- 2-gram: defense attorney, for liberty, and montecito, attorney for, liberty and, attorney for
- 3-gram: defense attorney for, liberty and montecito, attorney for liberty, for liberty and, liberty and montecito
- 4-gram: defense attorney for liberty, attorney for liberty and, for liberty and montecito,
- 5-gram: defense attorney for liberty and montecito, attorney for liberty and montecito
- 6-gram: defense attorney for liberty and montecito
Shallow NLP technique
- Assign a syntactic label (noun, verb etc.) to a chunk
- Knowledge extraction from text through semantic/syntactic analysis approach
- Taxonomy extraction (predefined terms and entities). Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases, medicines
- Concept extraction (main idea or a theme)
Advantage: Less noisy than n-grams
Disadvantage: Does not specify role of items in the main sentence
Example: Consider the sentence “The driver from Bangalore crashed the bike with the black bumper”. Lets examine the results of applying the n-gram and shallow NLP technique to extract the concept to the sample sentence.
Apply below 3 steps to example sentence
- Convert to lowercase & PoS tag
- Remove stop words
- Retain only Noun’s & Verb’s as these hold higher weight in the sentence
1-gram output: driver, bangalore, crashed, bike, bumper
Bi-gram output with noun/verb’s retained: crashed bike, driver bangalore, bangalore crashed
3-gram output with noun/verb’s retained: driver bangalore crashed, bangalore crashed bike
- 1-gram: Reduced noise, however no clear context
- Bi-gram & 3-gram: Increased context, however there is a information loss (like bumper tells us what is crashed which is not in output)
Deep NLP technique
- Extension to the shallow NLP
- Get the syntactic relationship between each pair of words
- Apply sentence segmentation to determine the sentence boundaries
- The Stanford Parser (or any similar) is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence
- Detected relationships are expressed as complex construction to retain the context
- Example relationships: Located in, employed by, part of, married to
Applications: Develop features and representations appropriate for complex interpretation tasks like fraud detection and prediction activities based on complex RNA-Sequence in life science
The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object) without loosing the context. Modifiers are negations, multi-word expression, adverbial modifier like not, maybe, however etc. You can learn more about stanford typed dependency here.
- driver : from : bangalore
- driver : crashed : bike
- driver : crashed with : bumper
- Sentence level is too structured
- Usage of abbreviations and grammatical errors in sentence will mislead the analysis
Hope this helps and I welcome feedback/comments.