Triples – Deep Natural Language Processing

Problem: In Text Mining extracting keywords (n-grams) alone cannot produce meaningful data nor discover “unknown” themes and trends.

Objective: The aim here is to extract dependency relation from sentence i.e., extract sets of the form {subject, predicate[modifiers], object} out of syntactically parsed sentences, using Stanford parser and opennlp.

Steps: 

1) Get the syntactic relationship between each pair of words

2) Apply sentence segmentation to determine the sentence boudaries

3) The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relationships within each sentence

How this is different from n-gram?

Dependency relation allows the similarity comparison to be based on the syntactic relations between words, instead of having to match words in their exact order in n-gram based comparisons.

Example:

Sentence: “The flat tire was not changed by driver”

Stanford dependency relations: 

root(ROOT-0, changed-6)
det(tire-3, the-1)
amod(tire-3, flat-2)
nsubjpass(changed-6, tire-3)
auxpass(changed-6, was-4)
neg(changed-6, not-5)
prep(changed-6, by-7)
pobj(by-7, driver-8)

Refer Stanford typed dependencies manual for full list & more info: http://nlp.stanford.edu/software/dependencies_manual.pdf

Triples output in the form (Subject : Predicate [modifier] : Object) :  

driver : changed [not] : tire

Extraction Logic: You can use the below base logic to build the functionality in your favorite/comfortable language (R/Python/Java/etc). Please note that this is only the base logic and needs enhancement.

triples

Challenges:

  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis

 

Hope this article is useful!

When to change job/role ???

I have met many happy employees and this blog is definitely not for those kinds as they “stay hungry, stay foolish”. On the other hand there are unhappy or in the verge of becoming unhappy employees, and everyone goes through this phase at some point in their career. This blog is meant for the second category.

Let’s look at common reasons for unhappiness (only few key ones):

  • My duties have increased or changed but not my salary!
  • Same boring work every day and nothing new
  • I don’t like my boss and/or people I work with
  • My company is sinking or not doing well in the market so there is a decline in quality of facilities given to employee’s
  • Having health issues due to stress at work

And the list goes on…….

I think the reasons can be categorized into three key aspects. I call it the 3S’s, please find below the same based on their priority.

  1. Salary
  2. Satisfaction
  3. Standard (Brand Equity)

Salary

No doubt that this is the key driver for people to work. With time individual’s personal commitment increases (home/bike/car loans, etc) leading to increase in the salary need. However the issue arises when the salary paid reaches or crosses break-even point i.e., gains <= losses, in simple terms salary does not meet individual’s minimum expectation.

Satisfaction

This is the second key aspect after salary. Many individuals though happy with the salary they get paid, however unhappy with the work due to various other reasons i.e., work environment, people, nature of work or the work itself and personal issues.

Standard (Brand Equity)

Last in the list, organization’s standard/brand name does play a role in individual’s decision to stick or leave an organization. Standard or the brand name has indirect implications if not direct, like respect within the society or the peer organizations. For example you get lower interest rates on loans from bank if working for well established organizations, premium treatment from certain service providers, society and peer organizations.

Rule of 2/3:

General rule, based on observation is that for any individual to continue in an organization a minimum of two aspects must be met. Note that this might not be reliable for every situation and geography. However if you are unhappy or in the verge of becoming unhappy employee, do take time to rate/ask yourself on the above mentioned 3 aspects and apply the rule of 2/3 to decide whether a job/role change is required or not!

“Stay Hungry, Stay Foolish, Stay Healthy” 

If you decide to change role/job based on rule of 2/3, remember that there is not shortcut to being happy and successful.


Stay Hungry: Never stop learning and always look out for the next big thing. Everything in this world and the world itself is always moving if you don’t run along you’ll be run over. To paraphrase Charles Darwin: “It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change. For example in current times learning SMAC (social, mobile, analytics and cloud) would give you better edge in the job market. Associating or working in the next big thing will help you to meet your salary and satisfaction aspects.


Stay Foolish: Gain knowledge to challenge the status quo. History has taught us that great minds were initially not received well however believing in their dreams, having presence of mind and hard work led to success. So don’t be afraid to fail and learn from it to succeed, to paraphrase Thomas Edison: “I have not failed, not once. I’ve discovered ten thousand ways that don’t work”.

Donkey in the well story: “One day a farmer’s donkey fell down into a Well. The animal cried hard / piteously for hours for help, and finally stopped as it could not bray anymore. Earlier the villagers had decided to cover up the well as many had fallen in the past. Soon, they all grabbed a shovel and began to shovel dirt into the well without realizing donkey’s presence in well. The donkey realized what was happening, however could not bray for help. With each shovel of dirt that hit his back, the donkey did something amazing. He shook it off and took a step up.

As the villagers continued to shovel dirt on top of the animal, he would shake it off and take a step up. Pretty soon, everyone was amazed as the donkey stepped up over the edge of the well and happily trotted off!

Moral: Life is going to shovel dirt on you, all kinds of dirt. The trick to getting out of the well is to shake it off and take a step up. Each of our troubles is a steppingstone. We can get out of the deepest wells just by not stopping, never giving up! Shake it off and take a step up.


Stay Healthy: You can’t achieve anything which we discussed in ‘Stay Hungry’ and ‘Stay Foolish’ sections if you don’t maintain a healthy body and mind. Having good energy level and presence of mind are two key ingredient of successful people.  So eat healthy, exercise regularly and meditate daily.

Want to end with some good quotes and a poem.

Great quotes

“In order to succeed, we must first believe that we can” – Nikos Kazantzkis

“Success occurs when opportunity meets preparation” – Zig  Zigtar

“Stay Hungry, Stay Foolish” – Steve Jobs

If – by Rudyard Kipling

If you can keep your head when all about you

Are losing theirs and blaming it on you,

If you can trust yourself when all men doubt you,

But make allowance for their doubting too;

If you can wait and not be tired by waiting,

Or being lied about, don’t deal in lies,

Or being hated, don’t give way to hating,

And yet don’t look too good, nor talk too wise:

If you can dream—and not make dreams your master;

If you can think—and not make thoughts your aim;

If you can meet with Triumph and Disaster

And treat those two impostors just the same;

If you can bear to hear the truth you’ve spoken

Twisted by knaves to make a trap for fools,

Or watch the things you gave your life to, broken,

And stoop and build ’em up with worn-out tools:

If you can make one heap of all your winnings

And risk it on one turn of pitch-and-toss,

And lose, and start again at your beginnings

And never breathe a word about your loss;

If you can force your heart and nerve and sinew

To serve your turn long after they are gone,

And so hold on when there is nothing in you

Except the Will which says to them: ‘Hold on!’

If you can talk with crowds and keep your virtue,

Or walk with Kings—nor lose the common touch,

If neither foes nor loving friends can hurt you,

If all men count with you, but none too much;

If you can fill the unforgiving minute

With sixty seconds’ worth of distance run,

Yours is the Earth and everything that’s in it,

And—which is more—you’ll be a Man, my son!


I welcome feedback and comments.

Text Mining 101

Text Mining or Text Analytic is the discovery and communication of meaningful patterns in text data. As part of 101, I would like to cover the building blocks of TM:

  • TM process overview
  • Calculate term weight (TF-IDF)
  • Similarity distance measure (Cosine)
  • Overview of key text mining techniques

Text Mining Process Overview


Broadly there are 4 stages in the text mining process. There are great open source tools available (R, python, etc) to carry out the process mentioned here. The steps almost remain the same irrespective of the analysis platform.

– Step 1: Data Assembly
– Step 2: Data Processing
– Step 3: Data Exploration or Visualization
– Step 4: Model Building

R - Text Mining-001

Brief description about Data Processing steps

Explore Corpus – Understand the types of variables, their functions, permissible values, and so on. Some formats including html and xml contain tags and other data structures that provide more metadata.

Convert text to lowercase – This is to avoid distinguish between words simply on case.

Remove Number(if required) – Numbers may or may not be relevant to our analyses.

Remove Punctuation – Punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning.

Remove English stop words – Stop words are common words found in a language. Words like for, of, are etc are common stop words.

Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.

Strip white space – Eliminate extra white spaces.

Stemming – Transforms to root word. Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”. For example i.e., 1) “computer” & “computers” become “comput”

Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”

Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be removed from the document term matrix.

Document term matrix – A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix.


Calculate Term Weight – TF-IDF


How frequently term appears?

Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

How important a term is?

DF: Document Frequency = d (number of documents containing a given term) / D (the size of the collection of documents)

To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

IDF: Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it)

Example:

Consider a document containing 100 words wherein the word CAR appears 3 times

TF(CAR) = 3 / 100 = 0.03

Now, assume we have 10 million documents and the word CAR appears in one thousand of these

IDF(CAR) = log(10,000,000 / 1,000) = 4

TF-IDF weight is product of these quantities: 0.03 * 4 = 0.12


Similarity Distance Measure (Cosine)


Why Cosine?

Here is a detailed paper on comparing the efficiency of different distance measures for text documents. General observation is that the Cosine similarity works better than the Euclidean for text data.

Cosine-001

So lets understand how to calculate Cosine similarity.

Example:

Text 1: statistics skills and programming skills are equally important for analytics

Text 2: statistics skills and domain knowledge are important for analytics

Text 3: I like reading books and travelling

Document Term Matrix for the above 3 text would be:

DTM-001

The three vectors are:

T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)

T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)

T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)

Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%

Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%


Overview of key Text Mining techniques


There 3 key techniques are in practice:

  1. N-gram based analytic
  2. Shallow Natural Language Processing technique
  3. Deep Natural Language Processing technique

N-gram based analytic

Definition:

  • n-gramis a contiguous sequence of n items from a given sequenceof text
  • The items can be syllables, letters, words or base pairs according to the application

Application:

  • Probabilistic language model for predicting the next item in a sequence in the form of a (n − 1)
  • Widely used in probability, communication theory, computational linguistics, biological sequence analysis

Advantage:

  • Relatively simple
  • Simply increasing n, model can be used to store more context

Disadvantage: 

Semantic value of the item is not considered

Example:

Lets look at the n-gram output for the sample sentence “defense attorney for liberty and montecito”

  • 1-gram: defense, attorney, for, liberty, and, montecito
  • 2-gram: defense attorney, for liberty, and montecito, attorney for, liberty and, attorney for
  • 3-gram: defense attorney for, liberty and montecito, attorney for liberty, for liberty and, liberty and montecito
  • 4-gram: defense attorney for liberty, attorney for liberty and, for liberty and montecito,
  • 5-gram: defense attorney for liberty and montecito, attorney for liberty and montecito
  • 6-gram: defense attorney for liberty and montecito

Shallow NLP technique

Definition:

  • Assign a syntactic label (noun, verb etc.) to a chunk
  • Knowledge extraction from text through semantic/syntactic analysis approach

Shallow-001

Application:

  • Taxonomy extraction (predefined terms and entities). Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases, medicines
  • Concept extraction (main idea or a theme)

Advantage: Less noisy than n-grams

Disadvantage: Does not specify role of items in the main sentence

Example: Consider the sentence “The driver from Bangalore crashed the bike with the black bumper”. Lets examine the results of applying the n-gram and shallow NLP technique to extract the concept to the sample sentence.

Apply below 3 steps to example sentence

  • Convert to lowercase & PoS tag
  • Remove stop words
  • Retain only Noun’s & Verb’s as these hold higher weight in the sentence

1-gram output: driver, bangalore, crashed, bike, bumper

1-gram

Bi-gram output with noun/verb’s retained: crashed bike, driver bangalore, bangalore crashed

2-gram

3-gram output with noun/verb’s retained: driver bangalore crashed, bangalore crashed bike

3-gram

Conclusion:

  • 1-gram: Reduced noise, however no clear context
  • Bi-gram & 3-gram:  Increased context, however there is a information loss (like bumper tells us what is crashed which is not in output)

Deep NLP technique

Definition:

  • Extension to the shallow NLP
  • Get the syntactic relationship between each pair of words
  • Apply sentence segmentation to determine the sentence boundaries
  • The Stanford Parser (or any similar) is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence
  • Detected relationships are expressed as complex construction to retain the context
  • Example relationships: Located in, employed by, part of, married to

ApplicationsDevelop features and representations appropriate for complex interpretation tasks like fraud detection and prediction activities based on complex RNA-Sequence in life science

Capture

The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object) without loosing the context. Modifiers are negations, multi-word expression, adverbial modifier like not, maybe, however etc. You can learn more about stanford typed dependency here.

Triples Output:

  • driver : from : bangalore
  • driver : crashed : bike
  • driver : crashed with : bumper

Disadvantages:

  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis

Summary


Summary

Hope this helps and I welcome feedback/comments.


References: