Triplets for concept extraction from English sentence (Deep NLP)

I recently published a white paper with the above mentioned title at ‘Fourth International Conference on Business Analytics and Intelligence’, held between 19 – 21, December 2016 at Indian Institute of Science, Bangalore. Here I present the key contents from the paper.



In Text Mining extracting n-gram keywords alone can not produce meaningful information nor uncover “unknown” themes and trends. Triples are a way to represent information from a text sentence in fewer words without losing the context. The application of triples leads to higher accuracy for complex interpretation tasks such as fraud detection, and prediction activities based on complex RNA-Sequence in life science. There are different techniques for getting this information before you represent it as triples, and the techniques depend on the kind of data being read as input. . In this paper we briefly evaluate different methods that are in practice to perform triples extraction from English sentence. An advanced NLP technique has been presented/discussed in detail to extract dependency relation from sentence i.e., extract sets of the form {subject, predicate [modifiers], object} out of syntactically parsed sentences, using Stanford parser. The technique is an extension to the shallow NLP. First we need to get the syntactic relationship between each pair of words. Apply sentence segmentation to determine the sentence boundaries. The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence. Detected relationships are expressed as complex construction to retain the context.


Text Analytics, Text Mining, Concept Extraction, Triples, Triplets


Resource Description Framework (RDF) is a well know data model for information extraction and was adopted as a World Wide Web Consortium recommendation in 1999 as a general method for conceptual description or modeling of information that is implemented in web resources. The RDF relates entities by the subject-predicate-object format where the subject and object are related to one another by the predicate. Later it was also used in knowledge management applications involving structured text contents. According to the approach presented in [1], a triplet in a text sentence is defined as a relation between subject and object, the relation being the predicate.The aim here is to extract sets of the form {subject, predicate, object} out of syntactically parsed sentences.The triple is a minimal representation for information without losing the context.In the current research we’ll look to enhance the objective of extracting aspects by using additional descriptors such as modifiers alongside predicate. Descriptor is a word, especially an adjective or any other modifier used attributively which restricts or adds to the sense of a head noun. Descriptors express opinions and sentiments about an aspect, which can be further used in generation of summaries for the aspects. For example, “The flat tire was replaced by the driver” can be represented as driver:replaced:tire which is subject:predicate[modifier]:object


A decent amount of research and implementation has been carried out in the past in the area of extracting triplets/triples from text sentences for concept extraction.

There has been usage of two major techniques:

1.     Machine Learning Technique:

A machine learning approach has been used [1] to extract subject-predicate-object triplets from English sentences. SVM is used to train a model on human annotated triplets, and the features are computed from three parser. The sentence is tokenized and then the stop words and punctuation are removed. This will give us a list of the important tokens in the sentence. The next step is to get all possible ordered combinations of three tokens from the list. The resulted combinations are the triplet candidates. From now on the problem is seen as a binary classification problem where the triplet candidates must be classified as positive or as negative.The SVM model assigns a positive score to those candidates which should be extracted as triplets, and a negative score to the others. Using the higher positive score words, the resulting triplet is formed. As opposed to the subject and the verb, the objects are different among the positively classified triplet candidates. In such cases an attempt to merge the different triplet elements (in this case objects) is made. if two or more words are consecutive in the list of important tokens, then they are merged. Where merges have been done in the object, the tokens are connected by the stop words from the original sentence.In the merging method described above, it will not always be possible to merge all tokens into a single set. In this case several triplets i.e., one for each of the three sets will be obtained. Note that in practice in the classification described above there are many false positives, so it does not work to take them all for the resulting triplets. Instead only the top few from the descending ordered list of triplet candidates are taken.

2.     Tree Bank Parser:

A treebank is a text corpus where each sentence belonging to the corpus has a syntactic structure added to it. A detailed extraction logic using different parser techniques have been presented in [2]

–       Stanford Parser:

It is a natural language parser developed by Dan Klein and Christopher D. Manning from The Stanford Natural Language Processing Group [1, 2]. The packagecontains a Java implementation of probabilistic natural language parsers; a graphical user interface is also available, for parse tree visualization. The software is available at [4].Stanford Parser will output a parse tree for a given sentence. Stanford Parser generates a Treebank parse tree for the input sentence. Figure 1 depicts the parse tree for the sentence “the flat tire was replaced by the driver”. A sentence (S) is represented by the parser as a tree having three children: a noun phrase (NP), a verbal phrase (VP) and the full stop (.). The root of the tree will be S. The triplet extracted out of this sentence is driver – replaced – tire.


–       Link Parser:

This application uses the link grammar, generating a linkage after parsing a sentence.It can be downloaded from the web site [5].Detailed explanations of what the different link labels mean are available at [6].


–       Minipar Parser:

It is a parser developed by DekangLin.Minipar takes one sentence at a time as an input and generates the tokens of type ‘DepTreeNode’. Later it assigns relation between these tokens. Each DepTreeNode consists of feature called ‘word’: this is the actual text of the word.

Figure 3.png

–       The Multi-Liaison Algorithm:

According to the approach presented in [3], an English sentence can have multiple subjects and objects and the Multi-Liaison Algorithm was presented for extracting multiple connections or links between subject and object from input sentence, which can have one or more than one subject, predicate and object. The parse tree visualization and the dependencies generated from the StanfordParser are used to extract this information from the given sentence. Using the dependencies, an output was generated which displays which subject is related to which object and the connecting predicate. Finding the subjects and objects helps in determining the entities involved and the predicates determine the relationship that exists between the subject and the object. An algorithm was developed to do so and this algorithm is elucidated in detail step-wise. It was named ‘The Multi-Liaison Algorithm’ since the liaison between the subjects and objects would be displayed. The word ‘liaison’ has been used as it is displaying the relationship and association between the subjects and predicates.

Example input sentence: “The old beggar ran after the rich man who was wearing a black coat”

Multi-liaison algorithm output: 1) beggar – ran – man 2) man – wearing – coat


Introducing modifier alongside predictor will increase clarity of obscure facts in the sentences. The goal here is to extract sets of the form {subject; predicate [modifier]; object} out of syntactically parsed sentences. Modifiers are words or phrases that give additional detail about the subject discussed in a sentence. Since these words enhance the reception of a sentence, they tend to be describing words such as adjectives and adverbs. In addition, phrases that modify tend to describe adjectives and adverbs, such as adjective clauses and adverbial phrases. They equip the writer with the capability to provide the reader with the most accurate illustration words can allow. For example, a writer can write a simple sentence that states the facts and nothing more, such as “Joseph caught a fish” If the writer chooses to utilize modifiers, the sentence could read as follows: “Joseph was a nice tall boy from India, who caught a fish which was smaller than a Mars bar”.

The additional details in the sentence, by way of modifiers, engage the reader and hold their attention.

As per the StanfordCoreNLP dependency manual[7], there are 22 types of modifiers in English sentence. In this paper a set of 3 key or essential modifiers have been identified that will help us get more context out of sentence.

1.     mwe – Multi-Word Expression Modifier:



2.     advmod – Adverbial Modifier:

An adverb modifier of a word is a (non-clausal) adverb or adverb-headed phrase that serves to modify the meaning of the word



3.     neg – Negation Modifier:

The negation modifier is the relation between a negation word and the word it modifies




  1. Get the syntactic relationship between each pair of words
  2. Apply sentence segmentation to determine the sentence boundaries
  3. The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relationships within each sentence
Example Sentence: “The flat tire was not replaced by driver”
Stanford dependency relations: 
root(ROOT-0, replaced-6)
det(tire-3, the-1)
amod(tire-3, flat-2)
nsubjpass(replaced -6, tire-3)
auxpass(replaced -6, was-4)
neg(replaced -6, not-5)
prep(replaced -6, by-7)
pobj(by-7, driver-8)

Triples output in the form (Subject : Predicate [modifier] : Object) :

driver : replaced[not] : tire


You can use the below base logic to build the functionality in your favorite/comfortable language (R/Python/Java/etc). Please note that this is only the base logic and needs enhancement.

Step 1: Annotate Using StanfordCoreNLP_Pipeline

In this stage we will divide a string into tokens based on the given delimiters. Token is one piece of information, a “word”. The String is tokenized via a tokenizer (using a TokenizerAnnotator) and then Penn treebank annotation is used to add things like lemmas, POS tags, and named entities. These are returned as a list of CoreLabels.

Step 2: Read Into NLP Tree Object

Transforms trees by turning the labels into their basic categories according to the TreebankLanguagePack

Step 3: Extract The Basic Dependencies

Stanford dependencies provides a representation of grammatical relations between words in a sentence.

Step 4: Extract Subject Predicate Object FromBasic Dependencies Table

            // single sentence can have multiple subject predicate and objects, so declare it a list
            subject = []
            predicate = []
            object = []
            if nsubj exists then // nominal sentence
                        if dobj exists then
                                    append nsubj to subject
                                    append prep to predicate
                                    append dobj to object
                        elseifpobj exists then
                                    append nsubj to subject
                                    append prep to predicate
                                    append pobj to object
                                    append nsubj to subject
                                    append prep to predicate
                                    append xcomp to object
            elseifnsubjpass exists then // passive sentence
                        append agent to subject
                        append root to predicate
                        append nsubjpass to object

Step 5: Extract Modifier And Named Entity

If the subject is named entity then we can anonymize to help us compare concept between two sentence. For example, in the below two sentence the concept is same however the subject differs so anonymizing the subject will tell us the concept is same.

Sentence 1: ‘The flat tire was replaced by John’ and the triples would be John:replaced:tire

Sentence 2: ‘The flat tire was replaced by Joe’ and the triples would be Joe:replaced:tire

Anonymizing the named entity would look as shown below which makes the comparison easy.

Post anonymizing, sentence 1 output: {unspecified}: replaced: tire

Post anonymizing, sentence 2 output: {unspecified}: replaced: tire

            if subject = named_entity then
                        subect = {unspecified} //anonymize
            modifier = admod + mwe + neg

Step 6: Represent Subject: Object[Modifier]: Predicate


Nominal sentence is a linguistic term that refers to a nonverbal sentence (i.e. a sentence without a finite verb). As a nominal sentence does not have a verbal predicate, it may contain a nominal predicate, an adjectival predicate, an adverbial predicate or even a prepositional predicate.

Active and Passive Sentences. A sentence is written in active voice when the subject of the sentence performs the action in the sentence. e.g. The girl was washing the dog. A sentence is written in passive voice when the subject of the sentence has an action done to it by someone or something else.

Figure 4.png

Chart Reference:
nsubj : nominal subject
nsubjpass: passive nominal subject
dobj: direct object
root: root
xcomp: open clausal complement
prep: prepositional modifier
pobj: object of a preposition
neg: negation modifier
advmod: adverbial modifier
mwe: multi-word expression

Note: More details on all possible dependencies can be found in the Stanford dependency manual [7]


Inclusion of modifier alongside predicate helps us to bring more meaningful context into the triples structure. In general, the deep NLP technique for sentence level analysis is too structured, and usage of abbreviation, grammatical errors in sentence will mislead the analysis. However, for a proper English sentence extraction of triples gives us key elements of sentence. In addition,inclusion of negation, multi word expression and adverbial modifiers alongside predicator helps bring more context to the triples. We have to be cautious in choosing the kind of additional modifiers (if any) to show alongside predicator depending on the business context. Further research to be done to get a better understanding of the possible additional modifiers that will qualify to show alongside predicator to store more context from different types of sentences.


  1. Lorand Dali, Blaž Fortuna, Artificial Intelligence Laboratory, Jožef Stefan Institute, “Triples extraction from sentences using SVM” in Slovenian KDD Conference on Data Mining and Data Warehouses (SiKDD), Ljubljana 2008
  2. Delia Rusu, Lorand Dali, Blaz Fortuna, Marko Grobelnik, DunjaMladenic, “Triplet extraction from sentences” in Artificial Intelligence Laboratory, Jožef Stefan Institute, Slovenia, Nov. 7, 2008.
  3. The Multi-Liaison algorithm by Ms. Anjali Ganesh Jivani, Ms.AmishaHetalShingala, Dr. Paresh. V. Virparia published in International Journal of Advanced Computer Science and Applications Vol. 2, No. 5, 2011.
  4. Stanford Parser web page:
  5. Link Parser web page:
  6. Link labels web page:
  7. Stanford dependency manual

If you are new to text mining, you can learn the basic concepts involved in text mining here! I welcome your feedback.


Text Mining 101

Text Mining or Text Analytic is the discovery and communication of meaningful patterns in text data. As part of 101, I would like to cover the building blocks of TM:

  • TM process overview
  • Calculate term weight (TF-IDF)
  • Similarity distance measure (Cosine)
  • Overview of key text mining techniques

Text Mining Process Overview

Broadly there are 4 stages in the text mining process. There are great open source tools available (R, python, etc) to carry out the process mentioned here. The steps almost remain the same irrespective of the analysis platform.

– Step 1: Data Assembly
– Step 2: Data Processing
– Step 3: Data Exploration or Visualization
– Step 4: Model Building

R - Text Mining-001

Brief description about Data Processing steps

Explore Corpus – Understand the types of variables, their functions, permissible values, and so on. Some formats including html and xml contain tags and other data structures that provide more metadata.

Convert text to lowercase – This is to avoid distinguish between words simply on case.

Remove Number(if required) – Numbers may or may not be relevant to our analyses.

Remove Punctuation – Punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning.

Remove English stop words – Stop words are common words found in a language. Words like for, of, are etc are common stop words.

Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.

Strip white space – Eliminate extra white spaces.

Stemming – Transforms to root word. Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”. For example i.e., 1) “computer” & “computers” become “comput”

Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”

Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be removed from the document term matrix.

Document term matrix – A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix.

Calculate Term Weight – TF-IDF

How frequently term appears?

Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

How important a term is?

DF: Document Frequency = d (number of documents containing a given term) / D (the size of the collection of documents)

To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

IDF: Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it)


Consider a document containing 100 words wherein the word CAR appears 3 times

TF(CAR) = 3 / 100 = 0.03

Now, assume we have 10 million documents and the word CAR appears in one thousand of these

IDF(CAR) = log(10,000,000 / 1,000) = 4

TF-IDF weight is product of these quantities: 0.03 * 4 = 0.12

Similarity Distance Measure (Cosine)

Why Cosine?

Here is a detailed paper on comparing the efficiency of different distance measures for text documents. General observation is that the Cosine similarity works better than the Euclidean for text data.


So lets understand how to calculate Cosine similarity.


Text 1: statistics skills and programming skills are equally important for analytics

Text 2: statistics skills and domain knowledge are important for analytics

Text 3: I like reading books and travelling

Document Term Matrix for the above 3 text would be:


The three vectors are:

T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)

T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)

T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)

Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%

Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%

Overview of key Text Mining techniques

There 3 key techniques are in practice:

  1. N-gram based analytic
  2. Shallow Natural Language Processing technique
  3. Deep Natural Language Processing technique

N-gram based analytic


  • n-gramis a contiguous sequence of n items from a given sequenceof text
  • The items can be syllables, letters, words or base pairs according to the application


  • Probabilistic language model for predicting the next item in a sequence in the form of a (n − 1)
  • Widely used in probability, communication theory, computational linguistics, biological sequence analysis


  • Relatively simple
  • Simply increasing n, model can be used to store more context


Semantic value of the item is not considered


Lets look at the n-gram output for the sample sentence “defense attorney for liberty and montecito”

  • 1-gram: defense, attorney, for, liberty, and, montecito
  • 2-gram: defense attorney, for liberty, and montecito, attorney for, liberty and, attorney for
  • 3-gram: defense attorney for, liberty and montecito, attorney for liberty, for liberty and, liberty and montecito
  • 4-gram: defense attorney for liberty, attorney for liberty and, for liberty and montecito,
  • 5-gram: defense attorney for liberty and montecito, attorney for liberty and montecito
  • 6-gram: defense attorney for liberty and montecito

Shallow NLP technique


  • Assign a syntactic label (noun, verb etc.) to a chunk
  • Knowledge extraction from text through semantic/syntactic analysis approach



  • Taxonomy extraction (predefined terms and entities). Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases, medicines
  • Concept extraction (main idea or a theme)

Advantage: Less noisy than n-grams

Disadvantage: Does not specify role of items in the main sentence

Example: Consider the sentence “The driver from Bangalore crashed the bike with the black bumper”. Lets examine the results of applying the n-gram and shallow NLP technique to extract the concept to the sample sentence.

Apply below 3 steps to example sentence

  • Convert to lowercase & PoS tag
  • Remove stop words
  • Retain only Noun’s & Verb’s as these hold higher weight in the sentence

1-gram output: driver, bangalore, crashed, bike, bumper


Bi-gram output with noun/verb’s retained: crashed bike, driver bangalore, bangalore crashed


3-gram output with noun/verb’s retained: driver bangalore crashed, bangalore crashed bike



  • 1-gram: Reduced noise, however no clear context
  • Bi-gram & 3-gram:  Increased context, however there is a information loss (like bumper tells us what is crashed which is not in output)

Deep NLP technique


  • Extension to the shallow NLP
  • Get the syntactic relationship between each pair of words
  • Apply sentence segmentation to determine the sentence boundaries
  • The Stanford Parser (or any similar) is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence
  • Detected relationships are expressed as complex construction to retain the context
  • Example relationships: Located in, employed by, part of, married to

ApplicationsDevelop features and representations appropriate for complex interpretation tasks like fraud detection and prediction activities based on complex RNA-Sequence in life science


The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object) without loosing the context. Modifiers are negations, multi-word expression, adverbial modifier like not, maybe, however etc. You can learn more about stanford typed dependency here.

Triples Output:

  • driver : from : bangalore
  • driver : crashed : bike
  • driver : crashed with : bumper


  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis



Hope this helps and I welcome feedback/comments.