Accelerate Analysis, Insight & Data Mining (for non coders)

R and Python are most popular, powerful and well designed (reasonably for non-techies) open source command-line based language for analysis, data science or machine learning enthusiast. Getting started and moving to a proficient level in these programming language (or any) is a tedious and time-consuming process. However accelerated delivery of insight, analysis, modeling, and proof of concepts is a key characteristic of successful analytics team to validate the strategies that enable us to drive business decisions in the right direction. The aim of this article is to provide useful resources around tools that will help the analyst to accelerate the delivery of insight, analysis and modelling.

Lets first understand how the analysis, insight, and data mining process is performed.

How do we perform analysis/insight?

The objective for an analyst is to convert the Data/Information into an insight and recommend on a possible Action that the business can take. Insight is the mechanism to do this successfully and through the work, it is key that we always keep the Business Context in mind.

Insight_Model

How do we do data mining?

Mainly three data mining process frameworks have been most popular, and widely practiced by data mining experts/researchers to build machine learning systems. You’ll notice that the core phases are covered by all 3 frameworks, and is not much difference.

1. Knowledge Discovery Databases (KDD) process model

KDD

The concept of KDD is essentially integration of multiple technologies of data mining which was presented in a book by Fayyad in 1996, learn more from the link https://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/1230/1131

2. CRoss Industrial Standard Process for Data Mining (CRISP-DM)

CRISP_DM

CRISP_DM, creates unbiased methodology that is not domain dependent and consolidates Data Mining best practices. CRISP-DM voted as leading methodology for Data Mining in polls of 2002, 2004 & 2007. You can learn more about CRISM-DM from the following link https://www.the-modeling-agency.com/crisp-dm.pdf

3. Sample, Explore, Modify, Model and Assess (SEMMA)

SEMMA

SEMMA is sequential steps of 5 to build machine learning models incorporated in SAS Enterprise Miner by SAS Inc., You can learn more about SEMMA through following link http://www2.sas.com/proceedings/sugi22/DATAWARE/PAPER128.PDF

Tools that you know & able to use = your skill level!

A Command-Line Interface (CLI) is way faster, powerful than User Interface based tools in terms of flexibility and features, however Graphical User Interface (GUI) is easy to use, interactive and has out of the box visualization. The aim here is not GUI vs CLI, but to be equipped enough with tools to understand and apply basic concepts, gain independence when it comes to data analysis and communicate it effectively. At the end of the day, the quick delivery skills of an analyst boils down to the tools that they are aware and able to use, so the goal here is to introduce you to the MUST KNOW GUI based open source tools that do not require much of coding to get you started to help accelerate the delivery of analytics (particularly during initial stage of your data science adaptation).

Screenshot from 2019-11-01 20-05-10

Tool Power to analyst

Here is a summary of general advantages and disadvantage of GUI based analytical tools

Pros:

  • GUI’s are click to do things so lots of fun
  • Great functionalities, many data mining packages, stunning out of the box visualizations
  • Supports key & cross-platform (Windows, Mac, Linux)

Cons:

  • Stores all of the data in RAM, so can crash/run slow with high volume of data

These tools to accelerate analysis, insight, and mining can be divided into two categories.

  1. R/Python aider’s: These are built to utilize the native capabilities of R/Python. Note that I have listed the most popular GUI for R/Python below (and the list does not necessarily cover all of the available tools)
    1. Rattle (R)
    2. JGB/Deducer (R)
    3. RCommander (R)
    4. Orange (Python)
  2. Independent platforms
    1. H2O
    2. KNIME

1.1 R: Rattle

How to install: run the below command in your RStudio

install.packages("rattle", repos="http://rattle.togaware.com", type="source")

How to launch:

library(rattle);rattle()

Full Tutorial: Click here!

Sample Screenshot(s):

rattle.png

1.2 JGB / Deducer

How to Install: run the below command in your RStudio

install.packages(c("JGB","Deducer","DeducerExtras"),dependencies=T)

How to launch: 

 library(jgb); jgb()

Tutorials: Link -1 here!, Link-2 here!

Sample Screenshots:

JGB.png

1.3 RCommander

How to install: run the below command in your RStudio

install.packages("Rcmdr", dependencies=T)

How to launch:

library(Rcmdr)

Tutorials: Link-1 here!, Link-2 here!

Sample Screenshots:

RCommander.png

1.4 Python Orange

Orange is an open source, from the AI Laboratory in Ljubljana, Slovenia.

How to install: you can download the appropriate executable installation file from their official web site for your OS here!

How to launch: The installation will create a desktop shortcut and add it to the menu bar. The application can be launched using the same.

Tutorial: Click here!

Sample Screenshot:

Orange.png

2.1 H2O

H2O is open-source software for big-data analysis. It is produced by the company H2O.ai, which launched in 2011 in Silicon Valley. H2O allows users to fit thousands of potential models as part of discovering patterns in data. It is claimed to be one of the world’s leading open source deep learning platform, used by over 100,000 data scientists and more than 10,000 organizations around the world. Their design goal is “To Bring Beautiful Business Transformation Through AI and Visual Intelligence” through 1) Make it Open 2) Make it Fast, Really Fast 3) Make it Beautiful

How to Install: Click here to learn more.

Tutorial: Click here!

Sample Screenshot:

H2o.png

2.2 KNIME

KNIME, the Konstanz Information Miner, is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipeline concept. KNIME Analytics Platform is the leading open solution for data-driven innovation, helping you discover the potential hidden in your data, mine for fresh insights, or predict new futures. It’s enterprise-grade, open source platform is fast to depoly, easy to scale and intuitive to learn. With more than 1000 modules, hundreds of ready-to-run examples, a comprehensive range of integrated tools, and the widest choice of advanced algorithms available, KNIME Analytics Platform is the perfect toolbox for any data scientist. Their steady course on an unrestricted open source is your passport to a global community of data scientists, their expertise, and their active contributions. Read more here!

How to Install: You can download the appropriate executable installation file from their official website for your OS here!

Tutorial: Link-1 here!, Link-2 here!

Sample Screenshots:

KNIME.png

2.3 WEKA

Waikato Environment for Knowledge Analysis is a suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. It is free software licensed under the General Public License (GNU). Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a data-set or called from your own Java code. Weka contains tools for data per-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

How to Install: Click here!

Tutorials: Link-1 here!, Link-2 here!, Link-3 (video) here!

Sample Screenshot:

WEKA.png

Hope this article was useful!

Sentiment Analysis

Sentiment

Definition:

The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral

Use case:

Customer’s on line comments/feedback from an insurance companies website has been scrapped to run through the sentiment analysis.

You can find full R code along with the data set in my git repository here

Steps:

  • Load required R libraries

    # source("http://bioconductor.org/biocLite.R")
    # biocLite("Rgraphviz")
    # install.packages('tm')
    # install.packages('wordcloud')
    # download.file("http://cran.cnr.berkeley.edu/src/contrib/Archive/Rstem/Rstem_0.4-1.tar.gz", "Rstem_0.4-1.tar.gz")
    # install.packages("Rstem_0.4-1.tar.gz", repos=NULL, type="source")
    # download.file("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz", "sentiment.tar.gz")
    # install.packages("sentiment.tar.gz", repos=NULL, type="source")# Load libraries
    
    library(wordcloud)
    library(tm)
    library(plyr)
    library(ggplot2)
    library(grid)
    library(sentiment)
    library(Rgraphviz)
  • Pre-process data:

    Text pre-processing is an important step to reduce noise from the data. Each step is discussed below

  • convert to lower: this is to avoid distinguish between words simply on case
  • remove punctuation: punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation
  • remove numbers: numbers may or may not be relevant to our analyses
  • remove stop words: stop words are common words found in a language. Words like for, of, are etc are common stop word
  • create document term matrix: a document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix
df <- read.table("../input/data.csv",sep=",",header=TRUE)
corp <- Corpus(VectorSource(df$Review)) 
corp <- tm_map(corp, tolower) 
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
# corp <- tm_map(corp, stemDocument, language = "english") 
corp <- tm_map(corp, removeWords, c("the", stopwords("english"))) 
corp <- tm_map(corp, PlainTextDocument)
corp.tdm <- TermDocumentMatrix(corp, control = list(minWordLength = 3)) 
corp.dtm <- DocumentTermMatrix(corp, control = list(minWordLength = 3))
  • Insight through visualization

    • Word cloud: This visualization generates words whose font size relates to its frequency.

      wordcloud(corp, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, 'Dark2'))

      Word Cloud

    • Frequency plot:This visualization presents a bar chart whose length corresponds the frequency a particular word occurred
      corp.tdm.df <- sort(rowSums(corp.tdm.df),decreasing=TRUE) # populate term frequency and sort in decesending order
      df.freq <- data.frame(word = names(corp.tdm.df),freq=corp.tdm.df) # Table with terms and frequency
      
      # Set minimum term frequency value. The charts will be created for terms > or = to the minimum value that we set.
      freqControl <- 100
      # Frequency Plot
      freqplotData <- subset(df.freq, df.freq$freq > freqControl)
      freqplotData$word <- ordered(freqplotData$word,levels=levels(freqplotData$word)[unclass(freqplotData$word)])
      freqplot <- ggplot(freqplotData,aes(reorder(word,freq), freq))
      freqplot <- freqplot + geom_bar(stat="identity")
      freqplot <- freqplot + theme(axis.text.x=element_text(angle=90,hjust=1)) + coord_flip() 
      freqplot + xlim(rev(levels(freqplotData$word)))+ ggtitle("Frequency Plot")

      Frequency.png

    • Correlation plot: Here, we choose N number of high frequent words as the nodes and include links between words when they have at least a correlation of x %
      # Correlation Plot
      # 50 of the more frequent words have been chosen as the nodes and include links between words
      # when they have at least a correlation of 0.2
      # By default (without providing terms and a correlation threshold) the plot function chooses a
      # random 20 terms with a threshold of 0.7
      plot(corp.tdm,terms=findFreqTerms(corp.tdm,lowfreq=freqControl)[1:50],corThreshold=0.2, main="Correlation Plot")

      Correlation.png

    • Paired word cloud: This is a customized word cloud. Here, we pick the top N most frequent words and extract associated words with strong correlation. Combine individual top N words with the every associated word (say one of my top words is broken and one of the associated words is pipe; the combined word would be broken-pipe). Then we create a word cloud on the combined words. Although the concept is good, the chart below does not appear helpful. So need to figure out a better representation

      # Paired-Terms wordcloud
      # pick the top N most frequent words and extract associated words with strong correlation (say 70%). 
      # Combine individual top N words with every associated word.
      nFreqTerms <- findFreqTerms(corp.dtm,lowfreq=freqControl)
      nFreqTermsAssocs <- findAssocs(corp.dtm, nFreqTerms, 0.3)
      pairedTerms <- c()
      for (i in 1:length(nFreqTermsAssocs)){
        if(length(names(nFreqTermsAssocs[[i]]))!=0) 
          lapply(names(nFreqTermsAssocs[[i]]),function(x) pairedTerms <<- c(pairedTerms,paste(names(nFreqTermsAssocs[i]),x,sep="-")))
      }
      wordcloud(pairedTerms,random.order=FALSE,colors=dark2,main="Paired Wordcloud")

      Paired Word Cloud

  • Sentiment Score

    • Load positive / negative terms corpus

      The corpus contains around 6800 words, this list was compiled over many years starting from first paper by Hu and Liu, KDD-2004. Although necessary, having an opinion lexicon is far from sufficient for accurate sentiment analysis. See this paper: Sentiment Analysis and Subjectivity or the Sentiment Analysis

    • Calculate positive / negative score

      Simply we calculate the positive / negative score by comparing the terms with positive/negative term corpus and summing the occurrence count

    • Classify emotion

      R package sentiment by Timothy Jurka has a function that helps us to analyze some text and classify it in different types of emotion: anger, disgust, fear, joy, sadness, and surprise. The classification can be performed using two algorithms: one is a naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon; the other one is just a simple voter procedure.

    • Classify polarity

      Another function from sentiment package, classify_polarity allows us to classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon; or by a simple voter algorithm.

      hu.liu.pos = scan('../input/positive-words.txt', what = 'character',comment.char=';') 
      hu.liu.neg = scan('../input/negative-words.txt',what = 'character',comment.char= ';') 
      pos.words = c(hu.liu.pos)
      neg.words = c(hu.liu.neg)
      score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
      {
        require(plyr)
        require(stringr)
        
        # we got a vector of sentences. plyr will handle a list
        # or a vector as an "l" for us
        # we want a simple array ("a") of scores back, so we use
        # "l" + "a" + "ply" = "laply":
        scores = laply(sentences, function(sentence, pos.words, neg.words) {
          
          # clean up sentences with R's regex-driven global substitute, gsub():
          sentence = gsub('[[:punct:]]', '', sentence)
          sentence = gsub('[[:cntrl:]]', '', sentence)
          sentence = gsub('\\d+', '', sentence)
          # and convert to lower case:
          sentence = tolower(sentence)
          
          # split into words. str_split is in the stringr package
          word.list = str_split(sentence, '\\s+')
          # sometimes a list() is one level of hierarchy too much
          words = unlist(word.list)
          
          # compare our words to the dictionaries of positive & negative terms
          pos.matches = match(words, pos.words)
          neg.matches = match(words, neg.words)
          
          # match() returns the position of the matched term or NA
          # we just want a TRUE/FALSE:
          pos.matches= !is.na(pos.matches)
          neg.matches= !is.na(neg.matches)
          
          # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
          score = sum(pos.matches) - sum(neg.matches)
          
          return(score)
        }, pos.words, neg.words, .progress=.progress )
        
        scores.df = data.frame(score=scores, text=sentences)
        return(scores.df)
      }
      
      review.scores<- score.sentiment(df$Review,pos.words,neg.words,.progress='text')
      #classify emotion
      class_emo = classify_emotion(df$Review, algorithm="bayes", prior=1.0)
      #get emotion best fit
      emotion = class_emo[,7]
      # substitute NA's by "unknown"
      emotion[is.na(emotion)] = "unknown"
      
      # classify polarity
      class_pol = classify_polarity(df$Review, algorithm="bayes")
      
      # get polarity best fit
      polarity = class_pol[,4]
      
      # data frame with results
      sent_df = data.frame(text=df$Review, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)
      
      # sort data frame
      sent_df = within(sent_df, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
      
      
    • Visualize

      • Distribution of overall score
        ggplot(review.scores, aes(x=score)) + 
          geom_histogram(binwidth=1) + 
          xlab("Sentiment score") + 
          ylab("Frequency") + 
          ggtitle("Distribution of sentiment score") +
          theme_bw()  + 
          theme(axis.title.x = element_text(vjust = -0.5, size = 14)) + 
          theme(axis.title.y=element_text(size = 14, angle=90, vjust = -0.25)) + 
          theme(plot.margin = unit(c(1,1,2,2), "lines"))

        Sentiment Score Distribution.png

      • Distribution of score for a given term
        review.pos<- subset(review.scores,review.scores$score>= 2) 
        review.neg<- subset(review.scores,review.scores$score<= -2)
        claim <- subset(review.scores, regexpr("claim", review.scores$text) > 0) 
        ggplot(claim, aes(x = score)) + geom_histogram(binwidth = 1) + ggtitle("Sentiment score for the token 'claim'") + xlab("Score") + ylab("Frequency") + theme_bw()  + theme(axis.title.x = element_text(vjust = -0.5, size = 14)) + theme(axis.title.y = element_text(size = 14, angle = 90, vjust = -0.25)) + theme(plot.margin = unit(c(1,1,2,2), "lines"))

        Claim Score.png

      • Distribution of emotion
        # plot distribution of emotions
        ggplot(sent_df, aes(x=emotion)) +
          geom_bar(aes(y=..count.., fill=emotion)) +
          scale_fill_brewer(palette="Dark2") +
          labs(x="emotion categories", y="number of Feedback", 
               title = "Sentiment Analysis of Feedback about claim(classification by emotion)",
               plot.title = element_text(size=12))

        Emotions Distribution.png

      • Distribution of polarity
        # plot distribution of polarity
        ggplot(sent_df, aes(x=polarity)) +
          geom_bar(aes(y=..count.., fill=polarity)) +
          scale_fill_brewer(palette="RdGy") +
          labs(x="emotion categories", y="number of Feedback", 
               title = "Sentiment Analysis of Feedback about claim(classification by emotion)",
               plot.title = element_text(size=12))

        Polarity.png

      • Text by emotion
        # separating text by emotion
        emos = levels(factor(sent_df$emotion))
        nemo = length(emos)
        emo.docs = rep("", nemo)
        for (i in 1:nemo)
        {
          tmp = df$Review[emotion == emos[i]]
          emo.docs[i] = paste(tmp, collapse=" ")
        }
        
        # remove stopwords
        emo.docs = removeWords(emo.docs, stopwords("english"))
        # create corpus
        corpus = Corpus(VectorSource(emo.docs))
        tdm = TermDocumentMatrix(corpus)
        tdm = as.matrix(tdm)
        colnames(tdm) = emos
        
        # comparison word cloud
        comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),
                         scale = c(3,.5), random.order = FALSE, title.size = 1.5)

        Text by emotion.png

Next post to cover sentiment analysis in R + Hadoop.

 

Reference:

The above write up is based on the tutorials from following links:

Triples – Deep Natural Language Processing

Problem: In Text Mining extracting keywords (n-grams) alone cannot produce meaningful data nor discover “unknown” themes and trends.

Objective: The aim here is to extract dependency relation from sentence i.e., extract sets of the form {subject, predicate[modifiers], object} out of syntactically parsed sentences, using Stanford parser and opennlp.

Steps: 

1) Get the syntactic relationship between each pair of words

2) Apply sentence segmentation to determine the sentence boudaries

3) The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relationships within each sentence

How this is different from n-gram?

Dependency relation allows the similarity comparison to be based on the syntactic relations between words, instead of having to match words in their exact order in n-gram based comparisons.

Example:

Sentence: “The flat tire was not changed by driver”

Stanford dependency relations: 

root(ROOT-0, changed-6)
det(tire-3, the-1)
amod(tire-3, flat-2)
nsubjpass(changed-6, tire-3)
auxpass(changed-6, was-4)
neg(changed-6, not-5)
prep(changed-6, by-7)
pobj(by-7, driver-8)

Refer Stanford typed dependencies manual for full list & more info: http://nlp.stanford.edu/software/dependencies_manual.pdf

Triples output in the form (Subject : Predicate [modifier] : Object) :  

driver : changed [not] : tire

Extraction Logic: You can use the below base logic to build the functionality in your favorite/comfortable language (R/Python/Java/etc). Please note that this is only the base logic and needs enhancement.

triples

Challenges:

  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis

 

Hope this article is useful!

Text Mining 101

Text Mining or Text Analytic is the discovery and communication of meaningful patterns in text data. As part of 101, I would like to cover the building blocks of TM:

  • TM process overview
  • Calculate term weight (TF-IDF)
  • Similarity distance measure (Cosine)
  • Overview of key text mining techniques

Text Mining Process Overview


Broadly there are 4 stages in the text mining process. There are great open source tools available (R, python, etc) to carry out the process mentioned here. The steps almost remain the same irrespective of the analysis platform.

– Step 1: Data Assembly
– Step 2: Data Processing
– Step 3: Data Exploration or Visualization
– Step 4: Model Building

R - Text Mining-001

Brief description about Data Processing steps

Explore Corpus – Understand the types of variables, their functions, permissible values, and so on. Some formats including html and xml contain tags and other data structures that provide more metadata.

Convert text to lowercase – This is to avoid distinguish between words simply on case.

Remove Number(if required) – Numbers may or may not be relevant to our analyses.

Remove Punctuation – Punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning.

Remove English stop words – Stop words are common words found in a language. Words like for, of, are etc are common stop words.

Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.

Strip white space – Eliminate extra white spaces.

Stemming – Transforms to root word. Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”. For example i.e., 1) “computer” & “computers” become “comput”

Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”

Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be removed from the document term matrix.

Document term matrix – A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix.


Calculate Term Weight – TF-IDF


How frequently term appears?

Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

How important a term is?

DF: Document Frequency = d (number of documents containing a given term) / D (the size of the collection of documents)

To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

IDF: Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it)

Example:

Consider a document containing 100 words wherein the word CAR appears 3 times

TF(CAR) = 3 / 100 = 0.03

Now, assume we have 10 million documents and the word CAR appears in one thousand of these

IDF(CAR) = log(10,000,000 / 1,000) = 4

TF-IDF weight is product of these quantities: 0.03 * 4 = 0.12


Similarity Distance Measure (Cosine)


Why Cosine?

Here is a detailed paper on comparing the efficiency of different distance measures for text documents. General observation is that the Cosine similarity works better than the Euclidean for text data.

Cosine-001

So lets understand how to calculate Cosine similarity.

Example:

Text 1: statistics skills and programming skills are equally important for analytics

Text 2: statistics skills and domain knowledge are important for analytics

Text 3: I like reading books and travelling

Document Term Matrix for the above 3 text would be:

DTM-001

The three vectors are:

T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)

T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)

T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)

Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%

Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%


Overview of key Text Mining techniques


There 3 key techniques are in practice:

  1. N-gram based analytic
  2. Shallow Natural Language Processing technique
  3. Deep Natural Language Processing technique

N-gram based analytic

Definition:

  • n-gramis a contiguous sequence of n items from a given sequenceof text
  • The items can be syllables, letters, words or base pairs according to the application

Application:

  • Probabilistic language model for predicting the next item in a sequence in the form of a (n − 1)
  • Widely used in probability, communication theory, computational linguistics, biological sequence analysis

Advantage:

  • Relatively simple
  • Simply increasing n, model can be used to store more context

Disadvantage: 

Semantic value of the item is not considered

Example:

Lets look at the n-gram output for the sample sentence “defense attorney for liberty and montecito”

  • 1-gram: defense, attorney, for, liberty, and, montecito
  • 2-gram: defense attorney, for liberty, and montecito, attorney for, liberty and, attorney for
  • 3-gram: defense attorney for, liberty and montecito, attorney for liberty, for liberty and, liberty and montecito
  • 4-gram: defense attorney for liberty, attorney for liberty and, for liberty and montecito,
  • 5-gram: defense attorney for liberty and montecito, attorney for liberty and montecito
  • 6-gram: defense attorney for liberty and montecito

Shallow NLP technique

Definition:

  • Assign a syntactic label (noun, verb etc.) to a chunk
  • Knowledge extraction from text through semantic/syntactic analysis approach

Shallow-001

Application:

  • Taxonomy extraction (predefined terms and entities). Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases, medicines
  • Concept extraction (main idea or a theme)

Advantage: Less noisy than n-grams

Disadvantage: Does not specify role of items in the main sentence

Example: Consider the sentence “The driver from Bangalore crashed the bike with the black bumper”. Lets examine the results of applying the n-gram and shallow NLP technique to extract the concept to the sample sentence.

Apply below 3 steps to example sentence

  • Convert to lowercase & PoS tag
  • Remove stop words
  • Retain only Noun’s & Verb’s as these hold higher weight in the sentence

1-gram output: driver, bangalore, crashed, bike, bumper

1-gram

Bi-gram output with noun/verb’s retained: crashed bike, driver bangalore, bangalore crashed

2-gram

3-gram output with noun/verb’s retained: driver bangalore crashed, bangalore crashed bike

3-gram

Conclusion:

  • 1-gram: Reduced noise, however no clear context
  • Bi-gram & 3-gram:  Increased context, however there is a information loss (like bumper tells us what is crashed which is not in output)

Deep NLP technique

Definition:

  • Extension to the shallow NLP
  • Get the syntactic relationship between each pair of words
  • Apply sentence segmentation to determine the sentence boundaries
  • The Stanford Parser (or any similar) is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence
  • Detected relationships are expressed as complex construction to retain the context
  • Example relationships: Located in, employed by, part of, married to

ApplicationsDevelop features and representations appropriate for complex interpretation tasks like fraud detection and prediction activities based on complex RNA-Sequence in life science

Capture

The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object) without loosing the context. Modifiers are negations, multi-word expression, adverbial modifier like not, maybe, however etc. You can learn more about stanford typed dependency here.

Triples Output:

  • driver : from : bangalore
  • driver : crashed : bike
  • driver : crashed with : bumper

Disadvantages:

  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis

Summary


Summary

Hope this helps and I welcome feedback/comments.


References: