Sentiment Analysis



The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral

Use case:

Customer’s on line comments/feedback from an insurance companies website has been scrapped to run through the sentiment analysis.

You can find full R code along with the data set in my git repository here


  • Load required R libraries

    # source("")
    # biocLite("Rgraphviz")
    # install.packages('tm')
    # install.packages('wordcloud')
    # download.file("", "Rstem_0.4-1.tar.gz")
    # install.packages("Rstem_0.4-1.tar.gz", repos=NULL, type="source")
    # download.file("", "sentiment.tar.gz")
    # install.packages("sentiment.tar.gz", repos=NULL, type="source")# Load libraries
  • Pre-process data:

    Text pre-processing is an important step to reduce noise from the data. Each step is discussed below

  • convert to lower: this is to avoid distinguish between words simply on case
  • remove punctuation: punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation
  • remove numbers: numbers may or may not be relevant to our analyses
  • remove stop words: stop words are common words found in a language. Words like for, of, are etc are common stop word
  • create document term matrix: a document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix
df <- read.table("../input/data.csv",sep=",",header=TRUE)
corp <- Corpus(VectorSource(df$Review)) 
corp <- tm_map(corp, tolower) 
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
# corp <- tm_map(corp, stemDocument, language = "english") 
corp <- tm_map(corp, removeWords, c("the", stopwords("english"))) 
corp <- tm_map(corp, PlainTextDocument)
corp.tdm <- TermDocumentMatrix(corp, control = list(minWordLength = 3)) 
corp.dtm <- DocumentTermMatrix(corp, control = list(minWordLength = 3))
  • Insight through visualization

    • Word cloud: This visualization generates words whose font size relates to its frequency.

      wordcloud(corp, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, 'Dark2'))

      Word Cloud

    • Frequency plot:This visualization presents a bar chart whose length corresponds the frequency a particular word occurred
      corp.tdm.df <- sort(rowSums(corp.tdm.df),decreasing=TRUE) # populate term frequency and sort in decesending order
      df.freq <- data.frame(word = names(corp.tdm.df),freq=corp.tdm.df) # Table with terms and frequency
      # Set minimum term frequency value. The charts will be created for terms > or = to the minimum value that we set.
      freqControl <- 100
      # Frequency Plot
      freqplotData <- subset(df.freq, df.freq$freq > freqControl)
      freqplotData$word <- ordered(freqplotData$word,levels=levels(freqplotData$word)[unclass(freqplotData$word)])
      freqplot <- ggplot(freqplotData,aes(reorder(word,freq), freq))
      freqplot <- freqplot + geom_bar(stat="identity")
      freqplot <- freqplot + theme(axis.text.x=element_text(angle=90,hjust=1)) + coord_flip() 
      freqplot + xlim(rev(levels(freqplotData$word)))+ ggtitle("Frequency Plot")


    • Correlation plot: Here, we choose N number of high frequent words as the nodes and include links between words when they have at least a correlation of x %
      # Correlation Plot
      # 50 of the more frequent words have been chosen as the nodes and include links between words
      # when they have at least a correlation of 0.2
      # By default (without providing terms and a correlation threshold) the plot function chooses a
      # random 20 terms with a threshold of 0.7
      plot(corp.tdm,terms=findFreqTerms(corp.tdm,lowfreq=freqControl)[1:50],corThreshold=0.2, main="Correlation Plot")


    • Paired word cloud: This is a customized word cloud. Here, we pick the top N most frequent words and extract associated words with strong correlation. Combine individual top N words with the every associated word (say one of my top words is broken and one of the associated words is pipe; the combined word would be broken-pipe). Then we create a word cloud on the combined words. Although the concept is good, the chart below does not appear helpful. So need to figure out a better representation

      # Paired-Terms wordcloud
      # pick the top N most frequent words and extract associated words with strong correlation (say 70%). 
      # Combine individual top N words with every associated word.
      nFreqTerms <- findFreqTerms(corp.dtm,lowfreq=freqControl)
      nFreqTermsAssocs <- findAssocs(corp.dtm, nFreqTerms, 0.3)
      pairedTerms <- c()
      for (i in 1:length(nFreqTermsAssocs)){
          lapply(names(nFreqTermsAssocs[[i]]),function(x) pairedTerms <<- c(pairedTerms,paste(names(nFreqTermsAssocs[i]),x,sep="-")))
      wordcloud(pairedTerms,random.order=FALSE,colors=dark2,main="Paired Wordcloud")

      Paired Word Cloud

  • Sentiment Score

    • Load positive / negative terms corpus

      The corpus contains around 6800 words, this list was compiled over many years starting from first paper by Hu and Liu, KDD-2004. Although necessary, having an opinion lexicon is far from sufficient for accurate sentiment analysis. See this paper: Sentiment Analysis and Subjectivity or the Sentiment Analysis

    • Calculate positive / negative score

      Simply we calculate the positive / negative score by comparing the terms with positive/negative term corpus and summing the occurrence count

    • Classify emotion

      R package sentiment by Timothy Jurka has a function that helps us to analyze some text and classify it in different types of emotion: anger, disgust, fear, joy, sadness, and surprise. The classification can be performed using two algorithms: one is a naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon; the other one is just a simple voter procedure.

    • Classify polarity

      Another function from sentiment package, classify_polarity allows us to classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon; or by a simple voter algorithm.

      hu.liu.pos = scan('../input/positive-words.txt', what = 'character',comment.char=';') 
      hu.liu.neg = scan('../input/negative-words.txt',what = 'character',comment.char= ';') 
      pos.words = c(hu.liu.pos)
      neg.words = c(hu.liu.neg)
      score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
        # we got a vector of sentences. plyr will handle a list
        # or a vector as an "l" for us
        # we want a simple array ("a") of scores back, so we use
        # "l" + "a" + "ply" = "laply":
        scores = laply(sentences, function(sentence, pos.words, neg.words) {
          # clean up sentences with R's regex-driven global substitute, gsub():
          sentence = gsub('[[:punct:]]', '', sentence)
          sentence = gsub('[[:cntrl:]]', '', sentence)
          sentence = gsub('\\d+', '', sentence)
          # and convert to lower case:
          sentence = tolower(sentence)
          # split into words. str_split is in the stringr package
          word.list = str_split(sentence, '\\s+')
          # sometimes a list() is one level of hierarchy too much
          words = unlist(word.list)
          # compare our words to the dictionaries of positive & negative terms
          pos.matches = match(words, pos.words)
          neg.matches = match(words, neg.words)
          # match() returns the position of the matched term or NA
          # we just want a TRUE/FALSE:
          pos.matches= !
          neg.matches= !
          # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
          score = sum(pos.matches) - sum(neg.matches)
        }, pos.words, neg.words, .progress=.progress )
        scores.df = data.frame(score=scores, text=sentences)
      review.scores<- score.sentiment(df$Review,pos.words,neg.words,.progress='text')
      #classify emotion
      class_emo = classify_emotion(df$Review, algorithm="bayes", prior=1.0)
      #get emotion best fit
      emotion = class_emo[,7]
      # substitute NA's by "unknown"
      emotion[] = "unknown"
      # classify polarity
      class_pol = classify_polarity(df$Review, algorithm="bayes")
      # get polarity best fit
      polarity = class_pol[,4]
      # data frame with results
      sent_df = data.frame(text=df$Review, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)
      # sort data frame
      sent_df = within(sent_df, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
    • Visualize

      • Distribution of overall score
        ggplot(review.scores, aes(x=score)) + 
          geom_histogram(binwidth=1) + 
          xlab("Sentiment score") + 
          ylab("Frequency") + 
          ggtitle("Distribution of sentiment score") +
          theme_bw()  + 
          theme(axis.title.x = element_text(vjust = -0.5, size = 14)) + 
          theme(axis.title.y=element_text(size = 14, angle=90, vjust = -0.25)) + 
          theme(plot.margin = unit(c(1,1,2,2), "lines"))

        Sentiment Score Distribution.png

      • Distribution of score for a given term
        review.pos<- subset(review.scores,review.scores$score>= 2) 
        review.neg<- subset(review.scores,review.scores$score<= -2)
        claim <- subset(review.scores, regexpr("claim", review.scores$text) > 0) 
        ggplot(claim, aes(x = score)) + geom_histogram(binwidth = 1) + ggtitle("Sentiment score for the token 'claim'") + xlab("Score") + ylab("Frequency") + theme_bw()  + theme(axis.title.x = element_text(vjust = -0.5, size = 14)) + theme(axis.title.y = element_text(size = 14, angle = 90, vjust = -0.25)) + theme(plot.margin = unit(c(1,1,2,2), "lines"))

        Claim Score.png

      • Distribution of emotion
        # plot distribution of emotions
        ggplot(sent_df, aes(x=emotion)) +
          geom_bar(aes(y=..count.., fill=emotion)) +
          scale_fill_brewer(palette="Dark2") +
          labs(x="emotion categories", y="number of Feedback", 
               title = "Sentiment Analysis of Feedback about claim(classification by emotion)",
               plot.title = element_text(size=12))

        Emotions Distribution.png

      • Distribution of polarity
        # plot distribution of polarity
        ggplot(sent_df, aes(x=polarity)) +
          geom_bar(aes(y=..count.., fill=polarity)) +
          scale_fill_brewer(palette="RdGy") +
          labs(x="emotion categories", y="number of Feedback", 
               title = "Sentiment Analysis of Feedback about claim(classification by emotion)",
               plot.title = element_text(size=12))


      • Text by emotion
        # separating text by emotion
        emos = levels(factor(sent_df$emotion))
        nemo = length(emos) = rep("", nemo)
        for (i in 1:nemo)
          tmp = df$Review[emotion == emos[i]]
[i] = paste(tmp, collapse=" ")
        # remove stopwords = removeWords(, stopwords("english"))
        # create corpus
        corpus = Corpus(VectorSource(
        tdm = TermDocumentMatrix(corpus)
        tdm = as.matrix(tdm)
        colnames(tdm) = emos
        # comparison word cloud, colors = brewer.pal(nemo, "Dark2"),
                         scale = c(3,.5), random.order = FALSE, title.size = 1.5)

        Text by emotion.png

Next post to cover sentiment analysis in R + Hadoop.



The above write up is based on the tutorials from following links:

Triples – Deep Natural Language Processing

Problem: In Text Mining extracting keywords (n-grams) alone cannot produce meaningful data nor discover “unknown” themes and trends.

Objective: The aim here is to extract dependency relation from sentence i.e., extract sets of the form {subject, predicate[modifiers], object} out of syntactically parsed sentences, using Stanford parser and opennlp.


1) Get the syntactic relationship between each pair of words

2) Apply sentence segmentation to determine the sentence boudaries

3) The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relationships within each sentence

How this is different from n-gram?

Dependency relation allows the similarity comparison to be based on the syntactic relations between words, instead of having to match words in their exact order in n-gram based comparisons.


Sentence: “The flat tire was not changed by driver”

Stanford dependency relations: 

root(ROOT-0, changed-6)
det(tire-3, the-1)
amod(tire-3, flat-2)
nsubjpass(changed-6, tire-3)
auxpass(changed-6, was-4)
neg(changed-6, not-5)
prep(changed-6, by-7)
pobj(by-7, driver-8)

Refer Stanford typed dependencies manual for full list & more info:

Triples output in the form (Subject : Predicate [modifier] : Object) :  

driver : changed [not] : tire

Extraction Logic: You can use the below base logic to build the functionality in your favorite/comfortable language (R/Python/Java/etc). Please note that this is only the base logic and needs enhancement.



  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis


Hope this article is useful!

Text Mining 101

Text Mining or Text Analytic is the discovery and communication of meaningful patterns in text data. As part of 101, I would like to cover the building blocks of TM:

  • TM process overview
  • Calculate term weight (TF-IDF)
  • Similarity distance measure (Cosine)
  • Overview of key text mining techniques

Text Mining Process Overview

Broadly there are 4 stages in the text mining process. There are great open source tools available (R, python, etc) to carry out the process mentioned here. The steps almost remain the same irrespective of the analysis platform.

– Step 1: Data Assembly
– Step 2: Data Processing
– Step 3: Data Exploration or Visualization
– Step 4: Model Building

R - Text Mining-001

Brief description about Data Processing steps

Explore Corpus – Understand the types of variables, their functions, permissible values, and so on. Some formats including html and xml contain tags and other data structures that provide more metadata.

Convert text to lowercase – This is to avoid distinguish between words simply on case.

Remove Number(if required) – Numbers may or may not be relevant to our analyses.

Remove Punctuation – Punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning.

Remove English stop words – Stop words are common words found in a language. Words like for, of, are etc are common stop words.

Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we’ve done some analysis.

Strip white space – Eliminate extra white spaces.

Stemming – Transforms to root word. Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”. For example i.e., 1) “computer” & “computers” become “comput”

Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”

Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be removed from the document term matrix.

Document term matrix – A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix.

Calculate Term Weight – TF-IDF

How frequently term appears?

Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

How important a term is?

DF: Document Frequency = d (number of documents containing a given term) / D (the size of the collection of documents)

To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

IDF: Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it)


Consider a document containing 100 words wherein the word CAR appears 3 times

TF(CAR) = 3 / 100 = 0.03

Now, assume we have 10 million documents and the word CAR appears in one thousand of these

IDF(CAR) = log(10,000,000 / 1,000) = 4

TF-IDF weight is product of these quantities: 0.03 * 4 = 0.12

Similarity Distance Measure (Cosine)

Why Cosine?

Here is a detailed paper on comparing the efficiency of different distance measures for text documents. General observation is that the Cosine similarity works better than the Euclidean for text data.


So lets understand how to calculate Cosine similarity.


Text 1: statistics skills and programming skills are equally important for analytics

Text 2: statistics skills and domain knowledge are important for analytics

Text 3: I like reading books and travelling

Document Term Matrix for the above 3 text would be:


The three vectors are:

T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)

T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)

T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)

Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%

Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%

Overview of key Text Mining techniques

There 3 key techniques are in practice:

  1. N-gram based analytic
  2. Shallow Natural Language Processing technique
  3. Deep Natural Language Processing technique

N-gram based analytic


  • n-gramis a contiguous sequence of n items from a given sequenceof text
  • The items can be syllables, letters, words or base pairs according to the application


  • Probabilistic language model for predicting the next item in a sequence in the form of a (n − 1)
  • Widely used in probability, communication theory, computational linguistics, biological sequence analysis


  • Relatively simple
  • Simply increasing n, model can be used to store more context


Semantic value of the item is not considered


Lets look at the n-gram output for the sample sentence “defense attorney for liberty and montecito”

  • 1-gram: defense, attorney, for, liberty, and, montecito
  • 2-gram: defense attorney, for liberty, and montecito, attorney for, liberty and, attorney for
  • 3-gram: defense attorney for, liberty and montecito, attorney for liberty, for liberty and, liberty and montecito
  • 4-gram: defense attorney for liberty, attorney for liberty and, for liberty and montecito,
  • 5-gram: defense attorney for liberty and montecito, attorney for liberty and montecito
  • 6-gram: defense attorney for liberty and montecito

Shallow NLP technique


  • Assign a syntactic label (noun, verb etc.) to a chunk
  • Knowledge extraction from text through semantic/syntactic analysis approach



  • Taxonomy extraction (predefined terms and entities). Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases, medicines
  • Concept extraction (main idea or a theme)

Advantage: Less noisy than n-grams

Disadvantage: Does not specify role of items in the main sentence

Example: Consider the sentence “The driver from Bangalore crashed the bike with the black bumper”. Lets examine the results of applying the n-gram and shallow NLP technique to extract the concept to the sample sentence.

Apply below 3 steps to example sentence

  • Convert to lowercase & PoS tag
  • Remove stop words
  • Retain only Noun’s & Verb’s as these hold higher weight in the sentence

1-gram output: driver, bangalore, crashed, bike, bumper


Bi-gram output with noun/verb’s retained: crashed bike, driver bangalore, bangalore crashed


3-gram output with noun/verb’s retained: driver bangalore crashed, bangalore crashed bike



  • 1-gram: Reduced noise, however no clear context
  • Bi-gram & 3-gram:  Increased context, however there is a information loss (like bumper tells us what is crashed which is not in output)

Deep NLP technique


  • Extension to the shallow NLP
  • Get the syntactic relationship between each pair of words
  • Apply sentence segmentation to determine the sentence boundaries
  • The Stanford Parser (or any similar) is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence
  • Detected relationships are expressed as complex construction to retain the context
  • Example relationships: Located in, employed by, part of, married to

ApplicationsDevelop features and representations appropriate for complex interpretation tasks like fraud detection and prediction activities based on complex RNA-Sequence in life science


The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object) without loosing the context. Modifiers are negations, multi-word expression, adverbial modifier like not, maybe, however etc. You can learn more about stanford typed dependency here.

Triples Output:

  • driver : from : bangalore
  • driver : crashed : bike
  • driver : crashed with : bumper


  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis



Hope this helps and I welcome feedback/comments.