Accelerate Analysis, Insight & Data Mining (for non coders)

R and Python are most popular, powerful and well designed (reasonably for non-techies) open source command-line based language for analysis, data science or machine learning enthusiast. Getting started and moving to a proficient level in these programming language (or any) is a tedious and time-consuming process. However accelerated delivery of insight, analysis, modeling, and proof of concepts is a key characteristic of successful analytics team to validate the strategies that enable us to drive business decisions in the right direction. The aim of this article is to provide useful resources around tools that will help the analyst to accelerate the delivery of insight, analysis and modelling.

Lets first understand how the analysis, insight, and data mining process is performed.

How do we perform analysis/insight?

The objective for an analyst is to convert the Data/Information into an insight and recommend on a possible Action that the business can take. Insight is the mechanism to do this successfully and through the work, it is key that we always keep the Business Context in mind.

Insight_Model

How do we do data mining?

Mainly three data mining process frameworks have been most popular, and widely practiced by data mining experts/researchers to build machine learning systems. You’ll notice that the core phases are covered by all 3 frameworks, and is not much difference.

1. Knowledge Discovery Databases (KDD) process model

KDD

The concept of KDD is essentially integration of multiple technologies of data mining which was presented in a book by Fayyad in 1996, learn more from the link https://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/1230/1131

2. CRoss Industrial Standard Process for Data Mining (CRISP-DM)

CRISP_DM

CRISP_DM, creates unbiased methodology that is not domain dependent and consolidates Data Mining best practices. CRISP-DM voted as leading methodology for Data Mining in polls of 2002, 2004 & 2007. You can learn more about CRISM-DM from the following link https://www.the-modeling-agency.com/crisp-dm.pdf

3. Sample, Explore, Modify, Model and Assess (SEMMA)

SEMMA

SEMMA is sequential steps of 5 to build machine learning models incorporated in SAS Enterprise Miner by SAS Inc., You can learn more about SEMMA through following link http://www2.sas.com/proceedings/sugi22/DATAWARE/PAPER128.PDF

Tools that you know & able to use = your skill level!

A Command-Line Interface (CLI) is way faster, powerful than User Interface based tools in terms of flexibility and features, however Graphical User Interface (GUI) is easy to use, interactive and has out of the box visualization. The aim here is not GUI vs CLI, but to be equipped enough with tools to understand and apply basic concepts, gain independence when it comes to data analysis and communicate it effectively. At the end of the day, the quick delivery skills of an analyst boils down to the tools that they are aware and able to use, so the goal here is to introduce you to the MUST KNOW GUI based open source tools that do not require much of coding to get you started to help accelerate the delivery of analytics (particularly during initial stage of your data science adaptation).

Screenshot from 2019-11-01 20-05-10

Tool Power to analyst

Here is a summary of general advantages and disadvantage of GUI based analytical tools

Pros:

  • GUI’s are click to do things so lots of fun
  • Great functionalities, many data mining packages, stunning out of the box visualizations
  • Supports key & cross-platform (Windows, Mac, Linux)

Cons:

  • Stores all of the data in RAM, so can crash/run slow with high volume of data

These tools to accelerate analysis, insight, and mining can be divided into two categories.

  1. R/Python aider’s: These are built to utilize the native capabilities of R/Python. Note that I have listed the most popular GUI for R/Python below (and the list does not necessarily cover all of the available tools)
    1. Rattle (R)
    2. JGB/Deducer (R)
    3. RCommander (R)
    4. Orange (Python)
  2. Independent platforms
    1. H2O
    2. KNIME

1.1 R: Rattle

How to install: run the below command in your RStudio

install.packages("rattle", repos="http://rattle.togaware.com", type="source")

How to launch:

library(rattle);rattle()

Full Tutorial: Click here!

Sample Screenshot(s):

rattle.png

1.2 JGB / Deducer

How to Install: run the below command in your RStudio

install.packages(c("JGB","Deducer","DeducerExtras"),dependencies=T)

How to launch: 

 library(jgb); jgb()

Tutorials: Link -1 here!, Link-2 here!

Sample Screenshots:

JGB.png

1.3 RCommander

How to install: run the below command in your RStudio

install.packages("Rcmdr", dependencies=T)

How to launch:

library(Rcmdr)

Tutorials: Link-1 here!, Link-2 here!

Sample Screenshots:

RCommander.png

1.4 Python Orange

Orange is an open source, from the AI Laboratory in Ljubljana, Slovenia.

How to install: you can download the appropriate executable installation file from their official web site for your OS here!

How to launch: The installation will create a desktop shortcut and add it to the menu bar. The application can be launched using the same.

Tutorial: Click here!

Sample Screenshot:

Orange.png

2.1 H2O

H2O is open-source software for big-data analysis. It is produced by the company H2O.ai, which launched in 2011 in Silicon Valley. H2O allows users to fit thousands of potential models as part of discovering patterns in data. It is claimed to be one of the world’s leading open source deep learning platform, used by over 100,000 data scientists and more than 10,000 organizations around the world. Their design goal is “To Bring Beautiful Business Transformation Through AI and Visual Intelligence” through 1) Make it Open 2) Make it Fast, Really Fast 3) Make it Beautiful

How to Install: Click here to learn more.

Tutorial: Click here!

Sample Screenshot:

H2o.png

2.2 KNIME

KNIME, the Konstanz Information Miner, is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipeline concept. KNIME Analytics Platform is the leading open solution for data-driven innovation, helping you discover the potential hidden in your data, mine for fresh insights, or predict new futures. It’s enterprise-grade, open source platform is fast to depoly, easy to scale and intuitive to learn. With more than 1000 modules, hundreds of ready-to-run examples, a comprehensive range of integrated tools, and the widest choice of advanced algorithms available, KNIME Analytics Platform is the perfect toolbox for any data scientist. Their steady course on an unrestricted open source is your passport to a global community of data scientists, their expertise, and their active contributions. Read more here!

How to Install: You can download the appropriate executable installation file from their official website for your OS here!

Tutorial: Link-1 here!, Link-2 here!

Sample Screenshots:

KNIME.png

2.3 WEKA

Waikato Environment for Knowledge Analysis is a suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. It is free software licensed under the General Public License (GNU). Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a data-set or called from your own Java code. Weka contains tools for data per-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

How to Install: Click here!

Tutorials: Link-1 here!, Link-2 here!, Link-3 (video) here!

Sample Screenshot:

WEKA.png

Hope this article was useful!

Triplets for concept extraction from English sentence (Deep NLP)

I recently published a white paper with the above mentioned title at ‘Fourth International Conference on Business Analytics and Intelligence’, held between 19 – 21, December 2016 at Indian Institute of Science, Bangalore. Here I present the key contents from the paper.

 

ABSTRACT

In Text Mining extracting n-gram keywords alone can not produce meaningful information nor uncover “unknown” themes and trends. Triples are a way to represent information from a text sentence in fewer words without losing the context. The application of triples leads to higher accuracy for complex interpretation tasks such as fraud detection, and prediction activities based on complex RNA-Sequence in life science. There are different techniques for getting this information before you represent it as triples, and the techniques depend on the kind of data being read as input. . In this paper we briefly evaluate different methods that are in practice to perform triples extraction from English sentence. An advanced NLP technique has been presented/discussed in detail to extract dependency relation from sentence i.e., extract sets of the form {subject, predicate [modifiers], object} out of syntactically parsed sentences, using Stanford parser. The technique is an extension to the shallow NLP. First we need to get the syntactic relationship between each pair of words. Apply sentence segmentation to determine the sentence boundaries. The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence. Detected relationships are expressed as complex construction to retain the context.

KEYWORDS

Text Analytics, Text Mining, Concept Extraction, Triples, Triplets

INTRODUCTION

Resource Description Framework (RDF) is a well know data model for information extraction and was adopted as a World Wide Web Consortium recommendation in 1999 as a general method for conceptual description or modeling of information that is implemented in web resources. The RDF relates entities by the subject-predicate-object format where the subject and object are related to one another by the predicate. Later it was also used in knowledge management applications involving structured text contents. According to the approach presented in [1], a triplet in a text sentence is defined as a relation between subject and object, the relation being the predicate.The aim here is to extract sets of the form {subject, predicate, object} out of syntactically parsed sentences.The triple is a minimal representation for information without losing the context.In the current research we’ll look to enhance the objective of extracting aspects by using additional descriptors such as modifiers alongside predicate. Descriptor is a word, especially an adjective or any other modifier used attributively which restricts or adds to the sense of a head noun. Descriptors express opinions and sentiments about an aspect, which can be further used in generation of summaries for the aspects. For example, “The flat tire was replaced by the driver” can be represented as driver:replaced:tire which is subject:predicate[modifier]:object

EXISTING APPROACHES:

A decent amount of research and implementation has been carried out in the past in the area of extracting triplets/triples from text sentences for concept extraction.

There has been usage of two major techniques:

1.     Machine Learning Technique:

A machine learning approach has been used [1] to extract subject-predicate-object triplets from English sentences. SVM is used to train a model on human annotated triplets, and the features are computed from three parser. The sentence is tokenized and then the stop words and punctuation are removed. This will give us a list of the important tokens in the sentence. The next step is to get all possible ordered combinations of three tokens from the list. The resulted combinations are the triplet candidates. From now on the problem is seen as a binary classification problem where the triplet candidates must be classified as positive or as negative.The SVM model assigns a positive score to those candidates which should be extracted as triplets, and a negative score to the others. Using the higher positive score words, the resulting triplet is formed. As opposed to the subject and the verb, the objects are different among the positively classified triplet candidates. In such cases an attempt to merge the different triplet elements (in this case objects) is made. if two or more words are consecutive in the list of important tokens, then they are merged. Where merges have been done in the object, the tokens are connected by the stop words from the original sentence.In the merging method described above, it will not always be possible to merge all tokens into a single set. In this case several triplets i.e., one for each of the three sets will be obtained. Note that in practice in the classification described above there are many false positives, so it does not work to take them all for the resulting triplets. Instead only the top few from the descending ordered list of triplet candidates are taken.

2.     Tree Bank Parser:

A treebank is a text corpus where each sentence belonging to the corpus has a syntactic structure added to it. A detailed extraction logic using different parser techniques have been presented in [2]

–       Stanford Parser:

It is a natural language parser developed by Dan Klein and Christopher D. Manning from The Stanford Natural Language Processing Group [1, 2]. The packagecontains a Java implementation of probabilistic natural language parsers; a graphical user interface is also available, for parse tree visualization. The software is available at [4].Stanford Parser will output a parse tree for a given sentence. Stanford Parser generates a Treebank parse tree for the input sentence. Figure 1 depicts the parse tree for the sentence “the flat tire was replaced by the driver”. A sentence (S) is represented by the parser as a tree having three children: a noun phrase (NP), a verbal phrase (VP) and the full stop (.). The root of the tree will be S. The triplet extracted out of this sentence is driver – replaced – tire.

figure-1

–       Link Parser:

This application uses the link grammar, generating a linkage after parsing a sentence.It can be downloaded from the web site [5].Detailed explanations of what the different link labels mean are available at [6].

figure-2

–       Minipar Parser:

It is a parser developed by DekangLin.Minipar takes one sentence at a time as an input and generates the tokens of type ‘DepTreeNode’. Later it assigns relation between these tokens. Each DepTreeNode consists of feature called ‘word’: this is the actual text of the word.

Figure 3.png

–       The Multi-Liaison Algorithm:

According to the approach presented in [3], an English sentence can have multiple subjects and objects and the Multi-Liaison Algorithm was presented for extracting multiple connections or links between subject and object from input sentence, which can have one or more than one subject, predicate and object. The parse tree visualization and the dependencies generated from the StanfordParser are used to extract this information from the given sentence. Using the dependencies, an output was generated which displays which subject is related to which object and the connecting predicate. Finding the subjects and objects helps in determining the entities involved and the predicates determine the relationship that exists between the subject and the object. An algorithm was developed to do so and this algorithm is elucidated in detail step-wise. It was named ‘The Multi-Liaison Algorithm’ since the liaison between the subjects and objects would be displayed. The word ‘liaison’ has been used as it is displaying the relationship and association between the subjects and predicates.

Example input sentence: “The old beggar ran after the rich man who was wearing a black coat”

Multi-liaison algorithm output: 1) beggar – ran – man 2) man – wearing – coat

PROPOSED APPROACH:

Introducing modifier alongside predictor will increase clarity of obscure facts in the sentences. The goal here is to extract sets of the form {subject; predicate [modifier]; object} out of syntactically parsed sentences. Modifiers are words or phrases that give additional detail about the subject discussed in a sentence. Since these words enhance the reception of a sentence, they tend to be describing words such as adjectives and adverbs. In addition, phrases that modify tend to describe adjectives and adverbs, such as adjective clauses and adverbial phrases. They equip the writer with the capability to provide the reader with the most accurate illustration words can allow. For example, a writer can write a simple sentence that states the facts and nothing more, such as “Joseph caught a fish” If the writer chooses to utilize modifiers, the sentence could read as follows: “Joseph was a nice tall boy from India, who caught a fish which was smaller than a Mars bar”.

The additional details in the sentence, by way of modifiers, engage the reader and hold their attention.

As per the StanfordCoreNLP dependency manual[7], there are 22 types of modifiers in English sentence. In this paper a set of 3 key or essential modifiers have been identified that will help us get more context out of sentence.

1.     mwe – Multi-Word Expression Modifier:

Example:

table-1

2.     advmod – Adverbial Modifier:

An adverb modifier of a word is a (non-clausal) adverb or adverb-headed phrase that serves to modify the meaning of the word

Example:

table-2

3.     neg – Negation Modifier:

The negation modifier is the relation between a negation word and the word it modifies

Example:

table-3

TRIPLES EXTRACTION – HIGH LEVEL STEPS: 

  1. Get the syntactic relationship between each pair of words
  2. Apply sentence segmentation to determine the sentence boundaries
  3. The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relationships within each sentence
Example Sentence: “The flat tire was not replaced by driver”
Stanford dependency relations: 
root(ROOT-0, replaced-6)
det(tire-3, the-1)
amod(tire-3, flat-2)
nsubjpass(replaced -6, tire-3)
auxpass(replaced -6, was-4)
neg(replaced -6, not-5)
prep(replaced -6, by-7)
pobj(by-7, driver-8)

Triples output in the form (Subject : Predicate [modifier] : Object) :

driver : replaced[not] : tire

EXTRACTION LOGIC: 

You can use the below base logic to build the functionality in your favorite/comfortable language (R/Python/Java/etc). Please note that this is only the base logic and needs enhancement.

Step 1: Annotate Using StanfordCoreNLP_Pipeline

In this stage we will divide a string into tokens based on the given delimiters. Token is one piece of information, a “word”. The String is tokenized via a tokenizer (using a TokenizerAnnotator) and then Penn treebank annotation is used to add things like lemmas, POS tags, and named entities. These are returned as a list of CoreLabels.

Step 2: Read Into NLP Tree Object

Transforms trees by turning the labels into their basic categories according to the TreebankLanguagePack

Step 3: Extract The Basic Dependencies

Stanford dependencies provides a representation of grammatical relations between words in a sentence.

Step 4: Extract Subject Predicate Object FromBasic Dependencies Table

EXTRACT-SUBJECT-PREDICATE-OBJECT(Basic_Dependencies)
            // single sentence can have multiple subject predicate and objects, so declare it a list
            subject = []
            predicate = []
            object = []
            if nsubj exists then // nominal sentence
                        if dobj exists then
                                    append nsubj to subject
                                    append prep to predicate
                                    append dobj to object
                        elseifpobj exists then
                                    append nsubj to subject
                                    append prep to predicate
                                    append pobj to object
                        else
                                    append nsubj to subject
                                    append prep to predicate
                                    append xcomp to object
                                   
            elseifnsubjpass exists then // passive sentence
                        append agent to subject
                        append root to predicate
                        append nsubjpass to object

Step 5: Extract Modifier And Named Entity

If the subject is named entity then we can anonymize to help us compare concept between two sentence. For example, in the below two sentence the concept is same however the subject differs so anonymizing the subject will tell us the concept is same.

Sentence 1: ‘The flat tire was replaced by John’ and the triples would be John:replaced:tire

Sentence 2: ‘The flat tire was replaced by Joe’ and the triples would be Joe:replaced:tire

Anonymizing the named entity would look as shown below which makes the comparison easy.

Post anonymizing, sentence 1 output: {unspecified}: replaced: tire

Post anonymizing, sentence 2 output: {unspecified}: replaced: tire

EXTRACT-MODIFIERS_ENTITY(Basic_Dependencies)           
            if subject = named_entity then
                        subect = {unspecified} //anonymize
 
            modifier = admod + mwe + neg

Step 6: Represent Subject: Object[Modifier]: Predicate

TRIPLES EXTRACTION LOGIC FLOW CHART:

Nominal sentence is a linguistic term that refers to a nonverbal sentence (i.e. a sentence without a finite verb). As a nominal sentence does not have a verbal predicate, it may contain a nominal predicate, an adjectival predicate, an adverbial predicate or even a prepositional predicate.

Active and Passive Sentences. A sentence is written in active voice when the subject of the sentence performs the action in the sentence. e.g. The girl was washing the dog. A sentence is written in passive voice when the subject of the sentence has an action done to it by someone or something else.

Figure 4.png

Chart Reference:
nsubj : nominal subject
nsubjpass: passive nominal subject
dobj: direct object
root: root
xcomp: open clausal complement
prep: prepositional modifier
pobj: object of a preposition
neg: negation modifier
advmod: adverbial modifier
mwe: multi-word expression

Note: More details on all possible dependencies can be found in the Stanford dependency manual [7]

CONCLUSION:

Inclusion of modifier alongside predicate helps us to bring more meaningful context into the triples structure. In general, the deep NLP technique for sentence level analysis is too structured, and usage of abbreviation, grammatical errors in sentence will mislead the analysis. However, for a proper English sentence extraction of triples gives us key elements of sentence. In addition,inclusion of negation, multi word expression and adverbial modifiers alongside predicator helps bring more context to the triples. We have to be cautious in choosing the kind of additional modifiers (if any) to show alongside predicator depending on the business context. Further research to be done to get a better understanding of the possible additional modifiers that will qualify to show alongside predicator to store more context from different types of sentences.

REFERENCE:

  1. Lorand Dali, Blaž Fortuna, Artificial Intelligence Laboratory, Jožef Stefan Institute, “Triples extraction from sentences using SVM” in Slovenian KDD Conference on Data Mining and Data Warehouses (SiKDD), Ljubljana 2008 http://ailab.ijs.si/dunja/SiKDD2008/Papers/Dali_Final.pdf
  2. Delia Rusu, Lorand Dali, Blaz Fortuna, Marko Grobelnik, DunjaMladenic, “Triplet extraction from sentences” in Artificial Intelligence Laboratory, Jožef Stefan Institute, Slovenia, Nov. 7, 2008. http://ailab.ijs.si/dunja/SiKDD2007/Papers/Rusu_Trippels.pdf
  3. The Multi-Liaison algorithm by Ms. Anjali Ganesh Jivani, Ms.AmishaHetalShingala, Dr. Paresh. V. Virparia published in International Journal of Advanced Computer Science and Applications Vol. 2, No. 5, 2011. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.625.507&rep=rep1&type=pdf
  4. Stanford Parser web page: http://nlp.stanford.edu/software/lex-parser.shtml
  5. Link Parser web page: http://www.link.cs.cmu.edu/link/
  6. Link labels web page: http://www.link.cs.cmu.edu/link/dict/summarizelinks.html
  7. Stanford dependency manual http://nlp.stanford.edu/software/dependencies_manual.pdf

If you are new to text mining, you can learn the basic concepts involved in text mining here! I welcome your feedback.

 

Rhadoop – Sentiment Analysis

In continuation to my previous post on sentiment analysis, here lets explore further on performing the same using RHadoop.

What will be covered?

  • Environment & Pre-requisites
  • Rhadoop in action
    • Setting Rhadoop environment variables
    • Setting working folder paths
    • Loading data
    • Scoring function
    • Writing Mapper
    • Writing Reducer
    • Run your Map-Reduce program
    • Read data output from hadoop to R data frame

Environment:

I have performed this analysis on below given set up:

Pre-requisites:

Ensure that all hadoop process are running. You can do this by running the hadoop command on your terminal

start-dfs.sh and start-yarn.sh

Then, run the command jps on your terminal and the result should look similar to below screen shot:

11

RHadoop In Action:

Set up the environment variables, and note that the path may change based on your version of Ubuntu and Hadoop (I’m using Hadoop 2.4.0) installation

Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop"
Sys.setenv("HADOOP_STREAMING"="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar")

Setting folder paths:

To ease the testing process during development stage flexibility of switching between local and hadoop environment would be useful. In the below code setting Local = T would use files from local folder otherwise Local = F to use from hadoop

# Root folder path
setwd('/home/manohar/example/Sentiment_Analysis')

# Set "LOCAL" variable to T to execute using rmr's local backend.
# Otherwise, use Hadoop (which needs to be running, correctly configured, etc.)
LOCAL=F

if (LOCAL)
{
  rmr.options(backend = 'local')
  
  # we have smaller extracts of the data in this project's 'local' subdirectory
  hdfs.data.root = '/home/manohar/example/Sentiment_Analysis/'
  hdfs.data = file.path(hdfs.data.root, 'data', 'data.csv')
  
  hdfs.out.root = hdfs.data.root
  
} else {
  rmr.options(backend = 'hadoop')
  
  # assumes 'Sentiment_Analysis/data' input path exists on HDFS under /home/manohar/example
  
  hdfs.data.root = '/home/manohar/example/Sentiment_Analysis/'
  hdfs.data = file.path(hdfs.data.root, 'data')
  
  # writes output to 'Sentiment_Analysis' directory in user's HDFS home (e.g., /home/manohar/example/Sentiment_Analysis/)
  hdfs.out.root = 'Sentiment_Analysis'
}

hdfs.out = file.path(hdfs.out.root, 'out')

 

Loading Data:

Below code will copy the file from local to hadoop, if file already exists then will return TRUE

# equivalent to hadoop dfs -copyFromLocal
hdfs.put(hdfs.data,  hdfs.data)

 

Our data is in csv file, so setting the input format for better code readability especially in for the mapper stage

# asa.csv.input.format() - read CSV data files and label field names
# for better code readability (especially in the mapper)
#
asa.csv.input.format = make.input.format(format='csv', mode='text', streaming.format = NULL, sep=',',
                                         col.names = c('ID', 'Name', 'Gender', 'Age','OverAllRating',                                             
                                                       'ReviewType', 'ReviewTitle', 'Benefits', 'Money', 'Experience', 
                                                       'Purchase', 'claimsProcess', 'SpeedResolution', 'Fairness',            
                                                       'ReviewDate', 'Review', 'Recommend', 'ColCount'),
                                         stringsAsFactors=F)

 

Load opinion lexicons, the files and paper on opinion lexicons can be found here

pos_words <- scan('/home/manohar/example/Sentiment_Analysis/data/positive-words.txt', what='character',     comment.char=';')
neg_words <- scan('/home/manohar/example/Sentiment_Analysis/data/negative-words.txt', what='character', comment.char=';')

 

Scoring Function:

Below is the main function that calculates the sentiment score, written by Jeffrey Breen (source here!)

score.sentiment = function(sentence, pos.words, neg.words)
{
  require(plyr)
  require(stringr)
  
  score = laply(sentence, function(sentence, pos.words, neg.words) {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    sentence = tolower(sentence)
    
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    
    # match() returns the position of the matched term or NA
    # we just want a TRUE/FALSE:
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    
    return(score)
  }, pos.words, neg.words)
  
  score.df = data.frame(score)
  return(score.df)
}

 

Mapper:

This is the first stage in the map-reduce process which splits out each word into a separate string (also called as tokenizing the string) and for each word seen, it will output the word and a 1 which is the count value to indicate that it has seen the word one time. Mapper phase works parallel because Hadoop uses divide and conquer approach to slove the problem. This is just to execute your code as fast as possible. In this phase all the computation, processing and distribution of data takes place. However in our case we are using single node the code and logic is fairly simple.

The mapper gets keys and values from the input formatter. In our case, the key is NULL and the value is a data.frame from read.table()

mapper = function(key, val.df) {  
  # Remove header lines
  val.df = subset(val.df, Review != 'Review')
  output.key = data.frame(Review = as.character(val.df$Review),stringsAsFactors=F)
  output.val = data.frame(val.df$Review)
  return( keyval(output.key, output.val) )
}

 

Reducer:

The reduce phase will then sum up the number of times each word was seen and write that sum count together with the word as output.

There are two sub parts that internally works before our code gives its final result, that are shuffle and short. Shuffle just to collect similar type of works into single unit and Short for shorting data into some order.

reducer = function(key, val.df) {  
  output.key = key
  output.val = data.frame(score.sentiment(val.df, pos_words, neg_words))
  return( keyval(output.key, output.val) )  
}

 

Running your Map-Reduce:

Executing the map-reduce logic program.

mr.sa = function (input, output) {
  mapreduce(input = input,
            output = output,
            input.format = asa.csv.input.format,
            map = mapper,
            reduce = reducer,
            verbose=T)
}
out = mr.sa(hdfs.data, hdfs.out)

------- output on screen ------
> out = mr.sa(hdfs.data, hdfs.out)
16/09/09 10:11:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/usr/local/hadoop/data/hadoop-unjar2099064477903127749/] [] /tmp/streamjob6583314935744487158.jar tmpDir=null
16/09/09 10:11:33 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8050
16/09/09 10:11:33 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8050
16/09/09 10:11:36 INFO mapred.FileInputFormat: Total input paths to process : 3
16/09/09 10:11:36 INFO mapreduce.JobSubmitter: number of splits:4
16/09/09 10:11:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473394634383_0002
16/09/09 10:11:39 INFO impl.YarnClientImpl: Submitted application application_1473394634383_0002
16/09/09 10:11:39 INFO mapreduce.Job: The url to track the job: http://manohar-dt:8088/proxy/application_1473394634383_0002/
16/09/09 10:11:39 INFO mapreduce.Job: Running job: job_1473394634383_0002
16/09/09 10:11:58 INFO mapreduce.Job: Job job_1473394634383_0002 running in uber mode : false
16/09/09 10:11:58 INFO mapreduce.Job:  map 0% reduce 0%
16/09/09 10:12:27 INFO mapreduce.Job:  map 48% reduce 0%
16/09/09 10:12:37 INFO mapreduce.Job:  map 100% reduce 0%
16/09/09 10:13:22 INFO mapreduce.Job:  map 100% reduce 100%
16/09/09 10:13:35 INFO mapreduce.Job: Job job_1473394634383_0002 completed successfully
16/09/09 10:13:36 INFO mapreduce.Job: Counters: 50
    File System Counters
        FILE: Number of bytes read=1045750
        FILE: Number of bytes written=2580683
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=679293
        HDFS: Number of bytes written=578577
        HDFS: Number of read operations=15
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=4
        Launched reduce tasks=1
        Data-local map tasks=4
        Total time spent by all maps in occupied slots (ms)=148275
        Total time spent by all reduces in occupied slots (ms)=53759
        Total time spent by all map tasks (ms)=148275
        Total time spent by all reduce tasks (ms)=53759
        Total vcore-seconds taken by all map tasks=148275
        Total vcore-seconds taken by all reduce tasks=53759
        Total megabyte-seconds taken by all map tasks=151833600
        Total megabyte-seconds taken by all reduce tasks=55049216
    Map-Reduce Framework
        Map input records=9198
        Map output records=1818
        Map output bytes=1037580
        Map output materialized bytes=1045768
        Input split bytes=528
        Combine input records=0
        Combine output records=0
        Reduce input groups=1616
        Reduce shuffle bytes=1045768
        Reduce input records=1818
        Reduce output records=1720
        Spilled Records=3636
        Shuffled Maps =4
        Failed Shuffles=0
        Merged Map outputs=4
        GC time elapsed (ms)=1606
        CPU time spent (ms)=22310
        Physical memory (bytes) snapshot=1142579200
        Virtual memory (bytes) snapshot=5270970368
        Total committed heap usage (bytes)=947912704
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=678765
    File Output Format Counters 
        Bytes Written=578577
    rmr
        reduce calls=1616
16/09/09 10:13:36 INFO streaming.StreamJob: Output directory: /Sentiment_Analysis/out

 

Load output from hadoop to R data frame:

Read the output from hadoop folder to a R variable and convert it to data frame for further processing.

results = from.dfs(out)

# put the result in a dataframe
df = sapply(results,c)
df = data.frame(df) # convert to dataframe
colnames(df) <- c('Review', 'score') # assign column names

print(head(df))

------- Result -----
                                             Review score
1                              Very good experience     1
2                         It was a classic scenario     1
3              I have chosen  for all my insurances     0
4           As long as customers understand the t&c     0
5    time will tell if  live up to our expectations     0
6 Good price  good customer service happy to help..     3

 

Now we have the sentiment score for each text. This opens up opportunity for further analysis such as classifying emotion, polarity and a whole lot of visualization for insight. Please see my previous post here to learn more about this.

You can find the full working code in my github account here!

 

Sentiment Analysis

Sentiment

Definition:

The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral

Use case:

Customer’s on line comments/feedback from an insurance companies website has been scrapped to run through the sentiment analysis.

You can find full R code along with the data set in my git repository here

Steps:

  • Load required R libraries

    # source("http://bioconductor.org/biocLite.R")
    # biocLite("Rgraphviz")
    # install.packages('tm')
    # install.packages('wordcloud')
    # download.file("http://cran.cnr.berkeley.edu/src/contrib/Archive/Rstem/Rstem_0.4-1.tar.gz", "Rstem_0.4-1.tar.gz")
    # install.packages("Rstem_0.4-1.tar.gz", repos=NULL, type="source")
    # download.file("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz", "sentiment.tar.gz")
    # install.packages("sentiment.tar.gz", repos=NULL, type="source")# Load libraries
    
    library(wordcloud)
    library(tm)
    library(plyr)
    library(ggplot2)
    library(grid)
    library(sentiment)
    library(Rgraphviz)
  • Pre-process data:

    Text pre-processing is an important step to reduce noise from the data. Each step is discussed below

  • convert to lower: this is to avoid distinguish between words simply on case
  • remove punctuation: punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation
  • remove numbers: numbers may or may not be relevant to our analyses
  • remove stop words: stop words are common words found in a language. Words like for, of, are etc are common stop word
  • create document term matrix: a document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix
df <- read.table("../input/data.csv",sep=",",header=TRUE)
corp <- Corpus(VectorSource(df$Review)) 
corp <- tm_map(corp, tolower) 
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
# corp <- tm_map(corp, stemDocument, language = "english") 
corp <- tm_map(corp, removeWords, c("the", stopwords("english"))) 
corp <- tm_map(corp, PlainTextDocument)
corp.tdm <- TermDocumentMatrix(corp, control = list(minWordLength = 3)) 
corp.dtm <- DocumentTermMatrix(corp, control = list(minWordLength = 3))
  • Insight through visualization

    • Word cloud: This visualization generates words whose font size relates to its frequency.

      wordcloud(corp, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, 'Dark2'))

      Word Cloud

    • Frequency plot:This visualization presents a bar chart whose length corresponds the frequency a particular word occurred
      corp.tdm.df <- sort(rowSums(corp.tdm.df),decreasing=TRUE) # populate term frequency and sort in decesending order
      df.freq <- data.frame(word = names(corp.tdm.df),freq=corp.tdm.df) # Table with terms and frequency
      
      # Set minimum term frequency value. The charts will be created for terms > or = to the minimum value that we set.
      freqControl <- 100
      # Frequency Plot
      freqplotData <- subset(df.freq, df.freq$freq > freqControl)
      freqplotData$word <- ordered(freqplotData$word,levels=levels(freqplotData$word)[unclass(freqplotData$word)])
      freqplot <- ggplot(freqplotData,aes(reorder(word,freq), freq))
      freqplot <- freqplot + geom_bar(stat="identity")
      freqplot <- freqplot + theme(axis.text.x=element_text(angle=90,hjust=1)) + coord_flip() 
      freqplot + xlim(rev(levels(freqplotData$word)))+ ggtitle("Frequency Plot")

      Frequency.png

    • Correlation plot: Here, we choose N number of high frequent words as the nodes and include links between words when they have at least a correlation of x %
      # Correlation Plot
      # 50 of the more frequent words have been chosen as the nodes and include links between words
      # when they have at least a correlation of 0.2
      # By default (without providing terms and a correlation threshold) the plot function chooses a
      # random 20 terms with a threshold of 0.7
      plot(corp.tdm,terms=findFreqTerms(corp.tdm,lowfreq=freqControl)[1:50],corThreshold=0.2, main="Correlation Plot")

      Correlation.png

    • Paired word cloud: This is a customized word cloud. Here, we pick the top N most frequent words and extract associated words with strong correlation. Combine individual top N words with the every associated word (say one of my top words is broken and one of the associated words is pipe; the combined word would be broken-pipe). Then we create a word cloud on the combined words. Although the concept is good, the chart below does not appear helpful. So need to figure out a better representation

      # Paired-Terms wordcloud
      # pick the top N most frequent words and extract associated words with strong correlation (say 70%). 
      # Combine individual top N words with every associated word.
      nFreqTerms <- findFreqTerms(corp.dtm,lowfreq=freqControl)
      nFreqTermsAssocs <- findAssocs(corp.dtm, nFreqTerms, 0.3)
      pairedTerms <- c()
      for (i in 1:length(nFreqTermsAssocs)){
        if(length(names(nFreqTermsAssocs[[i]]))!=0) 
          lapply(names(nFreqTermsAssocs[[i]]),function(x) pairedTerms <<- c(pairedTerms,paste(names(nFreqTermsAssocs[i]),x,sep="-")))
      }
      wordcloud(pairedTerms,random.order=FALSE,colors=dark2,main="Paired Wordcloud")

      Paired Word Cloud

  • Sentiment Score

    • Load positive / negative terms corpus

      The corpus contains around 6800 words, this list was compiled over many years starting from first paper by Hu and Liu, KDD-2004. Although necessary, having an opinion lexicon is far from sufficient for accurate sentiment analysis. See this paper: Sentiment Analysis and Subjectivity or the Sentiment Analysis

    • Calculate positive / negative score

      Simply we calculate the positive / negative score by comparing the terms with positive/negative term corpus and summing the occurrence count

    • Classify emotion

      R package sentiment by Timothy Jurka has a function that helps us to analyze some text and classify it in different types of emotion: anger, disgust, fear, joy, sadness, and surprise. The classification can be performed using two algorithms: one is a naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon; the other one is just a simple voter procedure.

    • Classify polarity

      Another function from sentiment package, classify_polarity allows us to classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon; or by a simple voter algorithm.

      hu.liu.pos = scan('../input/positive-words.txt', what = 'character',comment.char=';') 
      hu.liu.neg = scan('../input/negative-words.txt',what = 'character',comment.char= ';') 
      pos.words = c(hu.liu.pos)
      neg.words = c(hu.liu.neg)
      score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
      {
        require(plyr)
        require(stringr)
        
        # we got a vector of sentences. plyr will handle a list
        # or a vector as an "l" for us
        # we want a simple array ("a") of scores back, so we use
        # "l" + "a" + "ply" = "laply":
        scores = laply(sentences, function(sentence, pos.words, neg.words) {
          
          # clean up sentences with R's regex-driven global substitute, gsub():
          sentence = gsub('[[:punct:]]', '', sentence)
          sentence = gsub('[[:cntrl:]]', '', sentence)
          sentence = gsub('\\d+', '', sentence)
          # and convert to lower case:
          sentence = tolower(sentence)
          
          # split into words. str_split is in the stringr package
          word.list = str_split(sentence, '\\s+')
          # sometimes a list() is one level of hierarchy too much
          words = unlist(word.list)
          
          # compare our words to the dictionaries of positive & negative terms
          pos.matches = match(words, pos.words)
          neg.matches = match(words, neg.words)
          
          # match() returns the position of the matched term or NA
          # we just want a TRUE/FALSE:
          pos.matches= !is.na(pos.matches)
          neg.matches= !is.na(neg.matches)
          
          # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
          score = sum(pos.matches) - sum(neg.matches)
          
          return(score)
        }, pos.words, neg.words, .progress=.progress )
        
        scores.df = data.frame(score=scores, text=sentences)
        return(scores.df)
      }
      
      review.scores<- score.sentiment(df$Review,pos.words,neg.words,.progress='text')
      #classify emotion
      class_emo = classify_emotion(df$Review, algorithm="bayes", prior=1.0)
      #get emotion best fit
      emotion = class_emo[,7]
      # substitute NA's by "unknown"
      emotion[is.na(emotion)] = "unknown"
      
      # classify polarity
      class_pol = classify_polarity(df$Review, algorithm="bayes")
      
      # get polarity best fit
      polarity = class_pol[,4]
      
      # data frame with results
      sent_df = data.frame(text=df$Review, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)
      
      # sort data frame
      sent_df = within(sent_df, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
      
      
    • Visualize

      • Distribution of overall score
        ggplot(review.scores, aes(x=score)) + 
          geom_histogram(binwidth=1) + 
          xlab("Sentiment score") + 
          ylab("Frequency") + 
          ggtitle("Distribution of sentiment score") +
          theme_bw()  + 
          theme(axis.title.x = element_text(vjust = -0.5, size = 14)) + 
          theme(axis.title.y=element_text(size = 14, angle=90, vjust = -0.25)) + 
          theme(plot.margin = unit(c(1,1,2,2), "lines"))

        Sentiment Score Distribution.png

      • Distribution of score for a given term
        review.pos<- subset(review.scores,review.scores$score>= 2) 
        review.neg<- subset(review.scores,review.scores$score<= -2)
        claim <- subset(review.scores, regexpr("claim", review.scores$text) > 0) 
        ggplot(claim, aes(x = score)) + geom_histogram(binwidth = 1) + ggtitle("Sentiment score for the token 'claim'") + xlab("Score") + ylab("Frequency") + theme_bw()  + theme(axis.title.x = element_text(vjust = -0.5, size = 14)) + theme(axis.title.y = element_text(size = 14, angle = 90, vjust = -0.25)) + theme(plot.margin = unit(c(1,1,2,2), "lines"))

        Claim Score.png

      • Distribution of emotion
        # plot distribution of emotions
        ggplot(sent_df, aes(x=emotion)) +
          geom_bar(aes(y=..count.., fill=emotion)) +
          scale_fill_brewer(palette="Dark2") +
          labs(x="emotion categories", y="number of Feedback", 
               title = "Sentiment Analysis of Feedback about claim(classification by emotion)",
               plot.title = element_text(size=12))

        Emotions Distribution.png

      • Distribution of polarity
        # plot distribution of polarity
        ggplot(sent_df, aes(x=polarity)) +
          geom_bar(aes(y=..count.., fill=polarity)) +
          scale_fill_brewer(palette="RdGy") +
          labs(x="emotion categories", y="number of Feedback", 
               title = "Sentiment Analysis of Feedback about claim(classification by emotion)",
               plot.title = element_text(size=12))

        Polarity.png

      • Text by emotion
        # separating text by emotion
        emos = levels(factor(sent_df$emotion))
        nemo = length(emos)
        emo.docs = rep("", nemo)
        for (i in 1:nemo)
        {
          tmp = df$Review[emotion == emos[i]]
          emo.docs[i] = paste(tmp, collapse=" ")
        }
        
        # remove stopwords
        emo.docs = removeWords(emo.docs, stopwords("english"))
        # create corpus
        corpus = Corpus(VectorSource(emo.docs))
        tdm = TermDocumentMatrix(corpus)
        tdm = as.matrix(tdm)
        colnames(tdm) = emos
        
        # comparison word cloud
        comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),
                         scale = c(3,.5), random.order = FALSE, title.size = 1.5)

        Text by emotion.png

Next post to cover sentiment analysis in R + Hadoop.

 

Reference:

The above write up is based on the tutorials from following links:

Clothing Sales Prediction – Mini DataHack

minidatahack-cover

Analytics Vidhya organized a weekend mini data hackathon for Clothing Sales Prediction. The hackathon started at 20:00 (UTC + 5:30 ) on 28th May, 2016 and closed at 23:00 on 28th May, 2016 (UTC + 5:30)

Its my second hack in this forum, earlier I had participated in The Seer’s Accuracy hackathon, and ended up 54th place on public leaderboard. I was hoping this one would be better! but due to time constraints could not do better.

Unfortunately, my participation was delayed by an hour, so only had two hours to solve the problem.

Problem Statement:

SimpleBuy is a clothing company which runs operations in brick and mortar fashion. Be it parent, child, man, woman, they have wide range of products catering to the need to every individual. They aim to become one stop destination for all clothing desires.

Their idea of offline and online channels is doing quite well. Their stock now runs out even faster than they could replenish it. Customers are no longer skeptical about their quality. Their offline stores help customer to physically check clothes before buying them, especially the expensive clothes. In addition, their delivery channels are known to achieve six sigma efficiency.

However, SimpleBuy can only provide this experience, if they can manage the inventory well. Hence, they need to forecast the sales ahead of time. And this is where you will help them today. SimpleBuy has provided you with their Sales data for last 2 years and they want to you predict the sales for next 12 months.

Data:

The train data had only two columns i.e., ‘Date’ and ‘Number_SKU_Sold’

Train Data: 2007 and 2008 (Daily Sales,  587 records)

Test Data: 2009 (only contained date column, 365 records)

Model:

As this is a time-series data, I felt that this was the right opportunity to try my hands on “forecast” R package. Referring to the Dataiku’s time-series tutorial tried 3 models from the package.

Model 1: Exponential State Smoothing

Model 2: Auto ARIMA
The auto.arima() function automatically searches for the best model and optimizes the parameters.

Model 3: TBATS

TBATS (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is designed for use when there are multiple cyclic patterns e.g. daily, weekly and yearly patterns in a single time series.

On comparing the 3 models on AIC, TBATS seems to be performing slightly better than ETS/ARIMA.

Model_Compare

Note that the model with the smallest AIC is the best fitting model. However, the submission performed poor on public leader board.

So quickly moved to the Random Forest as I was more comfortable with this and it gives better results most of the time. Extracted features from date such as year, month, day, day of month, day of the year. Added 2 more features to weight days (this idea was by referring Kaggle’s Walmart sales prediction solution).

Github:

here is my github repository

Results:

This model scored 21046427.5142 on the public LB ranked 107th, view public leader-board.

Conclusion:

I clearly missed adding few possible key features (day of the week, seasonality, holiday etc) which could have improved the score. However, given that I had only two hours to solve the problem  so glad that I was able to complete submission.

On a personal interest will definitely come back to the problem to see how score can be improved.

It was a very interesting problem and thanks to the Analytics Vidhya organizers.

 

 

 

R-Hadoop Integration on Ubuntu

Contents

  • About the Manual
  • Pre-requisites
  • Install R Base on Hadoop
  • Install R Studio on Hadoop
  • Install RHadoop packages

RHadoop is a collection of four R packages that allow users to manage and analyze data with Hadoop.

  1. plyrmr– higher level plyr-like data processing for structured data, powered by rmr
  2. rmr– functions providing Hadoop MapReduce functionality in R
  3. rhdfs– functions providing file management of the HDFS from within R
  4. rhbase– functions providing database management for the HBase distributed database from within R

This manual is direct for R and Hadoop 2.4.0 integration on Ubuntu 14.04

Pre-requisites:

 We assume, that the user would have below two running up before starting R and Hadoop integration

Ubuntu 14.04

Hadoop 2.x +

Read my blog to learn more about here on how setting-up-a-single-node-hadoop-cluster.

Pre – requisite:

Once Hadoop installation is done, make sure that all the processes are running:

Run the command jps on your terminal and the result should look similar to below screen shot:

11

R installation

Step 1: Click on the Ubuntu-software center.

1.png

Step 2:  Open Ubuntu Software Center in full screen mode, if the size of the screen is small then we cannot see the search option,Search R-base and click on the First link. Click on install

2.png

Step 3: Once installation has done open your terminal. Type the command R and your r console will be open.

 

You can perform any operation on this R console for example, to plot a graph of some variables:-

plot(seq(1,1000,2.3))

We can see the graph of this plot function below screenshot:

3.png

Step 4:

If we want to come out from R console then give the command

q()

If you want to save workspace then type y otherwise type n.

c is for continue on the same workspace.

Step 7: Now we install R-studio in ubuntu.

  • Open your browser and download r-studio. I downloaded RStudio 0.98.953 – Debian 6+/Ubuntu 10.04+ (32-bit) — this is actually a file: rstudio-0.98.953-amd32.deb

4.png

Go to download folder, right click on the download file and open file with Ubuntu Software Center and click on install.

5.png

6.png

Go on terminal and type R, you can see R console and R studio.

7.png

Install RHadoop packages

 Step1: Install thrift

sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

$ cd /tmp

If the below does not work please manually download the thrift jar

$ sudo wget https://dist.apache.org/repos/dist/release/thrift/0.9.0/thrift-0.9.0.tar.gz | tar zx

$ cd thrift-0.9.0/

$ ./configure

$ make

$ sudo make install

$ thrift –help

 

Step 2: Install supporting R packges:

install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”, “dplyr”, “R.methodsS3”, “caTools”, “Hmisc”), lib=”/usr/local/R/library”)

Step 3: Download below packages from https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads

rmr2

rhdfs

rhbase

plyrmr

In R terminal run the commands to install packages. Replace <path> to suit your downloaded file location

sudo gedit /etc/R/Renviron

Install RHadoop (rhdfs, rhbase, rmr2 and plyrmr)

Install relevant packages:

install.packages(“rhdfs_1.0.8.tar.gz”, repos=NULL, type=”source”)

install.packages(“rmr2_3.1.2.tar.gz”, repos=NULL, type=”source”)

install.packages(“plyrmr_0.3.0.tar.gz”, repos=NULL, type=”source”)

install.packages(“rhbase_1.2.1.tar.gz”, repos=NULL, type=”source”)

References

You’ll find youtube vedio and step by step instruction about installing R in Hadoop in the following link.

URL http://www.rdatamining.com/tutorials/rhadoop

Rdatamining: R on Handoop – Step by step instructions

URL: http://www.rdatamining.com/tutorials/rhadoop

Youtube: Word count map reduce program in R

URL: http://www.youtube.com/watch?v=hSrW0Iwghtw

Revolution Analytics: RHadoop packages

URL: https://github.com/RevolutionAnalytics/RHadoop/wiki

Install R-base Guide

URL: http://www.sysads.co.uk/2014/06/install-r-base-3-1-0-ubuntu-14-04/

 

In the next blog post I’ll show a sample sentiment analysis using map reduce in R using rmr package.

 

Setting up a Single Node Hadoop Cluster

Step By Step Hadoop Installation Guide

Setting up Single Node Hadoop Cluster on Windows over VM

Contents

  • Objective
  • Current Environments
  • Download VM and Ubuntu 14.04
  • Install Ubuntu on VM
  • Install Hadoop 2.4 on Ubuntu 14.04

 

Objective: This document will help you to setup Hadoop 2.4.0 onto Ubuntu 14.04 on your virtual machine of Windows operating system.

Current environment includes:

  • Windows XP/7 – 32 bit
  • VM Player (Non-commercial use only)
  • Ubuntu 14.04 32 bit
  • Java 1.7
  • Hadoop 2.4.0

Download and Install VM Player from the link https://www.vmware.com/tryvmware/?p=player

Download Ubuntu 14.04 iso file from the link: http://www.ubuntu.com/download/desktop

Download the list of Hadoop commands for reference from the following link: http://hadoop.apache.org/docs/r1.0.4/commands_manual.pdf (Don’t be afraid of this file, this is just for your refer to help you learn more about important Hadoop commands)

Install Ubuntu in VM:

  • Click on Create a New Virtual Machine
  • Browse and select the Ubuntu iso file.
  • Personalize Linux by providing appropriate details.
  • Follow through the wizard steps to finish installation.

1

2.png

3.png

Install Hadoop 2.4 on Ubuntu 14.04

Step 1: Open Terminal

4.png

Step 2: Download Hadoop tar file by running the below command in terminal

wget http://mirror.fibergrid.in/apache/hadoop/common/stable/hadoop-2.7.2.tar.gz

Step 3: Unzip tar file through command: tar -xzf hadoop-2.7.2.tar.gz

Step 4: Let’s move everything into a more appropriate directory:

sudo mv hadoop-2.7.2/ /usr/local

cd /usr/local

sudo ln -s hadoop-2.7.2/ hadoop

Lets create a directory to for later use to store hadoop data:

mkdir /usr/local/hadoop/data

 

Step 5: Set up user and permission (Replace manohar by your user id)

sudo addgroup hadoop

sudo adduser –ingroup hadoop manohar

sudo chown -R hadoop: manohar /usr/local/hadoop/

Step 6: Install ssh:

sudo apt-get install ssh

ssh-keygen -t rsa -P “”

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 7: Install Java:

sudo apt-get update

sudo apt-get install default-jdk

sudo gedit ~/.bashrc

This will open the .bashrc file in a text editor. Go to the end of the file and paste/type the following content in it:

#HADOOP VARIABLES START

export HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr

export HADOOP_INSTALL=/usr/local/hadoop

export PATH=$PATH:$HADOOP_INSTALL/bin

export PATH=$PATH:$HADOOP_INSTALL/sbin

export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_HOME=$HADOOP_INSTALL

export HADOOP_HDFS_HOME=$HADOOP_INSTALL

export YARN_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native

export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”

export HADOOP_PREFIX=$HADOOP_INSTALL

export HADOOP_CMD=$HADOOP_INSTALL/bin/hadoop

export HADOOP_STREAMING=$HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar

#HADOOP VARIABLES END

5.png

After saving and closing the .bashrc file, execute the following command so that your system recognizes the newly created environment variables:

source ~/.bashrc

Putting the above content in the .bashrc file ensures that these variables are always available when your VPS starts up.

Step 8:

Unfortunately, Hadoop and ipv6 don’t play nice so we’ll have to disable it – to do this you’ll need to open up /etc/sysctl.conf and add the following lines to the end:

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Type the command: sudo gedit /etc/sysctl.conf

6

Step 9: Editing /usr/local/hadoop/etc/hadoop/hadoop-env.sh:

 sudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

In this file, locate the line that exports the JAVA_HOME variable. Change this line to the following:

Change export JAVA_HOME=${JAVA_HOME} to match the JAVA_HOME you set in your .bashrc (for us JAVA_HOME=/usr).

Also, change this line:

export HADOOP_OPTS=”$HADOOP_OPTS -Djava.net.preferIPv4Stack=true

TO BE

export HADOOP_OPTS=”$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.library.path=$HADOOP_PREFIX/lib”

And finally, add the following line:

export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native

Step 10: Editing /usr/local/hadoop/etc/hadoop/core-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

In this file, enter the following content in between the <configuration></configuration> tag:

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/usr/local/hadoop/data</value>

</property>

7.png

Step 11: Editing /usr/local/hadoop/etc/hadoop/yarn-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

In this file, enter the following content in between the <configuration></configuration> tag:

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>localhost:8025</value>

</property>

<property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>localhost:8030</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>localhost:8050</value>

</property>

The yarn-site.xml file should look something like this:

8.png

Step 12: Creating and Editing /usr/local/hadoop/etc/hadoop/mapred-site.xml:

 By default, the /usr/local/hadoop/etc/hadoop/ folder contains the /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml. This file is used to specify which framework is being used for MapReduce.

This can be done using the following command:

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

Once this is done, open the newly created file with following command:

sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml

In this file, enter the following content in between the <configuration></configuration> tag:

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

The mapred-site.xml file should look something like this:

9.png

Step 13: Editing /usr/local/hadoop/etc/hadoop/hdfs-site.xml:

 The /usr/local/hadoop/etc/hadoop/hdfs-site.xml has to be configured for each host in the cluster that is being used. It is used to specify the directories which will be used as the namenode and the datanode on that host.

Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation. This can be done using the following commands:

sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode

Open the /usr/local/hadoop/etc/hadoop/hdfs-site.xml file with following command:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

In this file, enter the following content in between the <configuration></configuration> tag:

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/datanode</value>

</property>

The hdfs-site.xml file should look something like this:

10.png

Step 14: Format the New Hadoop Filesystem:

After completing all the configuration outlined in the above steps, the Hadoop filesystem needs to be formatted so that it can start being used. This is done by executing the following command:

hdfs namenode –format

Note: This only needs to be done once before you start using Hadoop. If this command is executed again after Hadoop has been used, it’ll destroy all the data on the Hadoop filesystem.

Step 15: Start Hadoop

All that remains to be done is starting the newly installed single node cluster:

start-dfs.sh

While executing this command, you’ll be prompted twice with a message similar to the following:

Are you sure you want to continue connecting (yes/no)?

Type in yes for both these prompts and press the enter key. Once this is done, execute the following command:

start-yarn.sh

Executing the above two commands will get Hadoop up and running. You can verify this by typing in the following command:

jps

Executing this command should show you something similar to the following:

11.png

If you can see a result similar to the depicted in the screenshot above, it means that you now have a functional instance of Hadoop running on your VPS.

 

VBA Coding Best Practice

DEFINITION: Best practices are the agreed general set of guidelines that is believed to be more effective at delivering MS Excel based tools which are:

Ø User friendly

Ø Easy to maintain

Ø More reliable and robust

These are just general guidelines; a professional developer will always assess the options and make the appropriate choice in their specific situation. These suggestions are specific to Excel, VBA.

BEST PRACTICE PRINCIPLES

1.Easy to read and follow what’s happening

2.Efficient code

3.Flexible and easy to change

4.Robust and deals with errors

5.Uses existing Excel functionality where possible.

CONTENTS: The contents have divided it into sections as given below

Contents

Variable_Fun_Sub-001

V2-001

ScopeThree levels of scope exist for each variable in VBA: Public, Private, and Local

Scope Meaning Example
<none> Local variable, procedure-level lifetime, declared with “Dim” intOrderValue
st Local variable, object lifetime, declared with “Static” stLastInvoiceID
m Private (module) variable, object lifetime, declared with “Private” mcurRunningSum
g Public (global) variable, object lifetime, declared with “Public” glngGrandTotal

Var_Type (for variables)

Var_Type Object Type Example
bln or b Boolean blnPaid or bPaid
byt Byte bytPrice
int or i Integer intStoreID
lng Long lngSales
obj Object objArc
dbl Double dblSales
str or s String strName or sName
var or v Variant varColor or vColor
dte Date dteBirthDate
dec Decimal decLongitude
cht Chart chtSales
chk Check box chkReadOnly
Command button cmd cmdCancel
Label lbl lblHelpMessage
Option button opt optFrench

SUFFIXES – Suffixes modify the base name of an object, indicating additional information about a variable. You’ll likely create your own suffixes that are specific to your development work. Below table lists some generic/commonly used VBA suffixes.

Suffix Object Type Example
Min The absolute first element in an array or other kind of list iastrNamesMin
First The first element to be used in an array or list during the current operation iaintFontSizesFirst
Last The last element to be used in an array or list during the current operation igphsGlyphCollectionLast
Max The absolutely last element in an array or other kind of list iastrNamesMax

Modules-001

Avoid duplicate-001

BP - GSR Excel, VBA [Compatibility Mode]-001

BP - GSR Excel, VBA [Compatibility Mode]-002

BP - GSR Excel, VBA [Compatibility Mode]-003

BP - GSR Excel, VBA [Compatibility Mode]-004

BP - GSR Excel, VBA [Compatibility Mode]-005

BP - GSR Excel, VBA [Compatibility Mode]-006

BP - GSR Excel, VBA [Compatibility Mode]-007

BP - GSR Excel, VBA [Compatibility Mode]-008

BP - GSR Excel, VBA [Compatibility Mode]-009

Triples – Deep Natural Language Processing

Problem: In Text Mining extracting keywords (n-grams) alone cannot produce meaningful data nor discover “unknown” themes and trends.

Objective: The aim here is to extract dependency relation from sentence i.e., extract sets of the form {subject, predicate[modifiers], object} out of syntactically parsed sentences, using Stanford parser and opennlp.

Steps: 

1) Get the syntactic relationship between each pair of words

2) Apply sentence segmentation to determine the sentence boudaries

3) The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relationships within each sentence

How this is different from n-gram?

Dependency relation allows the similarity comparison to be based on the syntactic relations between words, instead of having to match words in their exact order in n-gram based comparisons.

Example:

Sentence: “The flat tire was not changed by driver”

Stanford dependency relations: 

root(ROOT-0, changed-6)
det(tire-3, the-1)
amod(tire-3, flat-2)
nsubjpass(changed-6, tire-3)
auxpass(changed-6, was-4)
neg(changed-6, not-5)
prep(changed-6, by-7)
pobj(by-7, driver-8)

Refer Stanford typed dependencies manual for full list & more info: http://nlp.stanford.edu/software/dependencies_manual.pdf

Triples output in the form (Subject : Predicate [modifier] : Object) :  

driver : changed [not] : tire

Extraction Logic: You can use the below base logic to build the functionality in your favorite/comfortable language (R/Python/Java/etc). Please note that this is only the base logic and needs enhancement.

triples

Challenges:

  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis

 

Hope this article is useful!