I was fortunate to have had the chance to work on some of the exciting text mining projects during my consulting role for insurance domain. I had summarized my learning / experience into my first post in LinkedIn. You can find the same here!
Rhadoop – Sentiment Analysis
In continuation to my previous post on sentiment analysis, here lets explore further on performing the same using RHadoop.
What will be covered?
- Environment & Pre-requisites
- Rhadoop in action
- Setting Rhadoop environment variables
- Setting working folder paths
- Loading data
- Scoring function
- Writing Mapper
- Writing Reducer
- Run your Map-Reduce program
- Read data output from hadoop to R data frame
Environment:
I have performed this analysis on below given set up:
- Single Node Hadoop Cluster set up over Ubuntu 14.04 (learn how to set up here!)
- RHadoop, is a collection of four R packages that allow users to manage and analyse data with Hadoop (learn how to set up here!)
Pre-requisites:
Ensure that all hadoop process are running. You can do this by running the hadoop command on your terminal
start-dfs.sh and start-yarn.sh
Then, run the command jps on your terminal and the result should look similar to below screen shot:
RHadoop In Action:
Set up the environment variables, and note that the path may change based on your version of Ubuntu and Hadoop (I’m using Hadoop 2.4.0) installation
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop" Sys.setenv("HADOOP_STREAMING"="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar")
Setting folder paths:
To ease the testing process during development stage flexibility of switching between local and hadoop environment would be useful. In the below code setting Local = T would use files from local folder otherwise Local = F to use from hadoop
# Root folder path setwd('/home/manohar/example/Sentiment_Analysis') # Set "LOCAL" variable to T to execute using rmr's local backend. # Otherwise, use Hadoop (which needs to be running, correctly configured, etc.) LOCAL=F if (LOCAL) { rmr.options(backend = 'local') # we have smaller extracts of the data in this project's 'local' subdirectory hdfs.data.root = '/home/manohar/example/Sentiment_Analysis/' hdfs.data = file.path(hdfs.data.root, 'data', 'data.csv') hdfs.out.root = hdfs.data.root } else { rmr.options(backend = 'hadoop') # assumes 'Sentiment_Analysis/data' input path exists on HDFS under /home/manohar/example hdfs.data.root = '/home/manohar/example/Sentiment_Analysis/' hdfs.data = file.path(hdfs.data.root, 'data') # writes output to 'Sentiment_Analysis' directory in user's HDFS home (e.g., /home/manohar/example/Sentiment_Analysis/) hdfs.out.root = 'Sentiment_Analysis' } hdfs.out = file.path(hdfs.out.root, 'out')
Loading Data:
Below code will copy the file from local to hadoop, if file already exists then will return TRUE
# equivalent to hadoop dfs -copyFromLocal hdfs.put(hdfs.data, hdfs.data)
Our data is in csv file, so setting the input format for better code readability especially in for the mapper stage
# asa.csv.input.format() - read CSV data files and label field names # for better code readability (especially in the mapper) # asa.csv.input.format = make.input.format(format='csv', mode='text', streaming.format = NULL, sep=',', col.names = c('ID', 'Name', 'Gender', 'Age','OverAllRating', 'ReviewType', 'ReviewTitle', 'Benefits', 'Money', 'Experience', 'Purchase', 'claimsProcess', 'SpeedResolution', 'Fairness', 'ReviewDate', 'Review', 'Recommend', 'ColCount'), stringsAsFactors=F)
Load opinion lexicons, the files and paper on opinion lexicons can be found here
pos_words <- scan('/home/manohar/example/Sentiment_Analysis/data/positive-words.txt', what='character', comment.char=';') neg_words <- scan('/home/manohar/example/Sentiment_Analysis/data/negative-words.txt', what='character', comment.char=';')
Scoring Function:
Below is the main function that calculates the sentiment score, written by Jeffrey Breen (source here!)
score.sentiment = function(sentence, pos.words, neg.words) { require(plyr) require(stringr) score = laply(sentence, function(sentence, pos.words, neg.words) { # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\\d+', '', sentence) # and convert to lower case: sentence = tolower(sentence) # split into words. str_split is in the stringr package word.list = str_split(sentence, '\\s+') # sometimes a list() is one level of hierarchy too much words = unlist(word.list) # compare our words to the dictionaries of positive & negative terms pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) # match() returns the position of the matched term or NA # we just want a TRUE/FALSE: pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words) score.df = data.frame(score) return(score.df) }
Mapper:
This is the first stage in the map-reduce process which splits out each word into a separate string (also called as tokenizing the string) and for each word seen, it will output the word and a 1 which is the count value to indicate that it has seen the word one time. Mapper phase works parallel because Hadoop uses divide and conquer approach to slove the problem. This is just to execute your code as fast as possible. In this phase all the computation, processing and distribution of data takes place. However in our case we are using single node the code and logic is fairly simple.
The mapper gets keys and values from the input formatter. In our case, the key is NULL and the value is a data.frame from read.table()
mapper = function(key, val.df) { # Remove header lines val.df = subset(val.df, Review != 'Review') output.key = data.frame(Review = as.character(val.df$Review),stringsAsFactors=F) output.val = data.frame(val.df$Review) return( keyval(output.key, output.val) ) }
Reducer:
The reduce phase will then sum up the number of times each word was seen and write that sum count together with the word as output.
There are two sub parts that internally works before our code gives its final result, that are shuffle and short. Shuffle just to collect similar type of works into single unit and Short for shorting data into some order.
reducer = function(key, val.df) { output.key = key output.val = data.frame(score.sentiment(val.df, pos_words, neg_words)) return( keyval(output.key, output.val) ) }
Running your Map-Reduce:
Executing the map-reduce logic program.
mr.sa = function (input, output) { mapreduce(input = input, output = output, input.format = asa.csv.input.format, map = mapper, reduce = reducer, verbose=T) } out = mr.sa(hdfs.data, hdfs.out) ------- output on screen ------ > out = mr.sa(hdfs.data, hdfs.out) 16/09/09 10:11:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable packageJobJar: [/usr/local/hadoop/data/hadoop-unjar2099064477903127749/] [] /tmp/streamjob6583314935744487158.jar tmpDir=null 16/09/09 10:11:33 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8050 16/09/09 10:11:33 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8050 16/09/09 10:11:36 INFO mapred.FileInputFormat: Total input paths to process : 3 16/09/09 10:11:36 INFO mapreduce.JobSubmitter: number of splits:4 16/09/09 10:11:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473394634383_0002 16/09/09 10:11:39 INFO impl.YarnClientImpl: Submitted application application_1473394634383_0002 16/09/09 10:11:39 INFO mapreduce.Job: The url to track the job: http://manohar-dt:8088/proxy/application_1473394634383_0002/ 16/09/09 10:11:39 INFO mapreduce.Job: Running job: job_1473394634383_0002 16/09/09 10:11:58 INFO mapreduce.Job: Job job_1473394634383_0002 running in uber mode : false 16/09/09 10:11:58 INFO mapreduce.Job: map 0% reduce 0% 16/09/09 10:12:27 INFO mapreduce.Job: map 48% reduce 0% 16/09/09 10:12:37 INFO mapreduce.Job: map 100% reduce 0% 16/09/09 10:13:22 INFO mapreduce.Job: map 100% reduce 100% 16/09/09 10:13:35 INFO mapreduce.Job: Job job_1473394634383_0002 completed successfully 16/09/09 10:13:36 INFO mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=1045750 FILE: Number of bytes written=2580683 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=679293 HDFS: Number of bytes written=578577 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=4 Launched reduce tasks=1 Data-local map tasks=4 Total time spent by all maps in occupied slots (ms)=148275 Total time spent by all reduces in occupied slots (ms)=53759 Total time spent by all map tasks (ms)=148275 Total time spent by all reduce tasks (ms)=53759 Total vcore-seconds taken by all map tasks=148275 Total vcore-seconds taken by all reduce tasks=53759 Total megabyte-seconds taken by all map tasks=151833600 Total megabyte-seconds taken by all reduce tasks=55049216 Map-Reduce Framework Map input records=9198 Map output records=1818 Map output bytes=1037580 Map output materialized bytes=1045768 Input split bytes=528 Combine input records=0 Combine output records=0 Reduce input groups=1616 Reduce shuffle bytes=1045768 Reduce input records=1818 Reduce output records=1720 Spilled Records=3636 Shuffled Maps =4 Failed Shuffles=0 Merged Map outputs=4 GC time elapsed (ms)=1606 CPU time spent (ms)=22310 Physical memory (bytes) snapshot=1142579200 Virtual memory (bytes) snapshot=5270970368 Total committed heap usage (bytes)=947912704 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=678765 File Output Format Counters Bytes Written=578577 rmr reduce calls=1616 16/09/09 10:13:36 INFO streaming.StreamJob: Output directory: /Sentiment_Analysis/out
Load output from hadoop to R data frame:
Read the output from hadoop folder to a R variable and convert it to data frame for further processing.
results = from.dfs(out) # put the result in a dataframe df = sapply(results,c) df = data.frame(df) # convert to dataframe colnames(df) <- c('Review', 'score') # assign column names print(head(df)) ------- Result ----- Review score 1 Very good experience 1 2 It was a classic scenario 1 3 I have chosen for all my insurances 0 4 As long as customers understand the t&c 0 5 time will tell if live up to our expectations 0 6 Good price good customer service happy to help.. 3
Now we have the sentiment score for each text. This opens up opportunity for further analysis such as classifying emotion, polarity and a whole lot of visualization for insight. Please see my previous post here to learn more about this.
You can find the full working code in my github account here!
Sentiment Analysis
Definition:
The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral
Use case:
Customer’s on line comments/feedback from an insurance companies website has been scrapped to run through the sentiment analysis.
You can find full R code along with the data set in my git repository here
Steps:
-
Load required R libraries
# source("http://bioconductor.org/biocLite.R") # biocLite("Rgraphviz") # install.packages('tm') # install.packages('wordcloud') # download.file("http://cran.cnr.berkeley.edu/src/contrib/Archive/Rstem/Rstem_0.4-1.tar.gz", "Rstem_0.4-1.tar.gz") # install.packages("Rstem_0.4-1.tar.gz", repos=NULL, type="source") # download.file("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz", "sentiment.tar.gz") # install.packages("sentiment.tar.gz", repos=NULL, type="source")# Load libraries library(wordcloud) library(tm) library(plyr) library(ggplot2) library(grid) library(sentiment) library(Rgraphviz)
-
Pre-process data:
Text pre-processing is an important step to reduce noise from the data. Each step is discussed below
- convert to lower: this is to avoid distinguish between words simply on case
- remove punctuation: punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation
- remove numbers: numbers may or may not be relevant to our analyses
- remove stop words: stop words are common words found in a language. Words like for, of, are etc are common stop word
- create document term matrix: a document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix
df <- read.table("../input/data.csv",sep=",",header=TRUE) corp <- Corpus(VectorSource(df$Review)) corp <- tm_map(corp, tolower) corp <- tm_map(corp, removePunctuation) corp <- tm_map(corp, removeNumbers) # corp <- tm_map(corp, stemDocument, language = "english") corp <- tm_map(corp, removeWords, c("the", stopwords("english"))) corp <- tm_map(corp, PlainTextDocument) corp.tdm <- TermDocumentMatrix(corp, control = list(minWordLength = 3)) corp.dtm <- DocumentTermMatrix(corp, control = list(minWordLength = 3))
-
Insight through visualization
- Word cloud: This visualization generates words whose font size relates to its frequency.
wordcloud(corp, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, 'Dark2'))
- Frequency plot:This visualization presents a bar chart whose length corresponds the frequency a particular word occurred
corp.tdm.df <- sort(rowSums(corp.tdm.df),decreasing=TRUE) # populate term frequency and sort in decesending order df.freq <- data.frame(word = names(corp.tdm.df),freq=corp.tdm.df) # Table with terms and frequency # Set minimum term frequency value. The charts will be created for terms > or = to the minimum value that we set. freqControl <- 100
# Frequency Plot freqplotData <- subset(df.freq, df.freq$freq > freqControl) freqplotData$word <- ordered(freqplotData$word,levels=levels(freqplotData$word)[unclass(freqplotData$word)]) freqplot <- ggplot(freqplotData,aes(reorder(word,freq), freq)) freqplot <- freqplot + geom_bar(stat="identity") freqplot <- freqplot + theme(axis.text.x=element_text(angle=90,hjust=1)) + coord_flip() freqplot + xlim(rev(levels(freqplotData$word)))+ ggtitle("Frequency Plot")
- Correlation plot: Here, we choose N number of high frequent words as the nodes and include links between words when they have at least a correlation of x %
# Correlation Plot # 50 of the more frequent words have been chosen as the nodes and include links between words # when they have at least a correlation of 0.2 # By default (without providing terms and a correlation threshold) the plot function chooses a # random 20 terms with a threshold of 0.7 plot(corp.tdm,terms=findFreqTerms(corp.tdm,lowfreq=freqControl)[1:50],corThreshold=0.2, main="Correlation Plot")
- Paired word cloud: This is a customized word cloud. Here, we pick the top N most frequent words and extract associated words with strong correlation. Combine individual top N words with the every associated word (say one of my top words is broken and one of the associated words is pipe; the combined word would be broken-pipe). Then we create a word cloud on the combined words. Although the concept is good, the chart below does not appear helpful. So need to figure out a better representation
# Paired-Terms wordcloud # pick the top N most frequent words and extract associated words with strong correlation (say 70%). # Combine individual top N words with every associated word. nFreqTerms <- findFreqTerms(corp.dtm,lowfreq=freqControl) nFreqTermsAssocs <- findAssocs(corp.dtm, nFreqTerms, 0.3) pairedTerms <- c() for (i in 1:length(nFreqTermsAssocs)){ if(length(names(nFreqTermsAssocs[[i]]))!=0) lapply(names(nFreqTermsAssocs[[i]]),function(x) pairedTerms <<- c(pairedTerms,paste(names(nFreqTermsAssocs[i]),x,sep="-"))) } wordcloud(pairedTerms,random.order=FALSE,colors=dark2,main="Paired Wordcloud")
- Word cloud: This visualization generates words whose font size relates to its frequency.
-
Sentiment Score
-
Load positive / negative terms corpus
The corpus contains around 6800 words, this list was compiled over many years starting from first paper by Hu and Liu, KDD-2004. Although necessary, having an opinion lexicon is far from sufficient for accurate sentiment analysis. See this paper: Sentiment Analysis and Subjectivity or the Sentiment Analysis
-
Calculate positive / negative score
Simply we calculate the positive / negative score by comparing the terms with positive/negative term corpus and summing the occurrence count
-
Classify emotion
R package sentiment by Timothy Jurka has a function that helps us to analyze some text and classify it in different types of emotion: anger, disgust, fear, joy, sadness, and surprise. The classification can be performed using two algorithms: one is a naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon; the other one is just a simple voter procedure.
-
Classify polarity
Another function from sentiment package, classify_polarity allows us to classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon; or by a simple voter algorithm.
hu.liu.pos = scan('../input/positive-words.txt', what = 'character',comment.char=';') hu.liu.neg = scan('../input/negative-words.txt',what = 'character',comment.char= ';') pos.words = c(hu.liu.pos) neg.words = c(hu.liu.neg) score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) # we got a vector of sentences. plyr will handle a list # or a vector as an "l" for us # we want a simple array ("a") of scores back, so we use # "l" + "a" + "ply" = "laply": scores = laply(sentences, function(sentence, pos.words, neg.words) { # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\\d+', '', sentence) # and convert to lower case: sentence = tolower(sentence) # split into words. str_split is in the stringr package word.list = str_split(sentence, '\\s+') # sometimes a list() is one level of hierarchy too much words = unlist(word.list) # compare our words to the dictionaries of positive & negative terms pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) # match() returns the position of the matched term or NA # we just want a TRUE/FALSE: pos.matches= !is.na(pos.matches) neg.matches= !is.na(neg.matches) # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) } review.scores<- score.sentiment(df$Review,pos.words,neg.words,.progress='text')
#classify emotion class_emo = classify_emotion(df$Review, algorithm="bayes", prior=1.0) #get emotion best fit emotion = class_emo[,7] # substitute NA's by "unknown" emotion[is.na(emotion)] = "unknown" # classify polarity class_pol = classify_polarity(df$Review, algorithm="bayes") # get polarity best fit polarity = class_pol[,4] # data frame with results sent_df = data.frame(text=df$Review, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE) # sort data frame sent_df = within(sent_df, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
-
Visualize
-
Distribution of overall score
ggplot(review.scores, aes(x=score)) + geom_histogram(binwidth=1) + xlab("Sentiment score") + ylab("Frequency") + ggtitle("Distribution of sentiment score") + theme_bw() + theme(axis.title.x = element_text(vjust = -0.5, size = 14)) + theme(axis.title.y=element_text(size = 14, angle=90, vjust = -0.25)) + theme(plot.margin = unit(c(1,1,2,2), "lines"))
-
Distribution of score for a given term
review.pos<- subset(review.scores,review.scores$score>= 2) review.neg<- subset(review.scores,review.scores$score<= -2) claim <- subset(review.scores, regexpr("claim", review.scores$text) > 0) ggplot(claim, aes(x = score)) + geom_histogram(binwidth = 1) + ggtitle("Sentiment score for the token 'claim'") + xlab("Score") + ylab("Frequency") + theme_bw() + theme(axis.title.x = element_text(vjust = -0.5, size = 14)) + theme(axis.title.y = element_text(size = 14, angle = 90, vjust = -0.25)) + theme(plot.margin = unit(c(1,1,2,2), "lines"))
-
Distribution of emotion
# plot distribution of emotions ggplot(sent_df, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) + scale_fill_brewer(palette="Dark2") + labs(x="emotion categories", y="number of Feedback", title = "Sentiment Analysis of Feedback about claim(classification by emotion)", plot.title = element_text(size=12))
-
Distribution of polarity
# plot distribution of polarity ggplot(sent_df, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette="RdGy") + labs(x="emotion categories", y="number of Feedback", title = "Sentiment Analysis of Feedback about claim(classification by emotion)", plot.title = element_text(size=12))
-
Text by emotion
# separating text by emotion emos = levels(factor(sent_df$emotion)) nemo = length(emos) emo.docs = rep("", nemo) for (i in 1:nemo) { tmp = df$Review[emotion == emos[i]] emo.docs[i] = paste(tmp, collapse=" ") } # remove stopwords emo.docs = removeWords(emo.docs, stopwords("english")) # create corpus corpus = Corpus(VectorSource(emo.docs)) tdm = TermDocumentMatrix(corpus) tdm = as.matrix(tdm) colnames(tdm) = emos # comparison word cloud comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"), scale = c(3,.5), random.order = FALSE, title.size = 1.5)
-
-
Next post to cover sentiment analysis in R + Hadoop.
Reference:
The above write up is based on the tutorials from following links:
Clothing Sales Prediction – Mini DataHack
Analytics Vidhya organized a weekend mini data hackathon for Clothing Sales Prediction. The hackathon started at 20:00 (UTC + 5:30 ) on 28th May, 2016 and closed at 23:00 on 28th May, 2016 (UTC + 5:30)
Its my second hack in this forum, earlier I had participated in The Seer’s Accuracy hackathon, and ended up 54th place on public leaderboard. I was hoping this one would be better! but due to time constraints could not do better.
Unfortunately, my participation was delayed by an hour, so only had two hours to solve the problem.
Problem Statement:
SimpleBuy is a clothing company which runs operations in brick and mortar fashion. Be it parent, child, man, woman, they have wide range of products catering to the need to every individual. They aim to become one stop destination for all clothing desires.
Their idea of offline and online channels is doing quite well. Their stock now runs out even faster than they could replenish it. Customers are no longer skeptical about their quality. Their offline stores help customer to physically check clothes before buying them, especially the expensive clothes. In addition, their delivery channels are known to achieve six sigma efficiency.
However, SimpleBuy can only provide this experience, if they can manage the inventory well. Hence, they need to forecast the sales ahead of time. And this is where you will help them today. SimpleBuy has provided you with their Sales data for last 2 years and they want to you predict the sales for next 12 months.
Data:
The train data had only two columns i.e., ‘Date’ and ‘Number_SKU_Sold’
Train Data: 2007 and 2008 (Daily Sales, 587 records)
Test Data: 2009 (only contained date column, 365 records)
Model:
As this is a time-series data, I felt that this was the right opportunity to try my hands on “forecast” R package. Referring to the Dataiku’s time-series tutorial tried 3 models from the package.
Model 1: Exponential State Smoothing
Model 2: Auto ARIMA
The auto.arima() function automatically searches for the best model and optimizes the parameters.
Model 3: TBATS
TBATS (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is designed for use when there are multiple cyclic patterns e.g. daily, weekly and yearly patterns in a single time series.
On comparing the 3 models on AIC, TBATS seems to be performing slightly better than ETS/ARIMA.
Note that the model with the smallest AIC is the best fitting model. However, the submission performed poor on public leader board.
So quickly moved to the Random Forest as I was more comfortable with this and it gives better results most of the time. Extracted features from date such as year, month, day, day of month, day of the year. Added 2 more features to weight days (this idea was by referring Kaggle’s Walmart sales prediction solution).
Github:
Results:
This model scored 21046427.5142 on the public LB ranked 107th, view public leader-board.
Conclusion:
I clearly missed adding few possible key features (day of the week, seasonality, holiday etc) which could have improved the score. However, given that I had only two hours to solve the problem so glad that I was able to complete submission.
On a personal interest will definitely come back to the problem to see how score can be improved.
It was a very interesting problem and thanks to the Analytics Vidhya organizers.
R-Hadoop Integration on Ubuntu
Contents
- About the Manual
- Pre-requisites
- Install R Base on Hadoop
- Install R Studio on Hadoop
- Install RHadoop packages
RHadoop is a collection of four R packages that allow users to manage and analyze data with Hadoop.
- plyrmr– higher level plyr-like data processing for structured data, powered by rmr
- rmr– functions providing Hadoop MapReduce functionality in R
- rhdfs– functions providing file management of the HDFS from within R
- rhbase– functions providing database management for the HBase distributed database from within R
This manual is direct for R and Hadoop 2.4.0 integration on Ubuntu 14.04
Pre-requisites:
We assume, that the user would have below two running up before starting R and Hadoop integration
– Ubuntu 14.04
– Hadoop 2.x +
Read my blog to learn more about here on how setting-up-a-single-node-hadoop-cluster.
Pre – requisite:
Once Hadoop installation is done, make sure that all the processes are running:
Run the command jps on your terminal and the result should look similar to below screen shot:
R installation
Step 1: Click on the Ubuntu-software center.
Step 2: Open Ubuntu Software Center in full screen mode, if the size of the screen is small then we cannot see the search option,Search R-base and click on the First link. Click on install
Step 3: Once installation has done open your terminal. Type the command R and your r console will be open.
You can perform any operation on this R console for example, to plot a graph of some variables:-
plot(seq(1,1000,2.3))
We can see the graph of this plot function below screenshot:
Step 4:
If we want to come out from R console then give the command
q()
If you want to save workspace then type y otherwise type n.
c is for continue on the same workspace.
Step 7: Now we install R-studio in ubuntu.
- Open your browser and download r-studio. I downloaded RStudio 0.98.953 – Debian 6+/Ubuntu 10.04+ (32-bit) — this is actually a file: rstudio-0.98.953-amd32.deb
Go to download folder, right click on the download file and open file with Ubuntu Software Center and click on install.
Go on terminal and type R, you can see R console and R studio.
Install RHadoop packages
Step1: Install thrift
sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev
$ cd /tmp
If the below does not work please manually download the thrift jar
$ sudo wget https://dist.apache.org/repos/dist/release/thrift/0.9.0/thrift-0.9.0.tar.gz | tar zx
$ cd thrift-0.9.0/
$ ./configure
$ make
$ sudo make install
$ thrift –help
Step 2: Install supporting R packges:
install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”, “dplyr”, “R.methodsS3”, “caTools”, “Hmisc”), lib=”/usr/local/R/library”)
Step 3: Download below packages from https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
rmr2
rhdfs
rhbase
plyrmr
In R terminal run the commands to install packages. Replace <path> to suit your downloaded file location
sudo gedit /etc/R/Renviron
Install RHadoop (rhdfs, rhbase, rmr2 and plyrmr)
Install relevant packages:
install.packages(“rhdfs_1.0.8.tar.gz”, repos=NULL, type=”source”)
install.packages(“rmr2_3.1.2.tar.gz”, repos=NULL, type=”source”)
install.packages(“plyrmr_0.3.0.tar.gz”, repos=NULL, type=”source”)
install.packages(“rhbase_1.2.1.tar.gz”, repos=NULL, type=”source”)
References
You’ll find youtube vedio and step by step instruction about installing R in Hadoop in the following link.
URL http://www.rdatamining.com/tutorials/rhadoop
Rdatamining: R on Handoop – Step by step instructions
URL: http://www.rdatamining.com/tutorials/rhadoop
Youtube: Word count map reduce program in R
URL: http://www.youtube.com/watch?v=hSrW0Iwghtw
Revolution Analytics: RHadoop packages
URL: https://github.com/RevolutionAnalytics/RHadoop/wiki
Install R-base Guide
URL: http://www.sysads.co.uk/2014/06/install-r-base-3-1-0-ubuntu-14-04/
In the next blog post I’ll show a sample sentiment analysis using map reduce in R using rmr package.
Setting up a Single Node Hadoop Cluster
Step By Step Hadoop Installation Guide
Setting up Single Node Hadoop Cluster on Windows over VM
Contents
- Objective
- Current Environments
- Download VM and Ubuntu 14.04
- Install Ubuntu on VM
- Install Hadoop 2.4 on Ubuntu 14.04
Objective: This document will help you to setup Hadoop 2.4.0 onto Ubuntu 14.04 on your virtual machine of Windows operating system.
Current environment includes:
- Windows XP/7 – 32 bit
- VM Player (Non-commercial use only)
- Ubuntu 14.04 32 bit
- Java 1.7
- Hadoop 2.4.0
Download and Install VM Player from the link https://www.vmware.com/tryvmware/?p=player
Download Ubuntu 14.04 iso file from the link: http://www.ubuntu.com/download/desktop
Download the list of Hadoop commands for reference from the following link: http://hadoop.apache.org/docs/r1.0.4/commands_manual.pdf (Don’t be afraid of this file, this is just for your refer to help you learn more about important Hadoop commands)
Install Ubuntu in VM:
- Click on Create a New Virtual Machine
- Browse and select the Ubuntu iso file.
- Personalize Linux by providing appropriate details.
- Follow through the wizard steps to finish installation.
Install Hadoop 2.4 on Ubuntu 14.04
Step 1: Open Terminal
Step 2: Download Hadoop tar file by running the below command in terminal
wget http://mirror.fibergrid.in/apache/hadoop/common/stable/hadoop-2.7.2.tar.gz
Step 3: Unzip tar file through command: tar -xzf hadoop-2.7.2.tar.gz
Step 4: Let’s move everything into a more appropriate directory:
sudo mv hadoop-2.7.2/ /usr/local
cd /usr/local
sudo ln -s hadoop-2.7.2/ hadoop
Lets create a directory to for later use to store hadoop data:
mkdir /usr/local/hadoop/data
Step 5: Set up user and permission (Replace manohar by your user id)
sudo addgroup hadoop
sudo adduser –ingroup hadoop manohar
sudo chown -R hadoop: manohar /usr/local/hadoop/
Step 6: Install ssh:
sudo apt-get install ssh
ssh-keygen -t rsa -P “”
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Step 7: Install Java:
sudo apt-get update
sudo apt-get install default-jdk
sudo gedit ~/.bashrc
This will open the .bashrc file in a text editor. Go to the end of the file and paste/type the following content in it:
#HADOOP VARIABLES START
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”
export HADOOP_PREFIX=$HADOOP_INSTALL
export HADOOP_CMD=$HADOOP_INSTALL/bin/hadoop
export HADOOP_STREAMING=$HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
#HADOOP VARIABLES END
After saving and closing the .bashrc file, execute the following command so that your system recognizes the newly created environment variables:
source ~/.bashrc
Putting the above content in the .bashrc file ensures that these variables are always available when your VPS starts up.
Step 8:
Unfortunately, Hadoop and ipv6 don’t play nice so we’ll have to disable it – to do this you’ll need to open up /etc/sysctl.conf and add the following lines to the end:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Type the command: sudo gedit /etc/sysctl.conf
Step 9: Editing /usr/local/hadoop/etc/hadoop/hadoop-env.sh:
sudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh
In this file, locate the line that exports the JAVA_HOME variable. Change this line to the following:
Change export JAVA_HOME=${JAVA_HOME} to match the JAVA_HOME you set in your .bashrc (for us JAVA_HOME=/usr).
Also, change this line:
export HADOOP_OPTS=”$HADOOP_OPTS -Djava.net.preferIPv4Stack=true
TO BE
export HADOOP_OPTS=”$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.library.path=$HADOOP_PREFIX/lib”
And finally, add the following line:
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
Step 10: Editing /usr/local/hadoop/etc/hadoop/core-site.xml:
sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml
In this file, enter the following content in between the <configuration></configuration> tag:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/data</value>
</property>
Step 11: Editing /usr/local/hadoop/etc/hadoop/yarn-site.xml:
sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml
In this file, enter the following content in between the <configuration></configuration> tag:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8050</value>
</property>
The yarn-site.xml file should look something like this:
Step 12: Creating and Editing /usr/local/hadoop/etc/hadoop/mapred-site.xml:
By default, the /usr/local/hadoop/etc/hadoop/ folder contains the /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml. This file is used to specify which framework is being used for MapReduce.
This can be done using the following command:
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
Once this is done, open the newly created file with following command:
sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml
In this file, enter the following content in between the <configuration></configuration> tag:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
The mapred-site.xml file should look something like this:
Step 13: Editing /usr/local/hadoop/etc/hadoop/hdfs-site.xml:
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml has to be configured for each host in the cluster that is being used. It is used to specify the directories which will be used as the namenode and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation. This can be done using the following commands:
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
Open the /usr/local/hadoop/etc/hadoop/hdfs-site.xml file with following command:
sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
In this file, enter the following content in between the <configuration></configuration> tag:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
The hdfs-site.xml file should look something like this:
Step 14: Format the New Hadoop Filesystem:
After completing all the configuration outlined in the above steps, the Hadoop filesystem needs to be formatted so that it can start being used. This is done by executing the following command:
hdfs namenode –format
Note: This only needs to be done once before you start using Hadoop. If this command is executed again after Hadoop has been used, it’ll destroy all the data on the Hadoop filesystem.
Step 15: Start Hadoop
All that remains to be done is starting the newly installed single node cluster:
start-dfs.sh
While executing this command, you’ll be prompted twice with a message similar to the following:
Are you sure you want to continue connecting (yes/no)?
Type in yes for both these prompts and press the enter key. Once this is done, execute the following command:
start-yarn.sh
Executing the above two commands will get Hadoop up and running. You can verify this by typing in the following command:
jps
Executing this command should show you something similar to the following:
If you can see a result similar to the depicted in the screenshot above, it means that you now have a functional instance of Hadoop running on your VPS.
VBA Coding Best Practice
DEFINITION: Best practices are the agreed general set of guidelines that is believed to be more effective at delivering MS Excel based tools which are:
Ø User friendly
Ø Easy to maintain
Ø More reliable and robust
These are just general guidelines; a professional developer will always assess the options and make the appropriate choice in their specific situation. These suggestions are specific to Excel, VBA.
BEST PRACTICE PRINCIPLES
1.Easy to read and follow what’s happening
2.Efficient code
3.Flexible and easy to change
4.Robust and deals with errors
5.Uses existing Excel functionality where possible.
CONTENTS: The contents have divided it into sections as given below
Scope – Three levels of scope exist for each variable in VBA: Public, Private, and Local
Scope | Meaning | Example |
<none> | Local variable, procedure-level lifetime, declared with “Dim” | intOrderValue |
st | Local variable, object lifetime, declared with “Static” | stLastInvoiceID |
m | Private (module) variable, object lifetime, declared with “Private” | mcurRunningSum |
g | Public (global) variable, object lifetime, declared with “Public” | glngGrandTotal |
Var_Type (for variables)
Var_Type | Object Type | Example |
bln or b | Boolean | blnPaid or bPaid |
byt | Byte | bytPrice |
int or i | Integer | intStoreID |
lng | Long | lngSales |
obj | Object | objArc |
dbl | Double | dblSales |
str or s | String | strName or sName |
var or v | Variant | varColor or vColor |
dte | Date | dteBirthDate |
dec | Decimal | decLongitude |
cht | Chart | chtSales |
chk | Check box | chkReadOnly |
Command button | cmd | cmdCancel |
Label | lbl | lblHelpMessage |
Option button | opt | optFrench |
SUFFIXES – Suffixes modify the base name of an object, indicating additional information about a variable. You’ll likely create your own suffixes that are specific to your development work. Below table lists some generic/commonly used VBA suffixes.
Suffix | Object Type | Example |
Min | The absolute first element in an array or other kind of list | iastrNamesMin |
First | The first element to be used in an array or list during the current operation | iaintFontSizesFirst |
Last | The last element to be used in an array or list during the current operation | igphsGlyphCollectionLast |
Max | The absolutely last element in an array or other kind of list | iastrNamesMax |