Clothing Sales Prediction – Mini DataHack


Analytics Vidhya organized a weekend mini data hackathon for Clothing Sales Prediction. The hackathon started at 20:00 (UTC + 5:30 ) on 28th May, 2016 and closed at 23:00 on 28th May, 2016 (UTC + 5:30)

Its my second hack in this forum, earlier I had participated in The Seer’s Accuracy hackathon, and ended up 54th place on public leaderboard. I was hoping this one would be better! but due to time constraints could not do better.

Unfortunately, my participation was delayed by an hour, so only had two hours to solve the problem.

Problem Statement:

SimpleBuy is a clothing company which runs operations in brick and mortar fashion. Be it parent, child, man, woman, they have wide range of products catering to the need to every individual. They aim to become one stop destination for all clothing desires.

Their idea of offline and online channels is doing quite well. Their stock now runs out even faster than they could replenish it. Customers are no longer skeptical about their quality. Their offline stores help customer to physically check clothes before buying them, especially the expensive clothes. In addition, their delivery channels are known to achieve six sigma efficiency.

However, SimpleBuy can only provide this experience, if they can manage the inventory well. Hence, they need to forecast the sales ahead of time. And this is where you will help them today. SimpleBuy has provided you with their Sales data for last 2 years and they want to you predict the sales for next 12 months.


The train data had only two columns i.e., ‘Date’ and ‘Number_SKU_Sold’

Train Data: 2007 and 2008 (Daily Sales,  587 records)

Test Data: 2009 (only contained date column, 365 records)


As this is a time-series data, I felt that this was the right opportunity to try my hands on “forecast” R package. Referring to the Dataiku’s time-series tutorial tried 3 models from the package.

Model 1: Exponential State Smoothing

Model 2: Auto ARIMA
The auto.arima() function automatically searches for the best model and optimizes the parameters.

Model 3: TBATS

TBATS (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is designed for use when there are multiple cyclic patterns e.g. daily, weekly and yearly patterns in a single time series.

On comparing the 3 models on AIC, TBATS seems to be performing slightly better than ETS/ARIMA.


Note that the model with the smallest AIC is the best fitting model. However, the submission performed poor on public leader board.

So quickly moved to the Random Forest as I was more comfortable with this and it gives better results most of the time. Extracted features from date such as year, month, day, day of month, day of the year. Added 2 more features to weight days (this idea was by referring Kaggle’s Walmart sales prediction solution).


here is my github repository


This model scored 21046427.5142 on the public LB ranked 107th, view public leader-board.


I clearly missed adding few possible key features (day of the week, seasonality, holiday etc) which could have improved the score. However, given that I had only two hours to solve the problem  so glad that I was able to complete submission.

On a personal interest will definitely come back to the problem to see how score can be improved.

It was a very interesting problem and thanks to the Analytics Vidhya organizers.




R-Hadoop Integration on Ubuntu


  • About the Manual
  • Pre-requisites
  • Install R Base on Hadoop
  • Install R Studio on Hadoop
  • Install RHadoop packages

RHadoop is a collection of four R packages that allow users to manage and analyze data with Hadoop.

  1. plyrmr– higher level plyr-like data processing for structured data, powered by rmr
  2. rmr– functions providing Hadoop MapReduce functionality in R
  3. rhdfs– functions providing file management of the HDFS from within R
  4. rhbase– functions providing database management for the HBase distributed database from within R

This manual is direct for R and Hadoop 2.4.0 integration on Ubuntu 14.04


 We assume, that the user would have below two running up before starting R and Hadoop integration

Ubuntu 14.04

Hadoop 2.x +

Read my blog to learn more about here on how setting-up-a-single-node-hadoop-cluster.

Pre – requisite:

Once Hadoop installation is done, make sure that all the processes are running:

Run the command jps on your terminal and the result should look similar to below screen shot:


R installation

Step 1: Click on the Ubuntu-software center.


Step 2:  Open Ubuntu Software Center in full screen mode, if the size of the screen is small then we cannot see the search option,Search R-base and click on the First link. Click on install


Step 3: Once installation has done open your terminal. Type the command R and your r console will be open.


You can perform any operation on this R console for example, to plot a graph of some variables:-


We can see the graph of this plot function below screenshot:


Step 4:

If we want to come out from R console then give the command


If you want to save workspace then type y otherwise type n.

c is for continue on the same workspace.

Step 7: Now we install R-studio in ubuntu.

  • Open your browser and download r-studio. I downloaded RStudio 0.98.953 – Debian 6+/Ubuntu 10.04+ (32-bit) — this is actually a file: rstudio-0.98.953-amd32.deb


Go to download folder, right click on the download file and open file with Ubuntu Software Center and click on install.



Go on terminal and type R, you can see R console and R studio.


Install RHadoop packages

 Step1: Install thrift

sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev

$ cd /tmp

If the below does not work please manually download the thrift jar

$ sudo wget | tar zx

$ cd thrift-0.9.0/

$ ./configure

$ make

$ sudo make install

$ thrift –help


Step 2: Install supporting R packges:

install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”, “dplyr”, “R.methodsS3”, “caTools”, “Hmisc”), lib=”/usr/local/R/library”)

Step 3: Download below packages from





In R terminal run the commands to install packages. Replace <path> to suit your downloaded file location

sudo gedit /etc/R/Renviron

Install RHadoop (rhdfs, rhbase, rmr2 and plyrmr)

Install relevant packages:

install.packages(“rhdfs_1.0.8.tar.gz”, repos=NULL, type=”source”)

install.packages(“rmr2_3.1.2.tar.gz”, repos=NULL, type=”source”)

install.packages(“plyrmr_0.3.0.tar.gz”, repos=NULL, type=”source”)

install.packages(“rhbase_1.2.1.tar.gz”, repos=NULL, type=”source”)


You’ll find youtube vedio and step by step instruction about installing R in Hadoop in the following link.


Rdatamining: R on Handoop – Step by step instructions


Youtube: Word count map reduce program in R


Revolution Analytics: RHadoop packages


Install R-base Guide



In the next blog post I’ll show a sample sentiment analysis using map reduce in R using rmr package.


Setting up a Single Node Hadoop Cluster

Step By Step Hadoop Installation Guide

Setting up Single Node Hadoop Cluster on Windows over VM


  • Objective
  • Current Environments
  • Download VM and Ubuntu 14.04
  • Install Ubuntu on VM
  • Install Hadoop 2.4 on Ubuntu 14.04


Objective: This document will help you to setup Hadoop 2.4.0 onto Ubuntu 14.04 on your virtual machine of Windows operating system.

Current environment includes:

  • Windows XP/7 – 32 bit
  • VM Player (Non-commercial use only)
  • Ubuntu 14.04 32 bit
  • Java 1.7
  • Hadoop 2.4.0

Download and Install VM Player from the link

Download Ubuntu 14.04 iso file from the link:

Download the list of Hadoop commands for reference from the following link: (Don’t be afraid of this file, this is just for your refer to help you learn more about important Hadoop commands)

Install Ubuntu in VM:

  • Click on Create a New Virtual Machine
  • Browse and select the Ubuntu iso file.
  • Personalize Linux by providing appropriate details.
  • Follow through the wizard steps to finish installation.




Install Hadoop 2.4 on Ubuntu 14.04

Step 1: Open Terminal


Step 2: Download Hadoop tar file by running the below command in terminal


Step 3: Unzip tar file through command: tar -xzf hadoop-2.7.2.tar.gz

Step 4: Let’s move everything into a more appropriate directory:

sudo mv hadoop-2.7.2/ /usr/local

cd /usr/local

sudo ln -s hadoop-2.7.2/ hadoop

Lets create a directory to for later use to store hadoop data:

mkdir /usr/local/hadoop/data


Step 5: Set up user and permission (Replace manohar by your user id)

sudo addgroup hadoop

sudo adduser –ingroup hadoop manohar

sudo chown -R hadoop: manohar /usr/local/hadoop/

Step 6: Install ssh:

sudo apt-get install ssh

ssh-keygen -t rsa -P “”

cat ~/.ssh/ >> ~/.ssh/authorized_keys

Step 7: Install Java:

sudo apt-get update

sudo apt-get install default-jdk

sudo gedit ~/.bashrc

This will open the .bashrc file in a text editor. Go to the end of the file and paste/type the following content in it:


export HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr

export HADOOP_INSTALL=/usr/local/hadoop








export HADOOP_OPTS=”-Djava.library.path=$HADOOP_INSTALL/lib”


export HADOOP_CMD=$HADOOP_INSTALL/bin/hadoop

export HADOOP_STREAMING=$HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar



After saving and closing the .bashrc file, execute the following command so that your system recognizes the newly created environment variables:

source ~/.bashrc

Putting the above content in the .bashrc file ensures that these variables are always available when your VPS starts up.

Step 8:

Unfortunately, Hadoop and ipv6 don’t play nice so we’ll have to disable it – to do this you’ll need to open up /etc/sysctl.conf and add the following lines to the end:

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Type the command: sudo gedit /etc/sysctl.conf


Step 9: Editing /usr/local/hadoop/etc/hadoop/

 sudo gedit /usr/local/hadoop/etc/hadoop/

In this file, locate the line that exports the JAVA_HOME variable. Change this line to the following:

Change export JAVA_HOME=${JAVA_HOME} to match the JAVA_HOME you set in your .bashrc (for us JAVA_HOME=/usr).

Also, change this line:



export HADOOP_OPTS=”$HADOOP_OPTS -Djava.library.path=$HADOOP_PREFIX/lib”

And finally, add the following line:


Step 10: Editing /usr/local/hadoop/etc/hadoop/core-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

In this file, enter the following content in between the <configuration></configuration> tag:










Step 11: Editing /usr/local/hadoop/etc/hadoop/yarn-site.xml:

sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

In this file, enter the following content in between the <configuration></configuration> tag:





















The yarn-site.xml file should look something like this:


Step 12: Creating and Editing /usr/local/hadoop/etc/hadoop/mapred-site.xml:

 By default, the /usr/local/hadoop/etc/hadoop/ folder contains the /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml. This file is used to specify which framework is being used for MapReduce.

This can be done using the following command:

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

Once this is done, open the newly created file with following command:

sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml

In this file, enter the following content in between the <configuration></configuration> tag:





The mapred-site.xml file should look something like this:


Step 13: Editing /usr/local/hadoop/etc/hadoop/hdfs-site.xml:

 The /usr/local/hadoop/etc/hadoop/hdfs-site.xml has to be configured for each host in the cluster that is being used. It is used to specify the directories which will be used as the namenode and the datanode on that host.

Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation. This can be done using the following commands:

sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode

Open the /usr/local/hadoop/etc/hadoop/hdfs-site.xml file with following command:

sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

In this file, enter the following content in between the <configuration></configuration> tag:













The hdfs-site.xml file should look something like this:


Step 14: Format the New Hadoop Filesystem:

After completing all the configuration outlined in the above steps, the Hadoop filesystem needs to be formatted so that it can start being used. This is done by executing the following command:

hdfs namenode –format

Note: This only needs to be done once before you start using Hadoop. If this command is executed again after Hadoop has been used, it’ll destroy all the data on the Hadoop filesystem.

Step 15: Start Hadoop

All that remains to be done is starting the newly installed single node cluster:

While executing this command, you’ll be prompted twice with a message similar to the following:

Are you sure you want to continue connecting (yes/no)?

Type in yes for both these prompts and press the enter key. Once this is done, execute the following command:

Executing the above two commands will get Hadoop up and running. You can verify this by typing in the following command:


Executing this command should show you something similar to the following:


If you can see a result similar to the depicted in the screenshot above, it means that you now have a functional instance of Hadoop running on your VPS.


VBA Coding Best Practice

DEFINITION: Best practices are the agreed general set of guidelines that is believed to be more effective at delivering MS Excel based tools which are:

Ø User friendly

Ø Easy to maintain

Ø More reliable and robust

These are just general guidelines; a professional developer will always assess the options and make the appropriate choice in their specific situation. These suggestions are specific to Excel, VBA.


1.Easy to read and follow what’s happening

2.Efficient code

3.Flexible and easy to change

4.Robust and deals with errors

5.Uses existing Excel functionality where possible.

CONTENTS: The contents have divided it into sections as given below




ScopeThree levels of scope exist for each variable in VBA: Public, Private, and Local

Scope Meaning Example
<none> Local variable, procedure-level lifetime, declared with “Dim” intOrderValue
st Local variable, object lifetime, declared with “Static” stLastInvoiceID
m Private (module) variable, object lifetime, declared with “Private” mcurRunningSum
g Public (global) variable, object lifetime, declared with “Public” glngGrandTotal

Var_Type (for variables)

Var_Type Object Type Example
bln or b Boolean blnPaid or bPaid
byt Byte bytPrice
int or i Integer intStoreID
lng Long lngSales
obj Object objArc
dbl Double dblSales
str or s String strName or sName
var or v Variant varColor or vColor
dte Date dteBirthDate
dec Decimal decLongitude
cht Chart chtSales
chk Check box chkReadOnly
Command button cmd cmdCancel
Label lbl lblHelpMessage
Option button opt optFrench

SUFFIXES – Suffixes modify the base name of an object, indicating additional information about a variable. You’ll likely create your own suffixes that are specific to your development work. Below table lists some generic/commonly used VBA suffixes.

Suffix Object Type Example
Min The absolute first element in an array or other kind of list iastrNamesMin
First The first element to be used in an array or list during the current operation iaintFontSizesFirst
Last The last element to be used in an array or list during the current operation igphsGlyphCollectionLast
Max The absolutely last element in an array or other kind of list iastrNamesMax


Avoid duplicate-001

BP - GSR Excel, VBA [Compatibility Mode]-001

BP - GSR Excel, VBA [Compatibility Mode]-002

BP - GSR Excel, VBA [Compatibility Mode]-003

BP - GSR Excel, VBA [Compatibility Mode]-004

BP - GSR Excel, VBA [Compatibility Mode]-005

BP - GSR Excel, VBA [Compatibility Mode]-006

BP - GSR Excel, VBA [Compatibility Mode]-007

BP - GSR Excel, VBA [Compatibility Mode]-008

BP - GSR Excel, VBA [Compatibility Mode]-009

Triples – Deep Natural Language Processing

Problem: In Text Mining extracting keywords (n-grams) alone cannot produce meaningful data nor discover “unknown” themes and trends.

Objective: The aim here is to extract dependency relation from sentence i.e., extract sets of the form {subject, predicate[modifiers], object} out of syntactically parsed sentences, using Stanford parser and opennlp.


1) Get the syntactic relationship between each pair of words

2) Apply sentence segmentation to determine the sentence boudaries

3) The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relationships within each sentence

How this is different from n-gram?

Dependency relation allows the similarity comparison to be based on the syntactic relations between words, instead of having to match words in their exact order in n-gram based comparisons.


Sentence: “The flat tire was not changed by driver”

Stanford dependency relations: 

root(ROOT-0, changed-6)
det(tire-3, the-1)
amod(tire-3, flat-2)
nsubjpass(changed-6, tire-3)
auxpass(changed-6, was-4)
neg(changed-6, not-5)
prep(changed-6, by-7)
pobj(by-7, driver-8)

Refer Stanford typed dependencies manual for full list & more info:

Triples output in the form (Subject : Predicate [modifier] : Object) :  

driver : changed [not] : tire

Extraction Logic: You can use the below base logic to build the functionality in your favorite/comfortable language (R/Python/Java/etc). Please note that this is only the base logic and needs enhancement.



  • Sentence level is too structured
  • Usage of abbreviations and grammatical errors in sentence will mislead the analysis


Hope this article is useful!