I was fortunate to have had the chance to work on some of the exciting text mining projects during my consulting role for insurance domain. I had summarized my learning / experience into my first post in LinkedIn. You can find the same here!
Analytics Vidhya organized a weekend mini data hackathon for Clothing Sales Prediction. The hackathon started at 20:00 (UTC + 5:30 ) on 28th May, 2016 and closed at 23:00 on 28th May, 2016 (UTC + 5:30)
Its my second hack in this forum, earlier I had participated in The Seer’s Accuracy hackathon, and ended up 54th place on public leaderboard. I was hoping this one would be better! but due to time constraints could not do better.
Unfortunately, my participation was delayed by an hour, so only had two hours to solve the problem.
SimpleBuy is a clothing company which runs operations in brick and mortar fashion. Be it parent, child, man, woman, they have wide range of products catering to the need to every individual. They aim to become one stop destination for all clothing desires.
Their idea of offline and online channels is doing quite well. Their stock now runs out even faster than they could replenish it. Customers are no longer skeptical about their quality. Their offline stores help customer to physically check clothes before buying them, especially the expensive clothes. In addition, their delivery channels are known to achieve six sigma efficiency.
However, SimpleBuy can only provide this experience, if they can manage the inventory well. Hence, they need to forecast the sales ahead of time. And this is where you will help them today. SimpleBuy has provided you with their Sales data for last 2 years and they want to you predict the sales for next 12 months.
The train data had only two columns i.e., ‘Date’ and ‘Number_SKU_Sold’
Train Data: 2007 and 2008 (Daily Sales, 587 records)
Test Data: 2009 (only contained date column, 365 records)
As this is a time-series data, I felt that this was the right opportunity to try my hands on “forecast” R package. Referring to the Dataiku’s time-series tutorial tried 3 models from the package.
Model 1: Exponential State Smoothing
Model 2: Auto ARIMA
The auto.arima() function automatically searches for the best model and optimizes the parameters.
Model 3: TBATS
TBATS (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components) is designed for use when there are multiple cyclic patterns e.g. daily, weekly and yearly patterns in a single time series.
On comparing the 3 models on AIC, TBATS seems to be performing slightly better than ETS/ARIMA.
Note that the model with the smallest AIC is the best fitting model. However, the submission performed poor on public leader board.
So quickly moved to the Random Forest as I was more comfortable with this and it gives better results most of the time. Extracted features from date such as year, month, day, day of month, day of the year. Added 2 more features to weight days (this idea was by referring Kaggle’s Walmart sales prediction solution).
This model scored 21046427.5142 on the public LB ranked 107th, view public leader-board.
I clearly missed adding few possible key features (day of the week, seasonality, holiday etc) which could have improved the score. However, given that I had only two hours to solve the problem so glad that I was able to complete submission.
On a personal interest will definitely come back to the problem to see how score can be improved.
It was a very interesting problem and thanks to the Analytics Vidhya organizers.