Prediction on large scale data using extreme gradient boosting
Abstract
This paper presents a use case of data mining for sales forecasting in retail demand and sales
prediction. In particular, the Extreme Gradient Boosting algorithm is used to design a prediction
model to accurately estimate probable sales for retail outlets of a major European Pharmacy
retailing company. The forecast of potential sales is based on a mixture of temporal and
economical features including prior sales data, store promotions, retail competitors, school and
state holidays, location and accessibility of the store as well as the time of year. The model
building process was guided by common sense reasoning and by analytic knowledge discovered
during data analysis and definitive conclusions were drawn. The performances of the XGBoost
predictor were compared with those of more traditional regression algorithms like Linear
Regression and Random Forest Regression. Findings not only reveal that the XGBoost algorithm
outperforms the traditional modeling approaches with regard to prediction accuracy, but it also
uncovers new knowledge that is hidden in data which help in building a more robust feature set
and strengthen the sales prediction model.