Prediction of Purchase Volume Coffee Shops in Surabaya Using Catboost with Leave-One-Out Cross Validation

Authors

DOI:

https://doi.org/10.26555/jiteki.v11i1.30610

Keywords:

Catboost, LightGBM, Coffee Shops, Gradient Boosting

Abstract

Indonesia's coffee consumption grew from 265,000 tons in 2015 to 294,000 tons in 2020. Averaging 2% annual growth with a projected 368,000 tons by 2024. One of the coffee businesses is coffee shops, Coffee shop businesses often struggle to attract customers quickly, risking low purchase volume within their first five years. In their first year, challenges include management, company size, service quality, and customer preferences.  This study adopts a quantitative approach and new solutions to develop a purchase prediction application based on machine learning and strategy to enhance purchase volumes for three coffee shops in Surabaya. It utilizes CatBoost, with LightGBM as a comparison, across multiple coffee shop locations. LOOCV (Leave-One-Out Cross-Validation) is used in this model to address research limitations, such as data overfitting and biases, while enhancing evaluation accuracy. As a result, the study established CatBoost as the superior model for purchase prediction, providing insights and practical applications in business forecasting. The Catboost model achieved an MAE of 0.91 and MAPE of 15%, outperforming LightGBM’s MAE of 1.13 and MAPE of 18%. These results confirmed CatBoost’s effectiveness for the coffee shop industry with good accuracy. This research also contributes to helping coffee shop owners in Surabaya understand market characteristics, such as the most profitable coffee types and high-customer-density locations. Additionally, it aids in optimizing purchase volume to leverage profit by developing new strategies based on prediction result.  In conclusion, CatBoost accurately predicts purchase volume, helping coffee shops identify target markets and refine strategies based on customer preferences.

Author Biographies

Calvien Danny Nariyana, Universitas Pembangunan Nasional “Veteran” Jawa Timur

Calvien Danny Nariyana is an undergraduate student in Computer Science at Universitas Pembangunan Nasional "Veteran" Jawa Timur, Surabaya. He has a keen interest in machine learning and artificial intelligence, with his primary research focus on leveraging data science to tackle a variety of challenges in science and technology. Calvien is deeply passionate about utilizing advanced methods in data analysis to create innovative and practical solutions.

Trimono Trimono, Universitas Pembangunan Nasional “Veteran” Jawa Timur

Trimono, is a lecturer in the Computer Science Study Program, UPN Veteran Jawa Timur, Indonesia. He earned a Bachelor's degree in Statistics from Diponegoro State University and a Master's degree in Mathematics from Bandung Institute of Technology. He has a strong passion for Risk Management, Time Series Analysis, and Financial Statistics. As a lecturer, Trimono actively teaches and mentors Mathematics students, integrating multiple scientific disciplines through the application of Data Science.

References

REFERENCES

[1] F. Wang and J. Aviles, “Enhancing Operational Efficiency: Integrating Machine Learning Predictive Capabilities in Business Intellgence for Informed Decision-Making,” Front. Business, Econ. Manag., vol. 9, no. 1, pp. 282–286, May 2023, https://doi.org/10.54097/fbem.v9i1.8694.

[2] S. Nainggolan, E. Kernalis, and D. Z. Carolin, “Analysis of Factors Affecting the Behavior of Coffee Shop Consumers in Jambi City,” Randwick Int. Soc. Sci. J., vol. 3, no. 1, pp. 53–60, 2022, https://doi.org/10.47175/rissj.v3i1.369.

[3] A. Mukhlis, A. Moeins, and W. Sunaryo, “Development Strategies for Micro, Small, and Medium Enterprises (Msme) By Improving the Quality of Human Resources,” Int. J. Econ. Educ. Entrep., vol. 2, no. 2, pp. 525–536, 2022, https://doi.org/10.53067/ije3.v2i2.91.

[4] A. Daengs, GS, B. Pramono, A. I. Soemantri, and R. B. Kusumo Negoro, “Orientation Entrepreneurial Effects on MSME Performance Facilitated by Surabaya Commerce Department through Marketing Strategy as a Moderating Variable,” Int. J. Adv. Eng. Manag. Res., vol. 08, no. 05, pp. 30–41, 2023, https://doi.org/10.51505/ijaemr.2023.8503.

[5] D. A. N. Menengah, “Factor Affecting Business Sustainability of Small and Medium Coffee Shop,” J. Teknol. Ind. Pertan., vol. 30, no. 3, pp. 308–318, 2020, https://doi.org/10.24961/j.tek.ind.pert.2020.30.3.308.

[6] W. Wahyuningsih and P. T. Prasetyaningrum, “Enhancing Sales Determination for Coffee Shop Packages through Associated Data Mining: Leveraging the FP-Growth Algorithm,” J. Inf. Syst. Informatics, vol. 5, no. 2, pp. 758–770, 2023, https://doi.org/10.51519/journalisi.v5i2.500.

[7] L. Setiyani and W. H. Utomo, “Arabica Coffee Price Prediction Using the Long Short Term Memory Network (LSTM) Algorithm,” Sci. J. Informatics, vol. 10, no. 3, pp. 287–296, 2023, https://doi.org/10.15294/sji.v10i3.44162.

[8] A. Z. Putra, C. Chalvin, A. Nurhadi, A. E. Tambun, and S. Defha, “Coffee Quality Prediction with Light Gradient Boosting Machine Algorithm Through Data Science Approach,” Sinkron, vol. 8, no. 1, pp. 563–573, 2023, https://doi.org/10.33395/sinkron.v8i1.12169.

[9] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “Catboost: Unbiased boosting with categorical features,” Adv. Neural Inf. Process. Syst., pp. 6638–6648, 2018, https://doi.org/10.48550/arXiv.1706.09516.

[10] S. Chen, H. Jin, and L. Li, “Analysis and Comparison of House Price Prediction Based on XGboost and LightGBM,” Adv. Econ. Manag. Polit. Sci., vol. 46, no. 1, pp. 55–61, 2023, https://doi.org/10.54254/2754-1169/46/20230317.

[11] A. N. Karabulut, “Comparing the Young People’s Coffee Shop Perceptions with Their Senses of Taste,” Yönetim ve Ekon. Derg., vol. 30, no. 1, pp. 1–19, 2023, https://doi.org/10.18657/yonveek.1244119.

[12] A. M. B. Wicaksana, S. Suharno, and W. Supartono, “The Impact of Consumer Behavior and Marketing Mix on the Decision to Buy Coffee at Coffee Shops in the Sleman Region During the Covid-19 Pandemic,” Agroindustrial J., vol. 8, no. 1, p. 520, 2022, https://doi.org/10.22146/aij.v8i1.73543.

[13] A. S. R. M. Sinaga, R. E. Putra, and A. S. Girsang, “Prediction measuring local coffee production and marketing relationships coffee with big data analysis support,” Bull. Electr. Eng. Informatics, vol. 11, no. 5, pp. 2764–2772, Oct. 2022, https://doi.org/10.11591/eei.v11i5.4082.

[14] W. A. Limont, J. T. Łukasiewicz-Wieleba, A. Demianowska, and M. Jabłonowska, “The Snowball Sampling Strategy in the Field of Social Sciences. Contexts and Considerations.,” Przegląd Badań Eduk. (Educational Stud. Rev., vol. 2, no. 43, pp. 87–104, Sep. 2024, https://doi.org/10.12775/PBE.2022.001.

[15] J. Sayyad, K. Attarde, and N. Saadouli, “Optimizing e-commerce Supply Chains with Categorical Boosting: A Predictive Modeling Framework,” IEEE Access, 2024, https://doi.org/10.1109/ACCESS.2024.3447756.

[16] Y. Zou, C. Gao, and H. Gao, “Business Failure Prediction Based on a Cost-Sensitive Extreme Gradient Boosting Machine,” IEEE Access, vol. 10, pp. 42623–42639, 2022, https://doi.org/10.1109/ACCESS.2022.3168857.

[17] M. Idhom, A. Fauzi, T. Trimono, and P. Riyantoko, “Time Series Regression: Prediction of Electricity Consumption Based on Number of Consumers at National Electricity Supply Company,” TEM J., vol. 12, no. 3, pp. 1575–1581, 2023, https://doi.org/10.18421/TEM123-39.

[18] M. Idhom, I. G. P. A. Buditjahjanto, Munoto, Trimono, and P. A. Riyantoko, “Antithesis of Human Rater: Psychometric Responding to Shifts Competency Test Assessment Using Automation (AES System),” Stud. Learn. Teach., vol. 4, no. 2, pp. 329–340, 2023, https://doi.org/10.46627/silet.v4i2.291.

[19] W. Liang, S. Luo, G. Zhao, and H. Wu, “Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms,” Mathematics, vol. 8, no. 5, pp. 1–17, 2020, https://doi.org/10.3390/MATH8050765.

[20] A. Odeh, Q. A. Al-Haija, A. Aref, and A. A. Taleb, “Comparative Study of CatBoost, XGBoost, and LightGBM for Enhanced URL Phishing Detection: A Performance Assessment,” J. Internet Serv. Inf. Secur., vol. 13, no. 4, pp. 1–11, 2023, https://doi.org/10.58346/JISIS.2023.I4.001.

[21] Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P. H. Huynh, “Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 2, pp. 125–136, 2024, https://doi.org/10.35882/jeeemi.v6i2.382.

[22] J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J. Big Data, vol. 7, no. 1, 2020, https://doi.org/10.1186/s40537-020-00369-8.

[23] M. Nagassou, R. W. Mwangi, and E. Nyarige, “A Hybrid Ensemble Learning Approach Utilizing Light Gradient Boosting Machine and Category Boosting Model for Lifestyle-Based Prediction of Type-II Diabetes Mellitus,” J. Data Anal. Inf. Process., vol. 11, no. 04, pp. 480–511, 2023, https://doi.org/10.4236/jdaip.2023.114025.

[24] X. Lv, D. Gu, X. Liu, J. Dong, and Y. li, “Momentum prediction models of tennis match based on CatBoost regression and random forest algorithms,” Sci. Rep., vol. 14, no. 1, pp. 1–17, 2024, https://doi.org/10.1038/s41598-024-69876-5.

[25] Z. Lu, “Study of Mother-infant Behavioural Relationships based on Structural Equation Modelling and LightGBM Regression Models,” Sci. J. Intell. Syst. Res., vol. 6, no. 7, pp. 1–9, 2024, https://doi.org/10.54691/m7eqms74.

[26] A. Alsubayhin, M. S. Ramzan, and B. Alzahrani, “Crime Prediction Model using Three Classification Techniques: Random Forest, Logistic Regression, and LightGBM,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 1, pp. 240–251, 2024, https://doi.org/10.14569/IJACSA.2024.0150123.

[27] Y. Zhang, C. Zhu, and Q. Wang, “Lightgbm-based model for metro passenger volume forecasting,” IET Intell. Transp. Syst., vol. 14, no. 13, pp. 1815–1823, 2020, https://doi.org/10.1049/iet-its.2020.0396.

[28] L. Lin, J. Zhang, N. Zhang, J. Shi, and C. Chen, “Optimized LightGBM Power Fingerprint Identification Based on Entropy Features,” Entropy, vol. 24, no. 11, 2022, https://doi.org/10.3390/e24111558.

[29] A. Botchkarev, “A New Typology Design of Performance Metrics to Measure Errors in Machine Learning Regression Algorithms,” Interdiscip. J. Information, Knowledge, Manag., vol. 14, no. 113, pp. 45–79, 2019, https://doi.org/10.28945/4184.

[30] A. T. Damaliana and S. Hidayati, “Implementation of Quantile Regression Neural Network Model for Forecasting Electricity Demand in East Java,” Proceeding - IEEE 8th Inf. Technol. Int. Semin. ITIS, pp. 229–234, 2022, https://doi.org/10.1109/ITIS57155.2022.10009045.

[31] A. Uribeetxebarria, A. Castellón, and A. Aizpurua, “Optimizing Wheat Yield Prediction Integrating Data from Sentinel-1 and Sentinel-2 with CatBoost Algorithm,” Remote Sens., vol. 15, no. 6, 2023, https://doi.org/10.3390/rs15061640.

[32] A. M. Aviolla Terza Damaliana and D. A. Prasetya, “Forecasting The Occupancy Rate Of Star Hotels In Bali,” J. Stat., vol. 12, no. 1, pp. 24–33, 2024, https://doi.org/10.14710/JSUNIMUS. 12.1.2024.24-33.

[33] N. Putu, V. Ginanti, C. Wiedyaningsih, and E. Yuniarti, “Comparison Of Forecasting Drug Needs Using Time Series Methods In Healthcare Facilities : A Systematic Review.,” J. Farm. Sains dan Prakt., vol. 10, no. 2, pp. 156–165, 2024, https://doi.org/10.31603/pharmacy.v10i2.11145.

[34] E. Vivas, H. Allende-Cid, and R. Salas, “A Systematic Review of Statistical and Machine Learning Methods for Electrical Power Forecasting with Reported MAPE Score,” Entropy, vol. 22, no. 12, p. 1412, Dec. 2020, https://doi.org/10.3390/e22121412.

[35] Y. L. Sukestiyarno, D. T. Wiyanti, L. Azizah, and W. Widada, “Algorithm Optimizer in GA-LSTM for Stock Price Forecasting,” Contemp. Math., vol. 5, no. 1, pp. 1–12, Jan. 2024, https://doi.org/10.37256/cm.5120243367.

[36] V. Lumumba, D. Kiprotich, M. Mpaine, N. Makena, and M. Kavita, “Comparative Analysis of Cross-Validation Techniques: LOOCV, K-folds Cross-Validation, and Repeated K-folds Cross-Validation in Machine Learning Models,” Am. J. Theor. Appl. Stat., vol. 13, no. 5, pp. 127–137, Oct. 2024, https://doi.org/10.11648/j.ajtas.20241305.13.

[37] A. Geroldinger, L. Lusa, M. Nold, and G. Heinze, “Leave-one-out cross-validation, penalization, and differential bias of some prediction model performance measures—a simulation study,” Diagnostic Progn. Res., vol. 7, no. 1, 2023, https://doi.org/10.1186/s41512-023-00146-0.

[38] I. Tougui, A. Jilbab, and J. El Mhamdi, “Impact of the choice of cross-validation techniques on the results of machine learning-based diagnostic applications,” Healthc. Inform. Res., vol. 27, no. 3, pp. 189–199, 2021, https://doi.org/10.4258/HIR.2021.27.3.189.

[39] C. O. Chavez-Chong, C. Hardouin, and A.-K. Fermin, “Ridge regularization for spatial autoregressive models with multicollinearity issues,” AStA Adv. Stat. Anal., vol. 109, no. 1, pp. 25–52, Mar. 2025, https://doi.org/10.1007/s10182-024-00496-0.

[40] A. R. Nur, A. K. Jaya, and S. Siswanto, “Comparative Analysis of Ridge, LASSO, and Elastic Net Regularization Approaches in Handling Multicollinearity for Infant Mortality Data in South Sulawesi,” J. Mat. Stat. dan Komputasi, vol. 20, no. 2, pp. 311–319, 2023, https://doi.org/10.20956/j.v20i2.31632.

[41] C. M. Le, K. Levin, P. J. Bickel, and E. Levina, “Comment: Ridge Regression and Regularization of Large Matrices,” Technometrics, vol. 62, no. 4, pp. 443–446, Oct. 2020, https://doi.org/10.1080/00401706.2020.1796815.

[42] C. Tirink, S. H. Abaci, and H. Onder, “Comparison of Ridge Regression and Least Squares Methods in the Presence of Multicollinearity for Body Measurements in Saanen Kids,” Iğdır Üniversitesi Fen Bilim. Enstitüsü Derg., vol. 10, no. 2, pp. 1429–1437, 2020, https://doi.org/10.21597/jist.671662.

[43] D. Barragán-Guerrero, M. Au, G. Gagnon, F. Gagnon, and P. Giard, “Early-detection scheme based on sequential tests for low-latency communications,” Eurasip J. Wirel. Commun. Netw., vol. 2023, no. 1, 2023, https://doi.org/10.1186/s13638-023-02240-9.

[44] A. Alabrah, “An Improved CCF Detector to Handle the Problem of Class Imbalance with Outlier Normalization Using IQR Method,” Sensors, vol. 23, no. 9, 2023, https://doi.org/10.3390/s23094406.

[45] B. Dym and C. Fiesler, “Ethical and privacy considerations for research using online fandom data,” Transform. Work. Cult., vol. 33, pp. 1–19, 2020, https://doi.org/10.3983/twc.2020.1733.

[46] A. C. Haber, U. Sax, and F. Prasser, “Open tools for quantitative anonymization of tabular phenotype data: literature review,” Brief. Bioinform., vol. 23, no. 6, pp. 1–10, 2022, https://doi.org/10.1093/bib/bbac440.

[47] S. Sardjono, R. Y. R. Alamsyah, M. Marwondo, and E. Setiana, “Data Cleansing Strategies on Data Sets Become Data Science,” Int. J. Quant. Res. Model., vol. 1, no. 3, pp. 145–156, 2020, https://doi.org/10.46336/ijqrm.v1i3.71.

[48] D. A. Prasetya, A. P. Sari, P. A. Riyantoko, and T. M. Fahrudin, “The Effect of Information Quality and Service Quality on User Satisfaction of the Government of Kabupaten Malang,” TIERS Inf. Technol. J., vol. 4, no. 1, pp. 32–42, 2023, https://doi.org/10.38043/tiers.v4i1.4328.

[49] I. Bergelson, C. Tracy, and E. Takacs, “Best Practices for Reducing Bias in the Interview Process,” Curr. Urol. Rep., vol. 23, no. 11, pp. 319–325, Nov. 2022, https://doi.org/10.1007/s11934-022-01116-7.

[50] K. Isaksen et al., “Interviewing adolescent girls about sexual and reproductive health: a qualitative study exploring how best to ask questions in structured follow-up interviews in a randomized controlled trial in Zambia,” Reprod. Health, vol. 19, no. 1, pp. 1–11, 2022, https://doi.org/10.1186/s12978-021-01318-1.

[51] Q. Zheng, C. Yu, J. Cao, Y. Xu, Q. Xing, and Y. Jin, “Advanced Payment Security System:XGBoost, CatBoost and SMOTE Integrated,” arXiv e-prints, arXiv-2406, 2024, https://doi.org/10.1109/MetaCom62920.2024.00063.

[52] C. S. K. Dash, A. K. Behera, S. Dehuri, and A. Ghosh, “An outliers detection and elimination framework in classification task of data mining,” Decis. Anal. J., vol. 6, p. 100164, 2023, https://doi.org/10.1016/j.dajour.2023.100164.

[53] B. Bach et al., “Challenges and Opportunities in Data Visualization Education: A Call to Action,” IEEE Trans. Vis. Comput. Graph., vol. 30, no. 1, pp. 649–660, 2024, https://doi.org/10.1109/TVCG.2023.3327378.

[54] P. Cerda and G. Varoquaux, “Encoding High-Cardinality String Categorical Variables,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1164–1176, Mar. 2022, https://doi.org/10.1109/TKDE.2020.2992529.

[55] M. Fan, K. Xiao, L. Sun, S. Zhang, and Y. Xu, “Automated Hyperparameter Optimization of Gradient Boosting Decision Tree Approach for Gold Mineral Prospectivity Mapping in the Xiong’ershan Area,” Minerals, vol. 12, no. 12, 2022, https://doi.org/10.3390/min12121621.

[56] A. Maulana, R. P. F. Afidh, N. B. Maulydia, G. M. Idroes, and S. Rahimah, “Predicting Obesity Levels with High Accuracy: Insights from a CatBoost Machine Learning Model,” Infolitika J. Data Sci., vol. 2, no. 1, pp. 17–27, 2024, https://doi.org/10.60084/ijds.v2i1.195.

[57] K. M. Hindrayani, T. M. Fahrudin, R. Prismahardi Aji, and E. M. Safitri, “Indonesian Stock Price Prediction including Covid19 Era Using Decision Tree Regression,” 2020 3rd Int. Semin. Res. Inf. Technol. Intell. Syst. ISRITI, pp. 344–347, 2020, https://doi.org/10.1109/ISRITI51436.2020.9315484.

[58] I. G. S. M. Diayasa, M. Idhom, A. Fauzi, and A. T. Damaliana, “Stacking Ensemble Methods to Predict Obesity Levels in Adults,” Proceeding - IEEE 8th Inf. Technol. Int. Semin. ITIS, pp. 339–344, 2022, https://doi.org/10.1109/ITIS57155.2022.10010260.

[59] P. Bagus, P. Putra Budiartha, C. Wiedyaningsih, E. Yuniarti, A. Agung, and A. Prithadewi, “Forecasting Drug Demand Using The Single Moving Average At Prof. dr. I.G.N.G. Ngoerah Hospital,” Maj. Farm., vol. 19, no. 3, pp. 394–402, 2023, https://doi.org/10.22146/farmaseutik.v19i3.86207.

[60] M. Hani’ah, M. Z. Abdullah, W. I. Sabilla, S. Akbar, and D. R. Shafara, “Google Trends and Technical Indicator based Machine Learning for Stock Market Prediction,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 22, no. 2, pp. 271–284, Mar. 2023, https://doi.org/10.30812/matrik.v22i2.2287.

[61] P. Charilaou and R. Battat, “Machine learning models and over-fitting considerations,” World J. Gastroenterol., vol. 28, no. 5, pp. 605–607, Feb. 2022, https://doi.org/10.3748/wjg.v28.i5.605.

[62] E. Efendi, M. Butarbutar, R. M. Girsang, E. Chandra, and V. Candra, “Purchase Interest Reviewed Based On Price And Location At Danu Jaya Birdshop Pematang Siantar,” Mak. J. Manaj., vol. 9, no. 1, pp. 119–126, Jun. 2023, https://doi.org/10.37403/mjm.v9i1.579.

[63] A. Syaidah, M. Munawaroh, and L. Susilowati, “Influence of Price And Location on Belikopi Ploso Consumer Purchasing Decisions,” J. Bus. Manag. Econ. Dev., vol. 1, no. 03, pp. 556–564, 2023, https://doi.org/10.59653/jbmed.v1i03.297.

Downloads

Published

2025-03-21

How to Cite

[1]
C. D. Nariyana, M. Idhom, and T. Trimono, “Prediction of Purchase Volume Coffee Shops in Surabaya Using Catboost with Leave-One-Out Cross Validation”, J. Ilm. Tek. Elektro Komput. Dan Inform, vol. 11, no. 1, pp. 124–138, Mar. 2025.

Issue

Section

Articles