Prediction of Purchase Volume Coffee Shops in Surabaya Using Catboost with Leave-One-Out Cross Validation
DOI:
https://doi.org/10.26555/jiteki.v11i1.30610Keywords:
Catboost, LightGBM, Coffee Shops, Gradient BoostingAbstract
Indonesia's coffee consumption grew from 265,000 tons in 2015 to 294,000 tons in 2020. Averaging 2% annual growth with a projected 368,000 tons by 2024. One of the coffee businesses is coffee shops, Coffee shop businesses often struggle to attract customers quickly, risking low purchase volume within their first five years. In their first year, challenges include management, company size, service quality, and customer preferences. This study adopts a quantitative approach and new solutions to develop a purchase prediction application based on machine learning and strategy to enhance purchase volumes for three coffee shops in Surabaya. It utilizes CatBoost, with LightGBM as a comparison, across multiple coffee shop locations. LOOCV (Leave-One-Out Cross-Validation) is used in this model to address research limitations, such as data overfitting and biases, while enhancing evaluation accuracy. As a result, the study established CatBoost as the superior model for purchase prediction, providing insights and practical applications in business forecasting. The Catboost model achieved an MAE of 0.91 and MAPE of 15%, outperforming LightGBM’s MAE of 1.13 and MAPE of 18%. These results confirmed CatBoost’s effectiveness for the coffee shop industry with good accuracy. This research also contributes to helping coffee shop owners in Surabaya understand market characteristics, such as the most profitable coffee types and high-customer-density locations. Additionally, it aids in optimizing purchase volume to leverage profit by developing new strategies based on prediction result. In conclusion, CatBoost accurately predicts purchase volume, helping coffee shops identify target markets and refine strategies based on customer preferences.
References
REFERENCES
[1] F. Wang and J. Aviles, “Enhancing Operational Efficiency: Integrating Machine Learning Predictive Capabilities in Business Intellgence for Informed Decision-Making,” Front. Business, Econ. Manag., vol. 9, no. 1, pp. 282–286, May 2023, https://doi.org/10.54097/fbem.v9i1.8694.
[2] S. Nainggolan, E. Kernalis, and D. Z. Carolin, “Analysis of Factors Affecting the Behavior of Coffee Shop Consumers in Jambi City,” Randwick Int. Soc. Sci. J., vol. 3, no. 1, pp. 53–60, 2022, https://doi.org/10.47175/rissj.v3i1.369.
[3] A. Mukhlis, A. Moeins, and W. Sunaryo, “Development Strategies for Micro, Small, and Medium Enterprises (Msme) By Improving the Quality of Human Resources,” Int. J. Econ. Educ. Entrep., vol. 2, no. 2, pp. 525–536, 2022, https://doi.org/10.53067/ije3.v2i2.91.
[4] A. Daengs, GS, B. Pramono, A. I. Soemantri, and R. B. Kusumo Negoro, “Orientation Entrepreneurial Effects on MSME Performance Facilitated by Surabaya Commerce Department through Marketing Strategy as a Moderating Variable,” Int. J. Adv. Eng. Manag. Res., vol. 08, no. 05, pp. 30–41, 2023, https://doi.org/10.51505/ijaemr.2023.8503.
[5] D. A. N. Menengah, “Factor Affecting Business Sustainability of Small and Medium Coffee Shop,” J. Teknol. Ind. Pertan., vol. 30, no. 3, pp. 308–318, 2020, https://doi.org/10.24961/j.tek.ind.pert.2020.30.3.308.
[6] W. Wahyuningsih and P. T. Prasetyaningrum, “Enhancing Sales Determination for Coffee Shop Packages through Associated Data Mining: Leveraging the FP-Growth Algorithm,” J. Inf. Syst. Informatics, vol. 5, no. 2, pp. 758–770, 2023, https://doi.org/10.51519/journalisi.v5i2.500.
[7] L. Setiyani and W. H. Utomo, “Arabica Coffee Price Prediction Using the Long Short Term Memory Network (LSTM) Algorithm,” Sci. J. Informatics, vol. 10, no. 3, pp. 287–296, 2023, https://doi.org/10.15294/sji.v10i3.44162.
[8] A. Z. Putra, C. Chalvin, A. Nurhadi, A. E. Tambun, and S. Defha, “Coffee Quality Prediction with Light Gradient Boosting Machine Algorithm Through Data Science Approach,” Sinkron, vol. 8, no. 1, pp. 563–573, 2023, https://doi.org/10.33395/sinkron.v8i1.12169.
[9] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “Catboost: Unbiased boosting with categorical features,” Adv. Neural Inf. Process. Syst., pp. 6638–6648, 2018, https://doi.org/10.48550/arXiv.1706.09516.
[10] S. Chen, H. Jin, and L. Li, “Analysis and Comparison of House Price Prediction Based on XGboost and LightGBM,” Adv. Econ. Manag. Polit. Sci., vol. 46, no. 1, pp. 55–61, 2023, https://doi.org/10.54254/2754-1169/46/20230317.
[11] A. N. Karabulut, “Comparing the Young People’s Coffee Shop Perceptions with Their Senses of Taste,” Yönetim ve Ekon. Derg., vol. 30, no. 1, pp. 1–19, 2023, https://doi.org/10.18657/yonveek.1244119.
[12] A. M. B. Wicaksana, S. Suharno, and W. Supartono, “The Impact of Consumer Behavior and Marketing Mix on the Decision to Buy Coffee at Coffee Shops in the Sleman Region During the Covid-19 Pandemic,” Agroindustrial J., vol. 8, no. 1, p. 520, 2022, https://doi.org/10.22146/aij.v8i1.73543.
[13] A. S. R. M. Sinaga, R. E. Putra, and A. S. Girsang, “Prediction measuring local coffee production and marketing relationships coffee with big data analysis support,” Bull. Electr. Eng. Informatics, vol. 11, no. 5, pp. 2764–2772, Oct. 2022, https://doi.org/10.11591/eei.v11i5.4082.
[14] W. A. Limont, J. T. Łukasiewicz-Wieleba, A. Demianowska, and M. Jabłonowska, “The Snowball Sampling Strategy in the Field of Social Sciences. Contexts and Considerations.,” Przegląd Badań Eduk. (Educational Stud. Rev., vol. 2, no. 43, pp. 87–104, Sep. 2024, https://doi.org/10.12775/PBE.2022.001.
[15] J. Sayyad, K. Attarde, and N. Saadouli, “Optimizing e-commerce Supply Chains with Categorical Boosting: A Predictive Modeling Framework,” IEEE Access, 2024, https://doi.org/10.1109/ACCESS.2024.3447756.
[16] Y. Zou, C. Gao, and H. Gao, “Business Failure Prediction Based on a Cost-Sensitive Extreme Gradient Boosting Machine,” IEEE Access, vol. 10, pp. 42623–42639, 2022, https://doi.org/10.1109/ACCESS.2022.3168857.
[17] M. Idhom, A. Fauzi, T. Trimono, and P. Riyantoko, “Time Series Regression: Prediction of Electricity Consumption Based on Number of Consumers at National Electricity Supply Company,” TEM J., vol. 12, no. 3, pp. 1575–1581, 2023, https://doi.org/10.18421/TEM123-39.
[18] M. Idhom, I. G. P. A. Buditjahjanto, Munoto, Trimono, and P. A. Riyantoko, “Antithesis of Human Rater: Psychometric Responding to Shifts Competency Test Assessment Using Automation (AES System),” Stud. Learn. Teach., vol. 4, no. 2, pp. 329–340, 2023, https://doi.org/10.46627/silet.v4i2.291.
[19] W. Liang, S. Luo, G. Zhao, and H. Wu, “Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms,” Mathematics, vol. 8, no. 5, pp. 1–17, 2020, https://doi.org/10.3390/MATH8050765.
[20] A. Odeh, Q. A. Al-Haija, A. Aref, and A. A. Taleb, “Comparative Study of CatBoost, XGBoost, and LightGBM for Enhanced URL Phishing Detection: A Performance Assessment,” J. Internet Serv. Inf. Secur., vol. 13, no. 4, pp. 1–11, 2023, https://doi.org/10.58346/JISIS.2023.I4.001.
[21] Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P. H. Huynh, “Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 2, pp. 125–136, 2024, https://doi.org/10.35882/jeeemi.v6i2.382.
[22] J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J. Big Data, vol. 7, no. 1, 2020, https://doi.org/10.1186/s40537-020-00369-8.
[23] M. Nagassou, R. W. Mwangi, and E. Nyarige, “A Hybrid Ensemble Learning Approach Utilizing Light Gradient Boosting Machine and Category Boosting Model for Lifestyle-Based Prediction of Type-II Diabetes Mellitus,” J. Data Anal. Inf. Process., vol. 11, no. 04, pp. 480–511, 2023, https://doi.org/10.4236/jdaip.2023.114025.
[24] X. Lv, D. Gu, X. Liu, J. Dong, and Y. li, “Momentum prediction models of tennis match based on CatBoost regression and random forest algorithms,” Sci. Rep., vol. 14, no. 1, pp. 1–17, 2024, https://doi.org/10.1038/s41598-024-69876-5.
[25] Z. Lu, “Study of Mother-infant Behavioural Relationships based on Structural Equation Modelling and LightGBM Regression Models,” Sci. J. Intell. Syst. Res., vol. 6, no. 7, pp. 1–9, 2024, https://doi.org/10.54691/m7eqms74.
[26] A. Alsubayhin, M. S. Ramzan, and B. Alzahrani, “Crime Prediction Model using Three Classification Techniques: Random Forest, Logistic Regression, and LightGBM,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 1, pp. 240–251, 2024, https://doi.org/10.14569/IJACSA.2024.0150123.
[27] Y. Zhang, C. Zhu, and Q. Wang, “Lightgbm-based model for metro passenger volume forecasting,” IET Intell. Transp. Syst., vol. 14, no. 13, pp. 1815–1823, 2020, https://doi.org/10.1049/iet-its.2020.0396.
[28] L. Lin, J. Zhang, N. Zhang, J. Shi, and C. Chen, “Optimized LightGBM Power Fingerprint Identification Based on Entropy Features,” Entropy, vol. 24, no. 11, 2022, https://doi.org/10.3390/e24111558.
[29] A. Botchkarev, “A New Typology Design of Performance Metrics to Measure Errors in Machine Learning Regression Algorithms,” Interdiscip. J. Information, Knowledge, Manag., vol. 14, no. 113, pp. 45–79, 2019, https://doi.org/10.28945/4184.
[30] A. T. Damaliana and S. Hidayati, “Implementation of Quantile Regression Neural Network Model for Forecasting Electricity Demand in East Java,” Proceeding - IEEE 8th Inf. Technol. Int. Semin. ITIS, pp. 229–234, 2022, https://doi.org/10.1109/ITIS57155.2022.10009045.
[31] A. Uribeetxebarria, A. Castellón, and A. Aizpurua, “Optimizing Wheat Yield Prediction Integrating Data from Sentinel-1 and Sentinel-2 with CatBoost Algorithm,” Remote Sens., vol. 15, no. 6, 2023, https://doi.org/10.3390/rs15061640.
[32] A. M. Aviolla Terza Damaliana and D. A. Prasetya, “Forecasting The Occupancy Rate Of Star Hotels In Bali,” J. Stat., vol. 12, no. 1, pp. 24–33, 2024, https://doi.org/10.14710/JSUNIMUS. 12.1.2024.24-33.
[33] N. Putu, V. Ginanti, C. Wiedyaningsih, and E. Yuniarti, “Comparison Of Forecasting Drug Needs Using Time Series Methods In Healthcare Facilities : A Systematic Review.,” J. Farm. Sains dan Prakt., vol. 10, no. 2, pp. 156–165, 2024, https://doi.org/10.31603/pharmacy.v10i2.11145.
[34] E. Vivas, H. Allende-Cid, and R. Salas, “A Systematic Review of Statistical and Machine Learning Methods for Electrical Power Forecasting with Reported MAPE Score,” Entropy, vol. 22, no. 12, p. 1412, Dec. 2020, https://doi.org/10.3390/e22121412.
[35] Y. L. Sukestiyarno, D. T. Wiyanti, L. Azizah, and W. Widada, “Algorithm Optimizer in GA-LSTM for Stock Price Forecasting,” Contemp. Math., vol. 5, no. 1, pp. 1–12, Jan. 2024, https://doi.org/10.37256/cm.5120243367.
[36] V. Lumumba, D. Kiprotich, M. Mpaine, N. Makena, and M. Kavita, “Comparative Analysis of Cross-Validation Techniques: LOOCV, K-folds Cross-Validation, and Repeated K-folds Cross-Validation in Machine Learning Models,” Am. J. Theor. Appl. Stat., vol. 13, no. 5, pp. 127–137, Oct. 2024, https://doi.org/10.11648/j.ajtas.20241305.13.
[37] A. Geroldinger, L. Lusa, M. Nold, and G. Heinze, “Leave-one-out cross-validation, penalization, and differential bias of some prediction model performance measures—a simulation study,” Diagnostic Progn. Res., vol. 7, no. 1, 2023, https://doi.org/10.1186/s41512-023-00146-0.
[38] I. Tougui, A. Jilbab, and J. El Mhamdi, “Impact of the choice of cross-validation techniques on the results of machine learning-based diagnostic applications,” Healthc. Inform. Res., vol. 27, no. 3, pp. 189–199, 2021, https://doi.org/10.4258/HIR.2021.27.3.189.
[39] C. O. Chavez-Chong, C. Hardouin, and A.-K. Fermin, “Ridge regularization for spatial autoregressive models with multicollinearity issues,” AStA Adv. Stat. Anal., vol. 109, no. 1, pp. 25–52, Mar. 2025, https://doi.org/10.1007/s10182-024-00496-0.
[40] A. R. Nur, A. K. Jaya, and S. Siswanto, “Comparative Analysis of Ridge, LASSO, and Elastic Net Regularization Approaches in Handling Multicollinearity for Infant Mortality Data in South Sulawesi,” J. Mat. Stat. dan Komputasi, vol. 20, no. 2, pp. 311–319, 2023, https://doi.org/10.20956/j.v20i2.31632.
[41] C. M. Le, K. Levin, P. J. Bickel, and E. Levina, “Comment: Ridge Regression and Regularization of Large Matrices,” Technometrics, vol. 62, no. 4, pp. 443–446, Oct. 2020, https://doi.org/10.1080/00401706.2020.1796815.
[42] C. Tirink, S. H. Abaci, and H. Onder, “Comparison of Ridge Regression and Least Squares Methods in the Presence of Multicollinearity for Body Measurements in Saanen Kids,” Iğdır Üniversitesi Fen Bilim. Enstitüsü Derg., vol. 10, no. 2, pp. 1429–1437, 2020, https://doi.org/10.21597/jist.671662.
[43] D. Barragán-Guerrero, M. Au, G. Gagnon, F. Gagnon, and P. Giard, “Early-detection scheme based on sequential tests for low-latency communications,” Eurasip J. Wirel. Commun. Netw., vol. 2023, no. 1, 2023, https://doi.org/10.1186/s13638-023-02240-9.
[44] A. Alabrah, “An Improved CCF Detector to Handle the Problem of Class Imbalance with Outlier Normalization Using IQR Method,” Sensors, vol. 23, no. 9, 2023, https://doi.org/10.3390/s23094406.
[45] B. Dym and C. Fiesler, “Ethical and privacy considerations for research using online fandom data,” Transform. Work. Cult., vol. 33, pp. 1–19, 2020, https://doi.org/10.3983/twc.2020.1733.
[46] A. C. Haber, U. Sax, and F. Prasser, “Open tools for quantitative anonymization of tabular phenotype data: literature review,” Brief. Bioinform., vol. 23, no. 6, pp. 1–10, 2022, https://doi.org/10.1093/bib/bbac440.
[47] S. Sardjono, R. Y. R. Alamsyah, M. Marwondo, and E. Setiana, “Data Cleansing Strategies on Data Sets Become Data Science,” Int. J. Quant. Res. Model., vol. 1, no. 3, pp. 145–156, 2020, https://doi.org/10.46336/ijqrm.v1i3.71.
[48] D. A. Prasetya, A. P. Sari, P. A. Riyantoko, and T. M. Fahrudin, “The Effect of Information Quality and Service Quality on User Satisfaction of the Government of Kabupaten Malang,” TIERS Inf. Technol. J., vol. 4, no. 1, pp. 32–42, 2023, https://doi.org/10.38043/tiers.v4i1.4328.
[49] I. Bergelson, C. Tracy, and E. Takacs, “Best Practices for Reducing Bias in the Interview Process,” Curr. Urol. Rep., vol. 23, no. 11, pp. 319–325, Nov. 2022, https://doi.org/10.1007/s11934-022-01116-7.
[50] K. Isaksen et al., “Interviewing adolescent girls about sexual and reproductive health: a qualitative study exploring how best to ask questions in structured follow-up interviews in a randomized controlled trial in Zambia,” Reprod. Health, vol. 19, no. 1, pp. 1–11, 2022, https://doi.org/10.1186/s12978-021-01318-1.
[51] Q. Zheng, C. Yu, J. Cao, Y. Xu, Q. Xing, and Y. Jin, “Advanced Payment Security System:XGBoost, CatBoost and SMOTE Integrated,” arXiv e-prints, arXiv-2406, 2024, https://doi.org/10.1109/MetaCom62920.2024.00063.
[52] C. S. K. Dash, A. K. Behera, S. Dehuri, and A. Ghosh, “An outliers detection and elimination framework in classification task of data mining,” Decis. Anal. J., vol. 6, p. 100164, 2023, https://doi.org/10.1016/j.dajour.2023.100164.
[53] B. Bach et al., “Challenges and Opportunities in Data Visualization Education: A Call to Action,” IEEE Trans. Vis. Comput. Graph., vol. 30, no. 1, pp. 649–660, 2024, https://doi.org/10.1109/TVCG.2023.3327378.
[54] P. Cerda and G. Varoquaux, “Encoding High-Cardinality String Categorical Variables,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1164–1176, Mar. 2022, https://doi.org/10.1109/TKDE.2020.2992529.
[55] M. Fan, K. Xiao, L. Sun, S. Zhang, and Y. Xu, “Automated Hyperparameter Optimization of Gradient Boosting Decision Tree Approach for Gold Mineral Prospectivity Mapping in the Xiong’ershan Area,” Minerals, vol. 12, no. 12, 2022, https://doi.org/10.3390/min12121621.
[56] A. Maulana, R. P. F. Afidh, N. B. Maulydia, G. M. Idroes, and S. Rahimah, “Predicting Obesity Levels with High Accuracy: Insights from a CatBoost Machine Learning Model,” Infolitika J. Data Sci., vol. 2, no. 1, pp. 17–27, 2024, https://doi.org/10.60084/ijds.v2i1.195.
[57] K. M. Hindrayani, T. M. Fahrudin, R. Prismahardi Aji, and E. M. Safitri, “Indonesian Stock Price Prediction including Covid19 Era Using Decision Tree Regression,” 2020 3rd Int. Semin. Res. Inf. Technol. Intell. Syst. ISRITI, pp. 344–347, 2020, https://doi.org/10.1109/ISRITI51436.2020.9315484.
[58] I. G. S. M. Diayasa, M. Idhom, A. Fauzi, and A. T. Damaliana, “Stacking Ensemble Methods to Predict Obesity Levels in Adults,” Proceeding - IEEE 8th Inf. Technol. Int. Semin. ITIS, pp. 339–344, 2022, https://doi.org/10.1109/ITIS57155.2022.10010260.
[59] P. Bagus, P. Putra Budiartha, C. Wiedyaningsih, E. Yuniarti, A. Agung, and A. Prithadewi, “Forecasting Drug Demand Using The Single Moving Average At Prof. dr. I.G.N.G. Ngoerah Hospital,” Maj. Farm., vol. 19, no. 3, pp. 394–402, 2023, https://doi.org/10.22146/farmaseutik.v19i3.86207.
[60] M. Hani’ah, M. Z. Abdullah, W. I. Sabilla, S. Akbar, and D. R. Shafara, “Google Trends and Technical Indicator based Machine Learning for Stock Market Prediction,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 22, no. 2, pp. 271–284, Mar. 2023, https://doi.org/10.30812/matrik.v22i2.2287.
[61] P. Charilaou and R. Battat, “Machine learning models and over-fitting considerations,” World J. Gastroenterol., vol. 28, no. 5, pp. 605–607, Feb. 2022, https://doi.org/10.3748/wjg.v28.i5.605.
[62] E. Efendi, M. Butarbutar, R. M. Girsang, E. Chandra, and V. Candra, “Purchase Interest Reviewed Based On Price And Location At Danu Jaya Birdshop Pematang Siantar,” Mak. J. Manaj., vol. 9, no. 1, pp. 119–126, Jun. 2023, https://doi.org/10.37403/mjm.v9i1.579.
[63] A. Syaidah, M. Munawaroh, and L. Susilowati, “Influence of Price And Location on Belikopi Ploso Consumer Purchasing Decisions,” J. Bus. Manag. Econ. Dev., vol. 1, no. 03, pp. 556–564, 2023, https://doi.org/10.59653/jbmed.v1i03.297.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Calvien Danny Nariyana, Mohammad Idhom, Trimono Trimono

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with JITEKI agree to the following terms:
- Authors retain copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
This work is licensed under a Creative Commons Attribution 4.0 International License