Klasifikasi Kategori Cerita Pendek Menggunakan XGBoost dengan Seleksi Fitur Chi-Square
Keywords:
text classification, short stories, xgboost, random forest, chi-square, ensemble learningAbstract
Text classification is one of the major challenges in the field of natural language processing, particularly in categorizing texts by genre. This study aims to develop a classification system for Indonesian short stories into three genre categories: romance, horror, and religion. Two ensemble-based machine learning algorithms, XGBoost and Random Forest, are employed in the experiments. Prior to model training, the short story data undergo text preprocessing and feature extraction using the TF-IDF method. To enhance feature relevance, Chi-Square feature selection is applied. The models are trained using various hyperparameter combinations and validated using 5-Fold Cross Validation. Experimental results show that Chi-Square feature selection improves model accuracy. Final evaluation is performed on test data using the best hyperparameter configuration. XGBoost achieves the best performance with an F1-Score of 89%, while Random Forest achieves an F1-Score of 86%. These results indicate that XGBoost generalizes better to unseen data, despite using fewer trees than Random Forest.