Predicting students' success level in an examination using advanced linear regression and extreme gradient boosting

ABSTRACT


INTRODUCTION
Education is a crucial factor in the development of individuals and society.In an effort to improve the quality of education, evaluating, and monitoring student progress is very important [1]- [3].One of the indicators that is often used to measure student success is the score obtained in the exam [2], [4], [5].However, the process of determining accurate and effective test scores can be a complex challenge.In recent decades, prediction and machine learning techniques have undergone rapid development and have been successfully applied in various fields, including education [6]- [8].One popular prediction method is linear regression, which attempts to relate a linear relationship between an independent variable and a dependent variable [7], [9].However, in some cases, linear regression may not be robust enough to cope with the complexity of student data [10]- [12].Therefore, in this study, we aim to predict students' success rate in an exam using advanced linear regression techniques and the extreme gradient boosting (XGBoost) algorithm.Advanced linear regression, such as regularized linear regression or non-linear regression techniques like polynomial regression, can help overcome the limitations of ordinary linear regression and provide more accurate prediction results.In addition, we will utilize the XGBoost algorithm, which is one of the powerful and popular decision tree methods in machine learning.XGBoost is able to handle data complexity, such as non-linear features or complicated interactions between variables.This research is expected to contribute to the development of methods for predicting student success rates in examinations.With more accurate prediction results, educators can identify low-risk students, give more attention to students with potential, and adopt more personalized educational strategies to improve learning effectiveness.In this study, we will use a dataset that includes a number of variables that potentially affect student success, such as attendance rate, number of study hours, previous exam results, and other relevant factors.We will train an advanced linear regression model and XGBoost algorithm using this dataset and evaluate the prediction performance using relevant metrics, such as coefficient of determination (R-squared) and mean squared error (MSE).Thus, it is hoped that the results of this study can provide practical guidance for educational institutions in improving the monitoring and evaluation of student progress and identifying important factors that contribute to student success in examinations.
Education is a key factor in the development of individuals and society.In the context of education, it is important to understand the factors that influence students' success in achieving learning goals [7], [13]- [15].Several previous studies have identified variables that potentially affect student success, such as attendance rate, number of study hours, student motivation, and environmental factors.However, achieving a deep understanding of the relationship between these variables and student success rates requires a more sophisticated analytical approach.
Prediction and machine learning methods have been a rapidly growing field in recent decades [7], [16], [17].In the context of education, these techniques have been applied to predict the success rate of students in examinations by using various algorithms and models.One commonly used method is linear regression, which attempts to relate a linear relationship between the independent variable and the dependent variable.However, in some cases, linear regression may not be able to handle the complexity of student data well.
In an effort to improve prediction performance, several advanced linear regression methods have been developed.Linear regression with regularization, such as Ridge or Lasso regression, has been shown to be effective in reducing overfitting and improving model generalization [18]- [20].In addition, non-linear regression techniques such as polynomial regression can help overcome the limitations of ordinary linear regression and model more complex relationships between variables.By using these methods, we can improve the accuracy and reliability of student success rate predictions.
Besides linear regression, machine learning algorithms such as XGBoost have also become popular in predicting student outcomes.XGBoost is a powerful decision tree method and can handle the complexity of data, including non-linear features and interactions between variables.In the context of predicting student success rates, XGBoost can help identify more complex patterns and relationships in student data, which may not be detected by linear regression methods.
In this research, we will combine an advanced linear regression approach and the XGBoost algorithm to predict students' success rate in an exam.By combining the strengths and advantages of both, we hope to improve the accuracy and reliability of our predictions.It is hoped that this research will contribute to understanding the factors that influence student success and develop more effective and relevant prediction methods in an educational context.
Linear regression is one of the most commonly used statistical methods to analyze the relationship between dependent variables and independent variables.It is a linear approach that tries to find the best linear relationship between those variables [21], [22].However, in some cases, simple linear regression may not be robust enough to cope with higher data complexity.In this context, the concept of "advanced linear regression" emerges which refers to the use of additional techniques to improve the performance of linear regression [23]- [25].
One technique that is often used in advanced linear regression is linear regression with regularization [23]- [25].Regularization is an approach that involves a penalty to the regression coefficients to prevent overfitting and improve model generalization.Examples of linear regression with regularization include Ridge regression and Lasso regression.Ridge regression adds a squared penalty to the regression coefficients, while Lasso regression uses an absolute value penalty.By using this technique, we can reduce the effect of insignificant variables and improve the stability and predictability of linear regression.
In addition, advanced linear regression can also include the use of polynomial regression.Polynomial regression extends the linear regression model by incorporating polynomial features, which allows modeling the non-linear relationship between variables [26]- [30].By incorporating higher power terms of the independent variables into the model, we can capture more complex patterns and interactions in the data.Polynomial regression is useful when the relationship between variables cannot be explained linearly and requires a more flexible representation.
Advanced linear regression also includes more advanced feature selection techniques.Feature selection is the process of selecting the most relevant subset of independent variables to predict the dependent variable.In advanced linear regression, model evaluation and validation are also an important part.Evaluation metrics such as R-squared, MSE, or prediction accuracy are used to measure model performance.In addition, techniques such as cross-validation or out-of-sample testing are also used to ensure the reliability and generalizability of linear regression models.By conducting a comprehensive evaluation, we can gain a better understanding of the quality and predictive reliability of the improved linear regression model.
XGBoost is one of the popular and powerful machine learning algorithms used to build predictive models.XGBoost is based on the concept of ensemble learning, where several small models (weak learner) are combined into one stronger model (strong learner) [2], [18], [31]- [33].XGBoost uses a boosting approach, which means the sequentially generated models focus on reducing the prediction error present in the previous model.The XGBoost algorithm uses a decision tree as its weak learner.A decision tree is a predictive model that breaks data into smaller subsets based on a set of rules defined by features in the data.XGBoost builds decision trees sequentially and combines their predictions to produce the final prediction.In each step, XGBoost uses the derived gradient to update the weights and reduce the prediction error.
One of the advantages of XGBoost is its ability to handle problems with complex data and non-linear features.XGBoost can handle complex interactions between variables, model complex patterns, and identify features that are most important in prediction.It also copes well with classification and regression problems, and can be used in other tasks such as anomaly detection or recommendation systems.In addition, XGBoost provides several optimization and regularization techniques that help improve performance and prevent overfitting.For example, XGBoost uses L1 and L2 regularization to prevent excessive model complexity and reduce the tendency towards overfitting.XGBoost also utilizes loss mitigation techniques such as least squares loss regression objective function or log-loss classification objective function to minimize prediction errors in the training process.XGBoost has proven effective in many competitions and real-world applications.It stands out in terms of speed, scalability, and accuracy.Extensive support and an active community also make XGBoost a popular choice among practitioners and researchers in the field of machine learning.With the combination of the power of decision trees, optimization techniques, and regularization applied in XGBoost, the algorithm makes significant contributions in improving prediction performance in various contexts and domains.
Sugiyanto [1] applied advanced linear regression with Lasso regularization to predict students' academic success in mathematics exams.They found that advanced linear regression was able to overcome the complexity of the data and provide more accurate predictions than ordinary linear regression.In addition, variables such as number of study hours and participation in extracurricular activities were identified as significant factors in predicting student success.Wang et al. [34] used the XGBoost approach to predict student success in English exams.They collected data on variables such as attendance rate, previous test scores, and socio-economic characteristics of students.By using XGBoost, this study managed to achieve high prediction accuracy.The results showed that the variables of attendance rate and socio-economic characteristics contributed significantly to the prediction of student success rate.
Dabhade et al. [23] combined polynomial regression with model-based feature selection techniques to predict students' academic success in science exams.This study showed that by considering the non-linear relationship between variables and using polynomial regression, the prediction results were significantly improved.In addition, through careful feature selection, variables such as student age and level of participation in class discussions were identified as important predictors.Urbanski [35] applied an ensemble approach involving a combination of linear regression, polynomial regression, and XGBoost to predict student success rates in academic exams.The study showed that by combining the strengths of the three methods, the prediction model can achieve higher accuracy than using the methods individually.Variables such as attendance rate, study time, and student motivation level were shown to have a significant effect in predicting student success.

METHOD 2.1. Research steps
In this research flow, the process of prediction research utilizing advanced linear regression and XGBoost methods is outlined through several common stages.The main steps encompass understanding and defining the research problem, data collection and preprocessing, feature selection, model training using advanced linear regression and XGBoost techniques, and finally, evaluating and interpreting the results.This systematic approach ensures a comprehensive analysis, from problem definition to model evaluation, leading to robust predictions in the realm of advanced data analysis.The main steps in this research flow are as: Training model: in this stage, the advanced linear regression and XGBoost models will be trained using the training data that has been developed.The model will learn the patterns and relationships between the input features and the output label (student success rate).This step involves adjusting the parameters and hyperparameters of the model to fit the training data.-Model testing: after going through the training process, the trained model will be tested using separate testing data.This aims to measure the performance and prediction accuracy of the model against data that has never been seen before.Evaluation metrics such as accuracy, precision, recall, and MSE will be used to evaluate model performance.-Discussion and evaluation: the final stage involves discussion and evaluation of the results of this study.The prediction results from the advanced linear regression and XGBoost models will be compared, and the advantages and disadvantages of each method will be analyzed.The discussion may also include an analysis of the interpretability of the model, the importance of the identified features, and recommendations for further research development.

Model development
In this study, model development incorporates two prediction techniques: advanced linear regression and XGBoost.These methods are employed independently to generate predictions, and their results are then amalgamated.This combined approach enhances the accuracy and reliability of the predictions made in the study.The following provides a more intricate overview of the methodology employed in the model development process:

Advanced linear regression
At this stage, advanced linear regression is used which can improve the performance of ordinary linear regression.Advanced linear regression may include regularization techniques such as Ridge or Lasso regression, or use non-linear regression approaches such as polynomial regression.Ridge regression, for example, reduces model complexity by adding a squared penalty to the regression coefficients, while Lasso regression uses an absolute value penalty.

XGBoost
Next, the XGBoost algorithm was used to build the prediction model.XGBoost uses an ensemble learning approach with decision trees as the weak learner.In XGBoost, decision trees are built sequentially, and each tree focuses on reducing the prediction error generated by the previous tree.XGBoost uses the derived gradient to update the weights and reduce the prediction error.

Model combination
After training the advanced linear regression and XGBoost models separately, the prediction results from both models can be combined to produce a better final prediction.One commonly used approach is averaging, where the predictions from both models are taken as an average.The combination formula can be written as (3): In ( 3), the final prediction is the final result generated from the combination of linear regression and XGBoost predictions.By combining the prediction results from both models, we can utilize the strengths of each model and achieve more accurate predictions.It is important to note that the combination formula above uses a simple approach of taking the average of the predictions.There are also other methods to combine model predictions such as stacking, voting, or weighted averaging, which may result in better performance depending on the characteristics of the data and model used.

Model development
In Table 1, a comprehensive overview is provided of the model's performance, which was meticulously crafted through the amalgamation of advanced linear regression and XGBoost algorithms.The results showcased in the table represent the outcomes obtained from rigorous testing across five distinct trials.These tests not only serve as a testament to the model's robustness but also highlight its consistency in various scenarios.Additionally, the accuracy attained in the fifth and final test offers valuable insights into the model's real-world predictive capabilities, solidifying its reliability and effectiveness in practical applications.The model developed in this research is a combination of advanced linear regression and XGBoost.After 5 phased tests, it was found that in the fifth test the model achieved an accuracy of 0.680.This accuracy shows the extent to which the model can predict the success rate of students in the exam.
In comparison with research Urbanski [35] which used a relatively similar model, the model developed in this study tends to be better.He achieved an accuracy of 0.655 with a similar model.The results of this study show an increase in accuracy of 0.025 compared to the previous study.This shows that the use of a combination of advanced linear regression and XGBoost can provide a significant improvement in the prediction of student success rates in exams.
This improvement in accuracy can be attributed to XGBoost's ability to handle data complexity and non-linear features.XGBoost is able to extract complex patterns and interactions between variables, which may not be captured by ordinary linear regression methods.By combining the power of XGBoost with advanced linear regression, the model can produce more accurate predictions.
However, although the model achieved a significant improvement in accuracy, there is still room for further improvement.Further evaluation can be done by analyzing other evaluation metrics such as precision, recall, or MSE to gain a more comprehensive understanding of the model's performance.In addition, it is also important to consider variables that may not have been included in this model, as well as additional techniques such as more sophisticated feature selection to improve prediction reliability.
In Figure 1, the model accuracy plot for advanced linear regression is presented, offering a visual representation of the performance of this specific modeling approach.The plot illustrates how the accuracy of the advanced linear regression model evolves over different tests.This graphical representation serves as a valuable tool for understanding the predictive capabilities of advanced linear regression in the context of forecasting student success rates in exams.The x-axis denotes the various tests conducted, while the y-axis indicates the corresponding accuracy scores achieved by the model.Analyzing this plot provides insights into the consistency and effectiveness of advanced linear regression in predicting students' exam success.

Training and testing final model
After developing the prediction model using a combination of advanced linear regression and XGBoost, the final model was re-implemented and tested using the main dataset.The purpose of this test is to see the ability of the model to identify the level of student success in the exam using 30 labels in the dataset.The test results show that the developed model is able to identify the level of student success using the labels in the dataset.The model successfully provides predictions for 30 labels with a reliable level of accuracy.
The use of 30 labels in the dataset allows the model to provide more detailed and in-depth predictions of student success rates in the exam.Thus, the model can provide more complete information to educators or researchers in describing and understanding student achievement levels.The use of this final model also provides an opportunity to further analyze the variables that are significant in influencing student success rates.By looking at the contribution of these variables to the model predictions, educators can identify important factors that need to be considered in improving the quality of student learning.
However, it should be noted that the results of this model still need to be further verified and validated.Further evaluation can be done by involving additional test data and conducting comparisons with Comput Sci Inf Technol ISSN: 2722-3221  Predicting students' success level in an examination using advanced linear … (Tri Wahyuningsih)

35
other prediction methods already in the literature.It is also important to consider the limitations and assumptions of the model, and make adjustments or refinements where necessary to improve prediction performance.Overall, the testing and re-implementation of the final model proved that the development of the model using the combination of advanced linear regression and XGBoost can provide a good ability to identify the level of student success in the exam using 30 labels in the dataset.

Predicting student performance level
In the final stage of this research, the developed model was used to predict students' performance levels in exams.In testing using 10 data samples, it was found that the model gave accurate predictions in most cases.Of the 10 data samples used, there was only 1 data that was wrong in its prediction aspect.Although there was one prediction error, the error was still acceptable as 9 out of 10 predictions made by the model were accurate.Table 2 shows that the model performs well in predicting the success rate of students in the exam.
These prediction results can provide an initial initiation for teachers to know how many students are likely to fail the exam.With this information, teachers can give special attention and additional support to those students who are in that category.Thus, teachers can maximize learning efforts and help students achieve higher success rates.However, it is important to consider that these predictions are probabilistic and cannot be taken as absolute truth.The predictions only provide estimates based on the information available at the time of model testing.Therefore, continuous evaluation and direct monitoring of student performance is still required by teachers to take appropriate steps to help students who are likely to face difficulties.Overall, the use of this prediction model provides an important early benefit for teachers in identifying students who are likely to fail the test.While there is still the possibility of prediction error, the majority of accurate predictions provide a solid basis for teachers to pay special attention to these students and help them achieve better success in learning.

CONCLUSION
This study employs a hybrid approach integrating advanced linear regression and XGBoost algorithms to predict students' exam success rates, achieving an impressive accuracy of 0.680 in the fifth test.The research underscores the effectiveness of this combination, outperforming previous similar models.The study delves into the advantages of XGBoost in handling data complexity and non-linear features, complemented by advanced linear regression's strengths in interpreting coefficients and identifying linear relationships.Despite these accomplishments, the research acknowledges limitations, including the study's reliance on a specific dataset from Kaggle, cautioning against broad generalizations.Furthermore, the study emphasizes the need for future research to explore diverse datasets, incorporate external factors like learning environments and social support, and employ advanced ensemble methods and feature selection techniques to enhance prediction accuracy.In essence, while the study demonstrates the potency of the advanced linear regression and XGBoost combination, it underscores the importance of ongoing research to fully grasp its applicability in educational contexts.

Figure 2 .
Figure 2. Model accuracy plot-XGBoost focused Feature selection techniques can help reduce data dimensionality and improve model interpretability.Some commonly used feature selection methods in advanced linear regression include Predicting students' success level in an examination using advanced linear … (Tri Wahyuningsih) 31 model-based feature selection and wrapper methods such as recursive feature elimination (RFE).By using appropriate feature selection techniques, we can improve the efficiency and accuracy of linear regression models.


ISSN: 2722-3221 Comput Sci Inf Technol, Vol. 5, No. 1, March 2024: 29-37 32 -Data collection: the data for this study was collected from Kaggle sources.Kaggle is a platform that provides various public datasets for research and analysis purposes.Datasets that are relevant to the topic of this research, i.e. data regarding variables that potentially affect student success in exams, will be downloaded from Kaggle.-Data preprocessing: this stage involves cleaning, merging, and transforming the data.Data downloaded from Kaggle may require additional processing such as removing missing values or outliers, filling in missing values, and converting categorical variables into a numerical form that can be used by the model.The goal is to ensure the quality and consistency of the data before it is used in the model training and testing process.-Trainingdata and labels: after preprocessing, the dataset will be divided into two parts; training data and testing data.The training data will be used to train the prediction model, while the testing data will be used to test the performance of the trained model.In addition, a label will also be defined, which is the variable to be predicted by the model, in this case, the student's success rate in the exam.-Feature development: at this stage, additional features are developed that can improve the predictive ability of the model.These features can come from combinations of existing variables, creation of new features based on domain knowledge, or dimension reduction techniques such as principal component analysis (PCA).The goal is to optimize the data representation used by the model so that it can describe a more accurate relationship with the target variable.-

Table 1 .
Trial and testing model

Table 2 .
Trial and testing model