With the level of data collection in modern education, there is a growing interest in using this data to identify students who are potentially falling behind and pinpoint areas in which they can improve. With the extensive data collected by educational institutions, we can leverage machine learning to help educators by flagging if a student is at risk of failing a course. While having a statistical model that can predict if a student will score proficiently on an exam is powerful, being able to identify the factors going into each prediction are just as important. This article will walk through our work in predicting student performance through machine learning and our findings.
In this article, we will cover:
Given the educational data available across a state, it is first necessary to identify the data attributes that serve as strong indicators to a student’s success. We started with an extensive literature review on previous research that gave us insight into what areas to focus on. Following this literature review, we did some exploratory analysis ourselves to find trends and relationships within our data. During our analysis, we used many data visualizations as well as statistical tests such as:
- Shapiro-Wilks Test – examines how closely data fits to a normal distribution
- Two Tailed T-Test – tests whether two statistics are truly different using a p-value of 0.05
- Mann-Whitney U-Test – tests whether two statistics are different when the data does not follow a normal distribution because it does not assume a specific distribution like the T-Test does
- Correlation Coefficients – a statistical measure of linearity between two variables
We identified some of our strongest variables to leverage in predicting student performance as Student Demographical data, Past Student Performance, and Historical Standardized Test Scores.
Refining Educational Data for Predictions
Once we selected which variables we would be training a predictive model on, we had to run some data preparation steps. The raw data can have many issues such as missing data, invalid data types, and non-standardized numerical data. We started by standardizing our numerical data because different tests are scored on different scales. Standardizing this data is important since variables of larger magnitudes impact the predictions of a model more than those of smaller magnitude. Standardizing them ensures that all variables are treated equally. Next, because some of our data is non-numerical, we encoded these data points using a numerical representation of the character, Boolean, or string.
Finally, missing data is defined as data samples that are missing certain attributes about them such as a student not having their last year’s standardized test score. Many machine learning models struggle to handle samples with missing data, leading to our motivation to impute these values. Some methods of data imputation include simply taking the mean, median, or mode of the column for the missing data, but we can do better. We utilized a K-Nearest Neighbor algorithm that uses the other data attributes for that sample to predict what the missing value would be. This is done by finding the K most similar students to that student based on the non-missing data and inferring the missing data based on those K samples.
Choosing the Right Classifier for our Data
The next step in our process is to identify the classification algorithms we want to apply to our dataset and select the best performing one to utilize to predict student performance. We chose the top performing algorithms from our literature review on previous work that applied machine learning to the education domain. The algorithms we experimented with were XGBoost, RandomForestClassifier, Neural Networks, Support Vector Machine, and Logistic Regression. To evaluate the performance of each of these models, we set up a K-Fold Cross Validation pipeline where we split the dataset into K groups, allowing for each group to be used as the test set once and giving more stable results.
Following our experiments, we found that the RandomForestClassifer yielded the best results with an F1-Score of 83.5%. Furthermore, we employed feature selection on our pipeline to remove the use of redundant features. Features often have correlations between one another which means that there is no true information gained from including both features. Below is a figure visualizing how the model performance changes as more features are included. As you can see, we see a plateau in our F1-Score and adding more attributes yields no lift in performance.
The Magic Behind the Random Forest Algorithm
The Random Forest algorithm is a very popular machine learning algorithm that combines the outputs of many decision trees. A decision tree works by asking basic questions about the data sample and taking different paths down the tree until a result node is reached in the tree. The Random Forest algorithm is an ensemble of decision trees whose outputs are aggregated to make a final prediction. In the Random Forest algorithm, bagging and feature bagging are used. Bagging is when we select a random subset of samples from our dataset to train one of the decision trees. Feature bagging means that we are only training that decision tree on a subset of the available data attributes. Using this method, we create numerous weakly performing decision trees, that when combined, produce a well-performing model for predicting student performance.
Understanding the Why in our Model
In many domains, including education, knowing the outcome of a machine learning model is not as important as why the model made that prediction.
Explainable AI allows us to comprehend the reasoning behind AI-driven predictions and give insights to what key indicators are for a student’s success. Explainable AI also allows subject matter experts to analyze machine learning decisions and find inconsistencies that can be used to improve model performance.
While ML models may seem like black box solutions that perform magic, explainable AI sheds some light on what is happening. Using this motivation, we created a pipeline that, given a student and their educational data, generates a prediction and the primary indicators the model used to generate its prediction, such as in the figure below.
Merging Data Science and Education
In conclusion, we have not only successfully utilized machine learning to predict student performance but have also showcased the use of explainable AI to add transparency to our predictions. By leveraging the extensive educational data available to our team and modern data science techniques, we were able to achieve an impressive model F1-Score of 83.5%. While the model performance is impressive on its own, we understand the importance of answering the question of ‘why’ the model infers its answers and make use of explainable AI techniques. The use of explainable AI in our pipeline offers educators the opportunity to understand the critical factors influencing a student’s success.