Predicting Student Performance Through Machine Learning

By Connor Dolan

With the level of data collection in modern education, there is a growing interest in using this data to identify students who are potentially falling behind and pinpoint areas in which they can improve. With the extensive data collected by educational institutions, we can leverage machine learning to help educators by flagging if a student is at risk of failing a course. While having a statistical model that can predict if a student will score proficiently on an exam is powerful, being able to identify the factors going into each prediction are just as important. This article will walk through our work in predicting student performance through machine learning and our findings.

Identifying Data

Given the educational data available across a state, it is first necessary to identify the data attributes that serve as strong indicators to a student’s success. We started with an extensive literature review on previous research that gave us insight into what areas to focus on. Following this literature review, we did some exploratory analysis ourselves to find trends and relationships within our data. During our analysis, we used many data visualizations as well as statistical tests such as:

Shapiro-Wilks Test – examines how closely data fits to a normal distribution
Two Tailed T-Test – tests whether two statistics are truly different using a p-value of 0.05
Mann-Whitney U-Test – tests whether two statistics are different when the data does not follow a normal distribution because it does not assume a specific distribution like the T-Test does
Correlation Coefficients – a statistical measure of linearity between two variables

We identified some of our strongest variables to leverage in predicting student performance as Student Demographical data, Past Student Performance, and Historical Standardized Test Scores.

Refining Educational Data for Predictions

predicting educational data Once we selected which variables we would be training a predictive model on, we had to run some data preparation steps. The raw data can have many issues such as missing data, invalid data types, and non-standardized numerical data. We started by standardizing our numerical data because different tests are scored on different scales. Standardizing this data is important since variables of larger magnitudes impact the predictions of a model more than those of smaller magnitude. Standardizing them ensures that all variables are treated equally. Next, because some of our data is non-numerical, we encoded these data points using a numerical representation of the character, Boolean, or string.

Finally, missing data is defined as data samples that are missing certain attributes about them such as a student not having their last year’s standardized test score. Many machine learning models struggle to handle samples with missing data, leading to our motivation to impute these values. Some methods of data imputation include simply taking the mean, median, or mode of the column for the missing data, but we can do better. We utilized a K-Nearest Neighbor algorithm that uses the other data attributes for that sample to predict what the missing value would be. This is done by finding the K most similar students to that student based on the non-missing data and inferring the missing data based on those K samples.

Choosing the Right Classifier for our Data

The next step in our process is to identify the classification algorithms we want to apply to our dataset and select the best performing one to utilize to predict student performance. We chose the top performing algorithms from our literature review on previous work that applied machine learning to the education domain. The algorithms we experimented with were XGBoost, RandomForestClassifier, Neural Networks, Support Vector Machine, and Logistic Regression. To evaluate the performance of each of these models, we set up a K-Fold Cross Validation pipeline where we split the dataset into K groups, allowing for each group to be used as the test set once and giving more stable results.

Following our experiments, we found that the RandomForestClassifer yielded the best results with an F1-Score of 83.5%. Furthermore, we employed feature selection on our pipeline to remove the use of redundant features. Features often have correlations between one another which means that there is no true information gained from including both features. Below is a figure visualizing how the model performance changes as more features are included. As you can see, we see a plateau in our F1-Score and adding more attributes yields no lift in performance.

predicting student performance graph

The Magic Behind the Random Forest Algorithm

The Random Forest algorithm is a very popular machine learning algorithm that combines the outputs of many decision trees. A decision tree works by asking basic questions about the data sample and taking different paths down the tree until a result node is reached in the tree. The Random Forest algorithm is an ensemble of decision trees whose outputs are aggregated to make a final prediction. In the Random Forest algorithm, bagging and feature bagging are used. Bagging is when we select a random subset of samples from our dataset to train one of the decision trees. Feature bagging means that we are only training that decision tree on a subset of the available data attributes. Using this method, we create numerous weakly performing decision trees, that when combined, produce a well-performing model for predicting student performance.

Understanding the Why in our Model

In many domains, including education, knowing the outcome of a machine learning model is not as important as why the model made that prediction.

Explainable AI allows us to comprehend the reasoning behind AI-driven predictions and give insights to what key indicators are for a student’s success. Explainable AI also allows subject matter experts to analyze machine learning decisions and find inconsistencies that can be used to improve model performance.

While ML models may seem like black box solutions that perform magic, explainable AI sheds some light on what is happening. Using this motivation, we created a pipeline that, given a student and their educational data, generates a prediction and the primary indicators the model used to generate its prediction, such as in the figure below.

Merging Data Science and Education

In conclusion, we have not only successfully utilized machine learning to predict student performance but have also showcased the use of explainable AI to add transparency to our predictions. By leveraging the extensive educational data available to our team and modern data science techniques, we were able to achieve an impressive model F1-Score of 83.5%. While the model performance is impressive on its own, we understand the importance of answering the question of ‘why’ the model infers its answers and make use of explainable AI techniques. The use of explainable AI in our pipeline offers educators the opportunity to understand the critical factors influencing a student’s success.

About Connor Dolan

Connor is a Consultant on the Artificial Intelligence team.

Digging In

State Government
Improving the Customer Digital Experience for State Government
In today’s digital age, citizens expect seamless, efficient, and user-friendly interactions with their state government. Improving the customer digital experience is essential for building trust, enhancing service delivery, and ensuring that government services are accessible to all. Understanding the Customer Journey To improve the digital experience, it’s crucial to understand the customer journey. This involves […]
Read More
State Government
Application Modernization for State Government: Enhancing Efficiency & Citizen Services
State governments are increasingly recognizing the need to modernize their legacy applications. With the rapid pace of technological advancements, outdated systems can no longer meet the demands of today’s citizens. Application modernization offers a pathway to more efficient, secure, and user-friendly government services. Why Modernize? Legacy systems often struggle to keep up with the growing […]
Read More
State Government
Assessing AI Adoption Readiness in State Agencies
Artificial Intelligence (AI) has the potential to revolutionize the way state governments operate, offering enhanced efficiency, improved decision-making, and better service delivery. However, before diving into AI adoption, state agencies must assess their readiness to ensure successful implementation and integration. Understanding AI Readiness AI readiness involves evaluating an agency’s current capabilities, resources, and infrastructure to […]
Read More
State Government
Exploring the Future of AI at the Georgia Emerging Technology Summit: Data & AI 2024
The Georgia Emerging Technology Summit: Data & AI 2024 was a landmark event for public sector leaders, showcasing the transformative potential of AI and data technologies. This summit brought together key figures and experts to discuss, learn, and network to enhance public service delivery through innovative technology. Keynote Highlights The summit featured insightful keynote sessions […]
Read More
State Government
Exploring the Future of State Technology: Takeaways from the NASTD 2024 Annual Conference
We attended the National Association of State Technology Directors (NASTD) 2024 Annual Conference in Minneapolis, MN. This premier event brings together state technology directors, industry experts, and vendors to discuss the latest trends, challenges, and innovations in state technology. The conference fosters the exchange of ideas and best practices essential for addressing shared challenges and […]
Read More
State Government
NASCIO Midyear Conference | 4 Key Takeaways
I attended the National Association of State CIO (NASCIO)’s Midyear Conference for the first time in National Harbor, Maryland. It was my first time at a NASCIO event, and what struck me most was how collaborative the NASCIO community is and how passionate these leaders are about serving their state’s constituents and employees. Each […]
Read More

Your Privacy

Predicting Student Performance Through Machine Learning

By Connor Dolan

Identifying Data

Refining Educational Data for Predictions

Choosing the Right Classifier for our Data

The Magic Behind the Random Forest Algorithm

Understanding the Why in our Model

Merging Data Science and Education

About Connor Dolan

Digging In

Improving the Customer Digital Experience for State Government

Application Modernization for State Government: Enhancing Efficiency & Citizen Services

Assessing AI Adoption Readiness in State Agencies

Exploring the Future of AI at the Georgia Emerging Technology Summit: Data & AI 2024

Exploring the Future of State Technology: Takeaways from the NASTD 2024 Annual Conference

NASCIO Midyear Conference | 4 Key Takeaways