Predicting Student Performance: Part 1

Abstract — I am focusing on predicting student performance and looking for a trend if possible. This project is important because education is a key factor for achieving long-term economic progress. I want to understand the effects that have demographic, social and economic status on student performance. To present the data, my method is to implement classification algorithms to predict student performance. My goal is to dig deeper into the cause and effect of student performance.

Keywords — Student, classification, performance, education.

I.INTRODUCTION

The main goal of this project is to use data collected from students in two schools in Portugal to see whether or not it is possible to predict if a student will fail or pass a course. Using classifications algorithms, I will predict the final score that the student receives. The classification methods that I am going to use in my analysis are: logistic regression, k-nearest neighbors, decision tree, random forest, gradient boots and Ada boost. To evaluate my results, I will implement confusion matrices, accuracy, recall, precision and F1 score for each of my classifiers.

II.BACKGROUND

Student performance is an essential part in higher institutions. This is because one of the criteria for universities is based on its excellent record on academic achievements. Most of the higher institutions in Portugal used the final grades to evaluate student performances. By analyzing student performances, a strategic program can be well planned during their period of studies in an institution.

III.EXPERIMENTS/METHODOLOGY

1. DATASET DESCRIPTION

The data “Student Performance” can be found in the University of California public repository:(https://archive.ics.uci.edu/ml/datasets/student+performance).

The dataset contains two tables: one form the math course and another form the Portuguese course. I merged these two tables so I can have more data for my analysis.

A few characteristics of my data set are:

It has 1044 entries and 33 features.
Most of my features are categorical values which needs to be encoded for the classification analysis.
The dataset consists of demographic, social and economic characteristics of a student.
My target value is the final score.

Predicting Student Performance- Part 1-1.png

2. TARGET FEATURE

My target feature is final score feature. It contains numeric values that needed to be encoded as a categorical values for my analysis. The range of this feature was from zero to twenty. I divided it from zero to teen as fail and from 10 to twenty as pass. I have around 24 % of students that didn’t pass the course and 77% of students that pass the course and as a consequence I have class imbalance.

3. CORRELATION

Before I did my classification models, I implemented a correlation because I

wanted to see If I had highly correlated features that needed to be removed. I found higher correlated features: First period grade, second period grade were higher correlated with final period grade, so I removed these two features from my analysis.

3. CLASS IMBALANCE

To solve class imbalance problem, I tried under sampling techniques such as Tomek links and cluster centroids and also tried two over sampling techniques, Smote and over sampling followed by under sampling. The best result for my dataset was the oversampling technique called Smote. (Synthetic Minority Oversampling Technique). It works by taking elements of the minority class randomly, and computing the k-nearest neighbor algorithm for this point. The new points are added between the chosen point and its neighbor.

4. DATA MINING ALGORITHMS

To model my data, I used logistic regression, k-nearest neighbors and Decision tree classifier. First I implement each classifier with default parameters, then I turned the parameters for each classifier. After that, the accuracy and the F1 score of each classifier increased by around 5%to 8%. However, none of these models were very strong. After, I tried ensembles methods because I knew that they will increase the accuracy rate and the F1 score. The ensembles that I used were: Random Forest, Gradient Boost and Ada boost. I used the same method, I implement the ensembles with their default parameters first and after I tuned their parameters. Same as the classifiers, the accuracy rate and the F1 score increased after I tuned their parameters.

5. EVALUATION METRICS

To evaluate my results, I implement confusion matrix, accuracy, recall, precision and F1 score for each of my classifiers.

VII. RESULTS

I also plot the feature importance of this model and found that mother’s education was the most important feature. This makes a lot of sense when considering if the primary caregiver for students are mothers.

The classifier with higher Accuracy and F1 score was Gradient Boosting with an 88.02% accuracy rate and an 89.16% of F1 score.

VIII. CONCLUSION

Having the ability to predict whether or not a student will fail a course using these tools may be beneficial to prevent student failure and to make an intervention. In addition, since my data set is from two schools in Portugal, I would like to find more data from other countries so I can make a global prediction. Also, I would like to make another model to predict classification with multiple labels( poor, fair and good). Finally, I would like to make a linear regression model since my target value is numeric and see if the results will change.

REFERENCES

Analysing Student Performance using Sparse Data of Core Bachelor Courses EDM 2015 2015 Mirka Saarela Tommi Kärkkäinen.
Early Student Grade Prediction: An Empirical Study2019 2nd International Conference On Advancements In Computational Sciences (ICACS) 2019 Zafar Iqbal Adnan Qayyum Siddique Latif Junaid Qadir.
A Semester Grade Point Average Estimation System for Students Attaining Higher Schooling in Specialized Courses2018 2nd International Conference On Trends In Electronics And Informatics (ICOEI) 2018 Naomi Christianne Pereira Umang Mavani Aditi Pednekar Vivian Brian Lobo.

Predicting Student Performance

L1 Norm VS L2 Norm

U.S. Pollution Data by State, a Visualization Analysis