This project examines a dataset that predicts students’ academic success based on more than 4500+ entries and 37+ variables. According to the associated text, the data collected is from multiple disjoint sources from the Polytechnic Institute of Portalegre database. Based on this data, a machine learning model was trained on past students that had already graduated or dropped out to predict academic outcome. Some of the variables included are personal characteristics, such as race or gender. Does the structure or selection of these variables reflect biases or other assumptions based on students' backgrounds, and does that affect the outcome of their academic prediction? Our methods of reading the data centered more around understanding why the data was there and what it represented in the prediction. Firstly, each data point has a reason to be there; some are analytical to see if characteristics of people have relation to outcome after the predictions are made. Others are more directly correlated to the data, like first and second term grades. To get a better understanding of how each variable correlates to the prediction, we performed simple linear regression and examined the R and F values. A higher R and F value means that the variable weighs heavier in the calculation, so it provided a value of correlation for each data point. Our findings showed that often males and people of older age were predicted to dropout at a higher rate than other students. This may implicate some assumptions that the data model makes based on these personal values, but these characteristics are not often directly discriminated against in society. Although this is the case, there is still some value to other variables presented in the data, like nationality and displacement statistics which can affect a student’s life outside of the classroom.
This project examines a dataset that uses a machine learning model to predict student outcomes in higher education, and looks for any assumptions or bias in this prediction.