Predicting Graduate Admissions using Linear Regression

Tanwir Khan
5 min readOct 16, 2019

--

The dataset and the content described below has been taken from Kaggle for the better understanding of the readers. Please click the link below to download the dataset.

The dataset version that has been used here is Admission_Predict_Ver1.1.csv

Context

This dataset is created for prediction of Graduate Admissions from an Indian perspective.

Content

The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

  1. GRE Scores ( out of 340 )
  2. TOEFL Scores ( out of 120 )
  3. University Rating ( out of 5 )
  4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  5. Undergraduate GPA ( out of 10 )
  6. Research Experience ( either 0 or 1 )
  7. Chance of Admit ( ranging from 0 to 1 )

Acknowledgements

This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya.

Inspiration

This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

Citation

Please cite the following if you are interested in using the dataset : Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

GOAL

Our goal here would be to predict the “Chance of Admit” based on the different parameters that are provided in the dataset.

We will achieve this goal by using the Linear Regression model.

Based on the data that we have, we will split out data into training and testing sets. The Training set will have features and labels on which our model would be trained. The label here is the “Chance of Admit”. If you think from a no-technical standpoint then label is basically the output that we want and features are the parameters that drive us towards the output. Once our model is trained, we will use the trained model and run it on the test set and predict the output. Then we will compare the predicted results with the actual results that we have to see how our model performed .

This whole process of training the model using features and known labels and later testing it to predict the output is called Supervised Learning.

Now, before diving into how to prepare and model out data, let’s have a brief understanding on how Linear Regression works.

Linear Regression

It is a statistical method which is used to obtain formulas to predict the values of one variables from another where there is a relationship between the 2 variables.

The formula for simple linear regression is that of a straight line y =mx + c

The variables y and x in the formula is the one whose relationship will be determined.

Both the variables are named as below:

  1. y : Dependent variable
  2. x : Independent variable

The above equation is more equivalent to the slope intercept form in which the dependent variable is denoted by y, and c denotes the intercept, m denotes the slope, and x is the independent variable.

So, if we are given a particular Independent Variable x, the regression model would basically compute the results of c and m which would minimize the absolute difference between the dependent variable y which is the actual value we have and the predicted value of y.

To create the Linear Regression model we will use python.

P.S : I have embedded the screenshots of the jupyter notebook here as there is some issue with the github gist right now due to which I was not able to render the jupyter notebook to github gist.

However I am embedding the google colab version of Jupyter notebook here. Please click on it and you could see the code.

--

--