The final year of college is a very different experience from the previous years as a student. It is the year our focus changes from academics to life outside as we begin taking part in campus recruitment activities. The priorities shift from scoring better grades to scoring a better salary. In this regard, the Campus Recruitment data set on Keggle really intrigued me. The data set contained placement data of MBA students from a college in India. Although, the data set is not very large each record/row has a high granularity. Each row contains the complete academic details of a student from their percentage scores for Secondary, Higher Secondary levels all the way to placement and salary.
There were plenty of categorical and numerical variables in the dataset which gave it a lot of versatility and made it a fit for several kinds of analysis. The dataset contains data on 215 students out of which 148 are able to secure placement. In this blog we will share some key insights we got from the data as well as their implications. In this regard, it is important to highlight that the dataset is of a limited nature and thus it should be understood in its own context without generalizing it for campus recruitments.
The data cleaning process began by understanding the data. We started off by examining the data to check for missing values while also looking at important summary statistics. We also examined the data types of different columns. Using the head() function the first few rows were examined which showed missing values in the salary column. This signified that there may be more in the dataset and thus it was found that there were 67 missing values in the “Salary” column. This presented us with a decision of dealing with missing values. In this regard it is important to know that salary was displayed only for those students who were placed. There were a number of alternative options.
1. We could delete rows with missing values. This will remove all students who were not placed so not a good idea.
2. We could fill mean salary in place of missing values. This will be completely wrong since those not placed shouldn’t have salary.
3. We could fill missing values with 0. This will make sure data type continues to be integer for the column and ensure consistency.
Thus, we opted for the third option. The “serial number” column was dropped to minimize redundancy since it was exactly the same as the index. Lastly, to cap off the data cleaning process we decided to change the column names to make it easier to understand what each of them contained. To ensure consistency, each column name started with a capital letter.
Exploring the dataset (EDA):
EDA is an extremely broad process which gives us useful insights into the dataset. However, these insights should be based on questions that we aim to answer otherwise there will be too much information to make sense out of. In this regard we use the power of statistics and visualization to obtain insights and answer those questions. Since the ultimate aim of this exercise is to use machine learning processes to predict whether a student will get placed, it is important to study how these different variables relate to placement. Since salary is the result of placement we should also be able to predict it given a student record. In this regard we prepared six questions, each of which deals with a certain aspect of the data set.
Q1. What is the count for each of the categorical variables?
Understanding the composition of each category is important for further analysis. For example, a difference between the number of males and females will mean we can’t use sum to compare their salaries . Rather, we should use proportions or measure of central tendency like mean or median.
The code above produced bar charts for each of the categorical variables. The following were the insights generated. The number of male students is greater. Commerce and Management was the most popular undergrad degree choice while Marketing and Finance was the preferred MBA specialization. Most of the students did not have work experience yet most of the students were able to secure jobs.
Q2. Compare placement for three categorical variables i.e Under-Grad Degree Type, MBA Specialization and Work Experience?
The reason for selecting these three variables was basic intuition. It is natural for people to go for a certain degree or specialization in order to have a better chance of landing a job. Similarly, internships or interim jobs also help in securing better jobs or more salary. The bar graphs below illustrate these differences in terms of placement.
As I pointed before, this is where proportion comes into play. On first glance it seems as if people with work experience have a higher chance of getting placed and not placed. Seems weird right? Well, if we see the graph the proportion of people with experience were placed with a ratio of 60/70 which is greater than people without experience.
In this case, however, Commerce and Management students indeed have the higher placement since they not only have the higher absolute number of placed students but also greater proportion of 0.71 to 0.66 for Sci & Tech. This difference however, is quite small to have any significance.
Marketing and Finance has the highest absolute number as well as the greater proportion for placement.
Q3. Is there a Gender-bias in the data?
The gender pay gap is a well-known and well-documented issue. We decided to generate our own insights on this complex issue. The following graph illustrates the average salary for each.
This shows that the average salary for male is greater. However, it is important to analyze this further. To do so we controlled for different categorical variables to ensure if the differences were purely because of gender or there were other factors to account for such as specialization or degree type.
Interestingly, the graph shows that for two of the undergrad degree types i.e “Sci&Tech” and “Comm&Mgmt” Males enjoy a higher salary but for other degrees Females have a higher average salary. Moving On!
We’ll now compare the two on the basis of percentages for MBA, Undergrad as well as Employability test.
Comparing bins for each of them shows that males generally enjoy greater salary in most of the percentage bins.
For both cases of work experience males have a higher average salary. Now the question is, does this gender bias translate into job placement as well? In this regard we compared the proportion of each gender for example the percentage of females placed from the total number of female students.
The graphs show that 70% of males were placed to 60% Females. Thus, for this dataset we may say that a gender bias exists however, the results are not generalizable given the limited data.
Q4. What is the relationship between different numerical variables and what is its direction i.e positive or negative?
The answer to this question will help us understand if higher marks are related to higher salary or vice versa as well as the relation between other things. We used the “Pearson” correlation between numerical variables by default. A negative value indicates a negative relationship and vice versa. The closer the value to 0 the weaker the relationship. A value above absolute 0.5 or near to it may signify a significant correlation.
On first glance there appears to a positive correlation between all variables with a few statistically significant relations of percentages with “salary” which is our variable of interest. However, since we added a 0 for the salary of all who were not placed there is a chance these values for correlation may be higher than they should be. Thus, we controlled for placement and checked for only those who got placed.
This presents a different picture. There appear to be no statistically significant relations between Salary and other variables based on percentage values. As a matter of fact, the degree percentage appears to be negatively correlated with Salary however, with a value close to 0, it is not statistically significant.
Q5. What is the distribution of salaries for people who did get placed?
The box plot shows the median value of salary is between 200 and 300K with outliers ranging from just above 400K all the way to 900K. It may be inferred that most of the students got jobs in India while the outliers with high salaries were able to get jobs in top companies within or outside India. This is also evident from the density plot which shows the concentration of salaries between 200 and 300K.
Q6. What is the distribution of percentages for people who got placed and those who did not?
Probably the most popular question from students: “Does my GPA matter for job placement?” Being a student it is extremely easy to relate with this question especially when we are constantly told that the demands of the job in the real world are usually different from the academic knowledge we get from an institution. This data set gave us a chance to explore the age old question. We prepared density plots comparing the distribution of Employability test, Under-Grad Degree and MBA percentage as shown below.
From these three plots there is no pattern that can infer that people with higher percentages had a greater chance of securing placement. However, beyond a certain threshold of percentage for each of the graphs the number of students placed are greater than the number of students not placed. Yet for a majority of scores the peaks for placed and not placed are close to each other indicating that percentage may not matter after all in terms of placement.
Machine Learning and Statistical Inference
The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of two or more independent groups. The question subject to the one way ANOVA testing was that whether there is a statistical difference in salaries of people based on their specialization. Using basic descriptive statistics (mean,boxplots), we observed that people that had specialized in Mkt&Fin had higher mean salaries compared to MKT&HR. Moreover, the boxplots showed MKT&fin had significantly greater outliers than MKT&HR. This led to the following hypotheses.
Null Hypothesis: Difference in salaries of people with different specialization is not statistically significant and due to chance.
Alternate Hypothesis: Difference in salaries of people with different specialization is statistically significant and thus specialization does effect the salary people earn.
The ANOVA test was carried out in two steps:
Step 1 — Fit the model using an estimation method, on default ordinary least squares method or OLS was used. This estimation method aimed to give an estimation of the parameter being tested
Step 2 — Fitted model passed into ANOVA method to produce ANOVA table
As the P-value> 0.05 it favors the null the hypothesis that differences in Salaries (and their mean) based on Specialization are not statistically significant and due to chance (at 95% confidence).
Predicting MBA percentage from Degree Percentage using Simple Linear Regression
In order to use the simple/multiple linear regression model their needs to be a significant correlation of between two features. As per the correlation matrix show above, there was only one relation with a coefficient of 0.5 or above i.e MBA percentage and degree percentage.
The following shows the results of linear regression. The scatter plot illustrates the relationship between x (Degree Percentage) and y(MBA Percentage). The line of best fit (shown in green) is positioned appropriately between the points as in equal number of points above and below. This gives the relationship between x and the predicted values using the linear regression model.
To further dissect this relationship, we checked that how close our data is to the predicted regression line. In other words, the proportion of variance in the MBA percentage that is predictable from Degree Percentage. This was done using the r2_score function from sklearn.metrics. Our R squared value of 0.1619 indicated that 16.19 % of the variance in MBA percentage can be explained through degree percentage. The following equation summarizes the relationship between the two variables:
y = 0.31895971 + 41.108770838441785
Model for Predicting Student Placement
So, it boils down to this. We will be answering two questions in this part.
· Given the record of a student can we predict if he/she will get placed or not?
· How accurate is this prediction?
Before specifying which technique we used we will eliminate the options. Since we did not have a continuous dependent variable, linear regression was not possible. Our dependent variable i.e “Status of Placement” was categorical with only two possible values (binary) “Placed” and “Not Placed”. In short we were classifying each student into one of the two categories. For such classification we used “Binary Logistic Regression”.
Binary Logistic Regression:
We had to make certain changes in our dataset to make it suitable for logistic regression. There were several features/columns with categorical data which needed to be converted. This was achieved through one-hot encoding. This was easily achieved using replace method from pandas library for columns with two classes such as Gender. However, with columns with more than 2 classes such as UnderGrad Degree type we used dummy variables, converting each class into a column of 1s and 0s. We also dropped the salary column since it was a result of placement not a regressor.
Furthermore, we decided to split our data between training and testing. The training data was to be used to train the model while the test data was used later to gauge model accuracy. There were two options to choose from. We could either go with 90% of the data allocated to training set and 10% to the test set or a 75:25 split. We went for 75:25 for a better balance between complexity and generalizability while also making sure enough data points were allocated to the test data given that the campus recruitment data set is not very large.
Following the creation of the model, we needed to test the model accuracy. For this purpose we used the test data and used the following techniques to gauge accuracy:
- Confusion Matrix
- Classification report
- ROC Curve
Using the score method we got an accuracy percentage of 85.1%. This showed that our model could correctly predict placement for a student 85/100 times.
The confusion matrix shown above indicates that for the 54 student records as part of the test data, the model correctly predicted placement of 34 students while returning a false positive only 3 times. Similarly, it was able to correctly predict when the students were not placed for 12 records while returning a false negative for 5 of them. Overall it gives an accuracy of 46/54 = 0.85 / 85%.
The table above shows the classification report using the four metrics shown above. The precision tells us the accuracy of positive predictions which is in this case placement. This precision value is pivotal since it can indicate the probability of committing a Type 1 error. A type 1 error in this case would be a company missing out on a talented individual who should have been hired but was not. Looking at the weighted average we can see that 85% of the time our model will correctly place students.
Recall, is similar to average in the sense that it gives the fraction of positives that were correctly identified. Lastly, f1-score, calculated using the recall and precision values, is the “weighted harmonic mean” of precision and recall and thus, gives the same value of 85%. One thing to note is that, the model generally has greater accuracy in predicting placement (marked by 1) compared to non placement (marked by 0).
The ROC Curve gives a trade off between the False and True Positivity rates. The baseline (in blue) has no predictive value while the orange curve shows the effect of the actual test i.e logistic regression. Area under this curve (AUC) gives us the accuracy of this model which is in this case very high at 92.21%. Another indicator of the model’s accuracy is that the orange curve is further away from the base line.
Thus, given all the information regarding the accuracy of the model we can safely conclude that the model is extremely reliable in predicting placement.
The goal of the project was to make a model to predict student placement. This was achieved using Binary Logistic Regression. Apart from this One-Way ANOVA was used for statistical inference between MBA Specialization and Salary. This provided us an interesting insight that Salary is not dependent on the specialization chosen during MBA. We were also able to make a simple linear regression model to predict MBA percentage based on degree percentage.
EDA also generated valuable insights. There was indeed a gender bias in the data available. No evidence was found that higher marks throughout the academic journey of a student lead to a higher chance of placement. Most people placed earned a salary between 200–300k
As pointed out in the beginning, this dataset is from a particular university and thus, given a different dataset the results can differ however, such models can indeed have great industrial applications especially within the HR of a company. While the results can’t be relied upon they can help narrow down a long list of possible employees.