The final year of college is a very different experience from the previous years as a student. It is the year our focus changes from academics to life outside as we begin taking part in campus recruitment activities. The priorities shift from scoring better grades to scoring a better salary. In this regard, the Campus Recruitment data set on Keggle really intrigued me. The data set contained placement data of MBA students from a college in India. Although, the data set is not very large each record/row has a high granularity. Each row contains the complete academic details of a student from their percentage scores for Secondary, Higher Secondary levels all the way to placement and salary.
There were plenty of categorical and numerical variables in the dataset which gave it a lot of versatility and made it a fit for several kinds of analysis. The dataset contains data on 215 students out of which 148 are able to secure placement. In this blog we will share some key insights we got from the data as well as their implications. In this regard, it is important to highlight that the dataset is of a limited nature and thus it should be understood in its own context without generalizing it for campus recruitments.
The data cleaning process began by understanding the data. We started off by examining the data to check for missing values while also looking at important summary statistics. We also examined the data types of different columns. Using the head() function the first few rows were examined which showed missing values in the salary column. This signified that there may be more in the dataset and thus it was found that there were 67 missing values in the “Salary” column. This presented us with a decision of dealing with missing values. In this regard it is important to know that salary was displayed only for those students who were placed. There were a number of alternative options.
1. We could delete rows with missing values. This will remove all students who were not placed so not a good idea.
2. We could fill mean salary in place of missing values. This will be completely wrong since those not placed shouldn’t have salary.
3. We could fill missing values with 0. This will make sure data type continues to be integer for the column and ensure consistency.
Thus, we opted for the third option. The “serial number” column was dropped to minimize redundancy since it was exactly the same as the index. Lastly, to cap off the data cleaning process we decided to change the column names to make it easier to understand what each of them contained. To ensure consistency, each column name started with a capital letter.
Exploring the dataset (EDA):
EDA is an extremely broad process which gives us useful insights into the dataset. However, these insights should be based on questions that we aim to answer otherwise there will be too much information to make sense out of. In this regard we use the power of statistics and visualization to obtain insights and answer those questions. Since the ultimate aim of this exercise is to use machine learning processes to predict whether a student will get placed, it is important to study how these different variables relate to placement. Since salary is the result of placement we should also be able to predict it given a student record. In this regard we prepared six questions, each of which deals with a certain aspect of the data set.
Q1. What is the count for each of the categorical variables?
Understanding the composition of each category is important for further analysis. For example, a difference between the number of males and females will mean we can’t use sum to compare their salaries . Rather, we should use proportions or measure of central tendency like mean or median.
The code above produced bar charts for each of the categorical variables. The following were the insights generated. The number of male students is greater. Commerce and Management was the most popular undergrad degree choice while Marketing and Finance was the preferred MBA specialization. Most of the students did not have work experience yet most of the students were able to secure jobs.
Q2. Compare placement for three categorical variables i.e Under-Grad Degree Type, MBA Specialization and Work Experience?
The reason for selecting these three variables was basic intuition. It is natural for people to go for a certain degree or specialization in order to have a better chance of landing a job. Similarly, internships or interim jobs also help in securing better jobs or more salary. The bar graphs below illustrate these differences in terms of placement.
As I pointed before, this is where proportion comes into play. On first glance it seems as if people with work experience have a higher chance of getting placed and not placed. Seems weird right? Well, if we see the graph the proportion of people with experience were placed with a ratio of 60/70 which is greater than people without experience.
In this case, however, Commerce and Management students indeed have the higher placement since they not only have the higher absolute number of placed students but also greater proportion of 0.71 to 0.66 for Sci & Tech. This difference however, is quite small to have any significance.
Marketing and Finance has the highest absolute number as well as the greater proportion for placement.
Q3. Is there a Gender-bias in the data?
The gender pay gap is a well-known and well-documented issue. We decided to generate our own insights on this complex issue. The following graph illustrates the average salary for each.
This shows that the average salary for male is greater. However, it is important to analyze this further. To do so we controlled for different categorical variables to ensure if the differences were purely because of gender or there were other factors to account for such as specialization or degree type.
Interestingly, the graph shows that for two of the undergrad degree types i.e “Sci&Tech” and “Comm&Mgmt” Males enjoy a higher salary but for other degrees Females have a higher average salary. Moving On!
We’ll now compare the two on the basis of percentages for MBA, Undergrad as well as Employability test.
Comparing bins for each of them shows that males generally enjoy greater salary in most of the percentage bins.
For both cases of work experience males have a higher average salary. Now the question is, does this gender bias translate into job placement as well? In this regard we compared the proportion of each gender for example the percentage of females placed from the total number of female students.
The graphs show that 70% of males were placed to 60% Females. Thus, for this dataset we may say that a gender bias exists however, the results are not generalizable given the limited data.
Q4. What is the relationship between different numerical variables and what is its direction i.e positive and negative?
The answer to this question will help us understand if higher marks are related to higher salary or vice versa as well as the relation between other things. We used the “Pearson” correlation between numerical variables by default. A negative value indicates a negative relationship and vice versa. The closer the value to 0 the weaker the relationship. A value above absolute 0.5 or near to it may signify a significant correlation.
On first glance there appears to a positive correlation between all variables with a few statistically significant relations of percentages with “salary” which is our variable of interest. However, since we added a 0 for the salary of all who were not placed there is a chance these values for correlation may be higher than they should be. Thus, we controlled for placement and checked for only those who got placed.
This presents a different picture. There appear to be no statistically significant relations between Salary and other variables based on percentage values. As a matter of fact, the degree percentage appears to be negatively correlated with Salary however, with a value close to 0, it is not statistically significant.
Q5. What is the distribution of salaries for people who did get placed?
The box plot shows the median value of salary is between 200 and 300K with outliers ranging from just above 400K all the way to 900K. It may be inferred that most of the students got jobs in India while the outliers with high salaries were able to get jobs in top companies within or outside India. This is also evident from the density plot which shows the concentration of salaries between 200 and 300K.
Q6. What is the distribution of percentages for people who got placed and those who did not?
Probably the most popular question from students: “Does my GPA matter for job placement?” Being a student it is extremely easy to relate with this question especially when we are constantly told that the demands of the job in the real world are usually different from the academic knowledge we get from an institution. This data set gave us a chance to explore the age old question. We prepared density plots comparing the distribution of Employability test, Under-Grad Degree and MBA percentage as shown below.
From these three plots there is no pattern that can infer that people with higher percentages had a greater chance of securing placement. However, beyond a certain threshold of percentage for each of the graphs the number of students placed are greater than the number of students not placed. Yet for a majority of scores the peaks for placed and not placed are close to each other indicating that percentage may not matter after all in terms of placement.
Given the precious insights from exploration we can now move towards using machine learning for predicting placement.