Data Analysis of Job Postings on Glassdoor.
I came up with this personal personal project to test my skills to the fullest and learn new things. In this project I scraped job postings related to the position of 'Data Scientist' from glassdoor.com, analyzed the gathered data and framed a machine learning problem out of it. In the below write up I'll mention the details on what I learned. I selected states of California, Washington, New York as major areas to find the roles.
The project consists of three main jupyter notebooks
Glassdoor is a website where current and former employees anonymously review companies. Glassdoor also allows users to anonymously submit and view salaries as well as search and apply for jobs on its platform.Glassdoor launched its site in 2008 , as a site that “collects company reviews and real salaries from employees of large companies and displays them anonymously for all members to see,” according to TechCrunch. The company then averaged the reported salaries, posting these averages alongside the reviews employees made of the management and culture of the companies they worked for—including some of the larger tech companies like Google and Yahoo. The site also allows the posting of office photographs and other company-relevant media.
In this part of the project I developed a webscraper which scrapes data from glassdoor.com. Here's how I went about creating it.
There were two approaches to deal with it :
Once I extracted all the links from all the pages that were present on the glassdoor.com. I figured out how many jobs were present on each page which turned out to be 30.
Then I went to every exracted link using selenium got the page source code using a beautiful soup object and extracted the required elements.
First of all I used interactive plots in it. I used plotly and didn't use the offline version to create the plots sorry if you cannot see the plots, If you want to see the notebook with the plots here is link : eda notebook
I ran all of my code on deepnote, you should look them up.
When I read all the data from the CSV files. I found that it contained duplicated rows, so my first task was to delete them. After that pretty much the data was clean because of the good scraping we did.
Beginning of EDA
There are 12 columns in the data they are as follows:
a. State column
As mentioned above I used California, Washington and New york as place queries. But the scraper also collected some data from other states like Texas,Maryland,Virgina,etc maybe because these resultes were also the part of the search.
First step I did is I saw the number of jobs in each city of the state and found out the top 5 cities. Turns out CA and TX have the most number of jobs. City of San Fransico was dominating with the most number of jobs.
The next step was to see the average annual minimal and maximal salaries in the states
What's the Inference about salaries ?
Minimal Average salary for NY is greater CA & DC, that can be because of the less data points for NY state indicating it is an outlier.
Both DC and California offer almost the same average salaries both minimal and maximal.
Now I saw which city offered average minimal annual salary.
Outcome : South San Fransico is the city with highest minimal annual salary.
b. Industry Column
As expected companies which come under IT sector have max number of jobs followed by industries which come under Business services.
Though one intersting thing for me was Biotech requires more amount data science related individual than finance atleast for this data set.
Salaries for the top 8 industires having most number of jobs
Outcome : Agriculture and Forestry is an outlier if because in whole of the data there is only one example of it. IT industry pays the best avg minimal and maximal yearly salary.
c. Exploring the Company Columns*
According to the data and the above plot Genentech,Booz Allen Hamilton Inc.,Amazon are the companies with the most number of openings.
Average annual minimal & maximal salaries for top 5 companies with max job postings in each state. The plots sequence of plots is California,Texas,DC,Virginia,Maryland
d. Job Titles
The top 3 titles were : Data Scientist,Data Engineer,Data Analyst.
The salaries for the top 3 titles
e. Job Description
I have exhaustive explanation in the modelling notebook you can look at it there. There are explanations for every plot and method I used.