Project 4 Findings
Findings Data Scientist Salary Scraping Scenario
Layman’s Summary
We found that the best site to use for this project was careerbuilder.com. It’s easy for our systems to read and has enough data that we don’t have to try to put results together from multiple places.
We fed the listings into our system and got a table with job titles, job descriptions, salaries, company names, and locations.
The average salary for all of those jobs was $95,000 / year so we said that anything over that would be a high paying job and anything under it would be low paying.
We were able to take all of the words in all of the job descriptions and run them through a program that showed us which ones were most often found in descriptions of high-paying jobs and which ones were most often in descriptions of low-paying ones. That helped us figure out whether these words could help us guess if another job was likely to be high-paying or low-paying. They were okay, but not great.
We then saw whether adding in the job’s location and title helped guess how well a job payed, which it did a little.
Technical Summary
In order to collect the data required for the project we opted to utilize careerbuilder.com. It was an ideal source as it had a good number of listings with salaries, hosted its job postings internally, and did not utilize AJAX technology.
Data scraping was performed on a search of data science jobs with a minimum pay of $20,000. The minimum was primarily to filter out listings which would not contain salary data. On each page of listings we used BeautifulSoup to locate and acquire the job title, job description, salary range, company, and location, creating lists of each.
These lists were then combined into a DataFrame and the data was cleaned, removing html code and other extraneous characters (e.g. ‘$’’, ‘-‘’, or ‘,’).
Salary ranges were split into high and low ends of the range, allowing a single mean salary to be assigned to the job. Location was split into City and State columns and numerical data was converted from object-type.
We found that the mean salary for these jobs was approximately $95,000/yr so we set this as the threshold for high- versus low-paying positions.
The text from the job descriptions was consolidated and processed through sklearn’s CountVectorizer to find the 50 words with the highest likelihood to impact salary. The highest negative coefficients were for “level” and “coordinator” while the highest positive coefficient by far was “infrastructure” followed somewhat distantly by “architect”, “bangkok”, and “director.”
This word-based model had an roc_auc of 0.75, accuracy of 0.66, precision of 0.66, and recall of .57.
We created a second model using location and job title as well as job description which resulted in precision of 0.66. When C was increased to 500 precision grew to 0.71.