CASE STUDY: Data Science job postings on Glassdoor (Performing Exploratory Data Analysis)
Introduction
As a data analyst, I embarked on a journey to conduct comprehensive data cleaning and exploratory data analysis (EDA) to improve data quality and unveil hidden patterns within a dataset. This project aims to provide valuable insights and facilitate informed decision-making. In this blog post, I will walk you through the entire process, from the dataset explanation to the findings.
Goal
The primary goal of this project is to conduct a thorough data cleaning process and perform exploratory data analysis (EDA) on a dataset. By doing so, I aim to improve data quality, identify patterns, and facilitate informed decision-making for future analysis. Through this project, I showcase my proficiency in data preprocessing and analytical techniques.
Explain the Dataset
The dataset I worked with comprises 15 columns, each serving a specific purpose: (Link of the dataset)
Job Title
: The job title associated with each entry.Salary Estimate
: The salary range for each job position.Rating
: Company rating on a scale of 0 to 5.Company Name
: The name of the hiring company.Location
: The location of the job opportunity.Size
: The size category of the company in terms of employees.Founded
: The year the company was founded.Type of Ownership
: The ownership type of the company.Industry
: The industry to which the company belongs.Sector
: The sector to which the company belongs.Revenue
: The revenue of the company.Competitors
: Competitor companies.
Explaining the Process
My data cleaning and EDA process involved several key steps:
- Importing Libraries: I began by importing essential Python libraries, including pandas, numpy, matplotlib, and seaborn, to assist in data manipulation and visualization.
2. Reading the File: I loaded the dataset from the provided CSV file into a pandas DataFrame for further analysis.
3. Handling Missing Values: I checked for missing values within the dataset and found none, ensuring that the data was complete and ready for analysis.
4. Data Transformation:
- I dropped the ‘index’ column as it served as a serial number and had the potential to affect the analysis.
- I extracted the lower and upper salary limits from the ‘Salary Estimate’ column and created new columns to represent these values.
5. Data Cleaning and Transformation for Specific Columns:
- I processed columns such as ‘Job Title’, ‘Company Name’, ‘Size’, ‘Type of Ownership’, ‘Industry’, ‘Sector’, and ‘Revenue’ to handle missing values and improve data quality.
My Findings
During the exploratory data analysis (EDA) phase, I made several noteworthy observations:
- Size of Companies:
- The majority of companies in the dataset have between 51 to 200 employees, with 135 companies falling into this category.
- Companies with 1001 to 5000 employees and 1 to 50 employees are also relatively common, with 104 and 86 companies, respectively.
- The least common size category is companies with 5001 to 10,000 employees, with only 61 companies falling into this group.
2. Type of Ownership:
- The most common type of ownership is “Company — Private,” with 397 companies falling into this category.
- “Company — Public” is the second most common type of ownership, with 153 companies.
- Other types of ownership, such as “Nonprofit Organization” and “Subsidiary or Business Segment,” are less common.
3. Industry:
- The dataset covers a wide range of industries, with the top three being Biotech & Pharmaceuticals (66 companies), IT Services (61 companies), and Computer Hardware & Software (57 companies).
- Many other industries are represented, including Aerospace & Defense, Enterprise Software & Network Solutions, and Consulting.
4. Sector:
- The most common sector is Information Technology, with 188 companies falling into this category.
- Other significant sectors include Business Services (120 companies), Biotech & Pharmaceuticals (66 companies), and Aerospace & Defense (46 companies).
5. Revenue:
- The dataset includes companies with a diverse range of revenue levels.
- The most common revenue range is 100 to 500 million (USD) with 94 companies.
- Some companies have very high revenue levels, such as 10+ billion (USD), while others have lower revenues, such as Less than 1 million (USD).
6. Location:
- The dataset includes companies from various locations, with San Francisco, CA having the highest representation (69 companies).
- Other notable locations include New York, NY (50 companies), Washington, DC (26 companies), and Boston, MA (24 companies).
7. Skewness and Kurtosis for Rating:
- Skewness value is approximately 0.0187, indicating a nearly symmetric distribution of ratings with a slight right (positive) skew.
- Kurtosis value is approximately -0.44, suggesting a platykurtic distribution with thinner tails compared to a normal distribution.
8. Skewness and Kurtosis for High_Salary_in_dollar (Upper limit of salary):
- Skewness value is approximately 1.09, indicating a moderate right (positive) skew in salary distribution.
- Kurtosis value is approximately 2.41, suggesting a leptokurtic distribution with heavier tails and a more peaked shape than a normal distribution.
9. Outliers for High_Salary_in_dollar and Rating:
- Outliers were identified for both high salaries and ratings, which may require further investigation for their impact on the analysis.
10. Relationship between Rating and Salary:
- A weak positive relationship (correlation coefficient of 0.0099) was observed between employee ratings and high salaries.
11. Salary Variation Across Locations:
- Salaries exhibited significant variation across different locations, with high-paying and low-paying locations identified.
12. Salary with Respect to Industry:
- Industry had a noticeable impact on salary levels, with variations observed among different industries.
13. Rating Variation by Sector:
- Employee ratings varied across different sectors, with some sectors having higher average ratings than others.
14. Relationship between Rating and Type of Ownership:
- Type of ownership was found to influence employee ratings, with certain ownership types associated with higher average ratings.
15. Companies with the Most Job Postings:
- Companies with the highest number of job postings were identified, indicating active recruitment and potential job opportunities.
16. Comparison of Industry with Respect to Salary and Rating:
- Industries that offered the highest average salaries and received the highest ratings from employees were identified.
Conclusion
In conclusion, this data cleaning and exploratory data analysis project showcases my ability to transform raw data into meaningful insights. By addressing missing values, transforming data, and conducting a thorough analysis, I have enhanced the quality of the dataset and gained valuable insights into factors such as salary distribution, industry impact, and employee ratings.
As a data analyst, I am equipped with the skills to preprocess data effectively, conduct in-depth exploratory analyses, and provide actionable insights for informed decision-making. This project serves as a testament to my proficiency in data analytics and my commitment to delivering high-quality work in the field of data analysis.
Thank you for joining me on this journey of data cleaning and preprocessing in the world of FIFA 2021! If you’re interested in exploring the cleaned dataset or learning more about the specific code used in this project, feel free to check out my github for the full code and resources.
Thanks for Reading!
If you enjoyed this, follow me to never miss another article !
If you have any question feel free to ask!