This Analysis and Charts will help Aspiring Data Professionals make smarter decisions. Data is collected from glassdoor website.
<br>Data is cleaned and transformed to start doing analysis.

In [1]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

In [2]:
data = pd.read_csv("data_scientist_jobinfo.csv")

In [3]:
data.head()

Unnamed: 0,job_title,Location,Sector,Python,R,Scala,Spark,AWS,SQL,Excel,PowerBI,Tableau,Tensorflow,Pytorch,Keras,Company_Size,Company_Age
0,Engineer,Winnipeg,Information Technology,1,0,0,0,0,1,1,1,1,0,0,0,Medium,34
1,Scientist,Toronto,Information Technology,1,0,0,1,1,1,0,0,0,0,0,0,Small,7
2,Scientist,Toronto,Business Services,1,0,1,1,1,1,0,0,0,0,0,0,Medium,28
3,Scientist,Vancouver,Information Technology,1,0,1,0,1,0,1,0,0,0,0,0,Medium,10
4,Analyst,Waterloo,-1,1,0,0,0,1,1,1,0,0,0,0,0,Small,-1


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 532 entries, 0 to 531
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   job_title     532 non-null    object
 1   Location      532 non-null    object
 2   Sector        532 non-null    object
 3   Python        532 non-null    int64 
 4   R             532 non-null    int64 
 5   Scala         532 non-null    int64 
 6   Spark         532 non-null    int64 
 7   AWS           532 non-null    int64 
 8   SQL           532 non-null    int64 
 9   Excel         532 non-null    int64 
 10  PowerBI       532 non-null    int64 
 11  Tableau       532 non-null    int64 
 12  Tensorflow    532 non-null    int64 
 13  Pytorch       532 non-null    int64 
 14  Keras         532 non-null    int64 
 15  Company_Size  532 non-null    object
 16  Company_Age   532 non-null    int64 
dtypes: int64(13), object(4)
memory usage: 70.8+ KB


We have 532 rows and 17 columns

In [5]:
fig = px.pie(data, names='job_title', title='Job Title', color_discrete_sequence=px.colors.sequential.haline)
fig.update_traces(textposition='inside', textinfo='percent+label+value', pull=[0, 0.2, 0, 0, 0, 0],
                 marker=dict(line=dict(color='#000000', width=2)))

fig.show()

Based on the pie chart, roughly 38.2% of the data job which were posted is Data Scientist. Data Analyst comes second with 26.3% and Data Engineer comes third with 18.4%. Other roles such as Research Scientist, Machine Learning Engineer and Director is under 10%. But it also because of over lapping that happens in job roles. Some companies include MLE's task in Data Scientist role. However it clearly shows that Data Scientist are in demand.

In [6]:
print("Top 10 Sectors which have the most Jobs: \n")

sector_data = data[data['Sector']!='-1']
print(sector_data['Sector'].value_counts()[:10])

Top 10 Sectors which have the most Jobs: 

Information Technology       131
Business Services             55
Finance                       52
Biotech & Pharmaceuticals     36
Retail                        29
Media                         22
Manufacturing                 18
Insurance                     13
Telecommunications            11
Healthcare                    10
Name: Sector, dtype: int64


In [7]:
sector_wise = sector_data.groupby(by=['Sector'])['job_title'].count()
fig = go.Figure(data=[go.Bar(x=sector_wise.index, y=sector_wise.values)])

fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.8)

fig.update_layout(xaxis={'categoryorder':'total descending'},
                  title="Sector wise Total Jobs",
                  xaxis_title="Sectors",
                  yaxis_title="Total Jobs(532)")

fig.update_xaxes(tickangle=45, tickfont=dict(family='Rockwell', color='crimson', size=14))
fig.update_yaxes(tickfont=dict(family='Rockwell', color='darkblue', size=14))

fig.show()

IT sector has the most jobs than any other. In fact, it has over 100 posting while Business Services has just around 50 which is second in the order. Finance, Biotech & Pharmaceauticals and Retail sector also has more job postings. Based on this aspiring data scientists can choose which sector they should target.

In [8]:
pivot_data = data[data['Sector']!='-1']

pd.options.display.max_rows
pd.set_option('display.max_rows', None)
pd.pivot_table(pivot_data, index =['Sector','job_title'],values='Company_Age', aggfunc='count').sort_values(
    'Company_Age', ascending = False).rename(columns={'Company_Age':'Job Count'})[:20]

Unnamed: 0_level_0,Unnamed: 1_level_0,Job Count
Sector,job_title,Unnamed: 2_level_1
Information Technology,Scientist,41
Information Technology,Engineer,37
Information Technology,Analyst,31
Business Services,Analyst,23
Biotech & Pharmaceuticals,Scientist,22
Finance,Scientist,19
Retail,Scientist,18
Business Services,Scientist,16
Finance,Analyst,15
Finance,Engineer,11


Above table shows which job roles are most wanted by which sector. For instance Business Services needs more analysts than scientist which makes sense Since they focus on making smarter decision by analysing data rather than building models. 

In [9]:
print(data['Company_Size'].value_counts())

pd.pivot_table(pivot_data, index =['Company_Size','job_title'],values='Company_Age', aggfunc='count').sort_values(
    ['Company_Size','Company_Age'], ascending = False).rename(columns={'Company_Age':'Job Count'})[:20]

Small     227
Medium    178
Large     127
Name: Company_Size, dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,Job Count
Company_Size,job_title,Unnamed: 2_level_1
Small,Analyst,41
Small,Scientist,34
Small,Engineer,25
Small,MLE,8
Small,Researcher,8
Small,Director,3
Medium,Scientist,65
Medium,Analyst,47
Medium,Engineer,31
Medium,Researcher,18


Above table tell us that it's not only big companies that is making use of data. Now even smaller companies is starting to realize power of data and how it can help them. And they are the ones who is hiring more. 

In [10]:
fig = px.histogram(data[data['Company_Age']>0], x="Company_Age",
                   opacity=.8, labels={'Company_Age':'Company Age'},
                   title='Histogram of Company\'s Age',
                   color_discrete_sequence=['rgb(0, 100, 100)'])

fig.show()

This histogram demonstrates that even newer companies are hiring data professionals to make smarter decision for their businesses. So it also shows that you don't need huge amount of data to drive more business profits. <b>It's about how you use, what you have to solve business problems.</b>

In [11]:
pd.pivot_table(data, index =['Location'],values='Company_Age', aggfunc='count').sort_values(
    'Company_Age', ascending = False).rename(columns={'Company_Age':'Job_Count'})[:10]

Unnamed: 0_level_0,Job_Count
Location,Unnamed: 1_level_1
Toronto,152
Vancouver,72
Montreal,68
Mississauga,29
Ottawa,25
Brampton,21
Calgary,15
Canada,9
Waterloo,8
Victoria,8


Above table shows that most jobs will be in bigger cities.

In [12]:
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]]

fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Python', 'R', 'SQL', 'Scala'])

fig.add_trace(go.Pie(labels=['Yes','No'], values=data['Python'].value_counts(), name='Python',
                    marker_colors=['#00FFFF','#550000']), 1, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['R'].value_counts(), name='R'), 1, 2)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['SQL'].value_counts(), name='SQL'), 2, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Scala'].value_counts(), name='Scala'), 2, 2)

fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
                 marker=dict(line=dict(color='#000000', width=2)))

fig.update(layout_title_text='Languages Requirements',
           layout_showlegend=True)

fig.update_layout(
    autosize=False,
    width=700,
    height=700)

fig = go.Figure(fig)

fig.show()

Above Pie charts illustrates that Python and SQL are the must have language for any data professionals. Other languages depends on company's requirements. Scala is also getting popular because of Apache Spark.

In [13]:
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'},{'type':'domain'}]]

fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Tensorflow', 'Pytorch', 'Keras'])

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Tensorflow'].value_counts(),
                     name='Tensorflow', marker_colors=['#550000','#00FFFF']), 1, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Pytorch'].value_counts(), name='Pytorch'), 1, 2)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Keras'].value_counts(), name='Keras'), 2, 1)

fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
                 marker=dict(line=dict(color='#000000', width=2)))

fig.update(layout_title_text='DL Framework Requirements',
           layout_showlegend=True)

fig.update_layout(autosize=False,
                  width=800,
                  height=800)

fig = go.Figure(fig)
fig.show()

Most companies requires that you know tensorflow and it's higher level API Keras. Tensorflow is more popular than Pytorch because of it's deployment functionalities. Nevertheless Pytorch is also popular for it's easy use.

In [14]:
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'},{'type':'domain'}]]

fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=['Excel', 'Tableau', 'PowerBI'])

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Excel'].value_counts(),
                     name='Excel', marker_colors=['#550000','#00FFFF']), 1, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Tableau'].value_counts(), name='Tableau'), 1, 2)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['PowerBI'].value_counts(), 
                     name='PowerBI'), 2,1)

fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
                 marker=dict(line=dict(color='#000000', width=2)))

fig.update(layout_title_text='BI Tool Requirements',
           layout_showlegend=True)

fig.update_layout(autosize=False,
                  width=800,
                  height=800)

fig = go.Figure(fig)
fig.show()

In terms of visualization tools Excel is still popular but Tableau is more powerful tool which is very easy to use and doesn't require any coding skills.

In [15]:
specs = [[{'type':'domain'}, {'type':'domain'}]]

fig = make_subplots(rows=1, cols=2, specs=specs, subplot_titles=['AWS', 'Spark'])

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['AWS'].value_counts(),
                     name='AWS', marker_colors=['#550000','#00FFFF']), 1, 1)

fig.add_trace(go.Pie(labels=['No','Yes'], values=data['Spark'].value_counts(), name='Spark'), 1, 2)

fig.update_traces(textposition='inside', textinfo='percent+label+value', hole=.3,
                 marker=dict(line=dict(color='#000000', width=2)))

fig.update(layout_title_text='AWS & Spark Requirements',
           layout_showlegend=True)

fig = go.Figure(fig)
fig.show()

AWS and spark are the most important technologies that one should know for better job prospects at a larger companies.

In [16]:
columns = ['Python', 'R', 'AWS', 'Scala', 'Excel', 'Tableau', 'PowerBI', 'Spark', 'SQL', 'Pytorch', 'Tensorflow', 'Keras']
count = []

for col in columns:
    count.append(data[data[col]==1][col].count())


fig = go.Figure(data=[go.Bar(x=columns, y=count)])

fig.update_traces(marker_color='darkblue', marker_line_color='rgb(0,255,255)',
                  marker_line_width=1.5, opacity=.8)

fig.update_layout(xaxis={'categoryorder':'total descending'},
                  title="Number of times Tool & Technologies Mentioned in Job Descriptions",
                  xaxis_title="Tools & Technologies",
                  yaxis_title="Count(532)")

fig.update_xaxes(tickfont=dict(family='Rockwell', color='crimson', size=14))
fig.update_yaxes(tickfont=dict(family='Rockwell', color='darkblue', size=14))

fig.show()

This Bar Graph demonstrates that which tools you should more focus on learning. One more thing, Here keras is last but that doesn't mean that it's not required, most companies does not include it in job description because they expects you to know this basic tools for easy model development.