{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Analyzing Data-Science community Stack Exchange Data

\n", "

Determine the most popular content on the Data-science community of Stack exchange

\n", "\n", "Stack Exchange is a network of question-and-answer websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating.
\n", "Data Science Stack Exchange is a question-answer forum, to connect the Data Science community. Posts regarding Data Science - Machine Learning, Statistical Mathematics, Visualizations etc are mainly found here. Each post usually contains a question asked pertaining to above topics and attached are a string of answers.\n", "\n", "\n", "The aim of this project is to analyze the following:-\n", "\n", "* Questions posted on the data science community of Stack Exchange server in the 2019 year to determine the most popular topics among these post. \n", "* Trend in the popularity of deep learning topic over the years. \n", " * Would it benefit the community to post on Deep learning?\n", " * What other topics can be paired with Deep learning for the posts?\n", "\n", "The purpose of the analysis is to determine and write posts, to help the community, on topics that are most sought after in the data science community of Stack Exchange
\n", "Stack Exchange hosts an open Database - Link, the community can utilize. After analyzing the database, the table *Posts* is found relevant for the end goal. This is downloaded into the 2019_questions.csv file. The posts are of the year 2019 only.\n", "\n", "The *Posts* table has the following attributes(columns) :
\n", "\n", " PostTypeId : id\n", " CreationDate : date the post was uploaded\n", " Score : the score of the post\n", " ViewCount : number of views\n", " Tags : tags associated with the posts\n", " AnswerCount : number of replies to the post\n", " FavoriteCount : number of likes recieved by the post\n", "\n", "Out of the type of posts, we will mainly focus on the Questions and Answers. The other kind of posts are inconsequential.
\n", "\n", "The database is queried for all questions & answers in 2019 with above columns, and stored in the csv file - 2019_questions.csv\n", "For the second part of the analysis, all questions are downloaded into the all_questions.csv" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from itertools import combinations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is read into the environment with the `parse_dates` paramter so that the column *CreationDate* is automatically read in as a datetime object." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('2019_questions.csv',parse_dates=['CreationDate'])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 8839 entries, 0 to 8838\n", "Data columns (total 7 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Id 8839 non-null int64 \n", " 1 CreationDate 8839 non-null datetime64[ns]\n", " 2 Score 8839 non-null int64 \n", " 3 ViewCount 8839 non-null int64 \n", " 4 Tags 8839 non-null object \n", " 5 AnswerCount 8839 non-null int64 \n", " 6 FavoriteCount 1407 non-null float64 \n", "dtypes: datetime64[ns](1), float64(1), int64(4), object(1)\n", "memory usage: 483.5+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Id 0\n", "CreationDate 0\n", "Score 0\n", "ViewCount 0\n", "Tags 0\n", "AnswerCount 0\n", "FavoriteCount 7432\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df.fillna(0,inplace=True)\n", "df.FavoriteCount = df.FavoriteCount.astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *FavoriteCount* column has Null values. Since the focus of the project is on finding the most popular post, it is imperative that popular in this context has to be defined. FavoriteCount gives the number of likes recieved by a post, while this gives a peak into which post might be popular, it does not show the entire picture. A lot of posts amass many viewers, but not every viewer *Likes* the post. Hence *ViewCount* gives a better statistic as to how popular a post is. By this logic, the definition of popular for the project, A post accumulating alot of views relatively, is considered to be popular.
\n", "This means, the *FavoriteCount* column is no longer of importance and hence rows containing Null values are filled with 0s.
\n", "\n", "The dataset as such doesnot contain the text of the questions posted. This makes it difficult to learn the topic of the question. An alternative is to consider the tags given to the question by the author. These tags summarize the topic of the question. The *Tags* column holds the tags for the question. Each question can have multiple tags." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 \n", "1 \n", "3 \n", "4 \n", " ... \n", "8834 \n", "8835 \n", "8836 \n", "8837 \n", "8838 \n", "Name: Tags, Length: 8839, dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Tags" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *Tags* column is the center of interest to the project. The tags have to be cleaned and made uniform for further analysis. \n", "\n", "* All types of chevrons are removed\n", "* The string is split on ',' to convert it to list format\n", "\n", "A list format makes it easier to carry out any modifications/inferences to the *Tags* column." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [machine-learning, data-mining]\n", "1 [machine-learning, regression, linear-regressi...\n", "2 [python, time-series, forecast, forecasting]\n", "3 [machine-learning, scikit-learn, pca]\n", "4 [dataset, bigdata, data, speech-to-text]\n", " ... \n", "8834 [pca, dimensionality-reduction, linear-algebra]\n", "8835 [keras, weight-initialization]\n", "8836 [python, visualization, seaborn]\n", "8837 [time-series]\n", "8838 [k-nn]\n", "Name: Tags, Length: 8839, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Tags = df.Tags.str.replace(\"><\",\",\")\n", "df.Tags = df.Tags.str.replace(\"[<>]\",\"\")\n", "df.Tags = df.Tags.str.split(\",\")\n", "df.Tags" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The popularity as defined above, is the relative amount of views on a post. Since the actual post is missing from the data. The popularity will now be based on the tags. The rectified definition - Popularity of a topic (tag) is the number of posts of that topic and the total views amassed by the topic overall.
\n", "\n", "For this purpose, two functions are coded. The `count_tags()` function returns the number of posts tagged with the topic (tag) passed. The `count_views()` function returns the number of views accumulated by the topic (tag) across all posts. A new dataframe is created that holds the returned value for each unique tag in the dataset." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TagsTagCountViewCount
0.net1438
13d-object-detection17
23d-reconstruction91129
3ab-test6153
4accuracy8915233
\n", "
" ], "text/plain": [ " Tags TagCount ViewCount\n", "0 .net 1 438\n", "1 3d-object-detection 1 7\n", "2 3d-reconstruction 9 1129\n", "3 ab-test 6 153\n", "4 accuracy 89 15233" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tags = np.unique(np.concatenate(df.Tags,axis=None))\n", "tag_count = {}\n", "\n", "def count_tags(tag,data):\n", " ctr=0\n", " for row in data['Tags']:\n", " if tag in row:\n", " ctr+=1\n", " return ctr\n", "\n", "def count_views(tag,data):\n", " ctr=0\n", " for row in data[['Tags','ViewCount']].iterrows():\n", " if tag in row[1]['Tags']:\n", " ctr+=row[1]['ViewCount']\n", " return ctr\n", "\n", "for tag in tags:\n", " tag_count[tag] = [count_tags(tag,df),count_views(tag,df)]\n", "\n", "df_tags = pd.DataFrame.from_dict(tag_count,orient='index').reset_index()\n", "df_tags.columns = ['Tags','TagCount','ViewCount']\n", "df_tags.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataframe summarizes the popularity of each tag for the purpose of the project. The *TagCount* shows the most popular tags in terms of the number of questions posted on that tag. The plot below gives a relative comparision among the Top 10 tags with highest number of posts on the Data science Stack Exchange server." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_tags.sort_values(by='TagCount',ascending=False,inplace=True)\n", "most_used = df_tags.Tags.iloc[:20]\n", "\n", "plt.style.use('fivethirtyeight')\n", "plt.figure(figsize=(24,12))\n", "sns.barplot(y=df_tags.Tags.iloc[0:10],x=df_tags.TagCount.iloc[0:10])\n", "plt.ylabel(\"Popular Tags\")\n", "plt.xlabel(\"Number of Posts\")\n", "plt.xticks([],rotation=15)\n", "plt.title(\"Top 10 most posted topics on Data science Stack Exchange\",fontsize=32,y=1.07)\n", "plt.suptitle('Top 10 tags with the highest number of posts',y=0.93, fontsize = 25)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *ViewCount* similar shows the most popular tags with respect to the number of views recieved to the posts tagged. The plot below gives a relative comparision between the top 10 most viewed topics (tags) on the Data science Stack Exchange server." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_tags.sort_values(by='ViewCount',ascending=False,inplace=True)\n", "most_viewed = df_tags.Tags.iloc[:20]\n", "\n", "plt.figure(figsize=(24,12))\n", "sns.barplot(y=df_tags.Tags.iloc[0:10],x=df_tags.ViewCount.iloc[0:10])\n", "plt.ylabel(\"Popular Tags\")\n", "plt.xlabel(\"Number of Views\")\n", "plt.xticks([],rotation=45)\n", "plt.title(\"Top 10 most viewed topics on Data science Stack Exchange\",fontsize=32,y=1.07)\n", "plt.suptitle('Top 10 tags with the highest number of views',y=0.93, fontsize = 25)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two plots above give the most popular tags in terms of number of views amassed or the number of questions posted with that tag. Between the two groups, there are common tags. These common tags are ultimately the most popular tags on the Data science Stack Exchange server. They have high views and high number of posts.\n", "\n", "NOTE : The plot shows the top 10 popular tags for both criteria, but for the intersection, the top 20 tags are considered, to increase the resulting set." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['classification' 'clustering' 'cnn' 'dataset' 'deep-learning' 'keras'\n", " 'lstm' 'machine-learning' 'neural-network' 'nlp' 'pandas' 'python'\n", " 'regression' 'scikit-learn' 'tensorflow' 'time-series']\n" ] } ], "source": [ "popular_tags = np.intersect1d(most_used,most_viewed)\n", "print(popular_tags)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Between the two segments there are 16 common tags. These tags are the most popular tags :-
\n", "\n", " * classification * clustering * cnn * dataset\n", " * deep-learning * keras * lstm * machine-learning\n", " * neural-network * nlp * pandas * python\n", " * regression * scikit-learn * tensorflow * time-series\n", "\n", "The popular tags mostly belong to a common topic. Based on the dataset, relations between these tags can be discovered to better understand which pairs or sets of tags are widely popular and thus which segment of data science is widely popular in terms of questions on the server.
\n", "\n", "One method is to make all possible pairs from the popular tags and make a table with the pair's Tag counts and View counts. This results in a popularity index as done for the single tags.
\n", "The pairs are combinations of the popular tags and not the entire tags list, because based on the assumption that the top 10 are the most frequent and rest are not, The rest of the pairs can be eliminated via Upward Closure property from Data Mining concepts.
\n", "\n", "Similar to previous steps, two functions are coded. The `count_pair()` function returns the number of times the pair of tags have appeared together in the questions posted. The `count_pair_views()` function returns the number of views amassed by posts tagged with the pair. The popular pair of tags will be the intersection of the two i.e. pair of tags having high number of posts as well as high number of views." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TagsTagCountViewCount
0(classification, clustering)121532
1(classification, cnn)201096
2(classification, dataset)281991
3(classification, deep-learning)5935105
4(classification, keras)5837610
\n", "
" ], "text/plain": [ " Tags TagCount ViewCount\n", "0 (classification, clustering) 12 1532\n", "1 (classification, cnn) 20 1096\n", "2 (classification, dataset) 28 1991\n", "3 (classification, deep-learning) 59 35105\n", "4 (classification, keras) 58 37610" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pair_tags = list(combinations(popular_tags,2))\n", "pair_tag_count = {}\n", "\n", "def count_pair(tags,data):\n", " ctr=0\n", " for row in data['Tags']:\n", " if set(tags).issubset(row):\n", " ctr+=1\n", " return ctr\n", "\n", "def count_pair_views(tags,data):\n", " ctr=0\n", " for row in data[['Tags','ViewCount']].iterrows():\n", " if set(tags).issubset(row[1]['Tags']):\n", " ctr+=row[1]['ViewCount']\n", " return ctr\n", "\n", "for tags in pair_tags:\n", " pair_tag_count[tags] = [count_pair(tags,df),count_pair_views(tags,df)]\n", " \n", "df_pair_tags = pd.DataFrame.from_dict(pair_tag_count,orient='index').reset_index()\n", "df_pair_tags.columns = ['Tags','TagCount','ViewCount']\n", "df_pair_tags.head(5)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "df_pair_tags.sort_values(by='TagCount',ascending=False,inplace=True)\n", "most_used_pair = df_pair_tags.Tags.iloc[:10]\n", "\n", "df_pair_tags.sort_values(by='ViewCount',ascending=False,inplace=True)\n", "most_viewed_pair = df_pair_tags.Tags.iloc[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The initial tags were the most popular individually. Making combinations from them, the resulting dataframe was created. For every pair of tags, the dataframe shows the number of posts tagged with that pair and total number of views amassed. Since the initial set was of popular tags, the resulting pairs of popular tags represent association between the two tags forming the pair." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([('deep-learning', 'keras'), ('deep-learning', 'machine-learning'),\n", " ('deep-learning', 'neural-network'), ('keras', 'neural-network'),\n", " ('keras', 'python'), ('keras', 'tensorflow'),\n", " ('machine-learning', 'neural-network'),\n", " ('machine-learning', 'python'), ('pandas', 'python')], dtype=object)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "popular_tags_pair = np.intersect1d(most_used_pair,most_viewed_pair)\n", "popular_tags_pair" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The popular pair of tags as a result of the above analysis are :-\n", "\n", " * deep-learning : keras \n", " * deep-learning : machine-learning \n", " * deep-learning : neural-network\n", " * keras : neural-network \n", " * keras : python \n", " * keras : tensorflow\n", " * machine-learning : neural-netowork \n", " * machine-learning : python \n", " * pandas : python\n", "\n", "The concept used above was nothing but finding frequent item sets of Data mining, or in this case popular item sets. The next step in the algorithm would be to make triplets from these popular pairs. This would represent associativity between three tags.
\n", "The possible triplets that can be formed as a combination of the above popular pairs are :-\n", "\n", "* deep-learning : keras : neural-network\n", "* deep-learning : machine-learning : neural-network\n", "* machine-learning : python : keras\n", "\n", "NOTE : The triplets formed are not all the possible triplets. They are pruned based on the logic - A triplet is formed if and only if all pairs formed from the triplets are part of the set of pairs from which the triplets are derived." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the first part of the analysis the following conclusions are drawn :-\n", "\n", "* More generally, tags representing *predictive analysis* are more popular on the Data science community of the Stack Exchange server as compared to the Descriptive analysis or visualization part of Data science field.\n", "* The following tags are associated to each other and popular across the Data science community on Stack Exchange server,\n", "\n", " * deep-learning\n", " * machine-learning\n", " * neural-networks\n", " * keras" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next part of the analysis is to discover the trend in popularity of deep-learning over the years and to answer the question :-\n", "\n", "* Would it benefit the community to post on Deep learning?\n", "* What other topics can be paired with Deep learning for the posts?\n", "\n", "A different dataset has been downloaded for this purpose. Instead of just having questions from the year 2019. All questions across all years are present." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdCreationDateTags
0454162019-02-12 00:36:29<python><keras><tensorflow><cnn><probability>
1454182019-02-12 00:50:39<neural-network>
2454222019-02-12 04:40:51<python><ibm-watson><chatbot>
3454262019-02-12 04:51:49<keras>
4454272019-02-12 05:08:24<r><predictive-modeling><machine-learning-mode...
\n", "
" ], "text/plain": [ " Id CreationDate \\\n", "0 45416 2019-02-12 00:36:29 \n", "1 45418 2019-02-12 00:50:39 \n", "2 45422 2019-02-12 04:40:51 \n", "3 45426 2019-02-12 04:51:49 \n", "4 45427 2019-02-12 05:08:24 \n", "\n", " Tags \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "RangeIndex: 21576 entries, 0 to 21575\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Id 21576 non-null int64 \n", " 1 CreationDate 21576 non-null object\n", " 2 Tags 21576 non-null object\n", "dtypes: int64(1), object(2)\n", "memory usage: 505.8+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A similar approach, as done previously, to clean the tags column for ease of the analysis.\n", "* All chevrons are removed from the tags column.\n", "* The tags are converted to lists for ease of usage" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df.Tags = df.Tags.str.replace(\"><\",\",\")\n", "df.Tags = df.Tags.str.replace(\"[<>]\",\"\")\n", "df.Tags = df.Tags.str.split(\",\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose is to find the general trend deep learning has witnessed over time. The *CreationDate* holds granular information regarding the questions posted. Since the aim is to only find the trend, year extracted from *CreationDate* would suffice." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df.CreationDate = pd.to_datetime(df.CreationDate)\n", "df['Year'] = df.CreationDate.dt.year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Posts containing the deep-learning tag are relevant to the analysis. A new column *is_dl* is created to identify which posts are tagged as deep-learning." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df['is_dl'] = df.Tags.apply(\n", " lambda x: 1 if 'deep-learning' in x else 0\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The posts have been identified as relevant or not and the years have been extracted from *CreationDate*.
\n", "From these results, two groupings are made. Group 1, groups on the *Year* column and aggregates the size. In short group 1 indicates the number of questions posted on Data science Stack Exchange server for the years.\n", "Group 2, groups on the *Year* column, for only those questions that were tagged as deep-learning, and aggregated on size. Thus group 2 indicates the number of questions posted as deep-learning on Data science Stack Exchange for the years.
\n", "\n", "Both these groupings are comibined using the `pd.merge()` function. The resulting dataframe gives us the statistic on the number of questions posted as deep-learning vs number of questions in total posted for the years. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearDL_postsTotal
020148562
12015301167
220161572146
320174252957
420189025475
5201912168810
6202067459
\n", "
" ], "text/plain": [ " Year DL_posts Total\n", "0 2014 8 562\n", "1 2015 30 1167\n", "2 2016 157 2146\n", "3 2017 425 2957\n", "4 2018 902 5475\n", "5 2019 1216 8810\n", "6 2020 67 459" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_1 = df.groupby(by=['Year'])\n", "grouped_2 = df[df.is_dl == 1].groupby(by=['Year'])\n", "\n", "temp_1 = pd.DataFrame(grouped_2.size(),columns=['DL_posts']).reset_index()\n", "temp_2 = pd.DataFrame(grouped_1.size(),columns=['Total']).reset_index()\n", "\n", "df_posts = pd.merge(temp_1,temp_2,left_on=['Year'],right_on=['Year'],how='inner')\n", "df_posts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dividing the columns *DL_posts* and *Total* gives the proportion of posts posted as deep-learning as compared to total posts for that year. These proportions can be used to visualize the trend in the number of posts tagged deep-learning through the years." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DL_postsTotalPercentage
Year
201485621.423488
20153011672.570694
201615721467.315937
2017425295714.372675
2018902547516.474886
20191216881013.802497
20206745914.596950
\n", "
" ], "text/plain": [ " DL_posts Total Percentage\n", "Year \n", "2014 8 562 1.423488\n", "2015 30 1167 2.570694\n", "2016 157 2146 7.315937\n", "2017 425 2957 14.372675\n", "2018 902 5475 16.474886\n", "2019 1216 8810 13.802497\n", "2020 67 459 14.596950" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_posts['Percentage'] = (df_posts.DL_posts/df_posts.Total) * 100\n", "df_posts.set_index('Year',inplace=True)\n", "df_posts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table clearly showcases an upward trend in the number of posts tagged as deep-learning. The year 2020 has seen less posts as it is the current year, hence is excluded from the analysis. The plot below visualizes the trend shown in the table for better understanding." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.style.use('seaborn')\n", "sns.set_style('white')\n", "\n", "plt.figure(figsize=(10,6))\n", "sns.lineplot(x=df_posts.index[:-1],y=df_posts.iloc[:-1].DL_posts, markers=\"o\",style=True, legend=False)\n", "plt.xlabel(\"\")\n", "plt.ylabel(\"\")\n", "plt.title(\"deep-learning tagged questions posted over the year\",fontsize=18,loc='center',y=1.06)\n", "plt.suptitle(\"Trend in the questions tagged as deep-learning vs total number of questions posted\",y=0.93,fontsize=14)\n", "plt.tick_params(left=False,bottom=False)\n", "plt.yticks([])\n", "plt.xticks([2014,2019])\n", "plt.text(max(df_posts.index[:-1]),max(df_posts.DL_posts)-90,max(df_posts.DL_posts))\n", "plt.text(min(df_posts.index[:-1]),min(df_posts.DL_posts)+50,min(df_posts.DL_posts))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The visualization confirms the trend shown in the table. The following conclusions are drawn from the plot :-\n", "* Every year since 2014 there has been an upward trend (increase) in the number of questions tagged as deep-learning.\n", "* Over the years (between 2014 and 2019) there has been a 12% increase in the posts (questions) tagged as deep-learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the previous analysis, it was found that Deep Learning was associated with other tags such as :-\n", "\n", "* keras\n", "* machine-learning\n", "* neural-networks\n", "* tensorflow\n", "\n", "From these four tags, Neural networks is the underlying technology of Deep Learning, similarly, Keras and Tensorflow are frameworks built to support and implement deep learning. These three tags are highly associated to Deep Learning. Even though Machine Learning and Deep Learning intersect, Machine Learning consists of topics out of scope for Deep Learning.
\n", "\n", "By this logic, the assumption is made that any post containing either of these tags,\n", "\n", "* deep-learning\n", "* keras\n", "* neural-network\n", "* tensorflow\n", "\n", "is in general labeled as relevant to the topic and hence the post is categorized as a Deep Learning question.
\n", "A new column is created, *is_relevant*, to indicate the above. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "relevant_tags = ['deep-learning','keras','tensorflow','neural-network']\n", "\n", "df['is_relevant'] = df.Tags.apply( \n", " lambda x: 1 if list(np.intersect1d(x,relevant_tags)) else 0 \n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the previous logic, the grouping will happen based on years and aggregate on relevance of the post indicated by *is_relevant*." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearDL_postsTotal
0201430562
120151141167
220164172146
320178832957
4201818765475
5201926958810
62020130459
\n", "
" ], "text/plain": [ " Year DL_posts Total\n", "0 2014 30 562\n", "1 2015 114 1167\n", "2 2016 417 2146\n", "3 2017 883 2957\n", "4 2018 1876 5475\n", "5 2019 2695 8810\n", "6 2020 130 459" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_1 = df.groupby(by=['Year'])\n", "grouped_2 = df[df.is_relevant == 1].groupby(by=['Year'])\n", "\n", "temp_1 = pd.DataFrame(grouped_2.size(),columns=['DL_posts']).reset_index()\n", "temp_2 = pd.DataFrame(grouped_1.size(),columns=['Total']).reset_index()\n", "\n", "df_posts = pd.merge(temp_1,temp_2,left_on=['Year'],right_on=['Year'],how='inner')\n", "df_posts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dividing the columns DL_posts and Total gives the proportion of posts relevant to the analysis as compared to total posts for that year. These proportions can be used to visualize the trend in the number of posts relevant through the years. \n", "\n", "NOTE : A relevant post here means that the tag on the post is atleast one of the above 4 concluded tags. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DL_postsTotalPercentage
Year
2014305625.338078
201511411679.768638
2016417214619.431500
2017883295729.861346
20181876547534.264840
20192695881030.590238
202013045928.322440
\n", "
" ], "text/plain": [ " DL_posts Total Percentage\n", "Year \n", "2014 30 562 5.338078\n", "2015 114 1167 9.768638\n", "2016 417 2146 19.431500\n", "2017 883 2957 29.861346\n", "2018 1876 5475 34.264840\n", "2019 2695 8810 30.590238\n", "2020 130 459 28.322440" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_posts['Percentage'] = (df_posts.DL_posts/df_posts.Total) * 100\n", "df_posts.set_index('Year',inplace=True)\n", "df_posts" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.style.use('seaborn')\n", "sns.set_style('white')\n", "\n", "plt.figure(figsize=(10,6))\n", "sns.lineplot(x=df_posts.index[:-1],y=df_posts.iloc[:-1].DL_posts, markers=\"o\",style=True, legend=False)\n", "plt.xlabel(\"\")\n", "plt.ylabel(\"\")\n", "plt.title(\"Deep Learning questions posted over the year\",fontsize=18,loc='center',y=1.06)\n", "plt.suptitle(\"Trend in the questions from the Deep Learning category vs total number of questions posted\",y=0.93,fontsize=14)\n", "plt.tick_params(left=False,bottom=False)\n", "plt.yticks([])\n", "plt.xticks([2014,2019])\n", "plt.text(max(df_posts.index[:-1]),max(df_posts.DL_posts)-170,max(df_posts.DL_posts))\n", "plt.text(min(df_posts.index[:-1]),min(df_posts.DL_posts)+70,min(df_posts.DL_posts))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following inferences are made from the plot :-\n", "\n", "* Similar to previous plot, there exists an upward trend in number of questions posted on Deep Learning category.\n", "* From the year 2014 to 2019, posts on Deep Learning category have seen a 25% increase." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The conclusion, gathered from the project are :-\n", "\n", "* The following topics (tags) are most popular in the Data science community on Stack Exchange,\n", "\n", " * classification * clustering * cnn * dataset\n", " * deep-learning * keras * lstm * machine-learning\n", " * neural-network * nlp * pandas * python\n", " * regression * scikit-learn * tensorflow * time-series\n", " \n", "* Among the most popular topics, there exists association between certain topics such as,\n", " \n", " * deep-learning : keras : neural-network\n", " * deep-learning : machine-learning : neural-network\n", " * machine-learning : python : keras\n", "\n", "* The Deep Learning category consists of the following tags associated with the deep-learning tag,\n", "\n", " * keras \n", " * tensorflow\n", " * neural-networks\n", "\n", "* Deep Learning category has seen an upward trend in terms of number of posts since 2014.\n", "\n", "* Currently Deep Learning seems to be the most sought after category in the Data science community on Stack Exchange.\n", "\n", "* Creating posts related to Deep Learning would widely benefit the community and be popular amongst it." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }