{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Performance of Mobile Gaming Apps\n", "\n", "* The Impact of a Business Model on the Relationship between Entrance Timing and Performance among Mobile Games\n", "\n", "The goal of this notebook is to guide readers through the process of analyzing Apple AppStore data received from Tilburg University. For more information read through the attached paper that was created for the purpose of finishing the master course Strategy and Business Models. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents \n", " \n", "1. [Dataset](#dataset)\n", " \n", "2. [Purpose of this Study](#goal)\n", "\n", "3. [Functions](#functions) \n", "\n", "4. [Data Preparation](#dataprep)\n", "\n", " 4.1 [Time Slots](#timeslots)\n", " \n", " 4.2 [EDA](#eda)\n", " \n", " 4.3 [Creating Final Variables](#variables)\n", " \n", " 4.4 [Correlation Matrix](#correlation)\n", "\n", "5. [Selecting Statistical Model](#selectingmodel)\n", "\n", "6. [Statistical Analysis - Revenue Model](#revenue)\n", "\n", " 6.1 [Main Effects](#revenuemain)\n", " \n", " 6.2 [Interaction Effects](#revenueinteraction)\n", " \n", " 6.3 [Truncated Model](#revenuetruncated)\n", " \n", " 6.4 [Logistical Model](#revenuelogistical)\n", " \n", " 6.5 [Results](#revenueresults)\n", "\n", "7. [Statistical Analysis - Technological Innovation](#ti)\n", "\n", " 7.1 [Main Effects](#timain)\n", " \n", " 7.2 [Interaction Effects](#tiinteraction)\n", " \n", " 7.3 [Truncated Model](#titruncated)\n", " \n", " 7.4 [Logistical Model](#tilogistical)\n", " \n", " 7.5 [Results](#tiresults)\n", "\n", "8. [Conclusion](#conclusion) \n", "\n", "9. [References](#literature) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Dataset \n", "[Back to Table of Contents](#table)\n", "\n", "The dataset used in the study is information regarding apps scraped from the Apple AppStore. From those apps their category was collected, average rating, number of ratings, etc. For more details see \"Data Preparation\". \n", "\n", "### 2. Purpose of this study \n", "[Back to Table of Contents](#table)\n", "\n", "The main purpose of the study is to research the effects of a business model on the performance of Apple AppStore apps. Without going too much into the academic research, the following hypotheses were tested in this paper: \n", " \n", "* Hypothesis 1: Both early and late entrants will perform better when they make use of\n", "the free revenue model compared to the paid revenue model\n", "\n", "* Hypothesis 2: Order of entry and revenue model interact such that free apps will\n", "perform better when they are late entrants versus early entrants, whereas paid apps will\n", "perform better when they are early entrants versus late entrants.\n", " \n", "* Hypothesis 3: Order of entry and technological innovation interact such that apps using\n", "technological innovation will perform better when they are early entrants versus late\n", "entrants, whereas apps not using technological innovation will perform better when they\n", "are late entrants versus early entrants.\n", "\n", "I start by cleaning the data and doing some exploratory data analysis (EDA) before doing the statistical analysis. I try to be as clear as possible during the process what is done and for which reason. It should be noted though that for details you might want to look in the paper that was written with my fellow students. Moreover, since I'm not able to share the data this notebook cannot be run on your system. This notebook is to show my thought process when doing the analysis. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Functions \n", "[Back to Table of Contents](#table)\n", "Here are all the functions used in this study. It's a lot, I know! Normally this would all be in a separate .py file, but the goal is to be transparent, so here it is :-)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re as re\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import seaborn as sns\n", "\n", "import string\n", "\n", "from collections import Counter\n", "from nltk.corpus import stopwords\n", "from collections import Counter\n", "\n", "%matplotlib inline \n", "%load_ext rpy2.ipython\n", "\n", "def create_4_timeslots(df):\n", " \"\"\" Creates and returns dataframes in 4 timeslots:\n", " - first_seen_2015: Apps in 2015 that were first seen around may\n", " - last_seen_2015: Apps in 2015 that were last seen around september\n", " - first_seen_2017: Apps in 2017 that were first seen around may\n", " - last_seen_2017: Apps in 2017 that were last seen around september\n", " \"\"\"\n", " # Create a dataframe with apps in 2015 that were first seen around may\n", " first_seen_2015 = df[(df['timestamp'] == df['firstseen']) & (df['firstseen'].str.contains('2015'))].groupby(by='id').first()\n", " first_seen_2015 = first_seen_2015.reset_index()\n", "\n", " # Create a dataframe with apps in 2017 that were first seen around may\n", " first_seen_2017 = df[(df['timestamp'] == df['firstseen']) & (df['firstseen'].str.contains('2017'))].groupby(by='id').first()\n", " first_seen_2017 = first_seen_2017.reset_index()\n", "\n", " # Create dataframes with apps in 2015 and 2017 that were last seen around september\n", " last_seen_2015 = df[(df['week'] > 30) & (df['week'] < 40)].groupby(by='id').last().reset_index()\n", " last_seen_2017 = df[df['week'] > 90].groupby(by='id').last().reset_index()\n", " \n", " return first_seen_2015, last_seen_2015, first_seen_2017, last_seen_2017\n", "\n", "def create_4_equal_timeslots(first_seen_2015, last_seen_2015, first_seen_2017, last_seen_2017):\n", " \"\"\" Returns the dataframes of the 4 timeslots so that they include the exact same apps.\n", " - first_seen_2015 and last_seen_2015 contain the same apps\n", " - first_seen_2017 and last_seen_2017 contain the same apps\n", " \"\"\"\n", " # Making sure all id's that are in last_seen are also in first_seen\n", " list_ls_2015 = list(last_seen_2015['id'])\n", " first_seen_2015 = first_seen_2015[first_seen_2015['id'].isin(list_ls_2015)]\n", "\n", " # Making sure all id's that are in first_seen are also in last_seen\n", " list_fs_2015 = list(first_seen_2015['id'])\n", " last_seen_2015 = last_seen_2015[last_seen_2015['id'].isin(list_fs_2015)]\n", "\n", " # Making sure all id's that are in last_seen are also in first_seen\n", " list_ls_2017 = list(last_seen_2017['id'])\n", " first_seen_2017 = first_seen_2017[first_seen_2017['id'].isin(list_ls_2017)]\n", "\n", " # Making sure all id's that are in first_seen are also in last_seen\n", " list_fs_2017 = list(first_seen_2017['id'])\n", " last_seen_2017 = last_seen_2017[last_seen_2017['id'].isin(list_fs_2017)]\n", " \n", " return first_seen_2015, last_seen_2015, first_seen_2017, last_seen_2017\n", "\n", "def create_common_words():\n", " \"\"\" Returns a dataframe that contains the most common words for apps found in 2015 and 2017. \n", " I specifically choose to only include first seen apps seeing as that was their initial strategy. \n", " \"\"\"\n", " # Create stopwords and add some stopwords that I want removed\n", " stopwords_english = stopwords.words('english')\n", "\n", " for word in ['u2022', '', 'u2028', 'will', 'get', 'make', 'like', 'just', 'use', 'u2013', 'let', 'game', '\\u2022', '-', '&', \n", " 'u', 'e', 'f', 'b', 'c', 'cu', 'bu', 'au', 'fu', 'us', 'go', 'du', 'eu', 'ea', 'uff', 'n', 'one']:\n", " stopwords_english.append(word)\n", "\n", " # Create descriptions for each year (first seen), clean it and count the number of words\n", " description = {'2017': '', '2015': ''}\n", "\n", " for year, value in description.items():\n", " description[year] = eval('first_seen_{}'.format(year))['description'].str.cat(sep=' ') # create one string of column\n", " description[year] = re.sub('[^a-zA-z]', ' ', description[year]) # only keep letters\n", " description[year] = description[year].replace(\"\\\\\", \"\").lower() # remove backslashes and lower the text\n", " description[year] = ' '.join(description[year].split()) # Remove too many spaces\n", " description[year] = description[year].split(' ') # create a list with words\n", " description[year] = Counter(description[year]) # Count how often a word occurs\n", "\n", " # Removing stopwords\n", " for word in stopwords_english:\n", " if word in description[year].keys():\n", " del description[year][word]\n", "\n", " # Create a dataframe of the count of words for each year for easier readability\n", " df = pd.DataFrame()\n", "\n", " for year in ['2015', '2017']:\n", " for action, value in {'word': 0, 'count': 1}.items():\n", " df['{}_{}'.format(year, action)] = [word[value] for word in description[year].most_common(1000)]\n", " \n", " return df\n", "\n", "def join_first_last(first_seen_2015, last_seen_2015, first_seen_2017, last_seen_2017):\n", " \"\"\" Returns the following two dataframes:\n", " - df_2015: An inner join of first_seen_2015 and last_seen_2015\n", " - df_2017: An inner join of first_seen_2015 and last_seen_2017\n", " \"\"\"\n", " # Merges first_seen_2015 with last_seen_2015 and adds _first and _last to columns to show which belong to which data\n", " first_seen_2015.columns = [column + \"_first\" if column != 'id' else 'id' for column in first_seen_2015.columns]\n", " last_seen_2015.columns = [column + \"_last\" if column != 'id' else 'id' for column in last_seen_2015.columns]\n", " df_2015 = pd.merge(first_seen_2015, last_seen_2015, on='id')\n", " \n", " # Merges first_seen_2017 with last_seen_2017 and adds _first and _last to columns to show which belong to which data\n", " first_seen_2017.columns = [column + \"_first\" if column != 'id' else 'id' for column in first_seen_2017.columns]\n", " last_seen_2017.columns = [column + \"_last\" if column != 'id' else 'id' for column in last_seen_2017.columns]\n", " df_2017 = pd.merge(first_seen_2017, last_seen_2017, on='id')\n", "\n", " return df_2015, df_2017\n", "\n", "def get_change(row, column_1, column_2):\n", " \"\"\" Used for lambda expression. Compares two columns and gives back a 1 if there's a difference and 0 if there isn't. \n", " \"\"\"\n", " if row[column_1] != row[column_2]:\n", " return 1\n", " else:\n", " return 0\n", " \n", "def create_change_columns(df_2015, df_2017, columns = ['price', 'screenshots', 'content_rating', 'compatibility', \n", " 'size', 'quan_language', 'appversion', 'ratingscurrentversion', \n", " 'ratingcurrentversion', 'title']):\n", " \"\"\" This will return two dataframes that have a number of new columns that signify the differences between the\n", " value when firstseen and lastseen. For example, the 'price' column might change when first released and seen a\n", " half year later. This function will return dataframes with columns that show whether there was a change (1) or not (0).\n", " \"\"\"\n", " for column in columns:\n", " df_2015['change_{}'.format(column)] = df_2015.apply(lambda row: get_change(row, '{}_first'.format(column), \n", " '{}_last'.format(column)), axis = 1)\n", " df_2017['change_{}'.format(column)] = df_2017.apply(lambda row: get_change(row, '{}_first'.format(column), \n", " '{}_last'.format(column)), axis = 1)\n", " return df_2015, df_2017\n", "\n", "def show_changes(df_2015, df_2017, columns = ['price', 'screenshots', 'content_rating', 'compatibility', 'size', \n", " 'quan_language', 'appversion', 'ratingscurrentversion', 'ratingcurrentversion', \n", " 'title']):\n", " \"\"\" Prints the number of changes of a column between firstseen and lastseen for 2015 and 2017\n", " \"\"\"\n", " for year in ['2015', '2017']:\n", " print('Changes in column between firstseen and lastseen of {}:\\n'.format(year))\n", "\n", " for column in columns:\n", " changes = eval('df_{}'.format(year))['change_{}'.format(column)].value_counts()[1]\n", "\n", " if len(column) < 20:\n", " column = (' '*(20-len(column))) + column\n", " else:\n", " continue\n", "\n", " print('{}: \\t{} of {}'.format(column, changes, len(eval('df_{}'.format(year)))))\n", " print()\n", " \n", "def optimized_for(df_2015, df_2017, devices = ['iphone', 'ipad', 'ipod touch']):\n", " \"\"\" Prints how many of the apps are optimized for certain devices based on their release\n", " \"\"\"\n", " print('Apps in 2015 that are optimized for the following devices (based on their release): \\n')\n", " for value in devices:\n", " df_2015['optimized_for_{}'.format(value)] = df_2015.apply(lambda row: 1 if value in row['compatibility_first'].lower() \n", " else 0, axis = 1)\n", " optimized = df_2015['optimized_for_{}'.format(value)].value_counts()[1]\n", " value = ' '*(20-len(value)) + value\n", " print('{}: \\t {} out of {}'.format(value, optimized, len(df_2015)))\n", " print()\n", " \n", " print('Apps in 2017 that are optimized for the following devices (based on their release): \\n')\n", " for value in devices:\n", " df_2017['optimized_for_{}'.format(value)] = df_2017.apply(lambda row: 1 if value in row['compatibility_first'].lower() \n", " else 0, axis = 1)\n", " optimized = df_2017['optimized_for_{}'.format(value)].value_counts()[1]\n", " value = ' '*(20-len(value)) + value\n", " print('{}: \\t {} out of {}'.format(value, optimized, len(df_2017)))\n", " print()\n", " \n", "def count_subcategories(df_2015, df_2017):\n", " \"\"\" Prints how many apps there are in certain subcategories which are based on certain keywords in an apps description. \n", " \"\"\"\n", " slot_games = ['casino', 'slots', 'slot']\n", " driving = ['race', 'drive', 'car', 'driving', 'parking']\n", " puzzle = ['puzzle']\n", " adventure = ['adventure', 'jump', 'platformer']\n", " shooter = ['shoot', 'gun', 'pistol', 'sniper', 'war', 'vehicle']\n", " \n", " subcategories = {'slot_games': slot_games, 'driving': driving, 'puzzle': puzzle, 'adventure': adventure, 'shooter': shooter}\n", " \n", " print('Number of apps in 2015 in the following subcategories:\\n')\n", " for category, search_terms in subcategories.items():\n", " amount = len(df_2015[df_2015['description_first'].str.contains('|'.join(search_terms))])\n", " \n", " category = \" \"*(20-len(category)) + category\n", " print(category, ': ', amount, '\\tBased on the following terms: {}'.format(', '.join(search_terms)))\n", " print(' Total Apps : {}'.format(len(df_2015)))\n", " \n", " print('\\nNumber of apps in 2017 in the following subcategories:\\n')\n", " for category, search_terms in subcategories.items():\n", " amount = len(df_2017[df_2017['description_first'].str.contains('|'.join(search_terms))])\n", " \n", " category = \" \"*(20-len(category)) + category\n", " print(category, ': ', amount, '\\tBased on the following terms: {}'.format(', '.join(search_terms)))\n", " print(' Total Apps : {}'.format(len(df_2017)))\n", "\n", "def get_difference_rating(row1, row2):\n", " \"\"\" Returns the difference in rating between first seen and last seen. \n", " If both first seen and last seen have a rating of -1, then it will return 0\n", " If only first seen has a rating of -1, then it will return the rating of last seen\n", " In all other cases it returns the difference between last seen and first seen\n", " \"\"\"\n", " if row1 == -1:\n", " if row1 == row2:\n", " return 0\n", " else:\n", " return row2\n", " else:\n", " return row2 - row1\n", " \n", "def return_difference_rating(df_2015, df_2017):\n", " \"\"\" Calculates the difference between the rating(s) of an app when it was last seen and when it was first seen.\n", " Returns two dataframes with each two extra column indicating the difference in rating(s).\n", " \"\"\"\n", " df_2015['difference_rating'] = df_2015.apply(lambda row: get_difference_rating(row['ratingcurrentversion_first'],\n", " row['ratingcurrentversion_last']), axis = 1)\n", " df_2017['difference_rating'] = df_2017.apply(lambda row: get_difference_rating(row['ratingcurrentversion_first'],\n", " row['ratingcurrentversion_last']), axis = 1)\n", " df_2015['difference_ratings'] = df_2015.apply(lambda row: get_difference_rating(row['ratingscurrentversion_first'],\n", " row['ratingscurrentversion_last']), axis = 1)\n", " df_2017['difference_ratings'] = df_2017.apply(lambda row: get_difference_rating(row['ratingscurrentversion_first'],\n", " row['ratingscurrentversion_last']), axis = 1)\n", " return df_2015, df_2017\n", "\n", "def get_revenue_model(row):\n", " \"\"\" Returns whether an app if freemium or paid\n", " \"\"\"\n", " if row['price_first'] == 0:\n", " return 'Freemium'\n", " else:\n", " return 'Paid'\n", " \n", "def get_subcategory(row):\n", " \"\"\" Get subcategory based on the amount of keywords are present in the description\n", " \"\"\"\n", " slot_games = ['casino', 'slots', 'slot']\n", " driving = ['race', 'drive', 'car', 'driving', 'parking', 'park', 'racing', 'match 3', 'match three', 'match four', 'clues']\n", " puzzle = ['puzzle', 'puzzles', 'puzzling']\n", " adventure = ['adventure', 'jump', 'platformer']\n", " shooter = ['shoot', 'gun', 'pistol', 'sniper', 'war', 'vehicle']\n", " \n", " \n", " subcategories = {'slot_games': slot_games, 'driving': driving, 'puzzle': puzzle, 'adventure': adventure, 'shooter': shooter,\n", " 'matching': matching}\n", " count_categories = {'slot_games': 0, 'driving': 0, 'puzzle': 0, 'adventure': 0, 'shooter': 0, 'matching': 0}\n", " \n", " \n", " description = row['description_first']\n", " description = re.sub('[^a-zA-z]', ' ', description) # only keep letters\n", " description = description.replace(\"\\\\\", \"\").lower() # remove backslashes and lower the text\n", " description = ' '.join(description.split()) # Remove too many spaces\n", " description = description.split(' ') # create a list with words\n", " description = Counter(description) # Count how often a word occurs\n", " \n", " # Count how many times a certain keyword in one of the categories is seen in a description\n", " for category in subcategories:\n", " for word in subcategories[category]:\n", " if word in description.keys():\n", " count_categories[category] += description[word]\n", " \n", " # The category with the most words is returned\n", " if Counter(count_categories).most_common(1)[0][1] == 0:\n", " return 'Other'\n", " elif Counter(count_categories).most_common(1)[0][1] == Counter(count_categories).most_common(2)[1][1]:\n", " return 'Other'\n", " else:\n", " return Counter(count_categories).most_common(1)[0][0] \n", " \n", "\n", "def create_subcategory(row):\n", " \"\"\" Get subcategory based on keywords being present in the description\n", " \"\"\"\n", " slot_games = ['casino', 'slots', 'slot']\n", " driving = ['race', 'drive', 'car', 'driving', 'parking', 'park', 'racing']\n", " puzzle = ['puzzle', 'puzzles', 'puzzling', 'match 3', 'match three', 'match four', 'clues']\n", " adventure = ['adventure', 'jump', 'platformer']\n", " shooter = ['shoot', 'gun', 'pistol', 'sniper', 'war', 'vehicle']\n", "\n", " description = row['description_first']\n", " description = re.sub('[^a-zA-z]', ' ', description) # only keep letters\n", " description = description.replace(\"\\\\\", \"\").lower() # remove backslashes and lower the text\n", " description = ' '.join(description.split()) # Remove too many spaces\n", " description = description.split(' ') # create a list with words\n", " \n", " categories = {'slot_game': slot_games, 'driving': driving, 'puzzle':puzzle, 'adventure': adventure, 'shooter': shooter}\n", " \n", " for name, category in categories.items():\n", " counter = 0\n", " for word in category:\n", " if (word in description) & (counter == 0):\n", " row[name] = 1\n", " counter += 1\n", " break\n", " else:\n", " continue\n", " if counter == 0:\n", " row[name] = 0\n", " return row\n", " \n", "def get_ios_version(row):\n", " return row['compatibility_first'].split('iOS')[1].strip().split(' ')[0].strip().split('.')[0]\n", "\n", "def get_content_rating(row):\n", " \"\"\" Return the minimum age for a game\n", " \"\"\"\n", " rating = row['content_rating_first'].split(\" \")[0].replace('+', ' ')\n", " rating = row['content_rating_first'].split('+')[0]\n", " \n", " try:\n", " rating = int(rating)\n", " return rating\n", " except:\n", " rating = re.sub('[^0-9]','', row['content_rating_first'])\n", " \n", " try:\n", " if int(rating) > 20:\n", " print(row['content_rating_first'])\n", " return int(rating)\n", " except:\n", " 'Error'\n", "\n", "def create_variables(df_2015, df_2017):\n", " # Create early vs. late mover columns\n", " df_2015['mover'] = 'early'\n", " df_2017['mover'] = 'late'\n", "\n", " # Get revenue model\n", " df_2015['revenue'] = df_2015.apply(lambda row: get_revenue_model(row), axis = 1)\n", " df_2017['revenue'] = df_2017.apply(lambda row: get_revenue_model(row), axis = 1)\n", "\n", " # Get the subcategory\n", " df_2015 = df_2015.apply(lambda row: create_subcategory(row), axis = 1)\n", " df_2017 = df_2017.apply(lambda row: create_subcategory(row), axis = 1)\n", "\n", " # Get optimized for ipod touch\n", " df_2015['optimized_ipod_touch'] = df_2015.apply(lambda row:1 if 'ipod touch' in row['compatibility_first'].lower() \n", " else 0,axis=1)\n", " df_2017['optimized_ipod_touch'] = df_2017.apply(lambda row:1 if 'ipod touch' in row['compatibility_first'].lower() \n", " else 0,axis=1)\n", "\n", " # Get the lowest version of iOS for which the app will work\n", " df_2015['ios_version'] = df_2015.apply(lambda row: get_ios_version(row), axis = 1)\n", " df_2017['ios_version'] = df_2017.apply(lambda row: get_ios_version(row), axis = 1)\n", "\n", " # content rating\n", " df_2015['content_rating'] = df_2015.apply(lambda row: get_content_rating(row), axis = 1)\n", " df_2017['content_rating'] = df_2017.apply(lambda row: get_content_rating(row), axis = 1)\n", " \n", " return df_2015, df_2017\n", "\n", "def show_correlation_matrix(df):\n", " sns.set(style=\"white\")\n", " corr = df.corr()\n", "\n", " # Generate a mask for the upper triangle\n", " mask = np.zeros_like(corr, dtype=np.bool)\n", " mask[np.triu_indices_from(mask)] = True\n", "\n", " # Set up the matplotlib figure\n", " f, ax = plt.subplots(figsize=(11, 9))\n", "\n", " # Generate a custom diverging colormap\n", " cmap = sns.diverging_palette(220, 10, as_cmap=True)\n", "\n", " # Draw the heatmap with the mask and correct aspect ratio\n", " sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,\n", " square=True, linewidths=.5, cbar_kws={\"shrink\": .5}, annot=True, fmt=\".2f\")\n", " \n", "def get_innovation(row):\n", " # Technological Innovation (TI)\n", " innovation = ['gyroscope', 'accelerometer', 'vr', 'ar', 'a.r', 'vr-', 'iamcardboard', \n", " 'fibrum', 'homido', 'zeiss', 'beenoculus', 'colorcross', 'airvr', 'gyrometer', 'prodji', 'advanceddji',\n", " 'prodroneprix', 'onelick', 'vrarchos', 'vrdive', 'vrfreefly', 'gamepad', 'bluetooth']\n", " \n", " description = row['description_first']\n", " description = re.sub('[^a-zA-z]', ' ', description) # only keep letters\n", " description = description.replace(\"\\\\\", \" \").lower() # remove backslashes and lower the text\n", " description = ' '.join(description.split()) # Remove too many spaces\n", " description = description.split(' ') # create a list with words\n", " \n", " row['innovation'] = 0\n", " for word in innovation:\n", " for word_2 in description:\n", " if word == word_2:\n", " row['innovation'] = 1\n", " \n", " description = re.sub('[^a-zA-z]', ' ', row['description_first']).lower().replace('\\\\', '')\n", " terms = ['augmented reality', 'virtual reality', 'motion control', 'tilt the device', 'tilting the device',\n", " 'google cardboard', 'vr-', 'facing camera', 'tilt your device', 'camera lens', 'tilt your head',\n", " 'gyro sensor', 'game pad', 'rotate lens', 'wear your glasses']\n", " for term in terms:\n", " if term in description:\n", " row['innovation'] = 1\n", " if ('vr-' in row['description_first'].lower().replace('\\\\', '')):\n", " row['innovation'] = 1\n", " if ('ar-' in row['description_first'].lower().replace('\\\\', '')):\n", " row['innovation'] = 1\n", " \n", " return row" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Data Preparation \n", "[Back to Table of Contents](#table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset app_details contain information regarding all apps in the Apple AppStore in the last 3 years. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_csv('Data set/app_details.csv', low_memory=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.1 Time Slots \n", "[Back to Table of Contents](#table)\n", "\n", "I want to see the difference in performance between early and late movers after five months of being released. Therefore, four timeslots needs to be created: \n", "- Apps first seen in 2015 (i.e., first_seen_2015)\n", "- Apps 5 months after first seen in 2015 (i.e., last_seen_2015)\n", "- Apps first seen in 2017 (i.e., first_seen_2017)\n", "- Apps 5 months after first seen in 2017 (i.e., last_seen_2017)\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Moroever, it is important that the same apps appear in first_seen_2015 and last_seen_2015. The same for 2017. Then, the datasets will be combined into df_2015 and df_2017 where each datasets contains information regarding an app when it was first seen in year 2015/2017 and 5 months laters. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numer of records in First Seen 2015: \t10547\n", "Numer of records in Last Seen 2017: \t12958\n", "Numer of records in First Seen 2017: \t12958\n", "Numer of records in Last Seen 2015: \t10547\n" ] } ], "source": [ "first_seen_2015, last_seen_2015, first_seen_2017, last_seen_2017 = create_4_timeslots(df)\n", "first_seen_2015, last_seen_2015, first_seen_2017, last_seen_2017 = create_4_equal_timeslots(first_seen_2015, last_seen_2015, \n", " first_seen_2017, last_seen_2017)\n", "df_2015, df_2017 = join_first_last(first_seen_2015, last_seen_2015, first_seen_2017, last_seen_2017)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2 EDA \n", "[Back to Table of Contents](#table)\n", "\n", "Although I highly recommend to do extensive exploratory data analysis before going into the modelling, I'm afraid this notebook will simply be too long so I will show only a few key aspects. \n", "First, I check how many apps were optimzed for certain devices based on their release year. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Apps in 2015 that are optimized for the following devices (based on their release): \n", "\n", " iphone: \t 10283 out of 10547\n", " ipad: \t 10546 out of 10547\n", " ipod touch: \t 10272 out of 10547\n", "\n", "Apps in 2017 that are optimized for the following devices (based on their release): \n", "\n", " iphone: \t 12897 out of 12958\n", " ipad: \t 12957 out of 12958\n", " ipod touch: \t 6566 out of 12958\n", "\n" ] } ], "source": [ "optimized_for(df_2015, df_2017, devices = ['iphone', 'ipad', 'ipod touch'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, how many apps had changed certain characteristics from when they were first released to 5 months later was visualized. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Changes in column between firstseen and lastseen of 2015:\n", "\n", " price: \t376 of 10547\n", " screenshots: \t201 of 10547\n", " content_rating: \t150 of 10547\n", " compatibility: \t10137 of 10547\n", " size: \t1906 of 10547\n", " quan_language: \t252 of 10547\n", " appversion: \t2442 of 10547\n", " title: \t534 of 10547\n", "\n", "Changes in column between firstseen and lastseen of 2017:\n", "\n", " price: \t276 of 12958\n", " screenshots: \t148 of 12958\n", " content_rating: \t164 of 12958\n", " compatibility: \t6663 of 12958\n", " size: \t1592 of 12958\n", " quan_language: \t116 of 12958\n", " appversion: \t1956 of 12958\n", " title: \t442 of 12958\n", "\n" ] } ], "source": [ "df_2015, df_2017 = create_change_columns(df_2015, df_2017)\n", "show_changes(df_2015, df_2017)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, I checked for common words in the description of apps in order to get a feeling for which words might represent certain categories. Using those common words, initial categories were constructed within the category \"gaming\". " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | 2015_word | \n", "2015_count | \n", "2017_word | \n", "2017_count | \n", "
---|---|---|---|---|
0 | \n", "play | \n", "9478 | \n", "play | \n", "10419 | \n", "
1 | \n", "fun | \n", "5750 | \n", "games | \n", "7594 | \n", "
2 | \n", "features | \n", "4192 | \n", "fun | \n", "6914 | \n", "
3 | \n", "free | \n", "4084 | \n", "free | \n", "6241 | \n", "
4 | \n", "time | \n", "3596 | \n", "time | \n", "5306 | \n", "
5 | \n", "games | \n", "3119 | \n", "features | \n", "5032 | \n", "
6 | \n", "new | \n", "3108 | \n", "new | \n", "5024 | \n", "