{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Profitable App genres - iOS App Store and Google Play Store\n",
    "\n",
    "This is a data analysis project looking at the apps listed in the App Store and Google Play markets, and do a profiling of free apps in the respective app marketplace.\n",
    "\n",
    "**Note:** We are only interested in English language apps in this project.\n",
    "   \n",
    "## Goal:\n",
    "Through this project, our aim is to:\n",
    "\n",
    "1. Understand the free apps that are listed in the App Store and Google Play markets based on it's actual usage statistics, and user rating\n",
    "2. From the analysis come up with one app profile best suited to develop as a free app in both the marketplace that maximises the in-app ad revenue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Dataset:\n",
    "\n",
    "For this project we are going to use the following datasets:\n",
    "\n",
    "1. A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).\n",
    "2. A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preliminary Analysis:\n",
    "\n",
    "Let's now explore the dataset to understand in a bit more detail.\n",
    "\n",
    "First, we will create three functions so that we can reuse it to read both iOS and Android dataset from CSV, and print some sample data for our initial analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to read a CSV file, and return the contents as list of lists\n",
    "def csv_reader(file_name_with_path):\n",
    "    from csv import reader\n",
    "    open_file = open(file_name_with_path)\n",
    "    read_file = reader(open_file)\n",
    "    dataset = list(read_file)\n",
    "    return dataset\n",
    "\n",
    "# Function to read the dataset (list of lists) and print data range as passed\n",
    "def explore_data(dataset, start, end, rows_and_columns=False):\n",
    "    dataset_slice = dataset[start:end]\n",
    "    for row in dataset_slice:\n",
    "        print(row)\n",
    "        print('\\n') # adds a new (empty) line after each row\n",
    "    \n",
    "    # If row and column statistics is required (passed as parameter)\n",
    "    if rows_and_columns:\n",
    "        print('Number of rows:', len(dataset))\n",
    "        print('Number of columns:', len(dataset[0]))\n",
    "        \n",
    "# Function to check if all the columns have data in the dataset - Dataset is to be passed with the header row\n",
    "def print_missing_column_values(dataset):\n",
    "    for row in dataset[1:]:\n",
    "        header_length = len(dataset[0])\n",
    "        row_length = len(row)\n",
    "        if row_length != header_length:\n",
    "            print('Index = ',dataset.index(row))        \n",
    "            print('Data row = ',row)\n",
    "            print('\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at a few rows from both the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=====Apple Store=====\n",
      "\n",
      "['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']\n",
      "\n",
      "\n",
      "['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']\n",
      "\n",
      "\n",
      "['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']\n",
      "\n",
      "\n",
      "['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']\n",
      "\n",
      "\n",
      "['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']\n",
      "\n",
      "\n",
      "Number of rows: 7197\n",
      "Number of columns: 16\n",
      "\n",
      "\n",
      "=====Google Play Store=====\n",
      "\n",
      "['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']\n",
      "\n",
      "\n",
      "['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']\n",
      "\n",
      "\n",
      "['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']\n",
      "\n",
      "\n",
      "['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']\n",
      "\n",
      "\n",
      "['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']\n",
      "\n",
      "\n",
      "Number of rows: 10841\n",
      "Number of columns: 13\n"
     ]
    }
   ],
   "source": [
    "print('='*5+'Apple Store'+'='*5+'\\n')\n",
    "apple = csv_reader('AppleStore.csv')\n",
    "explore_data(apple[1:],0,5,True)\n",
    "print('\\n')\n",
    "print('='*5+'Google Play Store'+'='*5+'\\n')\n",
    "android = csv_reader('googleplaystore.csv')\n",
    "explore_data(android[1:],0,5,True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we see some sample data along with number of rows/columns in each dataset, let's understand the columns, and see the ones that might be useful for our analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=====Apple Store=====\n",
      "\n",
      "['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']\n",
      "\n",
      "\n",
      "=====Google Play Store=====\n",
      "\n",
      "['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']\n"
     ]
    }
   ],
   "source": [
    "print('='*5+'Apple Store'+'='*5+'\\n')\n",
    "print(apple[0])\n",
    "print('\\n')\n",
    "print('='*5+'Google Play Store'+'='*5+'\\n')\n",
    "print(android[0])\n",
    "\n",
    "apple_header = apple[0]\n",
    "android_header = android[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the iOS dataset, few columns that could be useful are:\n",
    "\n",
    "- *track_name* - Name of the app\n",
    "- *price* - Price of the app that might help us determine free vs paid apps\n",
    "- *prime_genre* - Genre classification of the app for our profiling\n",
    "- *cont_rating* - Content rating or Age group relevance\n",
    "- *user_rating* - Overall user rating of the app\n",
    "- *rating_count_tot* - Total number of users that reviewed/rated the app\n",
    "\n",
    "For full details, refer to the dataset [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).\n",
    "\n",
    "In the Android dataset, few columns that could be useful are:\n",
    "\n",
    "- *App* - Name of the app\n",
    "- *Price* - Price of the app that might help us determine free vs paid apps\n",
    "- *Genres* - Genre classification of the app for our profiling\n",
    "- *Category* - Another classification for the app that might aid our profiling\n",
    "- *Content Rating* - Content rating or Age group relevance\n",
    "- *Rating* - Overall user rating of the app\n",
    "- *Reviews* - Total number of users that reviewed/rated the app\n",
    "- *Installs* - Total number of users who have installed the app\n",
    "\n",
    "For full details, refer to the dataset [documentation](https://www.kaggle.com/lava18/google-play-store-apps)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data cleansing\n",
    "\n",
    "We will now start to analyse the data to see if there is any data cleansing we need to do before doing profiling and analysis.\n",
    "\n",
    "### Missing column values\n",
    "\n",
    "We will use the function we created **print_missing_column_values** to see in both the dataset if there are apps which have missing information and printing those rows so that we can take a decision."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=====Google Play Store=====\n",
      "\n",
      "Index =  10473\n",
      "Data row =  ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']\n",
      "\n",
      "\n",
      "=====Apple Play Store=====\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print('='*5+'Google Play Store'+'='*5+'\\n')\n",
    "print_missing_column_values(android)\n",
    "print('='*5+'Apple Play Store'+'='*5+'\\n')\n",
    "print_missing_column_values(apple)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the above display, we can see that one of the app in Google Playstore seem to have a data point missing and from comparing against the sample row for Google Play apps, it looks like this app is missing value for **Category** column.\n",
    "\n",
    "Checking this app in the [Google Playstore](https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe) reveals that it is categorised as ***Lifestyle***.\n",
    "\n",
    "We will correct this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']\n"
     ]
    }
   ],
   "source": [
    "android[10473].insert(1,'LIFESTYLE')\n",
    "print(android[10473])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Duplicate entries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As per the discussion about the Google Playstore, we see that this dataset suffers from lot of duplicates. However, the Apple appstore dataset does not have any duplicates.\n",
    "\n",
    "Let's now see how many duplicate apps we have in the Google Playstore, and look at some of those."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of apps in the dataset:  10841\n",
      "Number of duplicate apps:  1181\n",
      "\n",
      "\n",
      "=====Few duplicate apps=====\n",
      "\n",
      "['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']\n"
     ]
    }
   ],
   "source": [
    "duplicate_apps = []\n",
    "unique_apps = []\n",
    "for app in android[1:]:\n",
    "    name = app[0]\n",
    "    if name in unique_apps:\n",
    "        duplicate_apps.append(name)\n",
    "    else:\n",
    "        unique_apps.append(name)\n",
    "print('Total number of apps in the dataset: ',len(android[1:]))\n",
    "print('Number of duplicate apps: ', len(duplicate_apps))\n",
    "print('\\n')\n",
    "print('='*5+'Few duplicate apps'+'='*5+'\\n')\n",
    "print(duplicate_apps[:15])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see about **1181** apps are duplicates. \n",
    "\n",
    "Rather than randomly removing the duplicates, we will use the column ***reviews*** on the basis that higher the total number of reviews the more recent the app entry.\n",
    "\n",
    "First step is to build a dictionary based on the android dataset so that we have the app name and its reviews count that is maximum for that app."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Expected rows after cleanup:  9660\n"
     ]
    }
   ],
   "source": [
    "reviews_max = {}\n",
    "for row in android[1:]:\n",
    "    name = row[0]\n",
    "    n_reviews = float(row[3]) #total number of reviews\n",
    "    if name in reviews_max and reviews_max[name] < n_reviews:\n",
    "        reviews_max[name] = n_reviews\n",
    "    elif name not in reviews_max:\n",
    "        reviews_max[name] = n_reviews\n",
    "print('Expected rows after cleanup: ',len(reviews_max))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the app and its maximum reviews count, we use the above dictionary to build our unique apps dataset. After cleanup we should have **9660** unique rows.\n",
    "\n",
    "- We start by creating two empty lists **android_clean** and **already_added**\n",
    "- We loop through our android dataset (ignoring headers) and for each iteration, we add the row to **android_clean** list, and the app name to the **already_added** list if:\n",
    "    - The reviews matches the max reviews per the dictionary for that app\n",
    "    - The app is not already added to **already_added** list\n",
    "\n",
    "**Note**: We need to check the existence in the **already_added** list to make sure that we add the app only once if the duplicates has same maximum number of reviews for that app.       "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9660\n",
      "[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]\n"
     ]
    }
   ],
   "source": [
    "android_clean = []\n",
    "already_added = []\n",
    "for row in android[1:]:\n",
    "    name = row[0]\n",
    "    n_reviews = float(row[3]) #total number of reviews\n",
    "    if (reviews_max[name] == n_reviews) and (name not in already_added):\n",
    "        android_clean.append(row)\n",
    "        already_added.append(name)\n",
    "print(len(android_clean))\n",
    "print(android_clean[0:3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Non-English apps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this project our aim is only to analyse and profile English language apps and hence we need to identify any Non-English apps and remove from our dataset.\n",
    "\n",
    "First, we will write a function **is_english** that does the following:\n",
    "- Takes a string as input\n",
    "- Checks the character using the standard function **ord** to get the ASCII number, and see if it falls outside the range for English characters (0-127)\n",
    "- If we find more than three characters in a string outside our range then we return False (Not English)\n",
    "- If not, we return True (English)\n",
    "\n",
    "\n",
    "**Note**: To avoid the mistake of removing some apps with smileys and other special characters (e.g. 'Instachat 😜' or 'Docs To Go™ Free Office Suite'), we will establish a rule that we only return as non-english if the string has more than 3 characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "False\n",
      "True\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "def is_english(a_string):\n",
    "    no_of_chars = 0\n",
    "    for char in a_string:\n",
    "        if ord(char) > 127:\n",
    "            no_of_chars += 1\n",
    "        \n",
    "        if no_of_chars > 3:\n",
    "            return False\n",
    "    return True\n",
    "\n",
    "print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))\n",
    "print(is_english('Instachat 😜'))\n",
    "print(is_english('Docs To Go™ Free Office Suite'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we use this function in loop through our Apple and Android dataset to build our dataset of English only apps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "English only android apps:  9615\n",
      "Non-English android apps:  45\n",
      "% of English apps in Android:  99.53\n",
      "\n",
      "\n",
      "English only Apple apps:  6183\n",
      "Non-English apple apps:  1014\n",
      "% of English apps in Apple:  85.91\n"
     ]
    }
   ],
   "source": [
    "apple_english_only = []\n",
    "android_english_only = []\n",
    "for app in android_clean:\n",
    "    name = app[0]\n",
    "    if is_english(name):\n",
    "        android_english_only.append(app)\n",
    "print('English only android apps: ',len(android_english_only))\n",
    "print('Non-English android apps: ',len(android_clean) - len(android_english_only))\n",
    "print('% of English apps in Android: ',round((len(android_english_only)/len(android_clean))*100,2))\n",
    "print('\\n')\n",
    "for app in apple[1:]:\n",
    "    name = app[1]\n",
    "    if is_english(name):\n",
    "        apple_english_only.append(app)\n",
    "print('English only Apple apps: ',len(apple_english_only))\n",
    "print('Non-English apple apps: ',len(apple[1:]) - len(apple_english_only))\n",
    "print('% of English apps in Apple: ',round((len(apple_english_only)/len(apple[1:]))*100,2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems while Android dataset has lot of duplicate apps, Apple dataset have a lot more Non-English apps than Android."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Free vs paid apps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have mentioned before, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Hence we need isolate free apps from the non-free apps in both the datasets.\n",
    "\n",
    "This would be our last step in the data cleaning process.\n",
    "\n",
    "**Note**:\n",
    "- In the Apple dataset, we can rely and convert the column **price** (index: 4) to float as it does not have any decimals or currency symbols\n",
    "- In the Android dataset, we can rely on the column **Type** (index: 6) being 'Free' to determine if its free or non-free app"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total free apps in Apple Store:  3222\n",
      "Total free apps in Android Store:  8864\n"
     ]
    }
   ],
   "source": [
    "apple_final = []\n",
    "for app in apple_english_only:\n",
    "    price = float(app[4])\n",
    "    if price == 0:\n",
    "        apple_final.append(app)\n",
    "print('Total free apps in Apple Store: ',len(apple_final))\n",
    "\n",
    "android_final = []\n",
    "for app in android_english_only:\n",
    "    free = app[6]\n",
    "    if free == 'Free':\n",
    "        android_final.append(app)\n",
    "print('Total free apps in Android Store: ',len(android_final))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After going through a series of data cleaning measures, we finally have **3222** apps in the Apple dataset and **8864** apps in the Android dataset that we are going to use for our profiling and analysis further."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Analysis\n",
    "\n",
    "As mentioned at the start, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.\n",
    "\n",
    "To minimise risks and overhead, we would have the same app develop in both iOS and Android. But the idea is to launch on the Android market, and based on response from users, we develop further, and if profitable then we develop an iOS version.\n",
    "\n",
    "So for this reason, we need analyse and determine the app profile/genres that could attract more users in both iOS and Android.\n",
    "\n",
    "In the Apple dataset, we have a clear column called <mark>prime_genre</mark> that can aid our analysis. However, in the Android dataset, we have two columns <mark>Category</mark> and <mark>Genres</mark>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we start profiling our app based on the dataset and respective columns for genre, we will build two functions that we will reuse for both datasets to build the frequency table.\n",
    "\n",
    "1. **freq_table:** This function builds a frequency table by taking a dataset (list of lists) and index number of the column for which we are building the frequency table. Also, we will return the frequency table as a percentage.\n",
    "\n",
    "2. **display_table:** This function uses the freq_table from above function, and sorts by highest percentage and displays the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to build a frequency table\n",
    "# Takes:\n",
    "# A dataset (a list of lists) and \n",
    "# Index number for the column we are building the frequency table\n",
    "def freq_table(dataset, index):\n",
    "    result = {}\n",
    "    for row in dataset:\n",
    "        key = row[index]\n",
    "        if key in result:\n",
    "            result[key] += 1\n",
    "        else:\n",
    "            result[key] = 1\n",
    "            \n",
    "    # Make the frequency table as a percentage\n",
    "    total_apps = len(dataset)\n",
    "    for key in result:\n",
    "        result[key] /= total_apps\n",
    "        result[key] *= 100\n",
    "        result[key] = round(result[key] ,2)\n",
    "    return result\n",
    "\n",
    "\n",
    "def display_table(freq_tbl):\n",
    "    table = freq_tbl\n",
    "    table_display = []\n",
    "    for key in table:\n",
    "        key_val_as_tuple = (table[key], key)\n",
    "        table_display.append(key_val_as_tuple)\n",
    "\n",
    "    table_sorted = sorted(table_display, reverse = True)\n",
    "    for entry in table_sorted:\n",
    "        print(entry[1], ':', entry[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Apple app store\n",
    "\n",
    "Now we will use the function ***display_table*** against the Apple app store dataset to see the results based on the ***prime_genre*** column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=====Apple App Store - By Prime_Genre=====\n",
      "\n",
      "Games : 58.16\n",
      "Entertainment : 7.88\n",
      "Photo & Video : 4.97\n",
      "Education : 3.66\n",
      "Social Networking : 3.29\n",
      "Shopping : 2.61\n",
      "Utilities : 2.51\n",
      "Sports : 2.14\n",
      "Music : 2.05\n",
      "Health & Fitness : 2.02\n",
      "Productivity : 1.74\n",
      "Lifestyle : 1.58\n",
      "News : 1.33\n",
      "Travel : 1.24\n",
      "Finance : 1.12\n",
      "Weather : 0.87\n",
      "Food & Drink : 0.81\n",
      "Reference : 0.56\n",
      "Business : 0.53\n",
      "Book : 0.43\n",
      "Navigation : 0.19\n",
      "Medical : 0.19\n",
      "Catalogs : 0.12\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print('='*5+'Apple App Store - By Prime_Genre'+'='*5+'\\n')\n",
    "freq_tbl = freq_table(apple_final,11) #prime_genre\n",
    "display_table(freq_tbl)\n",
    "print('\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Purely, from the genre perspective, we see could see the following pattern based on the number of apps:\n",
    "\n",
    "- *Games* apps are significantly higher in proportion across all English free apps collection (about 60%)\n",
    "- Overall, Apple app store seems to have higher proportion of apps for entertainment purposes (games, photo and video, social networking, sports, music) than practical purposes (education, shopping, utilities, productivity, lifestyle)\n",
    "\n",
    "Even though the proportion based on the number of apps might present the above picture, the same might not be true with regard to the number of users/reviews.\n",
    "\n",
    "Let's now look at the Android dataset using the columns <mark>category</mark> and <mark>genres</mark>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=====Google Play Store - By Category =====\n",
      "\n",
      "FAMILY : 18.9\n",
      "GAME : 9.72\n",
      "TOOLS : 8.46\n",
      "BUSINESS : 4.59\n",
      "LIFESTYLE : 3.91\n",
      "PRODUCTIVITY : 3.89\n",
      "FINANCE : 3.7\n",
      "MEDICAL : 3.53\n",
      "SPORTS : 3.4\n",
      "PERSONALIZATION : 3.32\n",
      "COMMUNICATION : 3.24\n",
      "HEALTH_AND_FITNESS : 3.08\n",
      "PHOTOGRAPHY : 2.94\n",
      "NEWS_AND_MAGAZINES : 2.8\n",
      "SOCIAL : 2.66\n",
      "TRAVEL_AND_LOCAL : 2.34\n",
      "SHOPPING : 2.25\n",
      "BOOKS_AND_REFERENCE : 2.14\n",
      "DATING : 1.86\n",
      "VIDEO_PLAYERS : 1.79\n",
      "MAPS_AND_NAVIGATION : 1.4\n",
      "FOOD_AND_DRINK : 1.24\n",
      "EDUCATION : 1.16\n",
      "ENTERTAINMENT : 0.96\n",
      "LIBRARIES_AND_DEMO : 0.94\n",
      "AUTO_AND_VEHICLES : 0.93\n",
      "HOUSE_AND_HOME : 0.82\n",
      "WEATHER : 0.8\n",
      "EVENTS : 0.71\n",
      "PARENTING : 0.65\n",
      "ART_AND_DESIGN : 0.64\n",
      "COMICS : 0.62\n",
      "BEAUTY : 0.6\n",
      "\n",
      "\n",
      "=====Google Play Store - By Genre =====\n",
      "\n",
      "Tools : 8.45\n",
      "Entertainment : 6.07\n",
      "Education : 5.35\n",
      "Business : 4.59\n",
      "Productivity : 3.89\n",
      "Lifestyle : 3.89\n",
      "Finance : 3.7\n",
      "Medical : 3.53\n",
      "Sports : 3.46\n",
      "Personalization : 3.32\n",
      "Communication : 3.24\n",
      "Action : 3.1\n",
      "Health & Fitness : 3.08\n",
      "Photography : 2.94\n",
      "News & Magazines : 2.8\n",
      "Social : 2.66\n",
      "Travel & Local : 2.32\n",
      "Shopping : 2.25\n",
      "Books & Reference : 2.14\n",
      "Simulation : 2.04\n",
      "Dating : 1.86\n",
      "Arcade : 1.85\n",
      "Video Players & Editors : 1.77\n",
      "Casual : 1.76\n",
      "Maps & Navigation : 1.4\n",
      "Food & Drink : 1.24\n",
      "Puzzle : 1.13\n",
      "Racing : 0.99\n",
      "Role Playing : 0.94\n",
      "Libraries & Demo : 0.94\n",
      "Auto & Vehicles : 0.93\n",
      "Strategy : 0.9\n",
      "House & Home : 0.82\n",
      "Weather : 0.8\n",
      "Events : 0.71\n",
      "Adventure : 0.68\n",
      "Comics : 0.61\n",
      "Beauty : 0.6\n",
      "Art & Design : 0.6\n",
      "Parenting : 0.5\n",
      "Card : 0.45\n",
      "Casino : 0.43\n",
      "Trivia : 0.42\n",
      "Educational;Education : 0.39\n",
      "Board : 0.38\n",
      "Educational : 0.37\n",
      "Education;Education : 0.34\n",
      "Word : 0.26\n",
      "Casual;Pretend Play : 0.24\n",
      "Music : 0.2\n",
      "Racing;Action & Adventure : 0.17\n",
      "Puzzle;Brain Games : 0.17\n",
      "Entertainment;Music & Video : 0.17\n",
      "Casual;Brain Games : 0.14\n",
      "Casual;Action & Adventure : 0.14\n",
      "Arcade;Action & Adventure : 0.12\n",
      "Action;Action & Adventure : 0.1\n",
      "Educational;Pretend Play : 0.09\n",
      "Simulation;Action & Adventure : 0.08\n",
      "Parenting;Education : 0.08\n",
      "Entertainment;Brain Games : 0.08\n",
      "Board;Brain Games : 0.08\n",
      "Parenting;Music & Video : 0.07\n",
      "Educational;Brain Games : 0.07\n",
      "Casual;Creativity : 0.07\n",
      "Art & Design;Creativity : 0.07\n",
      "Education;Pretend Play : 0.06\n",
      "Role Playing;Pretend Play : 0.05\n",
      "Education;Creativity : 0.05\n",
      "Role Playing;Action & Adventure : 0.03\n",
      "Puzzle;Action & Adventure : 0.03\n",
      "Entertainment;Creativity : 0.03\n",
      "Entertainment;Action & Adventure : 0.03\n",
      "Educational;Creativity : 0.03\n",
      "Educational;Action & Adventure : 0.03\n",
      "Education;Music & Video : 0.03\n",
      "Education;Brain Games : 0.03\n",
      "Education;Action & Adventure : 0.03\n",
      "Adventure;Action & Adventure : 0.03\n",
      "Video Players & Editors;Music & Video : 0.02\n",
      "Sports;Action & Adventure : 0.02\n",
      "Simulation;Pretend Play : 0.02\n",
      "Puzzle;Creativity : 0.02\n",
      "Music;Music & Video : 0.02\n",
      "Entertainment;Pretend Play : 0.02\n",
      "Casual;Education : 0.02\n",
      "Board;Action & Adventure : 0.02\n",
      "Video Players & Editors;Creativity : 0.01\n",
      "Trivia;Education : 0.01\n",
      "Travel & Local;Action & Adventure : 0.01\n",
      "Tools;Education : 0.01\n",
      "Strategy;Education : 0.01\n",
      "Strategy;Creativity : 0.01\n",
      "Strategy;Action & Adventure : 0.01\n",
      "Simulation;Education : 0.01\n",
      "Role Playing;Brain Games : 0.01\n",
      "Racing;Pretend Play : 0.01\n",
      "Puzzle;Education : 0.01\n",
      "Parenting;Brain Games : 0.01\n",
      "Music & Audio;Music & Video : 0.01\n",
      "Lifestyle;Pretend Play : 0.01\n",
      "Lifestyle;Education : 0.01\n",
      "Health & Fitness;Education : 0.01\n",
      "Health & Fitness;Action & Adventure : 0.01\n",
      "Entertainment;Education : 0.01\n",
      "Communication;Creativity : 0.01\n",
      "Comics;Creativity : 0.01\n",
      "Casual;Music & Video : 0.01\n",
      "Card;Action & Adventure : 0.01\n",
      "Books & Reference;Education : 0.01\n",
      "Art & Design;Pretend Play : 0.01\n",
      "Art & Design;Action & Adventure : 0.01\n",
      "Arcade;Pretend Play : 0.01\n",
      "Adventure;Education : 0.01\n",
      " : 0.01\n"
     ]
    }
   ],
   "source": [
    "print('='*5+'Google Play Store - By Category '+'='*5+'\\n')\n",
    "freq_tbl = freq_table(android_final,1) #Category\n",
    "display_table(freq_tbl)\n",
    "print('\\n')\n",
    "print('='*5+'Google Play Store - By Genre '+'='*5+'\\n')\n",
    "freq_tbl = freq_table(android_final,9) #Genre\n",
    "display_table(freq_tbl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we look at the profile of apps, in Android dataset, we have both <mark>Category</mark> and <mark>Genre</mark> columns. And looking at the frequency table above, it is clear that the Genre column is more granular and seem to have a sub-category level information.\n",
    "\n",
    "Since at this point our analysis is going to involve high level categorisation, we will continue our analysis only with <mark>Category</mark> column for Android dataset.\n",
    "\n",
    "At first when we look at the frequency table based on category column we see that Android market seem to have a more balanced spread of apps across categories unlike Apple app store. Also, we have more apps that are for practical (such as productivity, finance, family, tools) than fun purposes.\n",
    "\n",
    "However, when we [look](https://play.google.com/store/apps/category/FAMILY?hl=en) at the apps in the **FAMILY** category (about 19%) for example, most of them are game apps for kids. Even then, we see apps more for practical purposes than fun unlike Apple.\n",
    "\n",
    "Up to this point, we found that the Apple app Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.\n",
    "\n",
    "One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the <mark>Installs</mark> column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the <mark>rating_count_tot</mark> app."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Most popular apps - App Store\n",
    "\n",
    "Let's start with calculating the average number of user ratings per app genre on the App Store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Navigation : 86090\n",
      "Reference : 74942\n",
      "Social Networking : 71548\n",
      "Music : 57327\n",
      "Weather : 52280\n",
      "Book : 39758\n",
      "Food & Drink : 33334\n",
      "Finance : 31468\n",
      "Photo & Video : 28442\n",
      "Travel : 28244\n",
      "Shopping : 26920\n",
      "Health & Fitness : 23298\n",
      "Sports : 23009\n",
      "Games : 22789\n",
      "News : 21248\n",
      "Productivity : 21028\n",
      "Utilities : 18684\n",
      "Lifestyle : 16486\n",
      "Entertainment : 14030\n",
      "Business : 7491\n",
      "Education : 7004\n",
      "Catalogs : 4004\n",
      "Medical : 612\n"
     ]
    }
   ],
   "source": [
    "app_store_genres = freq_table(apple_final, 11)\n",
    "genre_avg_ratings = {}\n",
    "for genre in app_store_genres:\n",
    "    total = 0 #total apps in genre\n",
    "    len_genre = 0 #total rating count of all apps in genre\n",
    "    for app in apple_final:\n",
    "        genre_app = app[11]\n",
    "        if genre == genre_app:\n",
    "            app_usr_rating_tot = int(app[5])\n",
    "            len_genre += 1\n",
    "            total += app_usr_rating_tot\n",
    "    avg_genre = round(total/len_genre)\n",
    "    genre_avg_ratings[genre] = avg_genre\n",
    "\n",
    "display_table(genre_avg_ratings)       "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even though games genre had larger proportion of apps, we see that Navigation and Reference genres have more average users.\n",
    "\n",
    "Before making any conclusions, let's dig a little deeper on the apps in these genre."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "***** Top 5 apps for Navigation *****\n",
      "Waze - GPS Navigation, Maps & Real-time Traffic : 345046 (66.8%)\n",
      "Google Maps - Navigation & Transit : 154911 (29.99%)\n",
      "Geocaching® : 12811 (2.48%)\n",
      "CoPilot GPS – Car Navigation & Offline Maps : 3582 (0.69%)\n",
      "ImmobilienScout24: Real Estate Search in Germany : 187 (0.04%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for Reference *****\n",
      "Bible : 985920 (73.09%)\n",
      "Dictionary.com Dictionary & Thesaurus : 200047 (14.83%)\n",
      "Dictionary.com Dictionary & Thesaurus for iPad : 54175 (4.02%)\n",
      "Google Translate : 26786 (1.99%)\n",
      "Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418 (1.37%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for Social Networking *****\n",
      "Facebook : 2974676 (39.22%)\n",
      "Pinterest : 1061624 (14.0%)\n",
      "Skype for iPhone : 373519 (4.93%)\n",
      "Messenger : 351466 (4.63%)\n",
      "Tumblr : 334293 (4.41%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for Music *****\n",
      "Pandora - Music & Radio : 1126879 (29.78%)\n",
      "Spotify Music : 878563 (23.22%)\n",
      "Shazam - Discover music, artists, videos & lyrics : 402925 (10.65%)\n",
      "iHeartRadio – Free Music & Radio Stations : 293228 (7.75%)\n",
      "SoundCloud - Music & Audio : 135744 (3.59%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for Weather *****\n",
      "The Weather Channel: Forecast, Radar & Alerts : 495626 (33.86%)\n",
      "The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648 (14.25%)\n",
      "WeatherBug - Local Weather, Radar, Maps, Alerts : 188583 (12.88%)\n",
      "MyRadar NOAA Weather Radar Forecast : 150158 (10.26%)\n",
      "AccuWeather - Weather for Life : 144214 (9.85%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for Book *****\n",
      "Kindle – Read eBooks, Magazines & Textbooks : 252076 (45.29%)\n",
      "Audible – audio books, original series & podcasts : 105274 (18.91%)\n",
      "Color Therapy Adult Coloring Book for Adults : 84062 (15.1%)\n",
      "OverDrive – Library eBooks and Audiobooks : 65450 (11.76%)\n",
      "HOOKED - Chat Stories : 47829 (8.59%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for Finance *****\n",
      "Chase Mobile℠ : 233270 (20.59%)\n",
      "Mint: Personal Finance, Budget, Bills & Money : 232940 (20.56%)\n",
      "Bank of America - Mobile Banking : 119773 (10.57%)\n",
      "PayPal - Send and request money safely : 119487 (10.55%)\n",
      "Credit Karma: Free Credit Scores, Reports & Alerts : 101679 (8.98%)\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "def top_apps_by_genre(dataset, genre, genre_index, appname_index, users_index, top_n = 5, pct = False):\n",
    "    genre_apps = []\n",
    "    total_genre_users = 0\n",
    "    for app in dataset:        \n",
    "        app_genre = app[genre_index]\n",
    "        app_name = app[appname_index]\n",
    "        app_users = int((app[users_index].replace(',','')).replace('+',''))        \n",
    "        if app_genre == genre:\n",
    "            total_genre_users += app_users\n",
    "            app_tupple = (app_users, app_name)\n",
    "            genre_apps.append(app_tupple)\n",
    "    top = 0\n",
    "    print('*'*5,'Top',top_n,'apps for',genre,'*'*5)    \n",
    "    for app in sorted(genre_apps, reverse = True):\n",
    "        top += 1\n",
    "        if top > top_n:\n",
    "            print('\\n')\n",
    "            break\n",
    "        app_name = app[1]\n",
    "        app_users = app[0]\n",
    "        if pct == True:\n",
    "            if total_genre_users != 0:\n",
    "                app_user_pct = round((app_users / total_genre_users) * 100,2)\n",
    "            else:\n",
    "                app_user_pct = 0\n",
    "            print(app_name,':',app_users, '(' + str(app_user_pct)+'%)')\n",
    "        else:\n",
    "            print(app_name,':',app_users)\n",
    "        \n",
    "top_apps_by_genre(dataset = apple_final, genre=\"Navigation\", genre_index=-5, appname_index=1, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = apple_final, genre=\"Reference\", genre_index=-5, appname_index=1, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = apple_final, genre=\"Social Networking\", genre_index=-5, appname_index=1, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = apple_final, genre=\"Music\", genre_index=-5, appname_index=1, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = apple_final, genre=\"Weather\", genre_index=-5, appname_index=1, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = apple_final, genre=\"Book\", genre_index=-5, appname_index=1, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = apple_final, genre=\"Finance\", genre_index=-5, appname_index=1, users_index=5, pct=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the Top 5 apps for few genres which command highest users, we see the following patterns:\n",
    "\n",
    "1. Consistently in <mark>Navigation</mark>, <mark>Social Networking</mark>, <mark>Music</mark> generes few apps dominate the user base and hence they skew the results overall\n",
    "2. <mark>Reference</mark> genre is interesting. Bible and Dictionary has near monopoly same as Navigation with Google (Waze and Google Maps owned by Google)\n",
    "3. <mark>Weather</mark> genre shows promise and is a practical use app rather than fun app which is near saturation point in AppStore - However, considering our primary goal of free app and maximising ad-revenue, weather might not be suitable genre were users would not stay long enough within the app\n",
    "4. Other genres that provide practical purpose such as \"Food & Drink\" could be considered but these require deeper partnerships at the supply chain level but may be perhaps marketplace is preferrable option here\n",
    "\n",
    "Even though <mark>Book</mark> genre is dominated by Amazon, we see promise that small business apps such as \"Color Therapy Adult Coloring Book for Adults\", \"HOOKED - Chat Stories\" and \"OverDrive – Library eBooks and Audiobooks\" have good percentage of user base too (Combined ~35%).\n",
    "\n",
    "So if we bring a standalone app in the <mark>Book</mark> genre perhaps of some famous best seller book there is potential for maximising ad-revenue by keeping the user within our app for longer time or creativity app such as kids or adult colouring book.\n",
    "\n",
    "**Note**: We do need to check about the rights and partnership for published books, but more importantly it should not already been an eBook/Audio book through Amazon's platform.\n",
    "\n",
    "<mark>Finance</mark> genre is interesting:\n",
    "\n",
    "- Proportion of apps is only 1.12% of the AppStore\n",
    "- Average number of users on the apps is quite high\n",
    "- Top 5 apps in this space shows no monopoly by any big players and spread is even\n",
    "- There are higher proportion of apps that provide banking/payment services but also there is services for personal finance\n",
    "\n",
    "Even though this genre requires deeper domain expertise, but if we could partner with some wealth management company or Financial adviser then we have a potential to add significant value to the user at the same time maximising the ad-revenue related to the financial advise in our app.\n",
    "\n",
    "From all of the genres in the AppStore, two genres definetly emerge as potentials in the AppStore:\n",
    "1. <mark>Book</mark> genre - Popular/best selling book not yet in big platforms (such as Amazon, Google, Apple) or Creativity app such as kids or adult colouring book\n",
    "2. <mark>Finance</mark> genre - Personal Finance app and maximise ad-revenue\n",
    "\n",
    "Both of these definetly fits our theme of apps for practical use.\n",
    "\n",
    "Let's now explore the Google PlayStore."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Most popular apps - Google Play Store\n",
    "\n",
    "In PlayStore dataset, the <mark>Installs</mark> column has a range such as 1,000+, 10,000+. But for our analysis we will consider these as hard numbers (By replacing \",\" and \"+\") as we are going to employ the same technique across the dataset it should not cause any error in judgment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "COMMUNICATION : 38456119\n",
      "VIDEO_PLAYERS : 24727872\n",
      "SOCIAL : 23253652\n",
      "PHOTOGRAPHY : 17840110\n",
      "PRODUCTIVITY : 16787331\n",
      "GAME : 15588016\n",
      "TRAVEL_AND_LOCAL : 13984078\n",
      "ENTERTAINMENT : 11640706\n",
      "TOOLS : 10801391\n",
      "NEWS_AND_MAGAZINES : 9549178\n",
      "BOOKS_AND_REFERENCE : 8767812\n",
      "SHOPPING : 7036877\n",
      "PERSONALIZATION : 5201483\n",
      "WEATHER : 5074486\n",
      "HEALTH_AND_FITNESS : 4188822\n",
      "MAPS_AND_NAVIGATION : 4056942\n",
      "FAMILY : 3697848\n",
      "SPORTS : 3638640\n",
      "ART_AND_DESIGN : 1986335\n",
      "FOOD_AND_DRINK : 1924898\n",
      "EDUCATION : 1833495\n",
      "BUSINESS : 1712290\n",
      "LIFESTYLE : 1433676\n",
      "FINANCE : 1387692\n",
      "HOUSE_AND_HOME : 1331541\n",
      "DATING : 854029\n",
      "COMICS : 817657\n",
      "AUTO_AND_VEHICLES : 647318\n",
      "LIBRARIES_AND_DEMO : 638504\n",
      "PARENTING : 542604\n",
      "BEAUTY : 513152\n",
      "EVENTS : 253542\n",
      "MEDICAL : 120551\n"
     ]
    }
   ],
   "source": [
    "['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']\n",
    "play_store_genres = freq_table(android_final, 1)\n",
    "genre_avg_ratings = {}\n",
    "for genre in play_store_genres:\n",
    "    total = 0 #total apps in genre\n",
    "    len_genre = 0 #total rating count of all apps in genre\n",
    "    for app in android_final:\n",
    "        genre_app = app[1]\n",
    "        if genre == genre_app:\n",
    "            app_usr_rating_tot = int((app[5].replace(',','')).replace('+',''))\n",
    "            len_genre += 1\n",
    "            total += app_usr_rating_tot\n",
    "    avg_genre = round(total/len_genre)\n",
    "    genre_avg_ratings[genre] = avg_genre\n",
    "\n",
    "display_table(genre_avg_ratings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even though the categorisation on the PlayStore is slightly different to the AppStore, there are some common themes and categories. But PlayStore user base shows good promise for both fun and practical purpose apps.\n",
    "\n",
    "Let's drill a bit more into the top 3 categories here and also we will explore two potentials from the App store - **Books** and **Finance** genre."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "***** Top 5 apps for COMMUNICATION *****\n",
      "WhatsApp Messenger : 1000000000 (9.06%)\n",
      "Skype - free IM & video calls : 1000000000 (9.06%)\n",
      "Messenger – Text and Video Chat for Free : 1000000000 (9.06%)\n",
      "Hangouts : 1000000000 (9.06%)\n",
      "Google Chrome: Fast & Secure : 1000000000 (9.06%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for VIDEO_PLAYERS *****\n",
      "YouTube : 1000000000 (25.43%)\n",
      "Google Play Movies & TV : 1000000000 (25.43%)\n",
      "MX Player : 500000000 (12.72%)\n",
      "VivaVideo - Video Editor & Photo Movie : 100000000 (2.54%)\n",
      "VideoShow-Video Editor, Video Maker, Beauty Camera : 100000000 (2.54%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for PRODUCTIVITY *****\n",
      "Google Drive : 1000000000 (17.27%)\n",
      "Microsoft Word : 500000000 (8.63%)\n",
      "Google Calendar : 500000000 (8.63%)\n",
      "Dropbox : 500000000 (8.63%)\n",
      "Cloud Print : 500000000 (8.63%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for FINANCE *****\n",
      "Google Pay : 100000000 (21.97%)\n",
      "PayPal : 50000000 (10.99%)\n",
      "İşCep : 10000000 (2.2%)\n",
      "Wells Fargo Mobile : 10000000 (2.2%)\n",
      "Mobile Bancomer : 10000000 (2.2%)\n",
      "\n",
      "\n",
      "***** Top 5 apps for BOOKS_AND_REFERENCE *****\n",
      "Google Play Books : 1000000000 (60.03%)\n",
      "Wattpad 📖 Free Books : 100000000 (6.0%)\n",
      "Bible : 100000000 (6.0%)\n",
      "Audiobooks from Audible : 100000000 (6.0%)\n",
      "Amazon Kindle : 100000000 (6.0%)\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "top_apps_by_genre(dataset = android_final, genre=\"COMMUNICATION\", genre_index=1, appname_index=0, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = android_final, genre=\"VIDEO_PLAYERS\", genre_index=1, appname_index=0, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = android_final, genre=\"PRODUCTIVITY\", genre_index=1, appname_index=0, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = android_final, genre=\"FINANCE\", genre_index=1, appname_index=0, users_index=5, pct=True)\n",
    "top_apps_by_genre(dataset = android_final, genre=\"BOOKS_AND_REFERENCE\", genre_index=1, appname_index=0, users_index=5, pct=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WhatsApp Messenger : 1,000,000,000+\n",
      "imo beta free calls and text : 100,000,000+\n",
      "Android Messages : 100,000,000+\n",
      "Google Duo - High Quality Video Calls : 500,000,000+\n",
      "Messenger – Text and Video Chat for Free : 1,000,000,000+\n",
      "imo free video calls and chat : 500,000,000+\n",
      "Skype - free IM & video calls : 1,000,000,000+\n",
      "Who : 100,000,000+\n",
      "GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+\n",
      "LINE: Free Calls & Messages : 500,000,000+\n",
      "Google Chrome: Fast & Secure : 1,000,000,000+\n",
      "Firefox Browser fast & private : 100,000,000+\n",
      "UC Browser - Fast Download Private & Secure : 500,000,000+\n",
      "Gmail : 1,000,000,000+\n",
      "Hangouts : 1,000,000,000+\n",
      "Messenger Lite: Free Calls & Messages : 100,000,000+\n",
      "Kik : 100,000,000+\n",
      "KakaoTalk: Free Calls & Text : 100,000,000+\n",
      "Opera Mini - fast web browser : 100,000,000+\n",
      "Opera Browser: Fast and Secure : 100,000,000+\n",
      "Telegram : 100,000,000+\n",
      "Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+\n",
      "UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+\n",
      "Viber Messenger : 500,000,000+\n",
      "WeChat : 100,000,000+\n",
      "Yahoo Mail – Stay Organized : 100,000,000+\n",
      "BBM - Free Calls & Messages : 100,000,000+\n"
     ]
    }
   ],
   "source": [
    "for app in android_final:\n",
    "    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'\n",
    "                                      or app[5] == '500,000,000+'\n",
    "                                      or app[5] == '100,000,000+'):\n",
    "        print(app[0], ':', app[5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3603485.3884615386"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "under_100_m = []\n",
    "\n",
    "for app in android_final:\n",
    "    n_installs = app[5]\n",
    "    n_installs = n_installs.replace(',', '')\n",
    "    n_installs = n_installs.replace('+', '')\n",
    "    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):\n",
    "        under_100_m.append(float(n_installs))\n",
    "        \n",
    "sum(under_100_m) / len(under_100_m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see a similar pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).\n",
    "\n",
    "Again, these app genres might seem more popular than reality and moreover, these niches seem to be dominated by a few giants who are hard to compete against.\n",
    "\n",
    "The game genre seems popular here too, but previously we found that in App Store this market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.\n",
    "\n",
    "The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811, but, here again we see few giants dominating the market such as Amazon, Google and Bible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "To reiterate our primary goal - We are going to launch the app in both AppStore and PlayStore albeit different timeline.\n",
    "\n",
    "So from that perspective, we are looking at the Top 5 apps on user base by the same genre/category in the AppStore we listed as potentials (Books and Finance).\n",
    "\n",
    "Here again the <mark>Books</mark> category shows dominance by big player with their marketplace apps such as Google or Amazon.\n",
    "\n",
    "This makes it especially hard to launch a famous/best selling book as app without having publishing license/rights issue with these big players is going to be hard.\n",
    "\n",
    "**Our recommendation at this stage is to develop a <mark>Personal Finance</mark> app in both AppStore and PlayStore**.\n",
    "\n",
    "**Note**:\n",
    "- We should not consider any sub-categories which require banking license, deep domain expertise and infrastructure (such as Payments) at this stage\n",
    "- To maximise the value we create for our user base and in turn maximise our ad-revenue, we should look at a partnership with a Financial advisor."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}