{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COPY Matching App Type & Number of Users\n", "\n", "* Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.\n", "* Cost of app is free, revenue to be generated by advertisement views.\n", "* Two app sources are considered, Google Play for Android users and the Apple App Store for iOS users." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def opener(file):\n", " import csv\n", " from csv import reader\n", " opened_file = open(file)\n", " read_file = reader(opened_file)\n", " data_file = list(read_file)\n", " return data_file\n", "\n", "ios_data = opener('AppleStore.csv')\n", "android_data = opener('googleplaystore.csv')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# function to isolate given number of rows in data set\n", "def slicer(data_file, start, stop):\n", " slice = data_file[start:stop]\n", " for row in slice:\n", " if row == slice[-1]:\n", " print(row)\n", " print('***')\n", " else:\n", " print(row)\n", " print('\\n')\n", " \n", "# function to count the number of total rows (including any header if present) and number of columns.\n", "def row_column_counter(data_file):\n", " print('There are ' + str(len(data_file)) + ' rows.')\n", " print('There are ' + str(len(data_file[0])) + ' columns.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Here is the header row for the iOS data file, and the first two rows as examples. There are about 7000 apps and 16 data columns.__\n", "\n", "__The original data set is available here: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps__" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']\n", "\n", "\n", "['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']\n", "\n", "\n", "['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']\n", "***\n", "There are 7198 rows.\n", "There are 16 columns.\n" ] } ], "source": [ "slicer(ios_data, 0, 3)\n", "row_column_counter(ios_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Most of the iOS headings are self-explanatory, but for clarity here is a chart:__\n", "\n", "|Heading |Definition | Heading | Defintion |\n", "|:-- |:-- |:-- |:-- |\n", "|id |App ID |user_rating_ver|Avg User Rating (current version) |\n", "|track_name |App Name |ver |Latest Version Code |\n", "|size_bites |Size(in Bytes) |cont_rating |Content Rating |\n", "|currency |Currency Type |prime_genre |Primary Genre |\n", "|price |Price |sup_devices.num|Number of Supporting Devices |\n", "|rating_count_tot|User Rating Count (all versions) |ipadSc_urls.num|Number of Screenshots Shown for Display|\n", "|rating_count_ver|User Rating Count (current version)|lang.num |Number of SUpported Languages |\n", "|user_rating |Avg User Rating (all versions) |vpp_lic |Vpp Device Based LIcensing Available |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Here is the header row for the Android data file, and the first two rows as exmaples. There are about 10,000 rows and 13 data columns.__\n", "\n", "__The original data set is available here: https://www.kaggle.com/lava18/google-play-store-apps__" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']\n", "\n", "\n", "['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']\n", "\n", "\n", "['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']\n", "***\n", "There are 10842 rows.\n", "There are 13 columns.\n" ] } ], "source": [ "slicer(android_data, 0, 3)\n", "row_column_counter(android_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__And here is a chart detailing the android headings:__\n", "\n", "|Heading |Definition |Heading |Definition |\n", "|:-- |:-- |:-- |:-- |\n", "|App |Applicantion Name |Price |Price |\n", "|Category|Category |Content Rating|Target Age Group |\n", "|Rating |User Rating |Genres |Genres | \n", "|Reviews |User Rating Count |Last Updated |Last Update(when scraped) |\n", "|Size |Size(in Megabytes) |Current Ver |Current Version |\n", "|Installs|Number of downloads|Android Ver |Minimum Required Version of Android|\n", "|Type |Paid or Free |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Below is an example of a corrupted row. In this case it can be deleted, but that could cause a problem with the data set.__\n", "__Here is a link to the discussion of the error: https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101__\n", "\n", "The entry for the category column is missing. One solution would be to fill the missing variable. Or, the row can be removed using ```del```, but if the block is run multiple times it will remove more than the corrupted row." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12\n", "['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']\n", "13\n", "['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']\n", "1\n" ] } ], "source": [ "# Row 10472 google data (header exclusive)\n", "\n", "print(len(android_data[10473]))\n", "print(android_data[10473])\n", "\n", "print(len(android_data[10474]))\n", "print(android_data[10474])\n", "\n", "# Curious to see if there are any other rows missing entries.\n", "short_rows = []\n", "for row in android_data:\n", " if len(row) != 13:\n", " short_rows.append(row)\n", "\n", "print(len(short_rows))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__A quick look will show that the only row with this problem is the one already identified, so filling the variable or using ```del``` are good options.__\n", "\n", "__However, sometimes the best practice is to leave the original data set unchanged. In that case it is possible to create a copy and fill it with the 'good' data.__\n", "\n", "* print the length of a data set\n", "* print the length of a copy of the data set\n", "* use a function to add all rows that are the same length as the header in the original data set to the copy\n", "* print the lengths of the two data sets again" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The android data set length is 10842\n", "The android data set copy length is 0\n", "The android data set length is 10842\n", "The android data set copy length is 10841\n" ] } ], "source": [ "data_copy = []\n", "print('The android data set length is ' + str(len(android_data)))\n", "print('The android data set copy length is ' + str(len(data_copy)))\n", "\n", "def refiner(data_set):\n", " for row in data_set:\n", " if len(row) == len(data_set[0]):\n", " data_copy.append(row)\n", " return 'Data set copied'\n", "\n", "refiner(android_data)\n", "print('The android data set length is ' + str(len(android_data)))\n", "print('The android data set copy length is ' + str(len(data_copy)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__For ease in the rest of the project, the original Android data can be modified.__" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10841\n" ] } ], "source": [ "android_data = data_copy\n", "print(len(android_data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Many data sets will contain duplicate data which needs to be consolidated or removed. In this case there are three entries for 'Slack'. One of them has a unique value in the fourth poistion, number of reviews (51510 vs 51507).__" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']\n", "['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']\n", "['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']\n" ] } ], "source": [ "for row in android_data:\n", " app_name = row[0]\n", " if app_name == 'Slack':\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__The code below will compile the data in two lists, one with unique entries and the other with duplicated entries. Then print the first ten entries of the duplicate list as an example and print the number of times the app 'Slack' appears in the both lists.__" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of unique apps: 9659\n", "Number of duplicate apps 1181\n", "\n", "\n", "Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']\n", "\n", "\n", "The number of times \"Slack\" appears in either data set: 3\n" ] } ], "source": [ "duplicate_apps = []\n", "unique_apps = []\n", "\n", "for row in android_data[1:]:\n", " name = row[0]\n", " if name in unique_apps:\n", " duplicate_apps.append(name)\n", " else:\n", " unique_apps.append(name)\n", " \n", "print('Number of unique apps:', len(unique_apps))\n", "print('Number of duplicate apps', len(duplicate_apps))\n", "print('\\n')\n", "print('Examples of duplicate apps:', duplicate_apps[:10])\n", "print('\\n')\n", "\n", "slack_count = 0\n", "\n", "for app_name in unique_apps:\n", " if app_name == 'Slack':\n", " slack_count += 1\n", "for app_name in duplicate_apps:\n", " if app_name == 'Slack':\n", " slack_count += 1\n", "\n", "print('The number of times \"Slack\" appears in either data set:' , slack_count)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__One criteria for which entry to retain is to keep the one with the highest number of reviews.__\n", "\n", "__Below the code loops through all the data and builds a dictionary with the app name as the key and the number of reviews as the corresponding value. If the code finds an entry where the name already exists in the dictionary, it will keep whichever entry has the most reviews.__\n", "\n", "__For the example, we want to keep the third entry for 'Slack', where the User Rating Count = 51510. The code also shows that the ```reviews_max``` dictionary is the same length as the ```unique_apps``` list above.__" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of reviews for \"Slack\": 51510.0\n", "The number of unique entries: 9659\n" ] } ], "source": [ "reviews_max = {}\n", "\n", "for row in android_data[1:]:\n", " name = row[0]\n", " n_reviews = float(row[3])\n", " if name not in reviews_max:\n", " reviews_max[name] = n_reviews\n", " if name in reviews_max and reviews_max[name] < n_reviews:\n", " reviews_max[name] = n_reviews\n", " \n", " \n", "print('The number of reviews for \"Slack\":', reviews_max['Slack'])\n", "print('The number of unique entries:', len(reviews_max))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__In this loop the entire row is added to a new cleaned data set if the User Review Count in the original data set matches the value from the ```reviews_max``` data set built above.__\n", "\n", "__The output confirms that the length of the new data set matches the length of both the ```reviews_max``` data set and the ```unique_apps``` data sets. It also shows the correct User Rating Count value for the 'Slack' application.__" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The length of the android_clean data set: 9659\n", "\n", "\n", "['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']\n" ] } ], "source": [ "android_clean = []\n", "already_added = []\n", "\n", "for row in android_data[1:]:\n", " name = row[0]\n", " n_reviews = float(row[3])\n", " if reviews_max[name] == n_reviews and name not in already_added:\n", " android_clean.append(row)\n", " already_added.append(name)\n", " \n", "print('The length of the android_clean data set:', len(android_clean))\n", "\n", "print('\\n')\n", "for row in android_clean:\n", " if row[0] == 'Slack':\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__The iOS data set uses an ID number for each application. It's straightforward to use this to see if there are any duplicate entries.__" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The total number of entries in the iOS data set: 7198\n", "The total number of unique ID numbers in the iOS data set: 7198\n" ] } ], "source": [ "print('The total number of entries in the iOS data set:', len(ios_data))\n", "\n", "ios_id_nums = []\n", "for row in ios_data:\n", " if row[0] not in ios_id_nums:\n", " ios_id_nums.append(row[0])\n", " else:\n", " print('This ID number already exists:', row[0])\n", " \n", "print('The total number of unique ID numbers in the iOS data set:', len(ios_id_nums))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Any title with more than three characters outside the common English character set will be removed from the data set. This allows for some special characters in titles, but limits the likleyhood the application will be intended for a non-English speaking audience.__\n", "__In the code blocks below the function to do this is written, called on a sample dataset, and then called on the Android and iOS files.__" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n", "False\n", "None\n", "None\n" ] } ], "source": [ "string1 = 'Instagram'\n", "string2 = '爱奇艺PPS -《欢乐颂2》电视剧热播'\n", "string3 = 'Docs To Go™ Free Office Suite'\n", "string4 = 'Instachat 😜'\n", "\n", "def common_english_character(string):\n", " c_count = 0\n", " for character in string:\n", " if ord(character) > 127:\n", " c_count += 1\n", " if c_count == 3:\n", " return False\n", " \n", "\n", "print(common_english_character(string1))\n", "print(common_english_character(string2))\n", "print(common_english_character(string3))\n", "print(common_english_character(string4))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['Instagram'], ['Docs To Go™ Free Office Suite'], ['Instachat 😜']]\n", "[['爱奇艺PPS -《欢乐颂2》电视剧热播']]\n" ] } ], "source": [ "dataset = [ ['Instagram'], ['爱奇艺PPS -《欢乐颂2》电视剧热播'], ['Docs To Go™ Free Office Suite'], ['Instachat 😜']]\n", "non_english_dataset = []\n", "\n", "def common_english_character(dataset1, dataset2):\n", " for row in dataset:\n", " c_count = 0\n", " app_name = row[0]\n", " for character in app_name:\n", " if ord(character) > 127:\n", " c_count += 1\n", " if c_count >= 3:\n", " dataset1.remove(row)\n", " dataset2.append(row)\n", " \n", "common_english_character(dataset, non_english_dataset)\n", "print(dataset)\n", "print(non_english_dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__\\***Figure this out later Bruce\\***__" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "61\n", "[['Truyện Vui Tý Quậy', 'COMICS', '4.5', '144', '4.7M', '10,000+', 'Free', '0', 'Everyone', 'Comics', 'July 19, 2018', '3.0', '4.0.3 and up'], ['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up'], ['At home - rental · real estate · room finding application such as apartment · apartment', 'HOUSE_AND_HOME', '3.8', '2496', 'Varies with device', '500,000+', 'Free', '0', 'Everyone', 'House & Home', 'July 5, 2018', 'Varies with device', 'Varies with device'], ['乐屋网: Buying a house, selling a house, renting a house', 'HOUSE_AND_HOME', '3.7', '2248', '15M', '100,000+', 'Free', '0', 'Everyone', 'House & Home', 'August 3, 2018', 'v3.1.1', '4.0 and up'], ['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up']]\n", "9598\n", "9598\n" ] } ], "source": [ "non_english_android = []\n", "dataset = android_clean\n", "\n", "common_english_character(dataset, non_english_android)\n", "\n", "print(len(non_english_android))\n", "print(non_english_android[:5])\n", "print(len(android_clean))\n", "print(len(dataset))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7198\n", "0\n" ] } ], "source": [ "non_english_ios = []\n", "\n", "def cec_ios(dataset1, dataset2):\n", " for row in dataset:\n", " c_count = 0\n", " app_name = row[1]\n", " for character in app_name:\n", " if ord(character) > 127:\n", " c_count += 1\n", " if c_count >= 3:\n", " dataset1.remove(row)\n", " dataset2.append(row)\n", " \n", "cec_ios(ios_data[1:], non_english_ios)\n", "\n", "print(len(ios_data))\n", "print(len(non_english_ios))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7198\n", "6487\n", "710\n" ] } ], "source": [ "non_english_ios = []\n", "dataset = ios_data[1:]\n", "\n", "cec_ios(dataset, non_english_ios)\n", "\n", "print(len(ios_data))\n", "print(len(dataset))\n", "print(len(non_english_ios))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6866\n" ] } ], "source": [ "extra = []\n", "\n", "cec_ios(ios_data, extra)\n", "\n", "print(len(ios_data))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }