{ "cells": [ { "cell_type": "markdown", "id": "6216ec6b", "metadata": {}, "source": [ "# PROFITABLE APPS ON THE MARKET - A GUIDED DATA ANALYSIS PROJECT\n", "\n", "**Project Description**\n", " \n", " A data analysis is the process of exploring, cleaning and transforming the data into a useable format. In this analysis, we are going to explore the apps on both Apple Store and Google Play Store and see what type of apps are currently attracting users. Most of us uses a smartphone everyday, we used different apps for different purposes.\n", "\n", "**Project Goal**\n", " \n", " The goal of this analysis is to recommend the type of apps that the developers can work on. In this analysis, we are going to work on exploring the data, cleaning the data, transforming and analyzing the data to come up with a recommendation." ] }, { "cell_type": "markdown", "id": "7b837550-c82c-4131-b804-d6927d6cfaf9", "metadata": {}, "source": [ "## Defining the Fucntions to open,read and explore the dataset\n", "\n", "Before we start our exploration, cleaning and analysis. First we're going to define a function that will load and explore out dataset." ] }, { "cell_type": "code", "execution_count": 1, "id": "08cc369f", "metadata": {}, "outputs": [], "source": [ "def DatasetCsv(file):\n", " '''This fuction takes in one parameter:\n", " file = the file name / file path that needs to be converted to a list of lists\n", " '''\n", " from csv import reader\n", " dataset = open(file, encoding = 'utf8')\n", " read = reader(dataset)\n", " data = list(read)\n", " \n", " return data" ] }, { "cell_type": "code", "execution_count": 2, "id": "8285bf00", "metadata": {}, "outputs": [], "source": [ "def explore_data(dataset,start,end,rows_and_columns = False):\n", " '''This function takes in four parameters:\n", " dataset = the data set that will be list of list. Which will contain all the rows and an optional header.\n", " start and end = are both integer type of data, it will indicate the start and end of the elements in the list of lists.\n", " rows_and_columns = will take an input of boolean values True or False and it will output the number of the column and rows of the dataset.\n", " '''\n", " dataset_slice = dataset[start:end]\n", " for i in dataset_slice:\n", " print(i)\n", " \n", " if rows_and_columns == True:\n", " print('')\n", " print('The number of column is', len(dataset[0]))\n", " print('The number of row is', len(dataset))\n", " \n", " " ] }, { "cell_type": "markdown", "id": "cb5a05ab-6c15-4a97-a406-501dec4e01a3", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "c8dc46cb-2aca-450a-8f83-7919543791eb", "metadata": {}, "source": [ "# Initial Dataset Exploration\n", "\n", "To understand the `variables` to consider that will drive the decision making on building the type of application, we'll look into the **first five(5) rows** and identify these `variables`." ] }, { "cell_type": "markdown", "id": "de35a64d-c7cf-47df-95c0-8857c1c78831", "metadata": {}, "source": [ "### iOS App Store\n", "\n", "First, we will explore the iOS dataset. The source for this dataset can be found *[here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)*. And this dataset was last updated on 2018-06-05." ] }, { "cell_type": "code", "execution_count": 3, "id": "2810ff9e-cef8-4bc6-ba9b-fbb8c77d2f11", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']\n", "['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']\n", "['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']\n", "['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']\n", "['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']\n", "['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']\n", "\n", "The number of column is 17\n", "The number of row is 7198\n" ] } ], "source": [ "ios = DatasetCsv(r'C:\\Users\\Mico\\OneDrive\\Desktop\\DATASETS\\KAGGLE\\APPLE STORE\\AppleStore.csv') #Using the defined function DatasetCsv(file) to store the dataset of 'AppleStore.csv' to the variable 'ios'\n", "explore_data(ios,0,6,rows_and_columns = True) #Using the defined function explore_data(dataset,start,end,rows_and_columns = False) to do an initial exploration for the 'ios' dataset." ] }, { "cell_type": "markdown", "id": "1c19a6ce-429a-45c0-b3f1-1c1b23e84607", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "319bc920-d56a-456a-801b-711538f24aad", "metadata": {}, "source": [ "As per our observation on the dataset. We have `7,198 apps` available. From the `16 columns` that are present, we identified the following columns that will be useful on this analysis:" ] }, { "cell_type": "markdown", "id": "7a3eda61-fc6d-459e-9b8f-df92c0142924", "metadata": {}, "source": [ "| Columns | Description | Ideal Datatype |\n", "|:---: | :---:| :---: |\n", "|size_bytes |Size (in Bytes) | Integer|\n", "|price |Price amount | Float|\n", "|ratingcounttot |Total number of user ratings | Integer|\n", "|user_rating |Average user rating value| Float |\n", "|cont_rating |Content rating | String|\n", "|prime_genre |Primary genre | String|\n", "|ipadSc_urls.num |Number of screenshots showed for display | Integer|\n", "|lang.num |Number of supported languages | Integer|\n", "\n", "For more information about the columns: [Kaggle - Mobile App Store ( 7200 apps)](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)" ] }, { "cell_type": "markdown", "id": "36ae924b-2e75-48fd-9259-d8b02c8feedf", "metadata": {}, "source": [ "### Google Play Store\n", "\n", "For the google play store dataset. The source can be found *[here](https://www.kaggle.com/lava18/google-play-store-apps)*. And this dataset was last updated on 2018-09-04." ] }, { "cell_type": "code", "execution_count": 4, "id": "f2e3a981", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']\n", "['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']\n", "['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']\n", "['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']\n", "['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']\n", "['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']\n", "\n", "The number of column is 13\n", "The number of row is 10842\n" ] } ], "source": [ "android = DatasetCsv(r'C:\\Users\\Mico\\OneDrive\\Desktop\\DATASETS\\KAGGLE\\GOOGLE PLAY STORE\\googleplaystore.csv')\n", "explore_data(android,0,6,rows_and_columns = True)" ] }, { "cell_type": "markdown", "id": "3ae2949f-4919-46b6-882d-42591cc7dee4", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "9e554f62-b78b-40b6-8c22-b4c1b599233e", "metadata": {}, "source": [ "As per our observation on the dataset. We have `10,842 apps` available. From the `13 columns` that are present, we identified the following columns that will be useful on this analysis:" ] }, { "cell_type": "markdown", "id": "d0bb0faa-e096-41f5-84bb-40e02ba9906c", "metadata": {}, "source": [ "| Column | Description | Ideal Datatype |\n", "|:---: | :---:| :---:|\n", "|Category |Category the app belongs to | String|\n", "|Rating |Average user rating value | Float|\n", "|Reviews |Total number of user reviews | Integer|\n", "|Size |Size of the app (in megabytes) | Float|\n", "|Installs |Number of user downloads/installs for the app| Integer|\n", "|Type |Paid or Free | Boolean|\n", "| Content Rating| Age group the app is targeted at - Children / Mature 21+ / Adult| String|\n", "| Genres| An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to| String|\n", "\n", "For more information about the columns: [Kaggle - Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps)\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "79a0068d-4f09-403a-9a02-a93c954ac35b", "metadata": { "tags": [] }, "source": [ "---" ] }, { "cell_type": "markdown", "id": "17bad710-ddb9-4782-b39f-61166cadbdf8", "metadata": {}, "source": [ "# Data Cleaning\n", "\n", "Before we do a further analysis. First, we'll have to ensure that the dataset is accurate and in a useable format. Remember that our goal is to understand what types of apps that will attract more users. Our target audience are *english-speaking* users and we will only build *free* apps.\n", "\n", "**In this section we are going to do the following:**\n", "- Detect inaccurate data, and correct or remove it.\n", "- Detect duplicate data, and remove the duplicates.\n", "- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.\n", "- Remove apps that aren't free." ] }, { "cell_type": "markdown", "id": "a6c864a8-4a40-4519-98a9-c955696222a2", "metadata": {}, "source": [ "### Data innacuracy" ] }, { "cell_type": "markdown", "id": "615033c4-8e29-46c2-a047-d6512637eb7e", "metadata": {}, "source": [ "For our google play dataset, a discussion on [kaggle](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) states that the `rating` attribute is missing on entry **10472**. Note that `rating` is the 3rd column which makes its index to be **[2]**. Our dataset has a header row and the reported entry **10472** might have or might not have a header so we will explore entries **10472-10473** to investigate." ] }, { "cell_type": "code", "execution_count": 5, "id": "688491ae-9621-4cf4-83aa-1c1bb29da939", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']\n", "['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']\n", "Number of columns in entry 10472: 13 columns\n", "Number of columns in entry 10473: 12 columns\n" ] } ], "source": [ "explore_data(android,10472,10474) #Note that specifying an index slice is [start:end - 1] hence we will start indexing at 10472 and ending it at 10473\n", "print('Number of columns in entry 10472:',len(android[10472]),'columns')\n", "print('Number of columns in entry 10473:',len(android[10473]),'columns')" ] }, { "cell_type": "markdown", "id": "ff3b75eb-02c6-45c1-874c-4d698cc82df9", "metadata": { "tags": [] }, "source": [ " " ] }, { "cell_type": "markdown", "id": "444e3023-f252-4e89-884e-65dc1e2e7fd0", "metadata": {}, "source": [ "After analyzing the two observations, we noticed that the entry **10473** is missing a column. `Rating` is present on the data and the `Category` is missing. Upon further investigating on this observation, we noticed that there's an element with `''` entry. Basing on the present elements beside it, we can deduce that this is the `Genre` attribute.\n", "\n", "Since this data will cause an error in the analysis and we've identified `Category` and `Genre` as an important part of the analysis. We can either look up for the application **Life Made WI-FI Touchscreen Photo Frame** in the Google Play Store to populate the data or we can just delete the observation. For this analysis we will delete the observation." ] }, { "cell_type": "code", "execution_count": 6, "id": "902db15a-0b35-4572-9baa-af68a3ed86d0", "metadata": {}, "outputs": [], "source": [ "del android[10473]" ] }, { "cell_type": "markdown", "id": "8a7be1cc-41a5-4dd8-a706-edda0e397c3a", "metadata": {}, "source": [ "To verify if we've made changes to the dataset. We will run the explore_data function again and check the number of columns." ] }, { "cell_type": "code", "execution_count": 7, "id": "265f9e25-2a53-43fb-b38e-67fdd5c0caac", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']\n", "['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']\n", "Number of columns in entry 10472: 13 columns\n", "Number of columns in entry 10473: 13 columns\n" ] } ], "source": [ "explore_data(android,10472,10474)\n", "print('Number of columns in entry 10472:',len(android[10472]),'columns')\n", "print('Number of columns in entry 10473:',len(android[10473]),'columns')" ] }, { "cell_type": "markdown", "id": "77bd25b5-9847-4716-9617-ddc1c8a6418a", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "dc073d97-2f79-445f-b689-c943988ed512", "metadata": {}, "source": [ "### Data Duplicate" ] }, { "cell_type": "markdown", "id": "6ec5eefe-2181-49ac-8c63-7e8d139ef1cf", "metadata": {}, "source": [ "Now we will invesitage the dataset if it has any duplicates. We will be basing the duplicate by using the `app` (index[0]) attribute which will give us the name of the app." ] }, { "cell_type": "code", "execution_count": 8, "id": "bc9e49d5-ca4b-4660-a7e1-f433376ff662", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The unique number of apps is: 9659\n", "The number of duplicate apps is: 1181\n", "Dupicated apps examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']\n", "The most duplicated apps is/are: ['ROBLOX']\n" ] } ], "source": [ "android_rows = android[1:] # Extracting the rows of android dataset\n", "unique_android_apps = list() # List of unique applications in Google Play Store\n", "duplicate_android_apps = list() # List of apps that has a duplicate\n", "most_dup_app = list() # App with the most number of duplicate\n", "\n", "\n", "# Using this for loop, we are going to populate the unique_android_apps and duplicate_android_apps\n", "\n", "for i in android_rows:\n", " name = i[0]\n", " \n", " if name in unique_android_apps:\n", " duplicate_android_apps.append(name)\n", " elif name not in unique_android_apps:\n", " unique_android_apps.append(name)\n", "\n", "frequency_of_duplicate = dict() # Frequency table of the duplicates\n", "\n", "for i in duplicate_android_apps:\n", " frequency_of_duplicate[i] = frequency_of_duplicate.get(i,0) + 1\n", "\n", "# Using this for loop, we are going to identify which app has the most number of duplicate\n", "\n", "for i in frequency_of_duplicate:\n", " if frequency_of_duplicate[i] == max(frequency_of_duplicate.values()):\n", " most_dup_app.append(i)\n", " \n", "print('The unique number of apps is:',len(unique_android_apps))\n", "print('The number of duplicate apps is:',len(duplicate_android_apps))\n", "print('Dupicated apps examples:',duplicate_android_apps[:5])\n", "print('The most duplicated apps is/are:', most_dup_app)" ] }, { "cell_type": "markdown", "id": "088fd881-ac6f-4cc4-b3fe-3c21b5286da2", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "61d8af9f-65bb-461a-a6ac-0c6e9043df7f", "metadata": {}, "source": [ "As we can see, there are a total number of **1,181** apps that are duplicates. And the app that has the most duplicate is **'Roblox'**.\n", "\n", "We will investigate this duplicated app since we cannot remove duplicates randomly. " ] }, { "cell_type": "code", "execution_count": 9, "id": "d5d5078a-5d88-43c0-8218-2d0c0967e11b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n", "['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n", "['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n", "['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n", "['ROBLOX', 'GAME', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n", "['ROBLOX', 'FAMILY', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n", "['ROBLOX', 'FAMILY', '4.5', '4450855', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n", "['ROBLOX', 'FAMILY', '4.5', '4450890', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n", "['ROBLOX', 'FAMILY', '4.5', '4443407', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']\n" ] } ], "source": [ "# Using this for loop, we are going to verify the data of the most_dup_app list\n", "\n", "for i in most_dup_app:\n", " for x in android_rows:\n", " if i in x:\n", " print(x)" ] }, { "cell_type": "markdown", "id": "40a9b6bd-d8ed-4b04-8c81-67297f012e6f", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "47a5a8c5-026b-4556-8ea9-31dacfefc3b2", "metadata": {}, "source": [ "The `Genre` and `Reviews` attributes are different on the **'Roblox'** app. We can use this information to create a criterion for deleting the observation. For this project, we are going to keep the observation among the duplicates that has the highest `Reviews`. Since we can assume that the most number of reviews is the latest one.\n", "\n", "As we start removing the duplicates, we have to keep in mind that the total number of apps should be **9,659**." ] }, { "cell_type": "code", "execution_count": 10, "id": "f6220b74-713d-497b-a9e8-5ad0f5dc2500", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The unique number of apps is: 9659\n" ] } ], "source": [ "reviews_max = dict() # Creating a frequency table that will store the unique apps since this is a dictionary, it will automatically delete an app with a similar name. We will also update the values based on a higher rated duplicate app.\n", "\n", "for i in android_rows:\n", " name = i[0]\n", " reviews = float(i[3])\n", " \n", " if name in reviews_max and reviews_max[name] < reviews:\n", " reviews_max[name] = reviews\n", " elif name not in reviews_max:\n", " reviews_max[name] = reviews\n", "\n", "#To verify that we have the correct number of unique apps\n", "print('The unique number of apps is:',len(reviews_max))" ] }, { "cell_type": "markdown", "id": "db98f1f7-4ef7-477b-bdb4-74769fb2f7ba", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "ff3126f2-d3a2-4de6-8c28-30254ae1c6b4", "metadata": {}, "source": [ "We will start removing the duplicate apps. By looping through the `android_rows` list. We will create a new list that will store our cleaned dataset, this dataset will be stored in the variable `android_clean`." ] }, { "cell_type": "code", "execution_count": 11, "id": "cb6765fd-2c29-44fa-8b4f-cfe74ba99fde", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of rows in our new dataset 'android_clean' is 9659\n" ] } ], "source": [ "android_clean = list()\n", "already_added = list() # We are going to add this to avoid adding a duplicate on the android_clean\n", "\n", "for i in android_rows:\n", " name = i[0]\n", " reviews = float(i[3])\n", " \n", " \n", " if (name in reviews_max) and (reviews_max[name] == float(i[3])) and (name not in already_added):\n", " android_clean.append(i)\n", " already_added.append(name) # The purpose of this is so that we can filter out the ones that we alreaddy added, because there might be an instance where an app has a similar name and review\n", "\n", "# To verify that all the data is unique, remember that there are 9,659 number of unique apps.\n", "print(\"The number of rows in our new dataset 'android_clean' is\",len(android_clean))\n", " " ] }, { "cell_type": "markdown", "id": "3d445352-d7ee-46b9-8c0c-1703947b5108", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "fcc38e93-b6bc-4884-a5b3-608a4b1b2362", "metadata": {}, "source": [ "### Removing non-english apps" ] }, { "cell_type": "markdown", "id": "0457f84b-5e6c-43f3-a06f-ef53f9069024", "metadata": {}, "source": [ "Since the app that we are going to build is targeted for *english-speaking* audience. We have to remove the apps that contains non-english characters. As per the [ASCII](https://en.wikipedia.org/wiki/ASCII)(American Standard Code for Information Interchange) system. An english text has a value range of 0 to 127. We can get the individual value of each characters by passing it as an argument in the function `ord()`. Below, we are going to create a function that will loop through the names of the apps. And we are expecting a boolean value of `True` if all the character in the name is english characters. And will give an output of `False` if the name containts a non-english character." ] }, { "cell_type": "code", "execution_count": 12, "id": "9f7c62ec-65b4-4fd4-9766-574899fc8fde", "metadata": {}, "outputs": [], "source": [ "def english(string):\n", " '''This fuction takes in one argument:\n", " string = the string that we are going to loop and verify if there exist a non-english character. If the string has a value of > 127 more than three times the function will return a false, otherwise it will return a true'''\n", " char = list()\n", " count = 0\n", " for character in string:\n", " if ord(character) > 127 and count <= 3:\n", " count += 1\n", " \n", " if count > 3:\n", " return False\n", " return True" ] }, { "cell_type": "markdown", "id": "7481efd6-2459-4196-ac6a-abb33ad81465", "metadata": {}, "source": [ "Using the newly defined function `english()` we will create a new list consisting of `non_english_app`. And apply the `explore_data()` function to have an insight of the apps." ] }, { "cell_type": "code", "execution_count": 13, "id": "df7dc0ef-2eb9-4e8b-8d6b-ea1b19bd3753", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Non-english app example:\n", "\n", "['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']\n", "['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up']\n", "['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up']\n", "['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up']\n", "['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up']\n", "['RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'FAMILY', 'NaN', '4', '64M', '1+', 'Free', '0', 'Everyone', 'Education', 'July 17, 2018', '1.0.1', '4.4 and up']\n", "\n", "The number of column is 13\n", "The number of row is 45\n" ] } ], "source": [ "non_english_app = list()\n", "for i in android_clean:\n", " name = english(i[0])\n", " if name == False:\n", " non_english_app.append(i)\n", "print('Non-english app example:\\n')\n", "explore_data(non_english_app,0,6,rows_and_columns = True)" ] }, { "cell_type": "markdown", "id": "829ceee0-ce1d-4355-b4d5-6bef484b038a", "metadata": {}, "source": [ "To continue with our data cleaning, we are going to remove the `non_english_app` on our dataset and create a new dataset that will be assgined to the variable `english_android_app`." ] }, { "cell_type": "code", "execution_count": 14, "id": "213429f0-2126-473d-a327-1e80dc3ef8ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "English android app: 9614 rows\n" ] } ], "source": [ "english_android_app = list()\n", "\n", "for row in android_clean:\n", " if row not in non_english_app:\n", " english_android_app.append(row)\n", "print('English android app:',len(english_android_app),'rows')" ] }, { "cell_type": "markdown", "id": "6153fffd-d363-4f78-90db-6cb7b7da01fc", "metadata": {}, "source": [ "Remember that the number of unique apps before we removed the `non_english_app` is **9,659**. The total number of `non_english_app` is **45**. So we successfully removed the `non_english_app` on our dataset." ] }, { "cell_type": "markdown", "id": "7de7b793-afd9-433e-8bde-db58e386106e", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "d8bd15f2-7071-4cf1-97e7-87bac1f35bd0", "metadata": {}, "source": [ "### Extracting free aps" ] }, { "cell_type": "markdown", "id": "6ca54320-e7a5-44be-9929-411fa4f75009", "metadata": {}, "source": [ "As we've mentioned, our focus is to build a free app and the main source of revenue will come from the in app ads. So in this section, we are going to isolate the free aps from the paid apps.\n", "\n", "First we are going to identify the free apps in the `english_android_app` dataset." ] }, { "cell_type": "code", "execution_count": 15, "id": "324a0eef-c241-4a54-ab9e-95c661c6c7c5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']\n", "['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']\n", "['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']\n", "['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']\n", "['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']\n", "['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']\n", "['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up']\n", "['Garden Coloring Book', 'ART_AND_DESIGN', '4.4', '13791', '33M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'September 20, 2017', '2.9.2', '3.0 and up']\n", "['Kids Paint Free - Drawing Fun', 'ART_AND_DESIGN', '4.7', '121', '3.1M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'July 3, 2018', '2.8', '4.0.3 and up']\n", "['Text on Photo - Fonteee', 'ART_AND_DESIGN', '4.4', '13880', '28M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'October 27, 2017', '1.0.4', '4.1 and up']\n", "\n", "The number of column is 13\n", "The number of row is 8863\n" ] } ], "source": [ "free_english_app = list()\n", "\n", "for row in english_android_app:\n", " free_paid = row[6]\n", " price = float(row[7].replace('$','')) # I added the price attribute to ensure that no Free app will have an error of having a price\n", "\n", " if free_paid == 'Free' and price == 0.0:\n", " free_english_app.append(row)\n", "\n", "#Exploring our cleaned dataset\n", "explore_data(free_english_app,0,10,rows_and_columns = True)" ] }, { "cell_type": "markdown", "id": "621b803c-7f05-45b4-b390-8e9c7eded3eb", "metadata": {}, "source": [ "After exploring the data, we now have **`8,863 rows`** from the initial **10,841 rows**.\n", "\n", "Now that we've finished the following:\n", "- Detect inaccurate data, and correct or remove it.\n", "- Detect duplicate data, and remove the duplicates.\n", "- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.\n", "- Remove apps that aren't free.\n", "\n", "We can say that the dataset `free_english_app` from [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-apps) is now cleaned." ] }, { "cell_type": "markdown", "id": "6d6cccc3-bb5e-47a8-9767-abed445ae107", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "6531dfe9-b09a-4cf8-9ff5-7511a1bc1a44", "metadata": {}, "source": [ "# iOS\n", "For the [iOS dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps), we're going to do the same thing that we did to the [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-apps). \n", "\n", "For starter, upon skimming the [discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176) in kaggle, a user named *Marjan* reported that there's two duplicate in the dataset. We will investiage this report.\n", "\n", "On our initial data exploratory earlier, iOS dataset has **7,198 rows**. Excluding the header, we are expecting 7,197 rows." ] }, { "cell_type": "code", "execution_count": 16, "id": "85167779-dec7-4cbf-90fa-9ce98ed8bd96", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'VR Roller Coaster': 1, 'Mannequin Challenge': 1}\n" ] } ], "source": [ "ios_rows = ios[1:] # Extracting the rows of android dataset\n", "unique_ios_apps = list() # List of unique applications in Google Play Store\n", "duplicate_ios_apps = list() # List of apps that has a duplicate\n", "most_dup_ios_app = list()\n", "\n", "\n", "for i in ios_rows:\n", " name = i[2]\n", "\n", " if name in unique_ios_apps:\n", " duplicate_ios_apps.append(name)\n", " elif name not in unique_ios_apps:\n", " unique_ios_apps.append(name)\n", " \n", "frequency_of_ios_duplicate = dict()\n", "for i in duplicate_ios_apps:\n", " frequency_of_ios_duplicate[i] = frequency_of_ios_duplicate.get(i,0) + 1\n", "\n", "print(frequency_of_ios_duplicate)" ] }, { "cell_type": "markdown", "id": "9225b4a8-e718-40bc-a215-7a3e3e353bd2", "metadata": {}, "source": [ "As we can see, there's two app duplicate in the `track_name` attribute. We will further investigate these two apps." ] }, { "cell_type": "code", "execution_count": 17, "id": "9eb093c3-e882-47c9-9bcb-ebdfea682caa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index number: 3319\n", "['4000', '952877179', 'VR Roller Coaster', '169523200', 'USD', '0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']\n", "Index number: 5603\n", "['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']\n", "Index number: 7092\n", "['10751', '1173990889', 'Mannequin Challenge', '109705216', 'USD', '0', '668', '87', '3', '3', '1.4', '9+', 'Games', '37', '4', '1', '1']\n", "Index number: 7128\n", "['10885', '1178454060', 'Mannequin Challenge', '59572224', 'USD', '0', '105', '58', '4', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']\n" ] } ], "source": [ "for i in duplicate_ios_apps:\n", " for x in ios_rows:\n", " name = x[2]\n", " if i == name:\n", " print('Index number:',ios_rows.index(x))\n", " print(x)" ] }, { "cell_type": "markdown", "id": "b9414834-f386-4abf-94b5-3683dc871527", "metadata": {}, "source": [ "Upon further investigation, the `id` (index[0]) and `size_bytes` (index[2]) attributes differ from each other we can conclude that these apps are different so we will keep it." ] }, { "cell_type": "markdown", "id": "e0142643-f836-418a-b4aa-c0f1c76a9fcf", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "414fd64e-c734-455d-8508-4252e8a20926", "metadata": {}, "source": [ "To check for the data accuracy, we'll loop in all the rows and check for a row that do not contain **16 columns**." ] }, { "cell_type": "code", "execution_count": 18, "id": "8c2d961b-05e6-4e0f-a7bd-3a16f9ab71c7", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "All observations has 17 columns\n" ] } ], "source": [ "inaccurate_columns = list()\n", "for i in ios_rows:\n", " if len(i) != 17:\n", " inaccurate_columns.append(i)\n", "if len(inaccurate_columns) > 0:\n", " print(inaccurate_columns)\n", "elif len(inaccurate_columns) == 0:\n", " print('All observations has 17 columns')\n", " " ] }, { "cell_type": "markdown", "id": "e48d4ad9-3109-4540-acb5-5d4363d407be", "metadata": {}, "source": [ "Since all the observations has **17 columns** it is safe to assume that we will not get an error due to `IndexError: list index out of range`." ] }, { "cell_type": "markdown", "id": "d1cad213-6562-4425-8f39-70421d458f59", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "628073a0-b007-4d80-9a8c-1e3bc1275641", "metadata": {}, "source": [ "Now we will check and remove the `non-english` apps using the function `english()` that we defined earlier." ] }, { "cell_type": "code", "execution_count": 19, "id": "3f6bba68-0ede-41bd-91e9-2a8255c32d3c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['80', '299853944', '新浪新闻-阅读最新时事热门头条资讯视频', '115143680', 'USD', '0', '2229', '4', '3.5', '1', '6.2.1', '17+', 'News', '37', '0', '1', '1']\n", "['96', '303191318', '同花顺-炒股、股票', '122886144', 'USD', '0', '1744', '0', '3.5', '0', '10.10.46', '4+', 'Finance', '37', '0', '1', '1']\n", "['239', '331259725', '央视影音-海量央视内容高清直播', '54648832', 'USD', '0', '2070', '0', '2.5', '0', '6.2.0', '4+', 'Sports', '37', '0', '1', '1']\n", "['268', '336141475', '优酷视频', '204959744', 'USD', '0', '4885', '0', '3.5', '0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1']\n", "['295', '340368403', 'クックパッド - No.1料理レシピ検索アプリ', '76644352', 'USD', '0', '115', '0', '3.5', '0', '17.5.1.0', '4+', 'Food & Drink', '37', '5', '1', '1']\n", "\n", "The number of column is 17\n", "The number of row is 1014\n" ] } ], "source": [ "non_english_ios_app = list()\n", "\n", "for i in ios_rows:\n", " name = i[2]\n", " \n", " if english(name) == False:\n", " non_english_ios_app.append(i)\n", "explore_data(non_english_ios_app,0,5,rows_and_columns = True)" ] }, { "cell_type": "markdown", "id": "4fb35483-5774-4414-acdd-c095e0921a79", "metadata": {}, "source": [ "As we observed there are a total of **1014 rows** of `non_english_ios_app` in the iOS dataset. We're going to isolate these apps." ] }, { "cell_type": "code", "execution_count": 20, "id": "5ea701b3-7648-4066-b3e1-dc362e480fbc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']\n", "['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']\n", "['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']\n", "['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']\n", "['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']\n", "\n", "The number of column is 17\n", "The number of row is 6183\n" ] } ], "source": [ "english_ios_apps = list()\n", "\n", "for i in ios_rows:\n", " if i not in non_english_ios_app:\n", " english_ios_apps.append(i)\n", "explore_data(english_ios_apps,0,5,rows_and_columns = True)" ] }, { "cell_type": "markdown", "id": "3fbda8de-5ba0-4442-97f2-e3c48bef2ef3", "metadata": {}, "source": [ "Remember that iOS dataset has **7,197 rows**. After isolating the english apps the we've managed to remove **1,014 non-english apps**. Giving us a dataset with **6,183 rows**." ] }, { "cell_type": "code", "execution_count": 21, "id": "fbebfed9-b587-4be0-8a86-c04c98e2d4ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Free english ios apps example:\n", "\n", "['252', '334235181', 'Trainline UK: Live Train Times, Tickets & Planner', '110198784', 'USD', '0', '248', '0', '4', '0', '22', '4+', 'Travel', '37', '4', '1', '1']\n", "['253', '334256223', 'CBS News - Watch Free Live Breaking News', '78047232', 'USD', '0', '11691', '44', '3.5', '4.5', '3.5.1', '12+', 'News', '37', '5', '1', '1']\n", "['255', '334503000', 'The Impossible Quiz!', '44652544', 'USD', '0', '18884', '451', '4', '4.5', '1.62', '9+', 'Entertainment', '37', '0', '1', '1']\n", "['261', '335364882', 'Walgreens – Pharmacy, Photo, Coupons and Shopping', '169138176', 'USD', '0', '88885', '333', '4.5', '4', '6.5', '12+', 'Shopping', '37', '5', '1', '1']\n", "['264', '335744614', 'NBA', '112074752', 'USD', '0', '43682', '19', '3.5', '2.5', '2013.4.3', '4+', 'Sports', '37', '5', '1', '1']\n", "['266', '335875911', 'My Cycles Period and Ovulation Tracker', '77686784', 'USD', '0', '7469', '68', '3.5', '5', '5.10.3', '12+', 'Health & Fitness', '37', '0', '2', '1']\n", "\n", "The number of column is 17\n", "The number of row is 3222\n" ] } ], "source": [ "free_english_ios_apps = list()\n", "\n", "for i in english_ios_apps:\n", " price = float(i[5])\n", " \n", " if price == 0.0:\n", " free_english_ios_apps.append(i)\n", "print('Free english ios apps example:')\n", "print('')\n", "explore_data(free_english_ios_apps,100,106,rows_and_columns = True)\n" ] }, { "cell_type": "markdown", "id": "e6706573-c619-4aa4-aa6c-89c78593526d", "metadata": {}, "source": [ "From the **6,183 rows** of the `english_ios_apps` dataset. We've managed to isolate the free apps giving us a clean dataset with **`3,222 rows`**." ] }, { "cell_type": "markdown", "id": "57d0d274-2236-4474-8b81-db74e798645b", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "19dd9c51-0aad-4d53-ae2c-11d6d6e3395c", "metadata": {}, "source": [ "# Data Analysis\n", "\n", "After cleaning the data, we can now procede to our analysis. Remember that our goal is to build a free app and the main revenue will come from the advertisements.\n", "\n", "We are going to follow these steps:\n", "1. Build a minimal Android version of the app, and add it to Google Play Store.\n", "2. If the app has a good response from users, we develop it further.\n", "3. If the app is profitable after six months, we will build an iOS version of the app and add it to the Apple Store.\n", "\n", "In this analysis process, we are going to analyze both the market for Google Play Store and Apple store since our end goal is to release the app on both platforms." ] }, { "cell_type": "markdown", "id": "07527411-a60f-473f-833f-74957bf59a06", "metadata": {}, "source": [ "We will begin our analysis by looking at the most common genre for both market. We are going to use the dataset `free_english_app` with **8,863 rows** for Android apps and `free_english_ios_apps` with **3,222 rows** for iOS apps." ] }, { "cell_type": "code", "execution_count": 22, "id": "89d3fc25-2416-4d7f-b101-a6f80e3ff671", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']\n", "['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']\n", "['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']\n", "['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']\n", "['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']\n", "\n", "The number of column is 13\n", "The number of row is 8863\n" ] } ], "source": [ "explore_data(free_english_app, 0,5,rows_and_columns = True)" ] }, { "cell_type": "code", "execution_count": 23, "id": "690028f9-0098-41b1-a093-febab5b77f8d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']\n", "['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']\n", "['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']\n", "['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']\n", "['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1']\n", "\n", "The number of column is 17\n", "The number of row is 3222\n" ] } ], "source": [ "explore_data(free_english_ios_apps,0,5,rows_and_columns = True)" ] }, { "cell_type": "markdown", "id": "83ca6e37-31aa-4e6b-8bfb-9b6fc07a247d", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "353c3af7-55b8-4369-83b7-fe56300c36df", "metadata": {}, "source": [ "By using the defined function `frequency_column()` we are going to make a frequency table for both of our datasets. And by using the defined function `display_table` we're going to display the frequency table that is sorted by descending order and the value is in percentage." ] }, { "cell_type": "code", "execution_count": 24, "id": "09321af8-30eb-458f-8c1b-883d51f6ff42", "metadata": {}, "outputs": [], "source": [ "def frequency_column(dataset,column):\n", " '''This function will take two agruments:\n", " dataset - where the frequency table will be extracted\n", " column - the index for the frequency table'''\n", " genre_list = list()\n", " for i in dataset:\n", " genre = i[column]\n", " genre_list.append(genre)\n", "\n", " frequency_table = dict()\n", " \n", " for i in genre_list:\n", " frequency_table[i] = (frequency_table.get(i,0) + 1)\n", " \n", " total_value = sum(frequency_table.values())\n", " \n", " frequency_percentage = dict()\n", " for i in frequency_table:\n", " frequency_percentage[i] = round(((frequency_table[i]/total_value)*100),2)\n", " \n", " return frequency_percentage\n", "\n", "def display_table(dataset,column,numofrows=10):\n", " '''This function will take two agruments:\n", " dataset - where the displayed table will be extracted\n", " column - the index for the displayed table'''\n", " table = frequency_column(dataset,column)\n", " displayed_table = list()\n", " \n", " for x,y in table.items():\n", " key_and_value = (y,x)\n", " displayed_table.append(key_and_value)\n", " displayed_table_sorted = sorted(displayed_table,reverse = True)\n", " for x,y in displayed_table_sorted[:numofrows]:\n", " print(y,':',x)" ] }, { "cell_type": "code", "execution_count": 25, "id": "b7480ada-5d68-42a3-88f7-e696d3abcbef", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Apple Store Genre: \n", "\n", "Games : 58.16\n", "Entertainment : 7.88\n", "Photo & Video : 4.97\n", "Education : 3.66\n", "Social Networking : 3.29\n", "Shopping : 2.61\n", "Utilities : 2.51\n", "Sports : 2.14\n", "Music : 2.05\n", "Health & Fitness : 2.02\n" ] } ], "source": [ "print('Apple Store Genre: \\n')\n", "display_table(free_english_ios_apps,-5)" ] }, { "cell_type": "markdown", "id": "a3a1e937-381f-46cb-8f4b-5b58f38949fe", "metadata": {}, "source": [ "By using the defined functions above, we can see the top 10 most common app for our `free_english_ios_apps` dataset. As we can observe, more than half of those app are games. Our top 3 `genre` which makes about **71%** of our dataset comes from the entertainment kind of apps." ] }, { "cell_type": "code", "execution_count": 26, "id": "80c92dc4-9d6e-4db4-947a-4ddccb0330a1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Google Play Store Categories: \n", "\n", "FAMILY : 18.9\n", "GAME : 9.73\n", "TOOLS : 8.46\n", "BUSINESS : 4.59\n", "LIFESTYLE : 3.9\n", "PRODUCTIVITY : 3.89\n", "FINANCE : 3.7\n", "MEDICAL : 3.53\n", "SPORTS : 3.4\n", "PERSONALIZATION : 3.32\n", "\n", "Google Play Store Genre: \n", "\n", "Tools : 8.45\n", "Entertainment : 6.07\n", "Education : 5.35\n", "Business : 4.59\n", "Productivity : 3.89\n", "Lifestyle : 3.89\n", "Finance : 3.7\n", "Medical : 3.53\n", "Sports : 3.46\n", "Personalization : 3.32\n" ] } ], "source": [ "print('Google Play Store Categories: \\n')\n", "display_table(free_english_app,1)\n", "print('\\nGoogle Play Store Genre: \\n')\n", "display_table(free_english_app,-4)" ] }, { "cell_type": "markdown", "id": "e2b47a08-a6ba-4035-af79-2bb380fd0f35", "metadata": {}, "source": [ "As for our `free_english_apps` dataset for Google Play Store. We can observe that there's more division for `genre`. Most of the app in Google Play Store are for practical usage type of apps. Although games is high in the list as well, but the variance of the result is closer to one another. It creates a balance between 'entertainment' and 'practical purposes' kind of apps." ] }, { "cell_type": "markdown", "id": "0b578b76-f631-49b8-be55-05f87c8fa737", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "1f55e2e6-07d6-42dc-b559-d7b7dbf12b47", "metadata": {}, "source": [ "For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot app`." ] }, { "cell_type": "code", "execution_count": 27, "id": "35fee6d5-ac32-4b0d-9ea5-1210c5567cb9", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Navigation : 86090.33\n", "Reference : 74942.11\n", "Social Networking : 71548.35\n", "Music : 57326.53\n", "Weather : 52279.89\n", "Book : 39758.5\n", "Food & Drink : 33333.92\n", "Finance : 31467.94\n", "Photo & Video : 28441.54\n", "Travel : 28243.8\n" ] } ], "source": [ "genre_in_ios = frequency_column(free_english_ios_apps,-5)\n", "average_per_genre = list()\n", "\n", "for i in genre_in_ios:\n", " total_rating = 0\n", " len_rating = 0\n", " for x in free_english_ios_apps:\n", " ratingcount = int(x[6])\n", " if i in x:\n", " total_rating += ratingcount\n", " len_rating += 1\n", " average_rating = total_rating/len_rating\n", " addtogenre = i,average_rating\n", " average_per_genre.append(addtogenre)\n", "displayed_table = list()\n", "for x,y in average_per_genre:\n", " key_and_value = (y,x)\n", " displayed_table.append(key_and_value)\n", "displayed_table_sorted = sorted(displayed_table,reverse = True)\n", "for x,y in displayed_table_sorted[:10]:\n", " print(y,':',round(x,2))" ] }, { "cell_type": "markdown", "id": "3e33f6ad-39b1-4539-aace-a196e3e4d0b7", "metadata": {}, "source": [ "As we can observe, navigation apps have the most number of users Apple Store. Followed by reference and social networking. As we can see, even though the `games` is the most common app. This genre is not present in our top 10 most number of rating apps." ] }, { "cell_type": "markdown", "id": "a5c68237-107f-4908-9a92-af53acba2248", "metadata": {}, "source": [ "As for the Google Play, we'll both analyze the `genre` and `category`." ] }, { "cell_type": "code", "execution_count": 28, "id": "feb05bca-70ca-465f-9187-51a047a9db13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average installs for genre in android:\n", "\n", "Communication : 38456119.17\n", "Adventure;Action & Adventure : 35333333.33\n", "Video Players & Editors : 24947335.8\n", "Social : 23253652.13\n", "Arcade : 22888365.49\n", "Casual : 19569221.6\n", "Puzzle;Action & Adventure : 18366666.67\n", "Photography : 17840110.4\n", "Educational;Action & Adventure : 17016666.67\n", "Productivity : 16787331.34\n", "\n", "\n", "Average installs for categories in android:\n", "\n", "COMMUNICATION : 38456119.17\n", "VIDEO_PLAYERS : 24727872.45\n", "SOCIAL : 23253652.13\n", "PHOTOGRAPHY : 17840110.4\n", "PRODUCTIVITY : 16787331.34\n", "GAME : 15588015.6\n", "TRAVEL_AND_LOCAL : 13984077.71\n", "ENTERTAINMENT : 11640705.88\n", "TOOLS : 10801391.3\n", "NEWS_AND_MAGAZINES : 9549178.47\n" ] } ], "source": [ "print('Average installs for genre in android:')\n", "print('')\n", "genre_in_android = frequency_column(free_english_app,-4)\n", "average_per_genre = list()\n", "for i in genre_in_android:\n", " total_install = 0\n", " len_install = 0\n", " for x in free_english_app:\n", " \n", " installs = int(x[5].replace('+','').replace(',',''))\n", " if i in x:\n", " total_install += installs\n", " len_install += 1\n", " average_install = total_install/len_install\n", " addtogenre = i,average_install\n", " average_per_genre.append(addtogenre)\n", "displayed_table = list()\n", "for x,y in average_per_genre:\n", " key_and_value = (y,x)\n", " displayed_table.append(key_and_value)\n", "displayed_table_sorted = sorted(displayed_table,reverse = True)\n", "for x,y in displayed_table_sorted[:10]:\n", " print(y,':',round(x,2))\n", "\n", "print('\\n')\n", "print('Average installs for categories in android:')\n", "print('')\n", "categories_in_android = frequency_column(free_english_app,1)\n", "average_per_categories = list()\n", "for i in categories_in_android:\n", " total_install = 0\n", " len_install = 0\n", " for x in free_english_app:\n", " \n", " installs = int(x[5].replace('+','').replace(',',''))\n", " if i in x:\n", " total_install += installs\n", " len_install += 1\n", " average_install = total_install/len_install\n", " addtogenre = i,average_install\n", " average_per_categories.append(addtogenre)\n", "displayed_table = list()\n", "for x,y in average_per_categories:\n", " key_and_value = (y,x)\n", " displayed_table.append(key_and_value)\n", "displayed_table_sorted = sorted(displayed_table,reverse = True)\n", "for x,y in displayed_table_sorted[:10]:\n", " print(y,':',round(x,2))" ] }, { "cell_type": "markdown", "id": "bc425202-db30-4b7b-bb47-4670ffbca19c", "metadata": {}, "source": [ "Both `genre` and `category` has the **Communication** as the most number of users that installed the application. We can also observe that there's more average user per apps for Google Play Store compared to Apple Store." ] }, { "cell_type": "markdown", "id": "e067b357-b3d2-4b0f-8da1-facddf7916ad", "metadata": {}, "source": [ "## App Recommendation\n", "\n", "Since this is a generation that is addicted with social media, sharing photos,videos and clips on social media. Building an app that can be shared on those platform can be a good idea. On the following section, we'll try to explore the `Photography` related application on Google Play Store." ] }, { "cell_type": "code", "execution_count": 35, "id": "ea27994d-a8bd-48af-a0de-85a894d36605", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PHOTOGRAPHY 2.94\n" ] } ], "source": [ "for i in categories_in_android:\n", " if i == 'PHOTOGRAPHY':\n", " print(i,categories_in_android[i])" ] }, { "cell_type": "markdown", "id": "e04ccc2a-4f48-4372-bc18-62f019c9b74b", "metadata": {}, "source": [ "About **2.94%** of the apps in Google Play Store are related to `Photography`." ] }, { "cell_type": "markdown", "id": "15acb43f-c1cc-4286-acfd-8e49ca9dae6b", "metadata": {}, "source": [ "We'll explore the most downloaded `Photography` related apps." ] }, { "cell_type": "code", "execution_count": 68, "id": "d8171085-98e2-408c-840b-aedbca5fcfe8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(1000000000.0, 'Google Photos'),\n", " (100000000.0, 'Z Camera - Photo Editor, Beauty Selfie, Collage'),\n", " (100000000.0, 'YouCam Perfect - Selfie Photo Editor'),\n", " (100000000.0, 'YouCam Makeup - Magic Selfie Makeovers'),\n", " (100000000.0, 'Sweet Selfie - selfie camera, beauty cam, photo edit'),\n", " (100000000.0, 'S Photo Editor - Collage Maker , Photo Collage'),\n", " (100000000.0, 'Retrica'),\n", " (100000000.0, 'PicsArt Photo Studio: Collage Maker & Pic Editor'),\n", " (100000000.0, 'PhotoGrid: Video & Pic Collage Maker, Photo Editor'),\n", " (100000000.0, 'Photo Editor Pro'),\n", " (100000000.0, 'Photo Editor Collage Maker Pro'),\n", " (100000000.0, 'Photo Collage Editor'),\n", " (100000000.0, 'LINE Camera - Photo editor'),\n", " (100000000.0, 'Cymera Camera- Photo Editor, Filter,Collage,Layout'),\n", " (100000000.0, 'Candy Camera - selfie, beauty camera, photo editor'),\n", " (100000000.0, 'Camera360: Selfie Photo Editor with Funny Sticker'),\n", " (100000000.0, 'BeautyPlus - Easy Photo Editor & Selfie Camera'),\n", " (100000000.0, 'B612 - Beauty & Filter Camera'),\n", " (100000000.0, 'AR effect'),\n", " (50000000.0, 'Video Editor Music,Cut,No Crop'),\n", " (50000000.0, 'VSCO'),\n", " (50000000.0, 'Square InPic - Photo Editor & Collage Maker'),\n", " (50000000.0, 'Snapseed'),\n", " (50000000.0, 'Selfie Camera - Photo Editor & Filter & Sticker'),\n", " (50000000.0, 'SNOW - AR Camera'),\n", " (50000000.0, 'Pixlr – Free Photo Editor'),\n", " (50000000.0, 'Pic Collage - Photo Editor'),\n", " (50000000.0, 'PhotoWonder: Pro Beauty Photo Editor Collage Maker'),\n", " (50000000.0, 'Photo Lab Picture Editor: face effects, art frames'),\n", " (50000000.0, 'Photo Effects Pro'),\n", " (50000000.0, 'Photo Editor by Aviary'),\n", " (50000000.0, 'Photo Editor Selfie Camera Filter & Mirror Image'),\n", " (50000000.0, 'Motorola Camera'),\n", " (50000000.0, 'MomentCam Cartoons & Stickers'),\n", " (50000000.0, 'MakeupPlus - Your Own Virtual Makeup Artist'),\n", " (50000000.0, 'Keepsafe Photo Vault: Hide Private Photos & Videos'),\n", " (50000000.0, 'InstaSize Photo Filters & Collage Editor'),\n", " (50000000.0, 'InstaBeauty -Makeup Selfie Cam'),\n", " (50000000.0, 'Boomerang from Instagram'),\n", " (50000000.0, 'Adobe Photoshop Express:Photo Editor Collage Maker'),\n", " (50000000.0, 'ASUS Gallery')]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "total_install_category = 0\n", "len_install_category = 0\n", "for i in free_english_app:\n", " category = i[1]\n", " install = float(i[5].replace('+','').replace(',',''))\n", " if category == 'PHOTOGRAPHY':\n", " total_install_category += install\n", " len_install_category += 1\n", " \n", "\n", " \n", "average_install = total_install_category/len_install_category\n", " \n", "above_average = list()\n", "for i in free_english_app:\n", " install = float(i[5].replace('+','').replace(',',''))\n", " category = i[1]\n", " if install > average_install and category == 'PHOTOGRAPHY' :\n", " above_average.append(i)\n", "sorted_list = list()\n", "for i in above_average:\n", " install = float(i[5].replace('+','').replace(',',''))\n", " name = i[0]\n", " key_val = install,name\n", " sorted_list.append(key_val)\n", "sorted(sorted_list, reverse = True)" ] }, { "cell_type": "markdown", "id": "2ee40df5-2d00-49c4-b20b-d0118b027778", "metadata": {}, "source": [ "# Conclusion" ] }, { "cell_type": "markdown", "id": "d3c19bad-678d-489a-9b05-70b4d677193f", "metadata": {}, "source": [ "As we con see from the results, those `photography` related apps that applies a pre-editting to the photos are among the top downloaded apps from the market. It is a good idea to further investigate these kinds of apps and build around it. Since advertisement is the main revenue for this app, we can then input an advertisement before the app releases the result where the pre-editting happens. Otherwise the consumer have to pay for a pro version of the app to get an advertisement free version of the app. We can also do a deeper study where we'll analyze the other categories and apply some concepts of those on the `photography` related app that we're going to build. Since there's a lot of these kinds of apps that was already released on the market, we have to innovate features in order to stand out." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }