# PROFITABLE APPS ON THE MARKET - A GUIDED DATA ANALYSIS PROJECT

**Project Description**
 
 A data analysis is the process of exploring, cleaning and transforming the data into a useable format. In this analysis, we are going to explore the apps on both Apple Store and Google Play Store and see what type of apps are currently attracting users. Most of us uses a smartphone everyday, we used different apps for different purposes.

**Project Goal**
 
 The goal of this analysis is to recommend the type of apps that the developers can work on. In this analysis, we are going to work on exploring the data, cleaning the data, transforming and analyzing the data to come up with a recommendation.

## Defining the Fucntions to open,read and explore the dataset

Before we start our exploration, cleaning and analysis. First we're going to define a function that will load and explore out dataset.

In [1]:
def DatasetCsv(file):
 '''This fuction takes in one parameter:
 file = the file name / file path that needs to be converted to a list of lists
 '''
 from csv import reader
 dataset = open(file, encoding = 'utf8')
 read = reader(dataset)
 data = list(read)
 
 return data

In [2]:
def explore_data(dataset,start,end,rows_and_columns = False):
 '''This function takes in four parameters:
 dataset = the data set that will be list of list. Which will contain all the rows and an optional header.
 start and end = are both integer type of data, it will indicate the start and end of the elements in the list of lists.
 rows_and_columns = will take an input of boolean values True or False and it will output the number of the column and rows of the dataset.
 '''
 dataset_slice = dataset[start:end]
 for i in dataset_slice:
 print(i)
 
 if rows_and_columns == True:
 print('')
 print('The number of column is', len(dataset[0]))
 print('The number of row is', len(dataset))
 
 

---

# Initial Dataset Exploration

To understand the `variables` to consider that will drive the decision making on building the type of application, we'll look into the **first five(5) rows** and identify these `variables`.

### iOS App Store

First, we will explore the iOS dataset. The source for this dataset can be found *[here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)*. And this dataset was last updated on 2018-06-05.

In [3]:
ios = DatasetCsv(r'C:\Users\Mico\OneDrive\Desktop\DATASETS\KAGGLE\APPLE STORE\AppleStore.csv') #Using the defined function DatasetCsv(file) to store the dataset of 'AppleStore.csv' to the variable 'ios'
explore_data(ios,0,6,rows_and_columns = True) #Using the defined function explore_data(dataset,start,end,rows_and_columns = False) to do an initial exploration for the 'ios' dataset.

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45

As per our observation on the dataset. We have `7,198 apps` available. From the `16 columns` that are present, we identified the following columns that will be useful on this analysis:

| Columns | Description | Ideal Datatype |
|:---: | :---:| :---: |
|size_bytes |Size (in Bytes) | Integer|
|price |Price amount | Float|
|ratingcounttot |Total number of user ratings | Integer|
|user_rating |Average user rating value| Float |
|cont_rating |Content rating | String|
|prime_genre |Primary genre | String|
|ipadSc_urls.num |Number of screenshots showed for display | Integer|
|lang.num |Number of supported languages | Integer|

For more information about the columns: [Kaggle - Mobile App Store ( 7200 apps)](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

### Google Play Store

For the google play store dataset. The source can be found *[here](https://www.kaggle.com/lava18/google-play-store-apps)*. And this dataset was last updated on 2018-09-04.

In [4]:
android = DatasetCsv(r'C:\Users\Mico\OneDrive\Desktop\DATASETS\KAGGLE\GOOGLE PLAY STORE\googleplaystore.csv')
explore_data(android,0,6,rows_and_columns = True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'A

As per our observation on the dataset. We have `10,842 apps` available. From the `13 columns` that are present, we identified the following columns that will be useful on this analysis:

| Column | Description | Ideal Datatype |
|:---: | :---:| :---:|
|Category |Category the app belongs to | String|
|Rating |Average user rating value | Float|
|Reviews |Total number of user reviews | Integer|
|Size |Size of the app (in megabytes) | Float|
|Installs |Number of user downloads/installs for the app| Integer|
|Type |Paid or Free | Boolean|
| Content Rating| Age group the app is targeted at - Children / Mature 21+ / Adult| String|
| Genres| An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to| String|

For more information about the columns: [Kaggle - Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps)




















---

# Data Cleaning

Before we do a further analysis. First, we'll have to ensure that the dataset is accurate and in a useable format. Remember that our goal is to understand what types of apps that will attract more users. Our target audience are *english-speaking* users and we will only build *free* apps.

**In this section we are going to do the following:**
- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.
- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
- Remove apps that aren't free.

### Data innacuracy

For our google play dataset, a discussion on [kaggle](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) states that the `rating` attribute is missing on entry **10472**. Note that `rating` is the 3rd column which makes its index to be **[2]**. Our dataset has a header row and the reported entry **10472** might have or might not have a header so we will explore entries **10472-10473** to investigate.

In [5]:
explore_data(android,10472,10474) #Note that specifying an index slice is [start:end - 1] hence we will start indexing at 10472 and ending it at 10473
print('Number of columns in entry 10472:',len(android[10472]),'columns')
print('Number of columns in entry 10473:',len(android[10473]),'columns')

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Number of columns in entry 10472: 13 columns
Number of columns in entry 10473: 12 columns


After analyzing the two observations, we noticed that the entry **10473** is missing a column. `Rating` is present on the data and the `Category` is missing. Upon further investigating on this observation, we noticed that there's an element with `''` entry. Basing on the present elements beside it, we can deduce that this is the `Genre` attribute.

Since this data will cause an error in the analysis and we've identified `Category` and `Genre` as an important part of the analysis. We can either look up for the application **Life Made WI-FI Touchscreen Photo Frame** in the Google Play Store to populate the data or we can just delete the observation. For this analysis we will delete the observation.

In [6]:
del android[10473]

To verify if we've made changes to the dataset. We will run the explore_data function again and check the number of columns.

In [7]:
explore_data(android,10472,10474)
print('Number of columns in entry 10472:',len(android[10472]),'columns')
print('Number of columns in entry 10473:',len(android[10473]),'columns')

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
Number of columns in entry 10472: 13 columns
Number of columns in entry 10473: 13 columns


### Data Duplicate

Now we will invesitage the dataset if it has any duplicates. We will be basing the duplicate by using the `app` (index[0]) attribute which will give us the name of the app.

In [8]:
android_rows = android[1:] # Extracting the rows of android dataset
unique_android_apps = list() # List of unique applications in Google Play Store
duplicate_android_apps = list() # List of apps that has a duplicate
most_dup_app = list() # App with the most number of duplicate


# Using this for loop, we are going to populate the unique_android_apps and duplicate_android_apps

for i in android_rows:
 name = i[0]
 
 if name in unique_android_apps:
 duplicate_android_apps.append(name)
 elif name not in unique_android_apps:
 unique_android_apps.append(name)

frequency_of_duplicate = dict() # Frequency table of the duplicates

for i in duplicate_android_apps:
 frequency_of_duplicate[i] = frequency_of_duplicate.get(i,0) + 1

# Using this for loop, we are going to identify which app has the most number of duplicate

for i in frequency_of_duplicate:
 if frequency_of_duplicate[i] == max(frequency_of_duplicate.values()):
 most_dup_app.append(i)
 
print('The unique number of apps is:',len(unique_android_apps))
print('The number of duplicate apps is:',len(duplicate_android_apps))
print('Dupicated apps examples:',duplicate_android_apps[:5])
print('The most duplicated apps is/are:', most_dup_app)

The unique number of apps is: 9659
The number of duplicate apps is: 1181
Dupicated apps examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
The most duplicated apps is/are: ['ROBLOX']


As we can see, there are a total number of **1,181** apps that are duplicates. And the app that has the most duplicate is **'Roblox'**.

We will investigate this duplicated app since we cannot remove duplicates randomly. 

In [9]:
# Using this for loop, we are going to verify the data of the most_dup_app list

for i in most_dup_app:
 for x in android_rows:
 if i in x:
 print(x)

['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'FAMILY', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1

The `Genre` and `Reviews` attributes are different on the **'Roblox'** app. We can use this information to create a criterion for deleting the observation. For this project, we are going to keep the observation among the duplicates that has the highest `Reviews`. Since we can assume that the most number of reviews is the latest one.

As we start removing the duplicates, we have to keep in mind that the total number of apps should be **9,659**.

In [10]:
reviews_max = dict() # Creating a frequency table that will store the unique apps since this is a dictionary, it will automatically delete an app with a similar name. We will also update the values based on a higher rated duplicate app.

for i in android_rows:
 name = i[0]
 reviews = float(i[3])
 
 if name in reviews_max and reviews_max[name] < reviews:
 reviews_max[name] = reviews
 elif name not in reviews_max:
 reviews_max[name] = reviews

#To verify that we have the correct number of unique apps
print('The unique number of apps is:',len(reviews_max))

The unique number of apps is: 9659


We will start removing the duplicate apps. By looping through the `android_rows` list. We will create a new list that will store our cleaned dataset, this dataset will be stored in the variable `android_clean`.

In [11]:
android_clean = list()
already_added = list() # We are going to add this to avoid adding a duplicate on the android_clean

for i in android_rows:
 name = i[0]
 reviews = float(i[3])
 
 
 if (name in reviews_max) and (reviews_max[name] == float(i[3])) and (name not in already_added):
 android_clean.append(i)
 already_added.append(name) # The purpose of this is so that we can filter out the ones that we alreaddy added, because there might be an instance where an app has a similar name and review

# To verify that all the data is unique, remember that there are 9,659 number of unique apps.
print("The number of rows in our new dataset 'android_clean' is",len(android_clean))
 

The number of rows in our new dataset 'android_clean' is 9659


### Removing non-english apps

Since the app that we are going to build is targeted for *english-speaking* audience. We have to remove the apps that contains non-english characters. As per the [ASCII](https://en.wikipedia.org/wiki/ASCII)(American Standard Code for Information Interchange) system. An english text has a value range of 0 to 127. We can get the individual value of each characters by passing it as an argument in the function `ord()`. Below, we are going to create a function that will loop through the names of the apps. And we are expecting a boolean value of `True` if all the character in the name is english characters. And will give an output of `False` if the name containts a non-english character.

In [12]:
def english(string):
 '''This fuction takes in one argument:
 string = the string that we are going to loop and verify if there exist a non-english character. If the string has a value of > 127 more than three times the function will return a false, otherwise it will return a true'''
 char = list()
 count = 0
 for character in string:
 if ord(character) > 127 and count <= 3:
 count += 1
 
 if count > 3:
 return False
 return True

Using the newly defined function `english()` we will create a new list consisting of `non_english_app`. And apply the `explore_data()` function to have an insight of the apps.

In [13]:
non_english_app = list()
for i in android_clean:
 name = english(i[0])
 if name == False:
 non_english_app.append(i)
print('Non-english app example:\n')
explore_data(non_english_app,0,6,rows_and_columns = True)

Non-english app example:

['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']
['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up']
['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up']
['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up']
['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up']
['RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'FAMILY', 'NaN', '4', '64M', '1+', 'Free', '0', 'Everyone', 'Education', 'July 17, 2018', '1.0.1', '4.4 and up']

The numb

To continue with our data cleaning, we are going to remove the `non_english_app` on our dataset and create a new dataset that will be assgined to the variable `english_android_app`.

In [14]:
english_android_app = list()

for row in android_clean:
 if row not in non_english_app:
 english_android_app.append(row)
print('English android app:',len(english_android_app),'rows')

English android app: 9614 rows


Remember that the number of unique apps before we removed the `non_english_app` is **9,659**. The total number of `non_english_app` is **45**. So we successfully removed the `non_english_app` on our dataset.

### Extracting free aps

As we've mentioned, our focus is to build a free app and the main source of revenue will come from the in app ads. So in this section, we are going to isolate the free aps from the paid apps.

First we are going to identify the free apps in the `english_android_app` dataset.

In [15]:
free_english_app = list()

for row in english_android_app:
 free_paid = row[6]
 price = float(row[7].replace('$','')) # I added the price attribute to ensure that no Free app will have an error of having a price

 if free_paid == 'Free' and price == 0.0:
 free_english_app.append(row)

#Exploring our cleaned dataset
explore_data(free_english_app,0,10,rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']
['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free',

After exploring the data, we now have **`8,863 rows`** from the initial **10,841 rows**.

Now that we've finished the following:
- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.
- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
- Remove apps that aren't free.

We can say that the dataset `free_english_app` from [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-apps) is now cleaned.

---

# iOS
For the [iOS dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps), we're going to do the same thing that we did to the [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-apps). 

For starter, upon skimming the [discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176) in kaggle, a user named *Marjan* reported that there's two duplicate in the dataset. We will investiage this report.

On our initial data exploratory earlier, iOS dataset has **7,198 rows**. Excluding the header, we are expecting 7,197 rows.

In [16]:
ios_rows = ios[1:] # Extracting the rows of android dataset
unique_ios_apps = list() # List of unique applications in Google Play Store
duplicate_ios_apps = list() # List of apps that has a duplicate
most_dup_ios_app = list()


for i in ios_rows:
 name = i[2]

 if name in unique_ios_apps:
 duplicate_ios_apps.append(name)
 elif name not in unique_ios_apps:
 unique_ios_apps.append(name)
 
frequency_of_ios_duplicate = dict()
for i in duplicate_ios_apps:
 frequency_of_ios_duplicate[i] = frequency_of_ios_duplicate.get(i,0) + 1

print(frequency_of_ios_duplicate)

{'VR Roller Coaster': 1, 'Mannequin Challenge': 1}


As we can see, there's two app duplicate in the `track_name` attribute. We will further investigate these two apps.

In [17]:
for i in duplicate_ios_apps:
 for x in ios_rows:
 name = x[2]
 if i == name:
 print('Index number:',ios_rows.index(x))
 print(x)

Index number: 3319
['4000', '952877179', 'VR Roller Coaster', '169523200', 'USD', '0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
Index number: 5603
['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']
Index number: 7092
['10751', '1173990889', 'Mannequin Challenge', '109705216', 'USD', '0', '668', '87', '3', '3', '1.4', '9+', 'Games', '37', '4', '1', '1']
Index number: 7128
['10885', '1178454060', 'Mannequin Challenge', '59572224', 'USD', '0', '105', '58', '4', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


Upon further investigation, the `id` (index[0]) and `size_bytes` (index[2]) attributes differ from each other we can conclude that these apps are different so we will keep it.

To check for the data accuracy, we'll loop in all the rows and check for a row that do not contain **16 columns**.

In [18]:
inaccurate_columns = list()
for i in ios_rows:
 if len(i) != 17:
 inaccurate_columns.append(i)
if len(inaccurate_columns) > 0:
 print(inaccurate_columns)
elif len(inaccurate_columns) == 0:
 print('All observations has 17 columns')
 

All observations has 17 columns


Since all the observations has **17 columns** it is safe to assume that we will not get an error due to `IndexError: list index out of range`.

Now we will check and remove the `non-english` apps using the function `english()` that we defined earlier.

In [19]:
non_english_ios_app = list()

for i in ios_rows:
 name = i[2]
 
 if english(name) == False:
 non_english_ios_app.append(i)
explore_data(non_english_ios_app,0,5,rows_and_columns = True)

['80', '299853944', '新浪新闻-阅读最新时事热门头条资讯视频', '115143680', 'USD', '0', '2229', '4', '3.5', '1', '6.2.1', '17+', 'News', '37', '0', '1', '1']
['96', '303191318', '同花顺-炒股、股票', '122886144', 'USD', '0', '1744', '0', '3.5', '0', '10.10.46', '4+', 'Finance', '37', '0', '1', '1']
['239', '331259725', '央视影音-海量央视内容高清直播', '54648832', 'USD', '0', '2070', '0', '2.5', '0', '6.2.0', '4+', 'Sports', '37', '0', '1', '1']
['268', '336141475', '优酷视频', '204959744', 'USD', '0', '4885', '0', '3.5', '0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1']
['295', '340368403', 'クックパッド - No.1料理レシピ検索アプリ', '76644352', 'USD', '0', '115', '0', '3.5', '0', '17.5.1.0', '4+', 'Food & Drink', '37', '5', '1', '1']

The number of column is 17
The number of row is 1014


As we observed there are a total of **1014 rows** of `non_english_ios_app` in the iOS dataset. We're going to isolate these apps.

In [20]:
english_ios_apps = list()

for i in ios_rows:
 if i not in non_english_ios_app:
 english_ios_apps.append(i)
explore_data(english_ios_apps,0,5,rows_and_columns = True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']

The number of column is 17
The number of row is 6183


Remember that iOS dataset has **7,197 rows**. After isolating the english apps the we've managed to remove **1,014 non-english apps**. Giving us a dataset with **6,183 rows**.

In [21]:
free_english_ios_apps = list()

for i in english_ios_apps:
 price = float(i[5])
 
 if price == 0.0:
 free_english_ios_apps.append(i)
print('Free english ios apps example:')
print('')
explore_data(free_english_ios_apps,100,106,rows_and_columns = True)


Free english ios apps example:

['252', '334235181', 'Trainline UK: Live Train Times, Tickets & Planner', '110198784', 'USD', '0', '248', '0', '4', '0', '22', '4+', 'Travel', '37', '4', '1', '1']
['253', '334256223', 'CBS News - Watch Free Live Breaking News', '78047232', 'USD', '0', '11691', '44', '3.5', '4.5', '3.5.1', '12+', 'News', '37', '5', '1', '1']
['255', '334503000', 'The Impossible Quiz!', '44652544', 'USD', '0', '18884', '451', '4', '4.5', '1.62', '9+', 'Entertainment', '37', '0', '1', '1']
['261', '335364882', 'Walgreens – Pharmacy, Photo, Coupons and Shopping', '169138176', 'USD', '0', '88885', '333', '4.5', '4', '6.5', '12+', 'Shopping', '37', '5', '1', '1']
['264', '335744614', 'NBA', '112074752', 'USD', '0', '43682', '19', '3.5', '2.5', '2013.4.3', '4+', 'Sports', '37', '5', '1', '1']
['266', '335875911', 'My Cycles Period and Ovulation Tracker', '77686784', 'USD', '0', '7469', '68', '3.5', '5', '5.10.3', '12+', 'Health & Fitness', '37', '0', '2', '1']

The number of c

From the **6,183 rows** of the `english_ios_apps` dataset. We've managed to isolate the free apps giving us a clean dataset with **`3,222 rows`**.

---

# Data Analysis

After cleaning the data, we can now procede to our analysis. Remember that our goal is to build a free app and the main revenue will come from the advertisements.

We are going to follow these steps:
1. Build a minimal Android version of the app, and add it to Google Play Store.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we will build an iOS version of the app and add it to the Apple Store.

In this analysis process, we are going to analyze both the market for Google Play Store and Apple store since our end goal is to release the app on both platforms.

We will begin our analysis by looking at the most common genre for both market. We are going to use the dataset `free_english_app` with **8,863 rows** for Android apps and `free_english_ios_apps` with **3,222 rows** for iOS apps.

In [22]:
explore_data(free_english_app, 0,5,rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']

The number of column is 13
The number of row is 8863


In [23]:
explore_data(free_english_ios_apps,0,5,rows_and_columns = True)

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']
['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1']

The number of column is 17
The number of row is 3222


By using the defined function `frequency_column()` we are going to make a frequency table for both of our datasets. And by using the defined function `display_table` we're going to display the frequency table that is sorted by descending order and the value is in percentage.

In [24]:
def frequency_column(dataset,column):
 '''This function will take two agruments:
 dataset - where the frequency table will be extracted
 column - the index for the frequency table'''
 genre_list = list()
 for i in dataset:
 genre = i[column]
 genre_list.append(genre)

 frequency_table = dict()
 
 for i in genre_list:
 frequency_table[i] = (frequency_table.get(i,0) + 1)
 
 total_value = sum(frequency_table.values())
 
 frequency_percentage = dict()
 for i in frequency_table:
 frequency_percentage[i] = round(((frequency_table[i]/total_value)*100),2)
 
 return frequency_percentage

def display_table(dataset,column,numofrows=10):
 '''This function will take two agruments:
 dataset - where the displayed table will be extracted
 column - the index for the displayed table'''
 table = frequency_column(dataset,column)
 displayed_table = list()
 
 for x,y in table.items():
 key_and_value = (y,x)
 displayed_table.append(key_and_value)
 displayed_table_sorted = sorted(displayed_table,reverse = True)
 for x,y in displayed_table_sorted[:numofrows]:
 print(y,':',x)

In [25]:
print('Apple Store Genre: \n')
display_table(free_english_ios_apps,-5)

Apple Store Genre: 

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02


By using the defined functions above, we can see the top 10 most common app for our `free_english_ios_apps` dataset. As we can observe, more than half of those app are games. Our top 3 `genre` which makes about **71%** of our dataset comes from the entertainment kind of apps.

In [26]:
print('Google Play Store Categories: \n')
display_table(free_english_app,1)
print('\nGoogle Play Store Genre: \n')
display_table(free_english_app,-4)

Google Play Store Categories: 

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32

Google Play Store Genre: 

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32


As for our `free_english_apps` dataset for Google Play Store. We can observe that there's more division for `genre`. Most of the app in Google Play Store are for practical usage type of apps. Although games is high in the list as well, but the variance of the result is closer to one another. It creates a balance between 'entertainment' and 'practical purposes' kind of apps.

For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot app`.

In [27]:
genre_in_ios = frequency_column(free_english_ios_apps,-5)
average_per_genre = list()

for i in genre_in_ios:
 total_rating = 0
 len_rating = 0
 for x in free_english_ios_apps:
 ratingcount = int(x[6])
 if i in x:
 total_rating += ratingcount
 len_rating += 1
 average_rating = total_rating/len_rating
 addtogenre = i,average_rating
 average_per_genre.append(addtogenre)
displayed_table = list()
for x,y in average_per_genre:
 key_and_value = (y,x)
 displayed_table.append(key_and_value)
displayed_table_sorted = sorted(displayed_table,reverse = True)
for x,y in displayed_table_sorted[:10]:
 print(y,':',round(x,2))

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8


As we can observe, navigation apps have the most number of users Apple Store. Followed by reference and social networking. As we can see, even though the `games` is the most common app. This genre is not present in our top 10 most number of rating apps.

As for the Google Play, we'll both analyze the `genre` and `category`.

In [28]:
print('Average installs for genre in android:')
print('')
genre_in_android = frequency_column(free_english_app,-4)
average_per_genre = list()
for i in genre_in_android:
 total_install = 0
 len_install = 0
 for x in free_english_app:
 
 installs = int(x[5].replace('+','').replace(',',''))
 if i in x:
 total_install += installs
 len_install += 1
 average_install = total_install/len_install
 addtogenre = i,average_install
 average_per_genre.append(addtogenre)
displayed_table = list()
for x,y in average_per_genre:
 key_and_value = (y,x)
 displayed_table.append(key_and_value)
displayed_table_sorted = sorted(displayed_table,reverse = True)
for x,y in displayed_table_sorted[:10]:
 print(y,':',round(x,2))

print('\n')
print('Average installs for categories in android:')
print('')
categories_in_android = frequency_column(free_english_app,1)
average_per_categories = list()
for i in categories_in_android:
 total_install = 0
 len_install = 0
 for x in free_english_app:
 
 installs = int(x[5].replace('+','').replace(',',''))
 if i in x:
 total_install += installs
 len_install += 1
 average_install = total_install/len_install
 addtogenre = i,average_install
 average_per_categories.append(addtogenre)
displayed_table = list()
for x,y in average_per_categories:
 key_and_value = (y,x)
 displayed_table.append(key_and_value)
displayed_table_sorted = sorted(displayed_table,reverse = True)
for x,y in displayed_table_sorted[:10]:
 print(y,':',round(x,2))

Average installs for genre in android:

Communication : 38456119.17
Adventure;Action & Adventure : 35333333.33
Video Players & Editors : 24947335.8
Social : 23253652.13
Arcade : 22888365.49
Casual : 19569221.6
Puzzle;Action & Adventure : 18366666.67
Photography : 17840110.4
Educational;Action & Adventure : 17016666.67
Productivity : 16787331.34


Average installs for categories in android:

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47


Both `genre` and `category` has the **Communication** as the most number of users that installed the application. We can also observe that there's more average user per apps for Google Play Store compared to Apple Store.

## App Recommendation

Since this is a generation that is addicted with social media, sharing photos,videos and clips on social media. Building an app that can be shared on those platform can be a good idea. On the following section, we'll try to explore the `Photography` related application on Google Play Store.

In [35]:
for i in categories_in_android:
 if i == 'PHOTOGRAPHY':
 print(i,categories_in_android[i])

PHOTOGRAPHY 2.94


About **2.94%** of the apps in Google Play Store are related to `Photography`.

We'll explore the most downloaded `Photography` related apps.

In [68]:
total_install_category = 0
len_install_category = 0
for i in free_english_app:
 category = i[1]
 install = float(i[5].replace('+','').replace(',',''))
 if category == 'PHOTOGRAPHY':
 total_install_category += install
 len_install_category += 1
 

 
average_install = total_install_category/len_install_category
 
above_average = list()
for i in free_english_app:
 install = float(i[5].replace('+','').replace(',',''))
 category = i[1]
 if install > average_install and category == 'PHOTOGRAPHY' :
 above_average.append(i)
sorted_list = list()
for i in above_average:
 install = float(i[5].replace('+','').replace(',',''))
 name = i[0]
 key_val = install,name
 sorted_list.append(key_val)
sorted(sorted_list, reverse = True)

[(1000000000.0, 'Google Photos'),
 (100000000.0, 'Z Camera - Photo Editor, Beauty Selfie, Collage'),
 (100000000.0, 'YouCam Perfect - Selfie Photo Editor'),
 (100000000.0, 'YouCam Makeup - Magic Selfie Makeovers'),
 (100000000.0, 'Sweet Selfie - selfie camera, beauty cam, photo edit'),
 (100000000.0, 'S Photo Editor - Collage Maker , Photo Collage'),
 (100000000.0, 'Retrica'),
 (100000000.0, 'PicsArt Photo Studio: Collage Maker & Pic Editor'),
 (100000000.0, 'PhotoGrid: Video & Pic Collage Maker, Photo Editor'),
 (100000000.0, 'Photo Editor Pro'),
 (100000000.0, 'Photo Editor Collage Maker Pro'),
 (100000000.0, 'Photo Collage Editor'),
 (100000000.0, 'LINE Camera - Photo editor'),
 (100000000.0, 'Cymera Camera- Photo Editor, Filter,Collage,Layout'),
 (100000000.0, 'Candy Camera - selfie, beauty camera, photo editor'),
 (100000000.0, 'Camera360: Selfie Photo Editor with Funny Sticker'),
 (100000000.0, 'BeautyPlus - Easy Photo Editor & Selfie Camera'),
 (100000000.0, 'B612 - Beauty & Fil

# Conclusion

As we con see from the results, those `photography` related apps that applies a pre-editting to the photos are among the top downloaded apps from the market. It is a good idea to further investigate these kinds of apps and build around it. Since advertisement is the main revenue for this app, we can then input an advertisement before the app releases the result where the pre-editting happens. Otherwise the consumer have to pay for a pro version of the app to get an advertisement free version of the app. We can also do a deeper study where we'll analyze the other categories and apply some concepts of those on the `photography` related app that we're going to build. Since there's a lot of these kinds of apps that was already released on the market, we have to innovate features in order to stand out.