# Profitable App genres - iOS App Store and Google Play Store

This is a data analysis project looking at the apps listed in the App Store and Google Play markets, and do a profiling of free apps in the respective app marketplace.

**Note:** We are only interested in English language apps in this project.
 
## Goal:
Through this project, our aim is to:

1. Understand the free apps that are listed in the App Store and Google Play markets based on it's actual usage statistics, and user rating
2. From the analysis come up with one app profile best suited to develop as a free app in both the marketplace that maximises the in-app ad revenue

## Dataset:

For this project we are going to use the following datasets:

1. A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
2. A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

## Preliminary Analysis:

Let's now explore the dataset to understand in a bit more detail.

First, we will create three functions so that we can reuse it to read both iOS and Android dataset from CSV, and print some sample data for our initial analysis.

In [1]:
# Function to read a CSV file, and return the contents as list of lists
def csv_reader(file_name_with_path):
 from csv import reader
 open_file = open(file_name_with_path)
 read_file = reader(open_file)
 dataset = list(read_file)
 return dataset

# Function to read the dataset (list of lists) and print data range as passed
def explore_data(dataset, start, end, rows_and_columns=False):
 dataset_slice = dataset[start:end]
 for row in dataset_slice:
 print(row)
 print('\n') # adds a new (empty) line after each row
 
 # If row and column statistics is required (passed as parameter)
 if rows_and_columns:
 print('Number of rows:', len(dataset))
 print('Number of columns:', len(dataset[0]))
 
# Function to check if all the columns have data in the dataset - Dataset is to be passed with the header row
def print_missing_column_values(dataset):
 for row in dataset[1:]:
 header_length = len(dataset[0])
 row_length = len(row)
 if row_length != header_length:
 print('Index = ',dataset.index(row)) 
 print('Data row = ',row)
 print('\n')

Let's look at a few rows from both the datasets.

In [2]:
print('='*5+'Apple Store'+'='*5+'\n')
apple = csv_reader('AppleStore.csv')
explore_data(apple[1:],0,5,True)
print('\n')
print('='*5+'Google Play Store'+'='*5+'\n')
android = csv_reader('googleplaystore.csv')
explore_data(android[1:],0,5,True)

=====Apple Store=====

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


=====Google Play Store=====

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['C

Now that we see some sample data along with number of rows/columns in each dataset, let's understand the columns, and see the ones that might be useful for our analysis.

In [3]:
print('='*5+'Apple Store'+'='*5+'\n')
print(apple[0])
print('\n')
print('='*5+'Google Play Store'+'='*5+'\n')
print(android[0])

apple_header = apple[0]
android_header = android[0]

=====Apple Store=====

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


=====Google Play Store=====

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In the iOS dataset, few columns that could be useful are:

- *track_name* - Name of the app
- *price* - Price of the app that might help us determine free vs paid apps
- *prime_genre* - Genre classification of the app for our profiling
- *cont_rating* - Content rating or Age group relevance
- *user_rating* - Overall user rating of the app
- *rating_count_tot* - Total number of users that reviewed/rated the app

For full details, refer to the dataset [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In the Android dataset, few columns that could be useful are:

- *App* - Name of the app
- *Price* - Price of the app that might help us determine free vs paid apps
- *Genres* - Genre classification of the app for our profiling
- *Category* - Another classification for the app that might aid our profiling
- *Content Rating* - Content rating or Age group relevance
- *Rating* - Overall user rating of the app
- *Reviews* - Total number of users that reviewed/rated the app
- *Installs* - Total number of users who have installed the app

For full details, refer to the dataset [documentation](https://www.kaggle.com/lava18/google-play-store-apps).

## Data cleansing

We will now start to analyse the data to see if there is any data cleansing we need to do before doing profiling and analysis.

### Missing column values

We will use the function we created **print_missing_column_values** to see in both the dataset if there are apps which have missing information and printing those rows so that we can take a decision.

In [4]:
print('='*5+'Google Play Store'+'='*5+'\n')
print_missing_column_values(android)
print('='*5+'Apple Play Store'+'='*5+'\n')
print_missing_column_values(apple)

=====Google Play Store=====

Index = 10473
Data row = ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


=====Apple Play Store=====



From the above display, we can see that one of the app in Google Playstore seem to have a data point missing and from comparing against the sample row for Google Play apps, it looks like this app is missing value for **Category** column.

Checking this app in the [Google Playstore](https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe) reveals that it is categorised as ***Lifestyle***.

We will correct this.

In [5]:
android[10473].insert(1,'LIFESTYLE')
print(android[10473])

['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Duplicate entries

As per the discussion about the Google Playstore, we see that this dataset suffers from lot of duplicates. However, the Apple appstore dataset does not have any duplicates.

Let's now see how many duplicate apps we have in the Google Playstore, and look at some of those.

In [6]:
duplicate_apps = []
unique_apps = []
for app in android[1:]:
 name = app[0]
 if name in unique_apps:
 duplicate_apps.append(name)
 else:
 unique_apps.append(name)
print('Total number of apps in the dataset: ',len(android[1:]))
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('='*5+'Few duplicate apps'+'='*5+'\n')
print(duplicate_apps[:15])


Total number of apps in the dataset: 10841
Number of duplicate apps: 1181


=====Few duplicate apps=====

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We can see about **1181** apps are duplicates. 

Rather than randomly removing the duplicates, we will use the column ***reviews*** on the basis that higher the total number of reviews the more recent the app entry.

First step is to build a dictionary based on the android dataset so that we have the app name and its reviews count that is maximum for that app.

In [7]:
reviews_max = {}
for row in android[1:]:
 name = row[0]
 n_reviews = float(row[3]) #total number of reviews
 if name in reviews_max and reviews_max[name] < n_reviews:
 reviews_max[name] = n_reviews
 elif name not in reviews_max:
 reviews_max[name] = n_reviews
print('Expected rows after cleanup: ',len(reviews_max))

Expected rows after cleanup: 9660


Now that we have the app and its maximum reviews count, we use the above dictionary to build our unique apps dataset. After cleanup we should have **9660** unique rows.

- We start by creating two empty lists **android_clean** and **already_added**
- We loop through our android dataset (ignoring headers) and for each iteration, we add the row to **android_clean** list, and the app name to the **already_added** list if:
 - The reviews matches the max reviews per the dictionary for that app
 - The app is not already added to **already_added** list

**Note**: We need to check the existence in the **already_added** list to make sure that we add the app only once if the duplicates has same maximum number of reviews for that app. 

In [8]:
android_clean = []
already_added = []
for row in android[1:]:
 name = row[0]
 n_reviews = float(row[3]) #total number of reviews
 if (reviews_max[name] == n_reviews) and (name not in already_added):
 android_clean.append(row)
 already_added.append(name)
print(len(android_clean))
print(android_clean[0:3])

9660
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


### Non-English apps

In this project our aim is only to analyse and profile English language apps and hence we need to identify any Non-English apps and remove from our dataset.

First, we will write a function **is_english** that does the following:
- Takes a string as input
- Checks the character using the standard function **ord** to get the ASCII number, and see if it falls outside the range for English characters (0-127)
- If we find more than three characters in a string outside our range then we return False (Not English)
- If not, we return True (English)


**Note**: To avoid the mistake of removing some apps with smileys and other special characters (e.g. 'Instachat 😜' or 'Docs To Go™ Free Office Suite'), we will establish a rule that we only return as non-english if the string has more than 3 characters.

In [9]:
def is_english(a_string):
 no_of_chars = 0
 for char in a_string:
 if ord(char) > 127:
 no_of_chars += 1
 
 if no_of_chars > 3:
 return False
 return True

print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instachat 😜'))
print(is_english('Docs To Go™ Free Office Suite'))

False
True
True


Then, we use this function in loop through our Apple and Android dataset to build our dataset of English only apps.

In [10]:
apple_english_only = []
android_english_only = []
for app in android_clean:
 name = app[0]
 if is_english(name):
 android_english_only.append(app)
print('English only android apps: ',len(android_english_only))
print('Non-English android apps: ',len(android_clean) - len(android_english_only))
print('% of English apps in Android: ',round((len(android_english_only)/len(android_clean))*100,2))
print('\n')
for app in apple[1:]:
 name = app[1]
 if is_english(name):
 apple_english_only.append(app)
print('English only Apple apps: ',len(apple_english_only))
print('Non-English apple apps: ',len(apple[1:]) - len(apple_english_only))
print('% of English apps in Apple: ',round((len(apple_english_only)/len(apple[1:]))*100,2))

English only android apps: 9615
Non-English android apps: 45
% of English apps in Android: 99.53


English only Apple apps: 6183
Non-English apple apps: 1014
% of English apps in Apple: 85.91


It seems while Android dataset has lot of duplicate apps, Apple dataset have a lot more Non-English apps than Android.

### Free vs paid apps

As we have mentioned before, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Hence we need isolate free apps from the non-free apps in both the datasets.

This would be our last step in the data cleaning process.

**Note**:
- In the Apple dataset, we can rely and convert the column **price** (index: 4) to float as it does not have any decimals or currency symbols
- In the Android dataset, we can rely on the column **Type** (index: 6) being 'Free' to determine if its free or non-free app

In [11]:
apple_final = []
for app in apple_english_only:
 price = float(app[4])
 if price == 0:
 apple_final.append(app)
print('Total free apps in Apple Store: ',len(apple_final))

android_final = []
for app in android_english_only:
 free = app[6]
 if free == 'Free':
 android_final.append(app)
print('Total free apps in Android Store: ',len(android_final))

Total free apps in Apple Store: 3222
Total free apps in Android Store: 8864


After going through a series of data cleaning measures, we finally have **3222** apps in the Apple dataset and **8864** apps in the Android dataset that we are going to use for our profiling and analysis further.

## Data Analysis

As mentioned at the start, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimise risks and overhead, we would have the same app develop in both iOS and Android. But the idea is to launch on the Android market, and based on response from users, we develop further, and if profitable then we develop an iOS version.

So for this reason, we need analyse and determine the app profile/genres that could attract more users in both iOS and Android.

In the Apple dataset, we have a clear column called prime_genre that can aid our analysis. However, in the Android dataset, we have two columns Category and Genres.

Before we start profiling our app based on the dataset and respective columns for genre, we will build two functions that we will reuse for both datasets to build the frequency table.

1. **freq_table:** This function builds a frequency table by taking a dataset (list of lists) and index number of the column for which we are building the frequency table. Also, we will return the frequency table as a percentage.

2. **display_table:** This function uses the freq_table from above function, and sorts by highest percentage and displays the result.

In [12]:
# Function to build a frequency table
# Takes:
# A dataset (a list of lists) and 
# Index number for the column we are building the frequency table
def freq_table(dataset, index):
 result = {}
 for row in dataset:
 key = row[index]
 if key in result:
 result[key] += 1
 else:
 result[key] = 1
 
 # Make the frequency table as a percentage
 total_apps = len(dataset)
 for key in result:
 result[key] /= total_apps
 result[key] *= 100
 result[key] = round(result[key] ,2)
 return result


def display_table(freq_tbl):
 table = freq_tbl
 table_display = []
 for key in table:
 key_val_as_tuple = (table[key], key)
 table_display.append(key_val_as_tuple)

 table_sorted = sorted(table_display, reverse = True)
 for entry in table_sorted:
 print(entry[1], ':', entry[0])

### Apple app store

Now we will use the function ***display_table*** against the Apple app store dataset to see the results based on the ***prime_genre*** column

In [13]:
print('='*5+'Apple App Store - By Prime_Genre'+'='*5+'\n')
freq_tbl = freq_table(apple_final,11) #prime_genre
display_table(freq_tbl)
print('\n')

=====Apple App Store - By Prime_Genre=====

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12




Purely, from the genre perspective, we see could see the following pattern based on the number of apps:

- *Games* apps are significantly higher in proportion across all English free apps collection (about 60%)
- Overall, Apple app store seems to have higher proportion of apps for entertainment purposes (games, photo and video, social networking, sports, music) than practical purposes (education, shopping, utilities, productivity, lifestyle)

Even though the proportion based on the number of apps might present the above picture, the same might not be true with regard to the number of users/reviews.

Let's now look at the Android dataset using the columns category and genres.

In [14]:
print('='*5+'Google Play Store - By Category '+'='*5+'\n')
freq_tbl = freq_table(android_final,1) #Category
display_table(freq_tbl)
print('\n')
print('='*5+'Google Play Store - By Genre '+'='*5+'\n')
freq_tbl = freq_table(android_final,9) #Genre
display_table(freq_tbl)

=====Google Play Store - By Category =====

FAMILY : 18.9
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.91
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


=====Google Play Store - By Genre =====

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Loc

Before we look at the profile of apps, in Android dataset, we have both Category and Genre columns. And looking at the frequency table above, it is clear that the Genre column is more granular and seem to have a sub-category level information.

Since at this point our analysis is going to involve high level categorisation, we will continue our analysis only with Category column for Android dataset.

At first when we look at the frequency table based on category column we see that Android market seem to have a more balanced spread of apps across categories unlike Apple app store. Also, we have more apps that are for practical (such as productivity, finance, family, tools) than fun purposes.

However, when we [look](https://play.google.com/store/apps/category/FAMILY?hl=en) at the apps in the **FAMILY** category (about 19%) for example, most of them are game apps for kids. Even then, we see apps more for practical purposes than fun unlike Apple.

Up to this point, we found that the Apple app Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

## Most popular apps - App Store

Let's start with calculating the average number of user ratings per app genre on the App Store.

In [15]:
app_store_genres = freq_table(apple_final, 11)
genre_avg_ratings = {}
for genre in app_store_genres:
 total = 0 #total apps in genre
 len_genre = 0 #total rating count of all apps in genre
 for app in apple_final:
 genre_app = app[11]
 if genre == genre_app:
 app_usr_rating_tot = int(app[5])
 len_genre += 1
 total += app_usr_rating_tot
 avg_genre = round(total/len_genre)
 genre_avg_ratings[genre] = avg_genre

display_table(genre_avg_ratings) 

Navigation : 86090
Reference : 74942
Social Networking : 71548
Music : 57327
Weather : 52280
Book : 39758
Food & Drink : 33334
Finance : 31468
Photo & Video : 28442
Travel : 28244
Shopping : 26920
Health & Fitness : 23298
Sports : 23009
Games : 22789
News : 21248
Productivity : 21028
Utilities : 18684
Lifestyle : 16486
Entertainment : 14030
Business : 7491
Education : 7004
Catalogs : 4004
Medical : 612


Even though games genre had larger proportion of apps, we see that Navigation and Reference genres have more average users.

Before making any conclusions, let's dig a little deeper on the apps in these genre.

In [16]:
def top_apps_by_genre(dataset, genre, genre_index, appname_index, users_index, top_n = 5, pct = False):
 genre_apps = []
 total_genre_users = 0
 for app in dataset: 
 app_genre = app[genre_index]
 app_name = app[appname_index]
 app_users = int((app[users_index].replace(',','')).replace('+','')) 
 if app_genre == genre:
 total_genre_users += app_users
 app_tupple = (app_users, app_name)
 genre_apps.append(app_tupple)
 top = 0
 print('*'*5,'Top',top_n,'apps for',genre,'*'*5) 
 for app in sorted(genre_apps, reverse = True):
 top += 1
 if top > top_n:
 print('\n')
 break
 app_name = app[1]
 app_users = app[0]
 if pct == True:
 if total_genre_users != 0:
 app_user_pct = round((app_users / total_genre_users) * 100,2)
 else:
 app_user_pct = 0
 print(app_name,':',app_users, '(' + str(app_user_pct)+'%)')
 else:
 print(app_name,':',app_users)
 
top_apps_by_genre(dataset = apple_final, genre="Navigation", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = apple_final, genre="Reference", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = apple_final, genre="Social Networking", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = apple_final, genre="Music", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = apple_final, genre="Weather", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = apple_final, genre="Book", genre_index=-5, appname_index=1, users_index=5, pct=True)
top_apps_by_genre(dataset = apple_final, genre="Finance", genre_index=-5, appname_index=1, users_index=5, pct=True)

***** Top 5 apps for Navigation *****
Waze - GPS Navigation, Maps & Real-time Traffic : 345046 (66.8%)
Google Maps - Navigation & Transit : 154911 (29.99%)
Geocaching® : 12811 (2.48%)
CoPilot GPS – Car Navigation & Offline Maps : 3582 (0.69%)
ImmobilienScout24: Real Estate Search in Germany : 187 (0.04%)


***** Top 5 apps for Reference *****
Bible : 985920 (73.09%)
Dictionary.com Dictionary & Thesaurus : 200047 (14.83%)
Dictionary.com Dictionary & Thesaurus for iPad : 54175 (4.02%)
Google Translate : 26786 (1.99%)
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418 (1.37%)


***** Top 5 apps for Social Networking *****
Facebook : 2974676 (39.22%)
Pinterest : 1061624 (14.0%)
Skype for iPhone : 373519 (4.93%)
Messenger : 351466 (4.63%)
Tumblr : 334293 (4.41%)


***** Top 5 apps for Music *****
Pandora - Music & Radio : 1126879 (29.78%)
Spotify Music : 878563 (23.22%)
Shazam - Discover music, artists, videos & lyrics : 402925 (10.65%)
iHeartRadio – Free Music & Radio Stations : 29

Looking at the Top 5 apps for few genres which command highest users, we see the following patterns:

1. Consistently in Navigation, Social Networking, Music generes few apps dominate the user base and hence they skew the results overall
2. Reference genre is interesting. Bible and Dictionary has near monopoly same as Navigation with Google (Waze and Google Maps owned by Google)
3. Weather genre shows promise and is a practical use app rather than fun app which is near saturation point in AppStore - However, considering our primary goal of free app and maximising ad-revenue, weather might not be suitable genre were users would not stay long enough within the app
4. Other genres that provide practical purpose such as "Food & Drink" could be considered but these require deeper partnerships at the supply chain level but may be perhaps marketplace is preferrable option here

Even though Book genre is dominated by Amazon, we see promise that small business apps such as "Color Therapy Adult Coloring Book for Adults", "HOOKED - Chat Stories" and "OverDrive – Library eBooks and Audiobooks" have good percentage of user base too (Combined ~35%).

So if we bring a standalone app in the Book genre perhaps of some famous best seller book there is potential for maximising ad-revenue by keeping the user within our app for longer time or creativity app such as kids or adult colouring book.

**Note**: We do need to check about the rights and partnership for published books, but more importantly it should not already been an eBook/Audio book through Amazon's platform.

Finance genre is interesting:

- Proportion of apps is only 1.12% of the AppStore
- Average number of users on the apps is quite high
- Top 5 apps in this space shows no monopoly by any big players and spread is even
- There are higher proportion of apps that provide banking/payment services but also there is services for personal finance

Even though this genre requires deeper domain expertise, but if we could partner with some wealth management company or Financial adviser then we have a potential to add significant value to the user at the same time maximising the ad-revenue related to the financial advise in our app.

From all of the genres in the AppStore, two genres definetly emerge as potentials in the AppStore:
1. Book genre - Popular/best selling book not yet in big platforms (such as Amazon, Google, Apple) or Creativity app such as kids or adult colouring book
2. Finance genre - Personal Finance app and maximise ad-revenue

Both of these definetly fits our theme of apps for practical use.

Let's now explore the Google PlayStore.

## Most popular apps - Google Play Store

In PlayStore dataset, the Installs column has a range such as 1,000+, 10,000+. But for our analysis we will consider these as hard numbers (By replacing "," and "+") as we are going to employ the same technique across the dataset it should not cause any error in judgment.

In [17]:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
play_store_genres = freq_table(android_final, 1)
genre_avg_ratings = {}
for genre in play_store_genres:
 total = 0 #total apps in genre
 len_genre = 0 #total rating count of all apps in genre
 for app in android_final:
 genre_app = app[1]
 if genre == genre_app:
 app_usr_rating_tot = int((app[5].replace(',','')).replace('+',''))
 len_genre += 1
 total += app_usr_rating_tot
 avg_genre = round(total/len_genre)
 genre_avg_ratings[genre] = avg_genre

display_table(genre_avg_ratings)

COMMUNICATION : 38456119
VIDEO_PLAYERS : 24727872
SOCIAL : 23253652
PHOTOGRAPHY : 17840110
PRODUCTIVITY : 16787331
GAME : 15588016
TRAVEL_AND_LOCAL : 13984078
ENTERTAINMENT : 11640706
TOOLS : 10801391
NEWS_AND_MAGAZINES : 9549178
BOOKS_AND_REFERENCE : 8767812
SHOPPING : 7036877
PERSONALIZATION : 5201483
WEATHER : 5074486
HEALTH_AND_FITNESS : 4188822
MAPS_AND_NAVIGATION : 4056942
FAMILY : 3697848
SPORTS : 3638640
ART_AND_DESIGN : 1986335
FOOD_AND_DRINK : 1924898
EDUCATION : 1833495
BUSINESS : 1712290
LIFESTYLE : 1433676
FINANCE : 1387692
HOUSE_AND_HOME : 1331541
DATING : 854029
COMICS : 817657
AUTO_AND_VEHICLES : 647318
LIBRARIES_AND_DEMO : 638504
PARENTING : 542604
BEAUTY : 513152
EVENTS : 253542
MEDICAL : 120551


Even though the categorisation on the PlayStore is slightly different to the AppStore, there are some common themes and categories. But PlayStore user base shows good promise for both fun and practical purpose apps.

Let's drill a bit more into the top 3 categories here and also we will explore two potentials from the App store - **Books** and **Finance** genre.

In [18]:
top_apps_by_genre(dataset = android_final, genre="COMMUNICATION", genre_index=1, appname_index=0, users_index=5, pct=True)
top_apps_by_genre(dataset = android_final, genre="VIDEO_PLAYERS", genre_index=1, appname_index=0, users_index=5, pct=True)
top_apps_by_genre(dataset = android_final, genre="PRODUCTIVITY", genre_index=1, appname_index=0, users_index=5, pct=True)
top_apps_by_genre(dataset = android_final, genre="FINANCE", genre_index=1, appname_index=0, users_index=5, pct=True)
top_apps_by_genre(dataset = android_final, genre="BOOKS_AND_REFERENCE", genre_index=1, appname_index=0, users_index=5, pct=True)

***** Top 5 apps for COMMUNICATION *****
WhatsApp Messenger : 1000000000 (9.06%)
Skype - free IM & video calls : 1000000000 (9.06%)
Messenger – Text and Video Chat for Free : 1000000000 (9.06%)
Hangouts : 1000000000 (9.06%)
Google Chrome: Fast & Secure : 1000000000 (9.06%)


***** Top 5 apps for VIDEO_PLAYERS *****
YouTube : 1000000000 (25.43%)
Google Play Movies & TV : 1000000000 (25.43%)
MX Player : 500000000 (12.72%)
VivaVideo - Video Editor & Photo Movie : 100000000 (2.54%)
VideoShow-Video Editor, Video Maker, Beauty Camera : 100000000 (2.54%)


***** Top 5 apps for PRODUCTIVITY *****
Google Drive : 1000000000 (17.27%)
Microsoft Word : 500000000 (8.63%)
Google Calendar : 500000000 (8.63%)
Dropbox : 500000000 (8.63%)
Cloud Print : 500000000 (8.63%)


***** Top 5 apps for FINANCE *****
Google Pay : 100000000 (21.97%)
PayPal : 50000000 (10.99%)
İşCep : 10000000 (2.2%)
Wells Fargo Mobile : 10000000 (2.2%)
Mobile Bancomer : 10000000 (2.2%)


***** Top 5 apps for BOOKS_AND_REFERENCE ****

In [19]:
for app in android_final:
 if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
 or app[5] == '500,000,000+'
 or app[5] == '100,000,000+'):
 print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [20]:
under_100_m = []

for app in android_final:
 n_installs = app[5]
 n_installs = n_installs.replace(',', '')
 n_installs = n_installs.replace('+', '')
 if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
 under_100_m.append(float(n_installs))
 
sum(under_100_m) / len(under_100_m)

3603485.3884615386

We see a similar pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, these app genres might seem more popular than reality and moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems popular here too, but previously we found that in App Store this market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811, but, here again we see few giants dominating the market such as Amazon, Google and Bible.

## Conclusion

To reiterate our primary goal - We are going to launch the app in both AppStore and PlayStore albeit different timeline.

So from that perspective, we are looking at the Top 5 apps on user base by the same genre/category in the AppStore we listed as potentials (Books and Finance).

Here again the Books category shows dominance by big player with their marketplace apps such as Google or Amazon.

This makes it especially hard to launch a famous/best selling book as app without having publishing license/rights issue with these big players is going to be hard.

**Our recommendation at this stage is to develop a Personal Finance app in both AppStore and PlayStore**.

**Note**:
- We should not consider any sub-categories which require banking license, deep domain expertise and infrastructure (such as Payments) at this stage
- To maximise the value we create for our user base and in turn maximise our ad-revenue, we should look at a partnership with a Financial advisor.