{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Google Play Store\n", "\n", "Can we predict an application's success? How is the number of installations connected with other characteristics of the app?\n", "\n", "Let's make a few plots to see how a certain feature affects the installations.\n", "\n", "Data comes from [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:57.590379Z", "iopub.status.busy": "2024-04-17T07:38:57.590298Z", "iopub.status.idle": "2024-04-17T07:38:57.905954Z", "shell.execute_reply": "2024-04-17T07:38:57.905609Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from lets_plot import *\n", "from lets_plot.mapping import as_discrete" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:57.907418Z", "iopub.status.busy": "2024-04-17T07:38:57.907301Z", "iopub.status.idle": "2024-04-17T07:38:57.909353Z", "shell.execute_reply": "2024-04-17T07:38:57.909182Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "LetsPlot.setup_html()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:57.922685Z", "iopub.status.busy": "2024-04-17T07:38:57.922578Z", "iopub.status.idle": "2024-04-17T07:38:58.653045Z", "shell.execute_reply": "2024-04-17T07:38:58.652757Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(10841, 13)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AppCategoryRatingReviewsSizeInstallsTypePriceContent RatingGenresLast UpdatedCurrent VerAndroid Ver
0Photo Editor & Candy Camera & Grid & ScrapBookART_AND_DESIGN4.115919M10,000+Free0EveryoneArt & DesignJanuary 7, 20181.0.04.0.3 and up
1Coloring book moanaART_AND_DESIGN3.996714M500,000+Free0EveryoneArt & Design;Pretend PlayJanuary 15, 20182.0.04.0.3 and up
2U Launcher Lite – FREE Live Cool Themes, Hide ...ART_AND_DESIGN4.7875108.7M5,000,000+Free0EveryoneArt & DesignAugust 1, 20181.2.44.0.3 and up
\n", "
" ], "text/plain": [ " App Category Rating \\\n", "0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 \n", "1 Coloring book moana ART_AND_DESIGN 3.9 \n", "2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 \n", "\n", " Reviews Size Installs Type Price Content Rating \\\n", "0 159 19M 10,000+ Free 0 Everyone \n", "1 967 14M 500,000+ Free 0 Everyone \n", "2 87510 8.7M 5,000,000+ Free 0 Everyone \n", "\n", " Genres Last Updated Current Ver Android Ver \n", "0 Art & Design January 7, 2018 1.0.0 4.0.3 and up \n", "1 Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up \n", "2 Art & Design August 1, 2018 1.2.4 4.0.3 and up " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/googleplaystore.csv\")\n", "print(df.shape)\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:58.654271Z", "iopub.status.busy": "2024-04-17T07:38:58.654189Z", "iopub.status.idle": "2024-04-17T07:38:58.669266Z", "shell.execute_reply": "2024-04-17T07:38:58.669094Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(10839, 13)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AppCategoryRatingReviewsSizeInstallsTypePriceContent RatingGenresLast UpdatedCurrent VerAndroid Ver
0Photo Editor & Candy Camera & Grid & ScrapBookART_AND_DESIGN4.11591992294410000Free0.0EveryoneArt & DesignJanuary 7, 20181.0.04.0.3 and up
1Coloring book moanaART_AND_DESIGN3.996714680064500000Free0.0EveryoneArt & Design;Pretend PlayJanuary 15, 20182.0.04.0.3 and up
2U Launcher Lite – FREE Live Cool Themes, Hide ...ART_AND_DESIGN4.78751091226115000000Free0.0EveryoneArt & DesignAugust 1, 20181.2.44.0.3 and up
\n", "
" ], "text/plain": [ " App Category Rating \\\n", "0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 \n", "1 Coloring book moana ART_AND_DESIGN 3.9 \n", "2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 \n", "\n", " Reviews Size Installs Type Price Content Rating \\\n", "0 159 19922944 10000 Free 0.0 Everyone \n", "1 967 14680064 500000 Free 0.0 Everyone \n", "2 87510 9122611 5000000 Free 0.0 Everyone \n", "\n", " Genres Last Updated Current Ver Android Ver \n", "0 Art & Design January 7, 2018 1.0.0 4.0.3 and up \n", "1 Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up \n", "2 Art & Design August 1, 2018 1.2.4 4.0.3 and up " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def size_to_bytes(size):\n", " size = size.lower()\n", " if size == 'varies with device' or size == '':\n", " return -1\n", " if 'k' in size:\n", " return int(float(size.split('k')[0]) * 1024)\n", " if 'm' in size:\n", " return int(float(size.split('m')[0]) * 1024 * 1024)\n", " return int(size)\n", "\n", "df = df[~df.Type.isna()]\n", "df = df[~df.Reviews.astype(str).str.contains('M')]\n", "\n", "df.Reviews = df.Reviews.astype(int)\n", "df.Size = df.Size.astype(str).apply(size_to_bytes).astype(int)\n", "df.Installs = df.Installs.astype(str).str.replace(',', '', regex=False)\\\n", " .str.replace('+', '', regex=False).astype(int)\n", "df.Price = df.Price.astype(str).str.replace('$', '', regex=False).astype(float)\n", "\n", "print(df.shape)\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:58.670293Z", "iopub.status.busy": "2024-04-17T07:38:58.670198Z", "iopub.status.idle": "2024-04-17T07:38:58.702611Z", "shell.execute_reply": "2024-04-17T07:38:58.702323Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df = df.groupby('Category').Installs.mean().to_frame().reset_index()\n", "\n", "ggplot() + \\\n", " geom_bar(aes(x=as_discrete('Category', order_by='Installs'), y='Installs', fill='Category'), \\\n", " data=cat_df, stat='identity', sampling=sampling_pick(cat_df.shape[0])) + \\\n", " scale_fill_brewer(type='qual', palette='Dark2') + \\\n", " xlab('category') + ylab('mean installations') + \\\n", " ggsize(600, 450) + \\\n", " ggtitle('Installations by Category') + \\\n", " theme(panel_grid_major_x='blank', legend_position='none')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see that some categories are much more popular than others." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:58.703762Z", "iopub.status.busy": "2024-04-17T07:38:58.703681Z", "iopub.status.idle": "2024-04-17T07:38:58.709053Z", "shell.execute_reply": "2024-04-17T07:38:58.708861Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gen_df = df.groupby('Genres').Installs.mean().to_frame().reset_index()\n", "\n", "ggplot() + \\\n", " geom_bar(aes(x=as_discrete('Genres', order_by='Installs'), y='Installs', fill='Genres'), \\\n", " data=gen_df, stat='identity', sampling=sampling_pick(gen_df.shape[0]), \\\n", " tooltips=layer_tooltips().line('genre|@Genres')\\\n", " .format('@Installs', '.0f')\\\n", " .line('mean installations|@Installs')) + \\\n", " scale_fill_brewer(type='qual', palette='Dark2') + \\\n", " ylab('mean installations') + \\\n", " ggsize(600, 300) + \\\n", " ggtitle('Installations by Genre') + \\\n", " theme(panel_grid_major_x='blank', legend_position='none', \\\n", " axis_title_x='blank', axis_text_x='blank', axis_ticks_x='blank')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see a big gap in popularity between different genres." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:58.710222Z", "iopub.status.busy": "2024-04-17T07:38:58.710136Z", "iopub.status.idle": "2024-04-17T07:38:58.779445Z", "shell.execute_reply": "2024-04-17T07:38:58.779065Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ggplot() + \\\n", " geom_bin2d(aes(x='Installs', y='Rating', fill='..count..'), \\\n", " data=df, color='white', size=1) + \\\n", " scale_fill_gradient(low='#e0ecf4', high='#8856a7') + \\\n", " scale_x_log10(name='installations') + \\\n", " ylim(1, 5) + ylab('rating') + \\\n", " ggsize(600, 300) + \\\n", " ggtitle('Connection Between Installations and Rating')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rating and number of installations are more or less positively correlated. At least an app rated below 3 will not be popular." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:58.780598Z", "iopub.status.busy": "2024-04-17T07:38:58.780496Z", "iopub.status.idle": "2024-04-17T07:38:58.994286Z", "shell.execute_reply": "2024-04-17T07:38:58.994053Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ggplot() + \\\n", " geom_jitter(aes(x='Reviews', y='Installs', paint_a='Type'), \\\n", " data=df, shape=21, color='black', alpha=.1, fill_by='paint_a') + \\\n", " geom_smooth(aes(x='Reviews', y='Installs', group='Type', paint_a='Type'), \\\n", " data=df, method='loess', deg=2, color_by='paint_a') + \\\n", " scale_x_log10(name='reviews') + scale_y_log10(name='installations') + \\\n", " scale_brewer('paint_a', palette='Set2') + \\\n", " ggsize(600, 450) + \\\n", " ggtitle('Connection Between Installations and Reviews')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot shows that the number of installations and the number of reviews are practically the same thing. \n", "\n", "The smoothing curves are far enough from each other, so it's better to separate free applications from the paid ones." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:58.995534Z", "iopub.status.busy": "2024-04-17T07:38:58.995421Z", "iopub.status.idle": "2024-04-17T07:38:59.067785Z", "shell.execute_reply": "2024-04-17T07:38:59.067463Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ggplot() + \\\n", " geom_bin2d(aes(x='Reviews', y='Size', fill='..count..'), \\\n", " data=df, color='white', size=1) + \\\n", " scale_fill_gradient(low='#e5f5f9', high='#2ca25f') + \\\n", " scale_x_log10(name='reviews') + scale_y_log10(name='size') + \\\n", " ggsize(600, 300) + \\\n", " ggtitle('Connection Between Reviews and Size')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we might not be interested in apps that are lighter than 1 Mb. For the others there is but minor correlation." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-04-17T07:38:59.069038Z", "iopub.status.busy": "2024-04-17T07:38:59.068928Z", "iopub.status.idle": "2024-04-17T07:38:59.078388Z", "shell.execute_reply": "2024-04-17T07:38:59.078200Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ggplot() + \\\n", " geom_bin2d(aes(x='Reviews', y='Price', fill='..count..'), \\\n", " data=df[df.Type == 'Paid'], color='white', size=1) + \\\n", " scale_fill_gradient(low='#ffeda0', high='#f03b20') + \\\n", " scale_x_log10(name='reviews') + scale_y_log10(name='price') + \\\n", " ggsize(600, 300) + \\\n", " ggtitle('Connection Between Price and Reviews')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I see nothing but chaos here. Anyway, paid apps are not very common, and others are either free of charge or use different sources of monetization." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 4 }