{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Analyzing Startup Fundraising Deals from Crunchbase\n", "\n", "We will analyze startup investments from Crunchbase.com.\n", "\n", "Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.\n", "\n", "In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.\n", "\n", "The data set of investments we'll be exploring is current as of October 2013. You can download it from [GitHub].\n", "\n", "Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While `crunchbase-investments.csv` consumes 10.3 megabytes of disk space, we know from earlier missions that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).\n", "\n", "[Crunchbase]: https://www.crunchbase.com/\n", "[Github]: https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sqlite3\n", "import pandas as pd\n", "import pprint as pp\n", "import matplotlib.pyplot as plt\n", "pd.options.display.max_columns = 99" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "crunchbase.db\r\n", "crunchbase-investments.csv\r\n", "project03_analyzing_startup_fundraising_deals_from_crunchbase.ipynb\r\n" ] } ], "source": [ "!ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "- Because the data set contains over 50,000 rows, you'll need to read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory.\n", "- Across all of the chunks, become familiar with:\n", " - Each column's missing value counts\n", " - Each column's memory footprint\n", " - The total memory footprint of all of the chunks combined\n", " - Which column(s) we can drop because they aren't useful for analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check Chunk's Memory" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chunk 1 memory: 5.58 mb\n", "Chunk 2 memory: 5.53 mb\n", "Chunk 3 memory: 5.54 mb\n", "Chunk 4 memory: 5.53 mb\n", "Chunk 5 memory: 5.52 mb\n", "Chunk 6 memory: 5.55 mb\n", "Chunk 7 memory: 5.53 mb\n", "Chunk 8 memory: 5.51 mb\n", "Chunk 9 memory: 5.40 mb\n", "Chunk 10 memory: 4.64 mb\n", "Chunk 11 memory: 2.66 mb\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAC0JJREFUeJzt3W+MZXddx/HPl26RP4I1dqJICYORmCBRwQ2CTYgWYoCS9oEk1qQqRNNoRNCYkPJEoo8gMQY1BrOCBgUFU8FU/gkGm8gDqrNtFcpC0mCFCqYDahE1YOXrg7mLyzCzcwp7Z+a7+3olk50799eZ769n8s7ZM/fMVncHgDkedtQDAPDQCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDnFjHJ73yyit7c3NzHZ8a4KJ0+vTpz3T3xpK1awn35uZmtra21vGpAS5KVfVPS9e6VAIwjHADDCPcAMMIN8Awwg0wjHADDCPcAMMIN8Awwg0wzFrunAQ4Sps3v/NIvu69r772UL6OM26AYYQbYBjhBhhGuAGGEW6AYYQbYBjhBhhGuAGGEW6AYYQbYBjhBhhGuAGGEW6AYYQbYBjhBhhGuAGGEW6AYYQbYJhF4a6qX6qqu6vqw1X1J1X1iHUPBsDeDgx3VT0+ycuSnOzupya5LMkN6x4MgL0tvVRyIskjq+pEkkcl+dT6RgLgfA4Md3f/c5JfT/KJJJ9O8kB3v3fdgwGwtyWXSr45yfVJnpTk25M8uqpu3GPdTVW1VVVb29vbF35SAJIsu1Ty3CT/2N3b3f0/Sd6W5Ad3L+ruU919srtPbmxsXOg5AVhZEu5PJHlmVT2qqirJc5KcWe9YAOxnyTXu25PckuSOJB9a/Ten1jwXAPs4sWRRd78qyavWPAsAC7hzEmAY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2CYReGuqiuq6paq+mhVnamqZ617MAD2dmLhut9M8p7uflFVPTzJo9Y4EwDncWC4q+qxSZ6d5MVJ0t1fTPLF9Y4FwH6WXCr5jiTbSf6gqu6sqtdX1aPXPBcA+1gS7hNJnp7kdd39tCT/meTm3Yuq6qaq2qqqre3t7Qs8JgBnLQn3fUnu6+7bV49vyU7Iv0J3n+ruk919cmNj40LOCMA5Dgx3d/9Lkk9W1XetPvScJB9Z61QA7Gvpq0p+IcmbV68o+XiSl6xvJADOZ1G4u/uuJCfXPAsAC7hzEmAY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2CYxeGuqsuq6s6qesc6BwLg/B7KGffLk5xZ1yAALLMo3FV1VZJrk7x+veMAcJClZ9yvTfKKJF9a4ywALHBguKvqhUnu7+7TB6y7qaq2qmpre3v7gg0IwFdacsZ9dZLrqureJG9Jck1VvWn3ou4+1d0nu/vkxsbGBR4TgLMODHd3v7K7r+ruzSQ3JHl/d9+49skA2JPXcQMMc+KhLO7u25LctpZJAFjEGTfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMAeGu6qeUFV/XVVnquruqnr5YQwGwN5OLFjzYJJf7u47quoxSU5X1fu6+yNrng2APRx4xt3dn+7uO1bv/0eSM0kev+7BANjbQ7rGXVWbSZ6W5PZ1DAPAwRaHu6q+McmfJfnF7v7cHs/fVFVbVbW1vb19IWcE4ByLwl1Vl2cn2m/u7rfttaa7T3X3ye4+ubGxcSFnBOAcS15VUknekORMd//G+kcC4HyWnHFfneQnklxTVXet3l6w5rkA2MeBLwfs7g8kqUOYBYAF3DkJMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwwj3ADDCDfAMMINMIxwAwxz4D9ddtg2b37nkXzde1997ZF8XS4NR/V9nfjevhg54wYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhhFugGGEG2AY4QYYRrgBhlkU7qp6XlV9rKruqaqb1z0UAPs7MNxVdVmS30ny/CRPSfLjVfWUdQ8GwN6WnHE/I8k93f3x7v5ikrckuX69YwGwnyXhfnyST57z+L7VxwA4AicWrKk9PtZftajqpiQ3rR5+vqo+9vUMdtjqNXt++MoknzncSdbKfo63texnn+/tw3JJHaOv8//1E5cuXBLu+5I84ZzHVyX51O5F3X0qyamlX3iCqtrq7pNHPceFYj/H28W2n+Ti29Nx2c+SSyV/l+TJVfWkqnp4khuS3LresQDYz4Fn3N39YFW9NMlfJrksye93991rnwyAPS25VJLufleSd615luPoorr0E/s57i62/SQX356OxX6q+6t+zgjAMeaWd4BhLvlwV9Ujqupvq+rvq+ruqvrVPdZ8Q1W9dXXL/+1VtXn4ky6zcD8vrqrtqrpr9fYzRzHrQ1FVl1XVnVX1jj2eG3N8zjpgPxOPz71V9aHVvFt7PF9V9VurY/QPVfX0o5hzqQX7+aGqeuCcY/QrhznfomvcF7kvJLmmuz9fVZcn+UBVvbu7P3jOmp9O8m/d/Z1VdUOS1yT5saMYdoEl+0mSt3b3S49gvq/Vy5OcSfLYPZ6bdHzOOt9+knnHJ0l+uLv3e43z85M8efX2A0let/rzODvffpLkb7r7hYc2zTku+TPu3vH51cPLV2+7L/xfn+SNq/dvSfKcqtrrxqQjt3A/o1TVVUmuTfL6fZaMOT7Jov1cjK5P8oer788PJrmiqh531ENNdcmHO/nyX1vvSnJ/kvd19+27lnz5tv/ufjDJA0m+5XCnXG7BfpLkR1d/Zb2lqp6wx/PHyWuTvCLJl/Z5ftTxycH7SWYdn2Tn5OC9VXV6dRf1btN+dcZB+0mSZ60uSb67qr77MIcT7iTd/b/d/X3ZuSv0GVX11F1LFt32f1ws2M9fJNns7u9J8lf5/7PVY6eqXpjk/u4+fb5le3zsWB6fhfsZc3zOcXV3Pz07l0R+vqqevev5Mcdo5aD93JHkid39vUl+O8mfH+Zwwn2O7v73JLcled6up758239VnUjyTUn+9VCH+xrst5/u/mx3f2H18PeSfP8hj/ZQXJ3kuqq6Nzu/mfKaqnrTrjWTjs+B+xl2fJIk3f2p1Z/3J3l7dn6r6LkW/eqM4+Kg/XT3585eklzd53J5VV15WPNd8uGuqo2qumL1/iOTPDfJR3ctuzXJT63ef1GS9/cxfQH8kv3surZ4XXZ+SHYsdfcru/uq7t7Mzq9beH9337hr2Zjjs2Q/k45PklTVo6vqMWffT/IjST68a9mtSX5y9eqSZyZ5oLs/fcijLrJkP1X1bWd/jlJVz8hOSz97WDN6VUnyuCRvrJ1/MOJhSf60u99RVb+WZKu7b03yhiR/VFX3ZOdM7oajG/dAS/bzsqq6LsmD2dnPi49s2q/R4OOzp+HH51uTvH3VsRNJ/ri731NVP5sk3f272bnz+gVJ7knyX0leckSzLrFkPy9K8nNV9WCS/05yw2GeLLhzEmCYS/5SCcA0wg0wjHADDCPcAMMIN8Awwg0wjHADDCPcAMP8HyKL7A476WUwAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1')\n", "\n", "memory_footprints = []\n", "i = 0\n", "\n", "for chunk in chunks_it:\n", " i += 1\n", " chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576\n", " memory_footprints.append(chunk_memory_mb)\n", " print(\"Chunk {0} memory: {1:0.2f} mb\".format(i, chunk_memory_mb))\n", "\n", "plt.hist(memory_footprints)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- Each chunk is under the 10 MB of memory requirement we had set\n", "- Each chunk has ~5.5 MB except the last 2 chunks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check each column's missing value counts across all chunks" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "company_country_code 1\n", "company_name 1\n", "company_permalink 1\n", "company_region 1\n", "investor_region 2\n", "investor_permalink 2\n", "investor_name 2\n", "funded_quarter 3\n", "funded_at 3\n", "funded_month 3\n", "funded_year 3\n", "funding_round_type 3\n", "company_state_code 492\n", "company_city 533\n", "company_category_code 643\n", "raised_amount_usd 3599\n", "investor_country_code 12001\n", "investor_city 12480\n", "investor_state_code 16809\n", "investor_category_code 50427\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1')\n", "\n", "missing_vc_list = []\n", "i = 0\n", "\n", "for chunk in chunks_it:\n", " i += 1\n", " #print(\"Chunk {0}:\".format(i))\n", " result = chunk.isnull().sum()\n", " #print(type(result))\n", " missing_vc_list.append(result)\n", "\n", "combined_missing_vc = pd.concat(missing_vc_list)\n", "unique_combined_missing_vc = combined_missing_vc.groupby(combined_missing_vc.index).sum()\n", "\n", "unique_combined_missing_vc.sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- There is a lot of investor related information missing across the dataset, specifically investor state code, investor category code, investor city and investory country code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check each column's memory footprint" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "company_permalink 4057788\n", "company_name 3591326\n", "company_category_code 3421104\n", "company_country_code 3172176\n", "company_state_code 3106051\n", "company_region 3411585\n", "company_city 3505926\n", "investor_permalink 4980548\n", "investor_name 3915666\n", "investor_category_code 622424\n", "investor_country_code 2647292\n", "investor_state_code 2476607\n", "investor_region 3396281\n", "investor_city 2885083\n", "funding_round_type 3410707\n", "funded_at 3542185\n", "funded_month 3383584\n", "funded_quarter 3383584\n", "funded_year 422960\n", "raised_amount_usd 422960\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1')\n", "\n", "series_memory_fp = pd.Series()\n", "counter = 0\n", "\n", "for chunk in chunks_it:\n", " if counter == 0:\n", " series_memory_fp = chunk.memory_usage(deep=True)\n", " else:\n", " series_memory_fp += chunk.memory_usage(deep=True)\n", " counter += 1\n", "\n", "series_memory_fp = series_memory_fp.drop('Index')\n", "series_memory_fp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The total memory footprint of all of the chunks combined" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chunk 1 memory: 5.58 mb\n", "Chunk 2 memory: 5.53 mb\n", "Chunk 3 memory: 5.54 mb\n", "Chunk 4 memory: 5.53 mb\n", "Chunk 5 memory: 5.52 mb\n", "Chunk 6 memory: 5.55 mb\n", "Chunk 7 memory: 5.53 mb\n", "Chunk 8 memory: 5.51 mb\n", "Chunk 9 memory: 5.40 mb\n", "Chunk 10 memory: 4.64 mb\n", "Chunk 11 memory: 2.66 mb\n", "Total memory: 56.99 mb\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1')\n", "\n", "memory_footprints = []\n", "i = 0\n", "\n", "for chunk in chunks_it:\n", " i += 1\n", " chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576\n", " memory_footprints.append(chunk_memory_mb)\n", " print(\"Chunk {0} memory: {1:0.2f} mb\".format(i, chunk_memory_mb))\n", "\n", "print(\"Total memory: {0:0.2f} mb\".format(sum(memory_footprints)))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "56.9876070022583" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series_memory_fp.sum() / (1048576)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Which column(s) we can drop because they aren't useful for analysis\n", "\n", "Let's drop columns that aren't very helpful (URLs) or columns that have too many missing columns (`> 90%`)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total Rows: 52870\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1')\n", "total_num_rows = 0\n", "for chunk in chunks_it:\n", " total_num_rows += chunk.shape[0]\n", "\n", "print(\"Total Rows: {0}\".format(total_num_rows))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "company_country_code 0.001891\n", "company_name 0.001891\n", "company_permalink 0.001891\n", "company_region 0.001891\n", "investor_region 0.003783\n", "investor_permalink 0.003783\n", "investor_name 0.003783\n", "funded_quarter 0.005674\n", "funded_at 0.005674\n", "funded_month 0.005674\n", "funded_year 0.005674\n", "funding_round_type 0.005674\n", "company_state_code 0.930584\n", "company_city 1.008133\n", "company_category_code 1.216191\n", "raised_amount_usd 6.807263\n", "investor_country_code 22.699073\n", "investor_city 23.605069\n", "investor_state_code 31.793077\n", "investor_category_code 95.379232\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get percentage of missing values per column\n", "unique_combined_missing_vc.sort_values() / total_num_rows * 100" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns to Keep:\n", "['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter', 'funded_year', 'raised_amount_usd']\n" ] } ], "source": [ "# Drop columns representing URLs or containing too many missing values ( > 90%)\n", "drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']\n", "keep_cols = chunk.columns.drop(drop_cols)\n", "\n", "print('Columns to Keep:')\n", "print(keep_cols.tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting Data Types\n", "\n", "Let's get familiar with the column types before adding the data into SQLite.\n", "\n", "- Identify the types for each column.\n", "- Identify the numeric columns we can represent using more space efficient types.\n", "- For text columns:\n", " - Analyze the unique value counts across all of the chunks to see if we can convert them to a numeric type.\n", " - See if we clean clean any text columns and separate them into multiple numeric columns without adding any overhead when querying.\n", "- Make your changes to the code from the last step so that the overall memory the data consumes stays under 10 megabytes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Types for each column" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'company_category_code': {'object'},\n", " 'company_city': {'object'},\n", " 'company_country_code': {'object'},\n", " 'company_name': {'object'},\n", " 'company_region': {'object'},\n", " 'company_state_code': {'object'},\n", " 'funded_at': {'object'},\n", " 'funded_month': {'object'},\n", " 'funded_quarter': {'object'},\n", " 'funded_year': {'int64', 'float64'},\n", " 'funding_round_type': {'object'},\n", " 'investor_city': {'object', 'float64'},\n", " 'investor_country_code': {'object', 'float64'},\n", " 'investor_name': {'object'},\n", " 'investor_region': {'object'},\n", " 'investor_state_code': {'object', 'float64'},\n", " 'raised_amount_usd': {'float64'}}\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "col_types = {}\n", "i = 0\n", "for chunk in chunks_it:\n", " i += 1\n", " #print(\"Chunk {0}:\".format(i))\n", " #print(chunk.dtypes)\n", " for col in chunk.columns:\n", " if col not in col_types:\n", " # use a set to keep only unique values!\n", " col_types[col] = set([str(chunk.dtypes[col])])\n", " else:\n", " col_types[col].add(str(chunk.dtypes[col]))\n", " \n", " #print('\\t', col, '|', col_types[col])\n", " #break\n", "\n", "pp.pprint(col_types)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- A few columns have several types: `funded_year`, `investor_city`, `investor_country_code`, `investor_state_code`\n", "- There are a number of **object** type columns which could be converted to more appropriate, space efficient data types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Identify the numeric columns we can represent using more space efficient types" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Chunk 1 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 2 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 3 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 4 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 5 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 6 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 7 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 8 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 9 : Index(['funded_year', 'raised_amount_usd'], dtype='object')\n", "\n", "Chunk 10 : Index(['investor_country_code', 'investor_state_code', 'investor_city',\n", " 'funded_year', 'raised_amount_usd'],\n", " dtype='object')\n", "\n", "Chunk 11 : Index(['investor_country_code', 'investor_state_code', 'investor_city',\n", " 'funded_year', 'raised_amount_usd'],\n", " dtype='object')\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "count = 0\n", "\n", "for chunk in chunks_it:\n", " count += 1\n", " print(f'\\nChunk {count:<3}:', end=' ')\n", " float_cols = chunk.select_dtypes(include=['float', 'int']).columns\n", " print(float_cols)\n", " \n", " chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'],\n", " downcast='float')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- The `raised_amount_usd` is the only consistent float column across all the chunks. We could use a more space efficient type for it.\n", "\n", "Let's dig a little deeper and look at the unique values within the numeric columns." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "CHUNK 1:\n", "\n", "funded_year\n", "type: int64 | 18 values: [2012 2011 2010 2009 2007]\n", "\n", "raised_amount_usd\n", "type: float64 | 625 values: [2000000. 20000. 70000. 75000. 100000.]\n", "\n", "CHUNK 2:\n", "\n", "funded_year\n", "type: int64 | 17 values: [2008 2009 2007 2010 2006]\n", "\n", "raised_amount_usd\n", "type: float64 | 613 values: [25000000. 20000000. 26000000. 34500000. 68000000.]\n", "\n", "CHUNK 3:\n", "\n", "funded_year\n", "type: int64 | 17 values: [2009 2007 2011 2012 2010]\n", "\n", "raised_amount_usd\n", "type: float64 | 634 values: [37400000. 6000000. 22300000. 8387128. 15700000.]\n", "\n", "CHUNK 4:\n", "\n", "funded_year\n", "type: int64 | 15 values: [2011 2012 2013 2010 2005]\n", "\n", "raised_amount_usd\n", "type: float64 | 588 values: [ 3000000. 5000000. 85000000. 15000000. 5300000.]\n", "\n", "CHUNK 5:\n", "\n", "funded_year\n", "type: int64 | 19 values: [2010 2007 2008 2011 2012]\n", "\n", "raised_amount_usd\n", "type: float64 | 627 values: [15000000. 14900000. 935000. 365000. 1300000.]\n", "\n", "CHUNK 6:\n", "\n", "funded_year\n", "type: int64 | 15 values: [2009 2010 2006 2005 2011]\n", "\n", "raised_amount_usd\n", "type: float64 | 612 values: [29000000. 2000000. 3200000. 3000000. 12000000.]\n", "\n", "CHUNK 7:\n", "\n", "funded_year\n", "type: float64 | 18 values: [2003. 2012. 2013. 2007. 2009.]\n", "\n", "raised_amount_usd\n", "type: float64 | 606 values: [ 500000. 3000000. 20000000. 1400000. 10000000.]\n", "\n", "CHUNK 8:\n", "\n", "funded_year\n", "type: int64 | 18 values: [2006 2010 2005 2011 2012]\n", "\n", "raised_amount_usd\n", "type: float64 | 615 values: [13000000. 12000000. 19300000. 12500000. 23500000.]\n", "\n", "CHUNK 9:\n", "\n", "funded_year\n", "type: int64 | 16 values: [2007 2010 2011 2012 2013]\n", "\n", "raised_amount_usd\n", "type: float64 | 617 values: [ 3500000. 9000000. 1400000. 10000000. 800000.]\n", "\n", "CHUNK 10:\n", "\n", "investor_country_code\n", "type: float64 | 1 values: [nan]\n", "\n", "investor_state_code\n", "type: float64 | 1 values: [nan]\n", "\n", "investor_city\n", "type: float64 | 1 values: [nan]\n", "\n", "funded_year\n", "type: int64 | 16 values: [2011 2013 2005 2006 2010]\n", "\n", "raised_amount_usd\n", "type: float64 | 405 values: [1000000. 1600000. 300000. 1100000. 250000.]\n", "\n", "CHUNK 11:\n", "\n", "investor_country_code\n", "type: float64 | 1 values: [nan]\n", "\n", "investor_state_code\n", "type: float64 | 1 values: [nan]\n", "\n", "investor_city\n", "type: float64 | 1 values: [nan]\n", "\n", "funded_year\n", "type: int64 | 15 values: [2012 2007 2008 2010 2006]\n", "\n", "raised_amount_usd\n", "type: float64 | 349 values: [ 3060000. 12000000. 500000. 750000. nan]\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "count = 0\n", "str_cols_vc = {}\n", "\n", "for chunk in chunks_it:\n", " count += 1\n", " print(f'\\nCHUNK {count}:')\n", " \n", " num_cols = chunk.select_dtypes(include=['float', 'int'])\n", " for col in num_cols:\n", " print(f'\\n{col}')\n", " current_col_vc = num_cols[col].value_counts()\n", " #continue\n", " \n", " if col in str_cols_vc:\n", " str_cols_vc[col].append(current_col_vc)\n", " else:\n", " str_cols_vc[col] = [current_col_vc]\n", " \n", " \n", " unique_values = chunk[col].unique()\n", " print(f'type: {chunk[col].dtype} | {len(unique_values)} values: {unique_values[:5]}')\n", " \n", " #break\n", "\n", "## Combine the value count results\n", "combined_vcs = {}\n", "\n", "for col, vc_list in str_cols_vc.items():\n", " combined_vc = pd.concat(vc_list)\n", " final_vc = combined_vc.groupby(combined_vc.index).sum()\n", " combined_vcs[col] = final_vc\n", "\n", "#pp.pprint(combined_vcs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- The `funded_year` column can be converted from float to integer as it shows the funded year\n", "- There are only 2 numeric columns - `funded_year` and `raised_amount_usd`. The remaining columns are showing up because they have Nan values in the later chunks.\n", "- The `raised_amount_usd` column can be downcast as a float to a more space efficient type.\n", "- Addition: The `funded_at` column has the necessary year, month, day info - so the `funded_year` is no longer required!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text Columns\n", "\n", "- Analyze the unique value counts across all of the chunks to see if we can convert them to a numeric type.\n", "- See if we clean clean any text columns and separate them into multiple numeric columns without adding any overhead when querying." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "CHUNK 1:\n", "col: company_name | type: object | 3373 values: ['AdverCar' 'LaunchGram' 'uTaP' 'ZoopShop' 'eFuneral']\n", "col: company_category_code | type: object | 43 values: ['advertising' 'news' 'messaging' 'software' 'web']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 48 values: ['CA' nan 'OH' 'NY' 'FL']\n", "col: company_region | type: object | 215 values: ['SF Bay' 'United States - Other' 'Columbus' 'Cleveland' 'New York']\n", "col: company_city | type: object | 543 values: ['San Francisco' 'Mountain View' nan 'columbus' 'Cleveland']\n", "col: investor_name | type: object | 1540 values: ['1-800-FLOWERS.COM' '10Xelerator'\n", " '2010 NYU Stern Business Plan Competition' '22Hundred Group' '3taps']\n", "col: investor_country_code | type: object | 46 values: ['USA' nan 'BMU' 'CHE' 'CAN']\n", "col: investor_state_code | type: object | 44 values: ['NY' 'OH' nan 'CA' 'MA']\n", "col: investor_region | type: object | 236 values: ['New York' 'Columbus' 'unknown' 'SF Bay' 'Boston']\n", "col: investor_city | type: object | 387 values: ['New York' 'Columbus' nan 'San Francisco' 'Palo Alto']\n", "col: funding_round_type | type: object | 9 values: ['series-a' 'other' 'angel' 'venture' 'series-b']\n", "col: funded_at | type: object | 1677 values: ['2012-10-30' '2012-01-23' '2012-01-01' '2012-02-15' '2011-09-08']\n", "col: funded_month | type: object | 159 values: ['2012-10' '2012-01' '2012-02' '2011-09' '2012-06']\n", "col: funded_quarter | type: object | 63 values: ['2012-Q4' '2012-Q1' '2011-Q3' '2012-Q2' '2012-Q3']\n", "\n", "CHUNK 2:\n", "col: company_name | type: object | 2946 values: ['Visible World' 'obopay' 'Jumptap' 'Avid Radiopharmaceuticals'\n", " 'Pacific Biosciences']\n", "col: company_category_code | type: object | 41 values: ['advertising' 'mobile' 'biotech' 'semiconductor' 'enterprise']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 49 values: ['NY' 'CA' 'MA' 'PA' 'WA']\n", "col: company_region | type: object | 229 values: ['New York' 'SF Bay' 'Boston' 'Philadelphia' 'Seattle']\n", "col: company_city | type: object | 559 values: ['New York' 'Redwood City' 'Boston' 'Philadelphia' 'Menlo Park']\n", "col: investor_name | type: object | 623 values: ['AllianceBernstein' 'Allianz Private Equity Partners' 'Allianz ROSNO'\n", " 'Allied Capital' 'Allied Minds']\n", "col: investor_country_code | type: object | 40 values: ['USA' nan 'SWE' 'FRA' 'CHE']\n", "col: investor_state_code | type: object | 39 values: ['NY' nan 'MI' 'MA' 'IN']\n", "col: investor_region | type: object | 146 values: ['New York' 'unknown' 'Kalamazoo' 'Boston' 'Indianapolis']\n", "col: investor_city | type: object | 236 values: ['New York' nan 'Kalamazoo' 'Boston' 'Carmel']\n", "col: funding_round_type | type: object | 8 values: ['series-c+' 'venture' 'series-b' 'series-a' 'angel']\n", "col: funded_at | type: object | 1835 values: ['2008-04-14' '2008-04-21' '2008-08-26' '2009-05-21' '2007-05-09']\n", "col: funded_month | type: object | 163 values: ['2008-04' '2008-08' '2009-05' '2007-05' '2009-10']\n", "col: funded_quarter | type: object | 61 values: ['2008-Q2' '2008-Q3' '2009-Q2' '2007-Q2' '2009-Q4']\n", "\n", "CHUNK 3:\n", "col: company_name | type: object | 2922 values: ['Zag' 'FRX Polymers' 'Targeted Growth' 'Mendel Biotechnology' 'Yammer']\n", "col: company_category_code | type: object | 41 values: ['search' 'cleantech' 'biotech' 'enterprise' 'finance']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 43 values: ['CA' 'MA' 'WA' 'NJ' 'OR']\n", "col: company_region | type: object | 208 values: ['Los Angeles' 'Boston' 'Seattle' 'SF Bay' 'unknown']\n", "col: company_city | type: object | 517 values: ['Santa Monica' 'Chelmsford' 'Seattle' 'Hayward' 'San Francisco']\n", "col: investor_name | type: object | 653 values: ['Capricorn Management' 'Capricorn Venture Partners' 'Capsalus'\n", " 'Capstone Partners' 'Capstone Ventures']\n", "col: investor_country_code | type: object | 32 values: ['USA' 'BEL' nan 'GBR' 'ISR']\n", "col: investor_state_code | type: object | 36 values: ['CT' nan 'GA' 'MA' 'OR']\n", "col: investor_region | type: object | 137 values: ['New York' 'Leuven' 'Atlanta' 'Boston' 'unknown']\n", "col: investor_city | type: object | 227 values: ['Greenwich' 'Leuven' 'Atlanta' 'Boston' nan]\n", "col: funding_round_type | type: object | 9 values: ['series-c+' 'series-a' 'venture' 'private-equity' 'series-b']\n", "col: funded_at | type: object | 1852 values: ['2009-02-05' '2009-07-27' '2007-02-07' '2007-06-13' '2011-06-06']\n", "col: funded_month | type: object | 159 values: ['2009-02' '2009-07' '2007-02' '2007-06' '2011-06']\n", "col: funded_quarter | type: object | 61 values: ['2009-Q1' '2009-Q3' '2007-Q1' '2007-Q2' '2011-Q2']\n", "\n", "CHUNK 4:\n", "col: company_name | type: object | 2847 values: ['Zettaset' 'BrightTag' 'Joyent' 'Exagen Diagnostics' 'Aspen Avionics']\n", "col: company_category_code | type: object | 42 values: ['enterprise' 'biotech' 'transportation' 'web' 'analytics']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 47 values: ['CA' 'IL' 'NM' 'OR' 'UT']\n", "col: company_region | type: object | 182 values: ['SF Bay' 'Chicago' 'Albuquerque' 'Portland' 'Santa Barbara']\n", "col: company_city | type: object | 458 values: ['Mountain View' 'Chicago' 'San Francisco' 'Albuquerque' 'Portland']\n", "col: investor_name | type: object | 501 values: ['EPIC Ventures' 'Epiphany Ventures' 'Equis Capital Partners'\n", " 'Equitas Capital Advisors' 'Equitek Capital']\n", "col: investor_country_code | type: object | 30 values: ['USA' 'IND' 'GBR' nan 'FIN']\n", "col: investor_state_code | type: object | 37 values: ['UT' nan 'NJ' 'LA' 'OH']\n", "col: investor_region | type: object | 116 values: ['Salt Lake City' 'Mumbai' 'Hoboken' 'New Orleans' 'Cleveland']\n", "col: investor_city | type: object | 175 values: ['Salt Lake City' 'Mumbai' 'Hoboken' 'New Orleans' 'Beachwood']\n", "col: funding_round_type | type: object | 9 values: ['venture' 'series-c+' 'series-b' 'angel' 'series-a']\n", "col: funded_at | type: object | 1786 values: ['2011-07-12' '2011-08-15' '2012-01-23' '2012-04-25' '2012-09-10']\n", "col: funded_month | type: object | 146 values: ['2011-07' '2011-08' '2012-01' '2012-04' '2012-09']\n", "col: funded_quarter | type: object | 59 values: ['2011-Q3' '2012-Q1' '2012-Q2' '2012-Q3' '2012-Q4']\n", "\n", "CHUNK 5:\n", "col: company_name | type: object | 2941 values: ['ChargePoint, Inc.' 'Oncothyreon' 'Grouply' 'Blue Buzz Network'\n", " 'Glue Networks']\n", "col: company_category_code | type: object | 42 values: ['cleantech' 'biotech' 'network_hosting' 'news' 'enterprise']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 46 values: ['CA' 'WA' 'NC' 'OH' 'NY']\n", "col: company_region | type: object | 204 values: ['SF Bay' 'Seattle' 'Sacramento' 'Los Angeles' 'Raleigh-Durham']\n", "col: company_city | type: object | 511 values: ['Campbell' 'Seattle' 'Redwood City' 'San Francisco' 'Sacramento']\n", "col: investor_name | type: object | 592 values: ['Hartford Ventures' 'Hartz Capital' 'Harvard Angel Group'\n", " 'Hasso Plattner Ventures' 'Hatch Partners, LLC']\n", "col: investor_country_code | type: object | 41 values: ['USA' nan 'DEU' 'ZAF' 'CHE']\n", "col: investor_state_code | type: object | 36 values: ['CT' nan 'WA' 'NC' 'CA']\n", "col: investor_region | type: object | 142 values: ['Hartford' 'unknown' 'TBD' 'Potsdam' 'Seattle']\n", "col: investor_city | type: object | 207 values: ['Hartford' nan 'Potsdam' 'Redmond' 'Raleigh']\n", "col: funding_round_type | type: object | 8 values: ['series-c+' 'venture' 'angel' 'series-a' 'series-b']\n", "col: funded_at | type: object | 1874 values: ['2010-09-08' '2010-09-24' '2007-06-01' '2008-01-01' '2008-01-14']\n", "col: funded_month | type: object | 165 values: ['2010-09' '2007-06' '2008-01' '2010-07' '2011-03']\n", "col: funded_quarter | type: object | 64 values: ['2010-Q3' '2007-Q2' '2008-Q1' '2011-Q1' '2012-Q3']\n", "\n", "CHUNK 6:\n", "col: company_name | type: object | 2909 values: ['Smith & Tinker' 'Sway' 'Shoutlet' 'Firefly Mobile' 'Fastmobile']\n", "col: company_category_code | type: object | 42 values: ['games_video' 'advertising' 'enterprise' 'mobile' 'software']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 47 values: ['WA' 'WI' 'FL' 'IL' 'MA']\n", "col: company_region | type: object | 202 values: ['Seattle' 'Middleton' 'Madison' 'Miami' 'Chicago']\n", "col: company_city | type: object | 502 values: ['Bellevue' 'Middleton' 'Madison' 'Miami Beach' 'Rolling Meadows']\n", "col: investor_name | type: object | 566 values: ['Leo Capital Holdings' 'LeoGroup' 'Leopard Capital' 'Lerer Ventures'\n", " 'lessin X2, llc']\n", "col: investor_country_code | type: object | 34 values: ['USA' nan 'KHM' 'RUS' 'KOR']\n", "col: investor_state_code | type: object | 38 values: ['IL' nan 'NY' 'CA' 'MN']\n", "col: investor_region | type: object | 140 values: ['Chicago' 'unknown' 'Phnom Penh' 'New York' 'Moscow']\n", "col: investor_city | type: object | 207 values: ['Northbrook' nan 'Phnom Penh' 'New York' 'Moscow']\n", "col: funding_round_type | type: object | 8 values: ['venture' 'series-a' 'series-b' 'series-c+' 'angel']\n", "col: funded_at | type: object | 1822 values: ['2009-08-25' '2010-01-19' '2006-10-19' '2005-05-16' '2005-11-10']\n", "col: funded_month | type: object | 149 values: ['2009-08' '2010-01' '2006-10' '2005-05' '2005-11']\n", "col: funded_quarter | type: object | 60 values: ['2009-Q3' '2010-Q1' '2006-Q4' '2005-Q2' '2005-Q4']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "CHUNK 7:\n", "col: company_name | type: object | 2987 values: ['Tricycle' 'Adapx' 'Action Engine' 'Vigilos' 'SprayCool']\n", "col: company_category_code | type: object | 42 values: [nan 'software' 'mobile' 'hardware' 'public_relations']\n", "col: company_country_code | type: object | 3 values: ['USA' nan '2008-02']\n", "col: company_state_code | type: object | 49 values: ['TN' 'WA' 'OR' 'NV' 'CA']\n", "col: company_region | type: object | 208 values: ['Chattanooga' 'Seattle' 'Liberty Lake' 'Portland' 'Spokane Valley']\n", "col: company_city | type: object | 530 values: ['Chattanooga' 'Seattle' 'Bellevue' 'Liberty Lake' 'Beaverton']\n", "col: investor_name | type: object | 688 values: ['Northwest Georgia Bank' 'Northwest Technology Ventures'\n", " 'Northwest Venture Associates' 'Northwestern Capital' 'Northzone']\n", "col: investor_country_code | type: object | 36 values: ['USA' nan 'SWE' 'GBR' 'CAN']\n", "col: investor_state_code | type: object | 37 values: ['GA' 'OR' 'WA' nan 'MN']\n", "col: investor_region | type: object | 156 values: ['Ringgold' 'Portland' 'Spokane' 'unknown' 'Stockholm']\n", "col: investor_city | type: object | 228 values: ['Ringgold' 'Beaverton' 'Spokane' nan 'Stockholm']\n", "col: funding_round_type | type: object | 9 values: ['other' 'venture' 'series-c+' 'series-b' 'angel']\n", "col: funded_at | type: object | 1882 values: ['2003-01-01' '2012-07-30' '2013-04-11' '2007-07-01' '2009-04-28']\n", "col: funded_month | type: object | 159 values: ['2003-01' '2012-07' '2013-04' '2007-07' '2009-04']\n", "col: funded_quarter | type: object | 62 values: ['2003-Q1' '2012-Q3' '2013-Q2' '2007-Q3' '2009-Q2']\n", "\n", "CHUNK 8:\n", "col: company_name | type: object | 3077 values: ['Ping Identity Corporation' 'MuleSoft' 'AmberPoint' 'Questra'\n", " 'Reva Systems']\n", "col: company_category_code | type: object | 41 values: ['security' 'enterprise' 'web' 'software' 'hardware']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 48 values: ['CO' 'CA' 'MA' 'WA' 'TX']\n", "col: company_region | type: object | 208 values: ['Denver' 'SF Bay' 'Boston' 'Seattle' 'Austin']\n", "col: company_city | type: object | 497 values: ['Denver' 'San Francisco' 'Oakland' 'Redwood City' 'Westford']\n", "col: investor_name | type: object | 607 values: ['SAP Ventures' 'Saratoga Ventures' 'Sark Fund' 'Sarofim Fayez and Co.'\n", " 'SAS Investors']\n", "col: investor_country_code | type: object | 32 values: ['USA' nan 'SAU' 'IND' 'CYM']\n", "col: investor_state_code | type: object | 34 values: ['CA' nan 'TX' 'MA' 'NH']\n", "col: investor_region | type: object | 151 values: ['SF Bay' 'unknown' 'Houston' 'TBD' 'Dallas']\n", "col: investor_city | type: object | 212 values: ['Palo Alto' 'Santa Clara' nan 'Houston' 'Dallas']\n", "col: funding_round_type | type: object | 8 values: ['series-c+' 'series-b' 'venture' 'other' 'series-a']\n", "col: funded_at | type: object | 1852 values: ['2006-10-11' '2010-03-30' '2006-03-28' '2006-08-15' '2006-10-02']\n", "col: funded_month | type: object | 162 values: ['2006-10' '2010-03' '2006-03' '2006-08' '2010-05']\n", "col: funded_quarter | type: object | 62 values: ['2006-Q4' '2010-Q1' '2006-Q1' '2006-Q3' '2010-Q2']\n", "\n", "CHUNK 9:\n", "col: company_name | type: object | 2999 values: ['GuardianEdge Technologies' 'NetBase Solutions' 'GigaLogix' 'Axcient'\n", " 'Apsalar']\n", "col: company_category_code | type: object | 43 values: ['software' 'social' 'public_relations' 'enterprise' 'advertising']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 46 values: ['CA' 'GA' 'MA' 'MN' 'WA']\n", "col: company_region | type: object | 177 values: ['SF Bay' 'Atlanta' 'Boston' 'Minneapolis' 'Los Angeles']\n", "col: company_city | type: object | 478 values: ['San Mateo' 'Mountain View' 'Oakland' 'SAN FRANCISCO' 'Pleasanton']\n", "col: investor_name | type: object | 1018 values: ['Thomvest Ventures' 'Thorner Ventures' 'Three Arch Partners'\n", " 'Thrive Capital' 'TI Ventures']\n", "col: investor_country_code | type: object | 28 values: ['USA' nan 'PAK' 'IND' 'JPN']\n", "col: investor_state_code | type: object | 35 values: ['CA' nan 'NY' 'TX' 'CT']\n", "col: investor_region | type: object | 112 values: ['SF Bay' 'TBD' 'New York' 'Dallas' 'unknown']\n", "col: investor_city | type: object | 168 values: ['Redwood City' nan 'Portola Valley' 'New York' 'Dallas']\n", "col: funding_round_type | type: object | 7 values: ['other' 'series-c+' 'series-a' 'series-b' 'angel']\n", "col: funded_at | type: object | 1783 values: ['2007-07-23' '2010-03-02' '2007-02-01' '2010-07-14' '2010-11-01']\n", "col: funded_month | type: object | 154 values: ['2007-07' '2010-03' '2007-02' '2010-07' '2010-11']\n", "col: funded_quarter | type: object | 59 values: ['2007-Q3' '2010-Q1' '2007-Q1' '2010-Q3' '2010-Q4']\n", "\n", "CHUNK 10:\n", "col: company_name | type: object | 2045 values: ['Bread' 'Nuiku' 'Savvy Cellar Wines' 'Enigma Technologies' 'Yext']\n", "col: company_category_code | type: object | 43 values: ['advertising' 'software' 'other' 'analytics' 'web']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 42 values: ['CA' 'WA' 'NY' 'TX' 'CO']\n", "col: company_region | type: object | 128 values: ['SF Bay' 'Seattle' 'New York' 'Los Angeles' 'Austin']\n", "col: company_city | type: object | 326 values: ['San Francisco' 'Redmond' 'Redwood City' 'New York' 'West Hollywood']\n", "col: investor_name | type: object | 2361 values: ['Brendan Wallace' 'Brent Frei' 'Brent Harrison' 'Brent Hurley'\n", " 'Brent Metz']\n", "col: investor_region | type: object | 1 values: ['unknown']\n", "col: funding_round_type | type: object | 7 values: ['angel' 'series-a' 'series-b' 'venture' 'private-equity']\n", "col: funded_at | type: object | 1056 values: ['2011-02-01' '2013-09-17' '2005-01-01' '2013-02-27' '2006-11-01']\n", "col: funded_month | type: object | 135 values: ['2011-02' '2013-09' '2005-01' '2013-02' '2006-11']\n", "col: funded_quarter | type: object | 55 values: ['2011-Q1' '2013-Q3' '2005-Q1' '2013-Q1' '2006-Q4']\n", "\n", "CHUNK 11:\n", "col: company_name | type: object | 1541 values: ['NuORDER' 'ChaCha' 'Binfire' 'Unified Color' 'HItviews']\n", "col: company_category_code | type: object | 42 values: ['fashion' 'advertising' 'software' 'social' 'ecommerce']\n", "col: company_country_code | type: object | 1 values: ['USA']\n", "col: company_state_code | type: object | 40 values: ['CA' 'IN' 'FL' 'NY' 'MO']\n", "col: company_region | type: object | 106 values: ['Los Angeles' 'Indianapolis' 'Bocat Raton' 'SF Bay' 'New York']\n", "col: company_city | type: object | 271 values: ['West Hollywood' 'Carmel' 'Bocat Raton' 'South San Frnacisco'\n", " 'New York City']\n", "col: investor_name | type: object | 1337 values: ['Mortimer Singer' 'Morton Meyerson' 'Moshe Ariel' 'Mr. Andrew Oung'\n", " 'multiple parties']\n", "col: investor_region | type: object | 1 values: ['unknown']\n", "col: funding_round_type | type: object | 7 values: ['series-a' 'series-b' 'angel' 'other' 'venture']\n", "col: funded_at | type: object | 886 values: ['2012-10-01' '2007-10-01' '2008-04-18' '2010-01-01' '2007-11-29']\n", "col: funded_month | type: object | 126 values: ['2012-10' '2007-10' '2008-04' '2010-01' '2007-11']\n", "col: funded_quarter | type: object | 51 values: ['2012-Q4' '2007-Q4' '2008-Q2' '2010-Q1' '2012-Q2']\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "count = 0\n", "#total_num_rows = 0\n", "str_cols_vc = {}\n", "\n", "for chunk in chunks_it:\n", " count += 1\n", " #total_num_rows += chunk.shape[0]\n", " \n", " print(f'\\nCHUNK {count}:')\n", " #print(chunk.dtypes.value_counts())\n", " \n", " obj_cols = chunk.select_dtypes(include=['object'])\n", " #print(obj_cols.value_counts())\n", " #print(type(obj_cols))\n", " for col in obj_cols:\n", " current_col_vc = obj_cols[col].value_counts()\n", " \n", " if col in str_cols_vc:\n", " str_cols_vc[col].append(current_col_vc)\n", " else:\n", " str_cols_vc[col] = [current_col_vc]\n", " \n", " unique_values = chunk[col].unique()\n", " print(f'col: {col} | type: {chunk[col].dtype} |'\\\n", " f' {len(unique_values)} values: {unique_values[:5]}')\n", "\n", "#print(f'Total Num Rows: {total_num_rows}')\n", "\n", "## Combine the value count results\n", "combined_vcs = {}\n", "\n", "for col, vc_list in str_cols_vc.items():\n", " combined_vc = pd.concat(vc_list)\n", " final_vc = combined_vc.groupby(combined_vc.index).sum()\n", " combined_vcs[col] = final_vc\n", "\n", "#pp.pprint(combined_vcs)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dict_keys(['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter', 'funded_year', 'raised_amount_usd'])\n", "\n", "All Value Counts:\n", "{'company_category_code': 2/7/08 1\n", "advertising 3200\n", "analytics 1863\n", "automotive 164\n", "biotech 4951\n", "cleantech 1948\n", "consulting 233\n", "design 55\n", "ecommerce 2168\n", "education 783\n", "enterprise 4489\n", "fashion 368\n", "finance 931\n", "games_video 1893\n", "government 10\n", "hardware 1537\n", "health 670\n", "hospitality 331\n", "legal 87\n", "local 22\n", "manufacturing 310\n", "medical 1315\n", "messaging 452\n", "mobile 4067\n", "music 287\n", "nanotech 216\n", "network_hosting 1075\n", "news 363\n", "nonprofit 149\n", "other 481\n", "pets 43\n", "photo_video 230\n", "public_relations 659\n", "real_estate 190\n", "search 632\n", "security 996\n", "semiconductor 1292\n", "social 920\n", "software 7243\n", "sports 121\n", "transportation 130\n", "travel 337\n", "web 5015\n", "Name: company_category_code, dtype: int64,\n", " 'company_city': (Oct. 01, 2011 - Sep. 30, 2012) 2\n", "10000000 1\n", "ALLSTON 2\n", "ATLANTA 2\n", "AUSTIN 2\n", "Acton 90\n", "Addison 29\n", "Agoura Hills 13\n", "Akron 12\n", "Alachua 19\n", "Alameda 44\n", "Albuquerque 83\n", "Alexandria 40\n", "Aliso Viejo 55\n", "Aliso Viejo, 6\n", "Allen 14\n", "Allen park 6\n", "Allentown 20\n", "Alliance 1\n", "Allston 4\n", "Alpharetta 83\n", "Alpharetta, 1\n", "Altadena 17\n", "Altamonte Springs 1\n", "Alviso 1\n", "American Fork 40\n", "Ames 6\n", "Amesbury 5\n", "Amherst 9\n", "Anaheim 3\n", " ..\n", "Worcester 10\n", "Wyckoff 2\n", "Wynnewood 1\n", "Yaphank 2\n", "Yardley 1\n", "Yonkers 7\n", "Yorba Linda 2\n", "Youngstown 2\n", "Ypsilanti 1\n", "Zagreb 2\n", "Zeeland 2\n", "ann arbor 1\n", "atlanta 1\n", "bluffton 1\n", "columbus 1\n", "foster city 2\n", "hermosa beach 3\n", "irvine 3\n", "los angeles 11\n", "miami 4\n", "new york 5\n", "omaha 1\n", "pahoa 1\n", "san francisco 1\n", "santa monica 19\n", "sausalito 1\n", "scottsdale 2\n", "seattle 1\n", "w. babylon 2\n", "westminster 1\n", "Name: company_city, Length: 1229, dtype: int64,\n", " 'company_country_code': 2008-02 1\n", "USA 52868\n", "Name: company_country_code, dtype: int64,\n", " 'company_name': #waywire 5\n", "0xdata 1\n", "1-800-DENTIST 2\n", "1000memories 10\n", "100Plus 4\n", "1010data 1\n", "11i Solutions 1\n", "121nexus 5\n", "12Society 5\n", "1366 Technologies 12\n", "140 Proof 6\n", "15Five 8\n", "170 Systems 1\n", "1Cast 1\n", "1World Online 1\n", "1stdibs 5\n", "2080 Media 8\n", "20JEANS 5\n", "20x200 17\n", "23andMe 16\n", "24/7 Card 2\n", "247 Techies 2\n", "24M Technologies 3\n", "24PageBooks 1\n", "27 Perry 1\n", "27 bards 1\n", "2U 18\n", "2nd Story Software, Inc. 1\n", "2sms 1\n", "3-V Biosciences 7\n", " ..\n", "uromovie 1\n", "userfox 3\n", "ustyme 1\n", "vArmour Networks 1\n", "vSocial 3\n", "vcopious Software 3\n", "via680 1\n", "viaCycle 2\n", "vidIQ 7\n", "videoNEXT 2\n", "vline 2\n", "walkby 2\n", "webmynd 9\n", "wedgies 4\n", "weeSpring 1\n", "weendy 5\n", "whereIstand.com 1\n", "wmbly 1\n", "woodpellets.com 3\n", "writewith 1\n", "wunderloop 4\n", "xAd 4\n", "xG Technology 1\n", "xkoto 5\n", "y prime 1\n", "yaM Labs 1\n", "ybuy 4\n", "zozi 38\n", "zulily 6\n", "zuuka! 3\n", "Name: company_name, Length: 11573, dtype: int64,\n", " 'company_region': 2008 1\n", "Akron 11\n", "Alachua 19\n", "Albuquerque 83\n", "Allentown 20\n", "Alliance 1\n", "Ames 1\n", "Amherst 9\n", "Angier 4\n", "Appleton 3\n", "Asheville 3\n", "Ashford 1\n", "Ashland 4\n", "Atlanta 558\n", "Atlantic Highlands 8\n", "Auburn Hlls 2\n", "Augusta 1\n", "Aurora 3\n", "Austin 947\n", "Avon 1\n", "B 5\n", "Bakersfield 1\n", "Bala Cynwyd 6\n", "Baltimore 95\n", "Bangalore 3\n", "Bar Harbor 1\n", "Barre 2\n", "Baton Rouge 14\n", "Battleground 3\n", "Bedford 8\n", " ... \n", "West Trenton 2\n", "Westfield 14\n", "Westport 17\n", "Whippany 6\n", "White River 3\n", "Whiting 1\n", "Wilbraham 3\n", "Williamsburg 2\n", "Williamstown 9\n", "Williamsville 3\n", "Williston 2\n", "Willmington 25\n", "Wilmington 12\n", "Wilton 4\n", "Windber 2\n", "Windsor 9\n", "Winooski 3\n", "Winston Salem 6\n", "Woodbury 3\n", "Woodstock 5\n", "Woonsocket 1\n", "Wooster 3\n", "Yaphank 2\n", "Youngstown 2\n", "Zagreb 2\n", "Zeeland 2\n", "piscataway 9\n", "unknown 102\n", "w. babylon 2\n", "wilkes barre 1\n", "Name: company_region, Length: 546, dtype: int64,\n", " 'company_state_code': 2008-Q1 1\n", "AL 17\n", "AR 33\n", "AZ 261\n", "CA 27054\n", "CO 977\n", "CT 333\n", "DC 219\n", "DE 28\n", "FL 550\n", "GA 568\n", "HI 31\n", "IA 33\n", "ID 32\n", "IL 911\n", "IN 143\n", "KS 65\n", "KY 45\n", "LA 43\n", "MA 4889\n", "MD 600\n", "ME 94\n", "MI 265\n", "MN 205\n", "MO 182\n", "MS 33\n", "MT 19\n", "NC 596\n", "NE 39\n", "NH 159\n", "NJ 691\n", "NM 90\n", "NV 105\n", "NY 5890\n", "OH 439\n", "OK 29\n", "OR 466\n", "PA 918\n", "RI 191\n", "SC 97\n", "SD 3\n", "TN 176\n", "TX 1619\n", "UT 419\n", "VA 704\n", "VT 55\n", "WA 1888\n", "WI 163\n", "WV 9\n", "WY 1\n", "Name: company_state_code, dtype: int64,\n", " 'funded_at': 1987-01-01 1\n", "1987-06-16 1\n", "1995-04-01 1\n", "1995-07-01 1\n", "1995-11-30 3\n", "1996-01-01 2\n", "1997-03-01 1\n", "1997-06-01 1\n", "1997-11-10 3\n", "1998-01-01 1\n", "1998-02-09 3\n", "1998-04-01 1\n", "1998-05-01 4\n", "1998-07-01 1\n", "1998-08-01 1\n", "1999-01-01 16\n", "1999-03-01 10\n", "1999-04-01 13\n", "1999-04-15 1\n", "1999-05-01 3\n", "1999-06-01 3\n", "1999-06-07 2\n", "1999-07-01 2\n", "1999-07-07 1\n", "1999-07-22 1\n", "1999-08-02 2\n", "1999-08-30 2\n", "1999-09-01 8\n", "1999-09-02 1\n", "1999-10-01 14\n", " ..\n", "2013-08-27 36\n", "2013-08-28 35\n", "2013-08-29 55\n", "2013-08-30 11\n", "2013-09-01 27\n", "2013-09-03 39\n", "2013-09-04 68\n", "2013-09-05 45\n", "2013-09-06 13\n", "2013-09-07 5\n", "2013-09-09 23\n", "2013-09-10 40\n", "2013-09-11 50\n", "2013-09-12 38\n", "2013-09-13 22\n", "2013-09-15 4\n", "2013-09-16 38\n", "2013-09-17 60\n", "2013-09-18 41\n", "2013-09-19 38\n", "2013-09-20 20\n", "2013-09-22 17\n", "2013-09-23 34\n", "2013-09-24 45\n", "2013-09-25 36\n", "2013-09-26 56\n", "2013-09-27 51\n", "2013-09-29 2\n", "2013-09-30 34\n", "2013-10-01 5\n", "Name: funded_at, Length: 2808, dtype: int64,\n", " 'funded_month': 1987-01 1\n", "1987-06 1\n", "1995-04 1\n", "1995-07 1\n", "1995-11 3\n", "1996-01 2\n", "1997-03 1\n", "1997-06 1\n", "1997-11 3\n", "1998-01 1\n", "1998-02 3\n", "1998-04 1\n", "1998-05 4\n", "1998-07 1\n", "1998-08 1\n", "1999-01 16\n", "1999-03 10\n", "1999-04 14\n", "1999-05 3\n", "1999-06 5\n", "1999-07 4\n", "1999-08 4\n", "1999-09 9\n", "1999-10 16\n", "1999-11 3\n", "1999-12 11\n", "2000-01 21\n", "2000-02 8\n", "2000-03 13\n", "2000-04 35\n", " ... \n", "2011-05 691\n", "2011-06 760\n", "2011-07 641\n", "2011-08 752\n", "2011-09 837\n", "2011-10 624\n", "2011-11 666\n", "2011-12 667\n", "2012-01 1005\n", "2012-02 567\n", "2012-03 651\n", "2012-04 696\n", "2012-05 805\n", "2012-06 926\n", "2012-07 779\n", "2012-08 608\n", "2012-09 666\n", "2012-10 818\n", "2012-11 669\n", "2012-12 583\n", "2013-01 836\n", "2013-02 702\n", "2013-03 707\n", "2013-04 654\n", "2013-05 784\n", "2013-06 913\n", "2013-07 927\n", "2013-08 741\n", "2013-09 846\n", "2013-10 5\n", "Name: funded_month, Length: 192, dtype: int64,\n", " 'funded_quarter': 1987-Q1 1\n", "1987-Q2 1\n", "1995-Q2 1\n", "1995-Q3 1\n", "1995-Q4 3\n", "1996-Q1 2\n", "1997-Q1 1\n", "1997-Q2 1\n", "1997-Q4 3\n", "1998-Q1 4\n", "1998-Q2 5\n", "1998-Q3 2\n", "1999-Q1 26\n", "1999-Q2 22\n", "1999-Q3 17\n", "1999-Q4 30\n", "2000-Q1 42\n", "2000-Q2 66\n", "2000-Q3 28\n", "2000-Q4 40\n", "2001-Q1 38\n", "2001-Q2 20\n", "2001-Q3 26\n", "2001-Q4 32\n", "2002-Q1 40\n", "2002-Q2 35\n", "2002-Q3 37\n", "2002-Q4 33\n", "2003-Q1 54\n", "2003-Q2 50\n", " ... \n", "2006-Q3 942\n", "2006-Q4 915\n", "2007-Q1 1269\n", "2007-Q2 1189\n", "2007-Q3 1280\n", "2007-Q4 1043\n", "2008-Q1 1476\n", "2008-Q2 1414\n", "2008-Q3 1061\n", "2008-Q4 986\n", "2009-Q1 1074\n", "2009-Q2 1077\n", "2009-Q3 1020\n", "2009-Q4 1539\n", "2010-Q1 1578\n", "2010-Q2 1575\n", "2010-Q3 1651\n", "2010-Q4 1491\n", "2011-Q1 2026\n", "2011-Q2 2046\n", "2011-Q3 2230\n", "2011-Q4 1957\n", "2012-Q1 2223\n", "2012-Q2 2427\n", "2012-Q3 2053\n", "2012-Q4 2070\n", "2013-Q1 2245\n", "2013-Q2 2351\n", "2013-Q3 2514\n", "2013-Q4 5\n", "Name: funded_quarter, Length: 72, dtype: int64,\n", " 'funded_year': 1987.0 2\n", "1995.0 5\n", "1996.0 2\n", "1997.0 5\n", "1998.0 11\n", "1999.0 95\n", "2000.0 176\n", "2001.0 116\n", "2002.0 145\n", "2003.0 172\n", "2004.0 411\n", "2005.0 2990\n", "2006.0 3867\n", "2007.0 4781\n", "2008.0 4937\n", "2009.0 4710\n", "2010.0 6295\n", "2011.0 8259\n", "2012.0 8773\n", "2013.0 7115\n", "Name: funded_year, dtype: int64,\n", " 'funding_round_type': angel 8989\n", "crowdfunding 5\n", "other 964\n", "post-ipo 33\n", "private-equity 357\n", "series-a 13938\n", "series-b 8794\n", "series-c+ 10870\n", "venture 8917\n", "Name: funding_round_type, dtype: int64,\n", " 'investor_city': (Oct. 01, 2011 - Sep. 30, 2012) 2\n", "ABERDEEN 2\n", "ADELAIDE 2\n", "ALBURY 1\n", "AUSTIN 1\n", "Aalborg 1\n", "Abbotsford 1\n", "Abu Dhabi 2\n", "Accra 1\n", "Addison 4\n", "Adelaide 2\n", "Agoura Hills 4\n", "Aichi 1\n", "Akron 1\n", "Alamo 12\n", "Albany 23\n", "Alberta 1\n", "Albuquerque 13\n", "Alexandria 98\n", "Aliso Viejo 5\n", "Allentown 1\n", "Alpharetta 6\n", "Alphen aan den Rijn 8\n", "Alpine 1\n", "Altadena 15\n", "American Falls 1\n", "Ames 2\n", "Amesbury 1\n", "Amherst 18\n", "Amman 5\n", " ..\n", "Wilsonville 1\n", "Wilton 3\n", "Windermer 1\n", "Windermere 1\n", "Winnetka 2\n", "Winnipeg 2\n", "Winston-Salem 10\n", "Winter Park 18\n", "Wood Dale 1\n", "Woodcliff Lake 2\n", "Woodside 71\n", "Wooster 1\n", "Wrexham 1\n", "Wuppertal 3\n", "Wynne 1\n", "Yardley 2\n", "Yvette cedex 1\n", "Zagreb 1\n", "Zeist 1\n", "Zuerich 2\n", "Zug 1\n", "Zurich 41\n", "ZŸrich 1\n", "ZÌ_rich 2\n", "_stanbul 1\n", "cape town 1\n", "herndon 2\n", "kansas city 1\n", "paris 1\n", "portland metro 1\n", "Name: investor_city, Length: 990, dtype: int64,\n", " 'investor_country_code': ARE 7\n", "ARG 14\n", "AUS 163\n", "BEL 44\n", "BGR 4\n", "BHR 10\n", "BHS 1\n", "BMU 6\n", "BRA 10\n", "CAN 432\n", "CHE 184\n", "CHL 8\n", "CHN 143\n", "CYM 15\n", "CYP 2\n", "CZE 2\n", "DEU 349\n", "DNK 70\n", "EGY 1\n", "ESP 45\n", "EST 8\n", "FIN 36\n", "FRA 229\n", "GBR 825\n", "GHA 1\n", "GIB 3\n", "GRC 1\n", "HKG 53\n", "HRV 1\n", "HUN 9\n", " ... \n", "LIE 1\n", "LTU 1\n", "LUX 32\n", "MAR 1\n", "MEX 5\n", "MUS 23\n", "MYS 15\n", "NGA 6\n", "NLD 72\n", "NOR 37\n", "NZL 14\n", "PAK 1\n", "PHL 6\n", "POL 1\n", "PRI 2\n", "PRT 13\n", "QAT 2\n", "ROM 2\n", "RUS 63\n", "SAU 2\n", "SGP 105\n", "SVN 3\n", "SWE 120\n", "TUR 4\n", "TWN 47\n", "UKR 9\n", "USA 36574\n", "VNM 5\n", "WSM 4\n", "ZAF 5\n", "Name: investor_country_code, Length: 72, dtype: int64,\n", " 'investor_name': .406 Ventures 36\n", "1-800-FLOWERS.COM 1\n", "10 Others 1\n", "10X Fund LP 11\n", "10X Venture Partners 2\n", "10Xelerator 7\n", "1in10 Ventures 1\n", "1to1 Venture Partners 1\n", "2010 NYU Stern Business Plan Competition 1\n", "2020 Ventures 5\n", "212 Capital Partners 1\n", "21Ventures 12\n", "22Hundred Group 1\n", "2M Companies 2\n", "2x Consumer Products Growth Partners 1\n", "3 Others 1\n", "3 Wise Ventures 1\n", "313 Ventures 1\n", "37 Ventures 2\n", "3Com Ventures 1\n", "3M New Ventures 6\n", "3TS Capital Partners 2\n", "3TS Cisco Growth Fund 1\n", "3V SourceOne Ventures 1\n", "3i Deutschland Gesellschaft fŸr Industriebeteiligungen 1\n", "3i Group 46\n", "3taps 1\n", "3x5 Special Opportunity Partners 3\n", "42 Ventures 1\n", "4th Level Ventures 2\n", " ..\n", "impact investment fund 1\n", "incTANK Ventures 1\n", "international private investors 1\n", "jVen Capital 2\n", "jamell Givens 1\n", "john bialk 1\n", "kbs+ Ventures 12\n", "lessin X2, llc 1\n", "lilian samartino 1\n", "m8 Capital 7\n", "merrickhanna ventures 1\n", "michael gartenberg 1\n", "mike schell 1\n", "multiple parties 2\n", "netprice.com 3\n", "reInvention Capital 2\n", "redbutler 1\n", "reinmkr capital 9\n", "ru-Net Holdings 3\n", "sheila gulati 1\n", "thetime 3\n", "undisclosed investor 1\n", "vSpring Capital 37\n", "venBio 1\n", "venture capitals 6\n", "ventureblue Capital 2\n", "venturecapital.de 1\n", "yet2Ventures 1\n", "zohar israel 1\n", "zulily 1\n", "Name: investor_name, Length: 10465, dtype: int64,\n", " 'investor_region': Aalborg 1\n", "Abbotsford 1\n", "Aberdeen 2\n", "Abu Dhabi 2\n", "Accra 1\n", "Adelaide 4\n", "Aichi 1\n", "Akron 1\n", "Albany 23\n", "Alberta 1\n", "Albuquerque 13\n", "Allentown 1\n", "Alphen aan den Rijn 8\n", "American Falls 1\n", "Ames 1\n", "Amherst 18\n", "Amman 5\n", "Amsterdam 25\n", "Antwerpen 14\n", "Asheville 2\n", "Aspen 1\n", "Athens 19\n", "Atlanta 266\n", "Atsugi-City 1\n", "Auburn 2\n", "Auckland 10\n", "Auckland Central 1\n", "Austin 368\n", "Australia - Other 6\n", "Avon 7\n", " ... \n", "West Lafayette 2\n", "West Point 2\n", "Weston 1\n", "Westport 38\n", "Whitehouse Station 1\n", "Wichita 3\n", "Williamstown 66\n", "Willmington 20\n", "Wilmington 8\n", "Wilton 3\n", "Windermer 1\n", "Winnipeg 2\n", "Winston Salem 10\n", "Woodcliff Lake 2\n", "Woodstock 1\n", "Wooster 1\n", "Wrexham 1\n", "Wuppertal 3\n", "Wynne 1\n", "Zagreb 1\n", "Zuerich 2\n", "Zug 1\n", "Zurich 48\n", "chicago 7\n", "freehold 3\n", "mkalles 1\n", "portland metro 1\n", "regina 1\n", "unknown 12285\n", "wilkes barre 1\n", "Name: investor_region, Length: 585, dtype: int64,\n", " 'investor_state_code': AL 67\n", "AR 14\n", "AZ 84\n", "CA 18405\n", "CO 729\n", "CT 577\n", "DC 323\n", "DE 20\n", "FL 242\n", "GA 274\n", "HI 13\n", "IA 9\n", "ID 40\n", "IL 992\n", "IN 88\n", "KS 13\n", "KY 54\n", "LA 15\n", "MA 3619\n", "MD 486\n", "ME 41\n", "MI 315\n", "MN 101\n", "MO 148\n", "MS 6\n", "MT 1\n", "NC 339\n", "ND 5\n", "NE 35\n", "NH 51\n", "NJ 456\n", "NM 41\n", "NV 38\n", "NY 4404\n", "OH 309\n", "OK 21\n", "OR 85\n", "PA 762\n", "RI 92\n", "SC 34\n", "SD 9\n", "TN 147\n", "TX 816\n", "UT 200\n", "VA 579\n", "VT 26\n", "WA 847\n", "WI 82\n", "WV 4\n", "WY 3\n", "Name: investor_state_code, dtype: int64,\n", " 'raised_amount_usd': 1.000000e+03 3\n", "2.000000e+03 2\n", "2.100000e+03 1\n", "3.000000e+03 3\n", "5.000000e+03 8\n", "5.500000e+03 1\n", "5.600000e+03 1\n", "6.000000e+03 1\n", "6.500000e+03 2\n", "8.000000e+03 2\n", "8.300000e+03 1\n", "1.000000e+04 30\n", "1.156700e+04 2\n", "1.160000e+04 1\n", "1.180000e+04 1\n", "1.200000e+04 18\n", "1.300000e+04 2\n", "1.320000e+04 1\n", "1.400000e+04 6\n", "1.500000e+04 79\n", "1.528200e+04 1\n", "1.600000e+04 2\n", "1.650000e+04 1\n", "1.700000e+04 5\n", "1.800000e+04 47\n", "1.833300e+04 1\n", "1.890000e+04 1\n", "1.900000e+04 1\n", "2.000000e+04 133\n", "2.200000e+04 2\n", " ... \n", "2.250000e+08 9\n", "2.280000e+08 1\n", "2.380000e+08 2\n", "2.400000e+08 1\n", "2.490000e+08 1\n", "2.500000e+08 16\n", "2.580000e+08 2\n", "2.650000e+08 2\n", "2.750000e+08 5\n", "2.830000e+08 1\n", "3.000000e+08 17\n", "3.160000e+08 1\n", "3.190000e+08 5\n", "3.500000e+08 4\n", "3.601000e+08 4\n", "4.000000e+08 6\n", "4.500000e+08 7\n", "4.650000e+08 1\n", "4.900000e+08 4\n", "5.290000e+08 1\n", "5.350000e+08 1\n", "5.650000e+08 1\n", "7.500000e+08 1\n", "9.200000e+08 5\n", "9.500000e+08 10\n", "1.000000e+09 1\n", "1.050000e+09 2\n", "1.500000e+09 8\n", "2.600000e+09 1\n", "3.200000e+09 5\n", "Name: raised_amount_usd, Length: 1458, dtype: int64}\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "count = 0\n", "#total_num_rows = 0\n", "str_cols_vc = {}\n", "\n", "for chunk in chunks_it:\n", " count += 1\n", " #total_num_rows += chunk.shape[0]\n", " \n", " #print(f'\\nChunk {count}:')\n", " #print(chunk.dtypes.value_counts())\n", " \n", " #obj_cols = chunk.select_dtypes(include=['object'])\n", " #print(obj_cols.value_counts())\n", " #print(type(obj_cols))\n", " #columns = ['investor_city', 'investor_country_code',\n", " # 'investor_state_code']\n", " for col in chunk:\n", " current_col_vc = chunk[col].value_counts()\n", " \n", " if col in str_cols_vc:\n", " str_cols_vc[col].append(current_col_vc)\n", " else:\n", " str_cols_vc[col] = [current_col_vc]\n", "\n", "#print(f'Total Num Rows: {total_num_rows}')\n", "\n", "## Combine the value count results\n", "combined_vcs = {}\n", "\n", "for col, vc_list in str_cols_vc.items():\n", " combined_vc = pd.concat(vc_list)\n", " final_vc = combined_vc.groupby(combined_vc.index).sum()\n", " combined_vcs[col] = final_vc\n", "\n", "pp.pprint(combined_vcs.keys())\n", "\n", "print('\\nAll Value Counts:')\n", "pp.pprint(combined_vcs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "- The `funded_month`, `funded_quarter`, `funded_year` columns seem unnecessary as we already have a `funded_at` column which has the exact funding date. The other 2 columns can be easily derived from the latter column.\n", "- The `funded_at` column can be converted into a datetime type column\n", "- There are more opportunities to clean up the data further. As a potential extension of this project, a deeper dive into the data to further understand and clean up the data would be good." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total Rows: 52870\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "total_num_rows = 0\n", "for chunk in chunks_it:\n", " total_num_rows += chunk.shape[0]\n", "\n", "print(\"Total Rows: {0}\".format(total_num_rows))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'raised_amount_usd']\n" ] } ], "source": [ "# Drop columns representing URLs or containing too many missing values ( > 90%) \n", "# or the 2 unnecessary fund date related columns we identified\n", "drop_cols = ['funded_month', 'funded_quarter', 'funded_year']\n", "keep_cols = chunk.columns.drop(drop_cols)\n", "\n", "print(keep_cols.tolist())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "CHUNK 1:\n", "company_name : 0.32 (object) vs. 0.32 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.34 (object) vs. 0.34 (object)\n", "investor_country_code : 0.27 (object) vs. 0.27 (object)\n", "investor_state_code : 0.25 (object) vs. 0.25 (object)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.29 (object) vs. 0.29 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 3.93 mb vs 3.63 mb\n", "\n", "CHUNK 2:\n", "company_name : 0.33 (object) vs. 0.33 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.36 (object) vs. 0.36 (object)\n", "investor_country_code : 0.28 (object) vs. 0.28 (object)\n", "investor_state_code : 0.26 (object) vs. 0.26 (object)\n", "investor_region : 0.30 (object) vs. 0.30 (object)\n", "investor_city : 0.31 (object) vs. 0.31 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 4.00 mb vs 3.70 mb\n", "\n", "CHUNK 3:\n", "company_name : 0.33 (object) vs. 0.33 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.36 (object) vs. 0.36 (object)\n", "investor_country_code : 0.28 (object) vs. 0.28 (object)\n", "investor_state_code : 0.26 (object) vs. 0.26 (object)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.31 (object) vs. 0.31 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 4.01 mb vs 3.71 mb\n", "\n", "CHUNK 4:\n", "company_name : 0.32 (object) vs. 0.32 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.36 (object) vs. 0.36 (object)\n", "investor_country_code : 0.28 (object) vs. 0.28 (object)\n", "investor_state_code : 0.26 (object) vs. 0.26 (object)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.31 (object) vs. 0.31 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 4.01 mb vs 3.71 mb\n", "\n", "CHUNK 5:\n", "company_name : 0.33 (object) vs. 0.33 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.36 (object) vs. 0.36 (object)\n", "investor_country_code : 0.28 (object) vs. 0.28 (object)\n", "investor_state_code : 0.25 (object) vs. 0.25 (object)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.30 (object) vs. 0.30 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 3.99 mb vs 3.69 mb\n", "\n", "CHUNK 6:\n", "company_name : 0.33 (object) vs. 0.33 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.37 (object) vs. 0.37 (object)\n", "investor_country_code : 0.28 (object) vs. 0.28 (object)\n", "investor_state_code : 0.26 (object) vs. 0.26 (object)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.31 (object) vs. 0.31 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 4.02 mb vs 3.72 mb\n", "\n", "CHUNK 7:\n", "company_name : 0.33 (object) vs. 0.33 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.36 (object) vs. 0.36 (object)\n", "investor_country_code : 0.28 (object) vs. 0.28 (object)\n", "investor_state_code : 0.26 (object) vs. 0.26 (object)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.30 (object) vs. 0.30 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 4.00 mb vs 3.70 mb\n", "\n", "CHUNK 8:\n", "company_name : 0.32 (object) vs. 0.32 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.35 (object) vs. 0.35 (object)\n", "investor_country_code : 0.28 (object) vs. 0.28 (object)\n", "investor_state_code : 0.26 (object) vs. 0.26 (object)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.31 (object) vs. 0.31 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 3.99 mb vs 3.69 mb\n", "\n", "CHUNK 9:\n", "company_name : 0.32 (object) vs. 0.32 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.35 (object) vs. 0.35 (object)\n", "investor_country_code : 0.25 (object) vs. 0.25 (object)\n", "investor_state_code : 0.24 (object) vs. 0.24 (object)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.27 (object) vs. 0.27 (object)\n", "funding_round_type : 0.31 (object) vs. 0.31 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 3.90 mb vs 3.60 mb\n", "\n", "CHUNK 10:\n", "company_name : 0.32 (object) vs. 0.32 (object)\n", "company_category_code : 0.31 (object) vs. 0.31 (object)\n", "company_country_code : 0.29 (object) vs. 0.29 (object)\n", "company_state_code : 0.28 (object) vs. 0.28 (object)\n", "company_region : 0.31 (object) vs. 0.31 (object)\n", "company_city : 0.32 (object) vs. 0.32 (object)\n", "investor_name : 0.33 (object) vs. 0.33 (object)\n", "investor_country_code : 0.04 (float64) vs. 0.04 (float64)\n", "investor_state_code : 0.04 (float64) vs. 0.04 (float64)\n", "investor_region : 0.31 (object) vs. 0.31 (object)\n", "investor_city : 0.04 (float64) vs. 0.04 (float64)\n", "funding_round_type : 0.30 (object) vs. 0.30 (object)\n", "funded_at : 0.32 (object) vs. 0.04 (datetime64[ns])\n", "raised_amount_usd : 0.04 (float64) vs. 0.02 (float32)\n", "Memory: 3.23 mb vs 2.93 mb\n", "\n", "CHUNK 11:\n", "company_name : 0.18 (object) vs. 0.18 (object)\n", "company_category_code : 0.18 (object) vs. 0.18 (object)\n", "company_country_code : 0.16 (object) vs. 0.16 (object)\n", "company_state_code : 0.16 (object) vs. 0.16 (object)\n", "company_region : 0.18 (object) vs. 0.18 (object)\n", "company_city : 0.18 (object) vs. 0.18 (object)\n", "investor_name : 0.19 (object) vs. 0.19 (object)\n", "investor_country_code : 0.02 (float64) vs. 0.02 (float64)\n", "investor_state_code : 0.02 (float64) vs. 0.02 (float64)\n", "investor_region : 0.18 (object) vs. 0.18 (object)\n", "investor_city : 0.02 (float64) vs. 0.02 (float64)\n", "funding_round_type : 0.17 (object) vs. 0.17 (object)\n", "funded_at : 0.18 (object) vs. 0.02 (datetime64[ns])\n", "raised_amount_usd : 0.02 (float64) vs. 0.01 (float32)\n", "Memory: 1.85 mb vs 1.68 mb\n", "\n", "Memory:40.92 mb vs 37.74 mb\n" ] } ], "source": [ "# Memory Comparison - Before vs After Type Conversion\n", "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "counter = 0\n", "total_initial_memory_mb = 0\n", "total_final_memory_mb = 0\n", "\n", "for chunk in chunks_it:\n", " counter += 1\n", " print(\"\\nCHUNK {0}:\".format(counter))\n", " \n", " total_initial_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n", " \n", " #print('Initial Memory Usage:')\n", " memory_comparison = []\n", " initial_memory_mb = 0\n", " for col in chunk:\n", " chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n", " initial_memory_mb += chunk_memory_mb\n", " initial_status = f'{col:<30}: {chunk_memory_mb:0.2f} ({chunk[col].dtype})'\n", " memory_comparison.append(initial_status)\n", " \n", " #for col in float_cols.columns:\n", " chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'], \n", " downcast='float')\n", " \n", " # convert object columns to datetime type\n", " chunk['funded_at'] = pd.to_datetime(chunk['funded_at'])\n", " \n", " total_memory_mb = 0\n", " for i, col in enumerate(chunk):\n", " chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576\n", " total_memory_mb += chunk_memory_mb\n", " previous_info = memory_comparison[i]\n", " print(f'{previous_info:<10} vs. {chunk_memory_mb:0.2f} ({chunk[col].dtype})')\n", " \n", " print(f'Memory: {initial_memory_mb:0.2f} mb vs {total_memory_mb:0.2f} mb')\n", " \n", " total_final_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)\n", "\n", "print(f'\\nMemory:{total_initial_memory_mb:0.2f} mb vs {total_final_memory_mb:0.2f} mb')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observations**:\n", "\n", "By converting the 2 columns (`raised_amount_usd` to a more space efficient float type and `funded_at` to a datetime type), we saved ~3 MB from the overall data. Not a significant amount. There are opportunities to convert some of the other columns into categories to further save space in the dataframe, however I am not sure how this would translate to more space saved on the SQLite database." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Chunks with SQLite\n", "\n", "The next step is to load each chunk into a table in a SQLite database so we can query the full data set.\n", "\n", "- Create and connect to a new SQLite database file.\n", "- Expand on the existing chunk processing code to export each chunk to a new table in the SQLite database.\n", "- Query the table and make sure the data types match up to what you had in mind for each column.\n", "- Use the `!wc IPython` command to return the file size of the database." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total Size List: '[(5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (2870, 14)]'\n" ] } ], "source": [ "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "\n", "total_size_list = []\n", "\n", "for chunk in chunks_it:\n", " total_size_list.append(chunk.shape)\n", "\n", "print(f\"Total Size List: '{total_size_list}'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We verify that our rows add up to 52,870:\n", "\n", "total_rows = $(5000 * 10) + 2,870$\n", "\n", "We have 14 columns." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading chunk 1...\n", "Loading chunk 2...\n", "Loading chunk 3...\n", "Loading chunk 4...\n", "Loading chunk 5...\n", "Loading chunk 6...\n", "Loading chunk 7...\n", "Loading chunk 8...\n", "Loading chunk 9...\n", "Loading chunk 10...\n", "Loading chunk 11...\n", "Completed loading all chunks to crunchbase.db!\n" ] } ], "source": [ "conn = sqlite3.connect('crunchbase.db')\n", "\n", "chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000, \n", " encoding='ISO-8859-1', usecols=keep_cols)\n", "counter = 0\n", "\n", "for chunk in chunks_it:\n", " #print(chunk.columns.tolist())\n", " counter += 1\n", " print(f\"Loading chunk {counter}...\")\n", " chunk.to_sql(\"investments\", conn, if_exists='append', index=False)\n", " #break\n", "\n", "conn.close()\n", "print(\"Completed loading all chunks to crunchbase.db!\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "crunchbase.db\r\n", "crunchbase-investments.csv\r\n", "project03_analyzing_startup_fundraising_deals_from_crunchbase.ipynb\r\n" ] } ], "source": [ "!ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Successfully created database, `crunchbase.db`." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# SQLite Helper Functions\n", "\n", "def run_query(query):\n", " with sqlite3.connect('crunchbase.db') as conn:\n", " return pd.read_sql(query, conn)\n", "\n", "def show_tables():\n", " query = '''\n", " SELECT\n", " name,\n", " type\n", " FROM sqlite_master\n", " WHERE type IN (\"table\", \"view\");\n", " '''\n", " return run_query(query)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametype
0investmentstable
\n", "
" ], "text/plain": [ " name type\n", "0 investments table" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_tables()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
company_namecompany_category_codecompany_country_codecompany_state_codecompany_regioncompany_cityinvestor_nameinvestor_country_codeinvestor_state_codeinvestor_regioninvestor_cityfunding_round_typefunded_atraised_amount_usd
0AdverCaradvertisingUSACASF BaySan Francisco1-800-FLOWERS.COMUSANYNew YorkNew Yorkseries-a2012-10-302000000.0
1LaunchGramnewsUSACASF BayMountain View10XeleratorUSAOHColumbusColumbusother2012-01-2320000.0
2uTaPmessagingUSANoneUnited States - OtherNone10XeleratorUSAOHColumbusColumbusother2012-01-0120000.0
3ZoopShopsoftwareUSAOHColumbuscolumbus10XeleratorUSAOHColumbusColumbusangel2012-02-1520000.0
4eFuneralwebUSAOHClevelandCleveland10XeleratorUSAOHColumbusColumbusother2011-09-0820000.0
\n", "
" ], "text/plain": [ " company_name company_category_code company_country_code company_state_code \\\n", "0 AdverCar advertising USA CA \n", "1 LaunchGram news USA CA \n", "2 uTaP messaging USA None \n", "3 ZoopShop software USA OH \n", "4 eFuneral web USA OH \n", "\n", " company_region company_city investor_name \\\n", "0 SF Bay San Francisco 1-800-FLOWERS.COM \n", "1 SF Bay Mountain View 10Xelerator \n", "2 United States - Other None 10Xelerator \n", "3 Columbus columbus 10Xelerator \n", "4 Cleveland Cleveland 10Xelerator \n", "\n", " investor_country_code investor_state_code investor_region investor_city \\\n", "0 USA NY New York New York \n", "1 USA OH Columbus Columbus \n", "2 USA OH Columbus Columbus \n", "3 USA OH Columbus Columbus \n", "4 USA OH Columbus Columbus \n", "\n", " funding_round_type funded_at raised_amount_usd \n", "0 series-a 2012-10-30 2000000.0 \n", "1 other 2012-01-23 20000.0 \n", "2 other 2012-01-01 20000.0 \n", "3 angel 2012-02-15 20000.0 \n", "4 other 2011-09-08 20000.0 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = run_query('SELECT * from investments')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 4773 229954 6340608 crunchbase.db\r\n" ] } ], "source": [ "!wc crunchbase.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**:\n", "\n", "Earlier, I calculated 52870 rows across all the chunks. The database ends up having only 4773 rows.\n", "\n", "**Update**: After a quick double check, I realized the number of rows returned by `wc` command doesn't correspond to the number of lines in our SQL table. Below we verified our row and column numbers match up with our expectations." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_rows
052870
\n", "
" ], "text/plain": [ " num_rows\n", "0 52870" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = run_query('SELECT count(*) as num_rows from investments')\n", "df" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cidnametypenotnulldflt_valuepk
00company_nameTEXT0None0
11company_category_codeTEXT0None0
22company_country_codeTEXT0None0
33company_state_codeTEXT0None0
44company_regionTEXT0None0
55company_cityTEXT0None0
66investor_nameTEXT0None0
77investor_country_codeTEXT0None0
88investor_state_codeTEXT0None0
99investor_regionTEXT0None0
1010investor_cityTEXT0None0
1111funding_round_typeTEXT0None0
1212funded_atTEXT0None0
1313raised_amount_usdREAL0None0
\n", "
" ], "text/plain": [ " cid name type notnull dflt_value pk\n", "0 0 company_name TEXT 0 None 0\n", "1 1 company_category_code TEXT 0 None 0\n", "2 2 company_country_code TEXT 0 None 0\n", "3 3 company_state_code TEXT 0 None 0\n", "4 4 company_region TEXT 0 None 0\n", "5 5 company_city TEXT 0 None 0\n", "6 6 investor_name TEXT 0 None 0\n", "7 7 investor_country_code TEXT 0 None 0\n", "8 8 investor_state_code TEXT 0 None 0\n", "9 9 investor_region TEXT 0 None 0\n", "10 10 investor_city TEXT 0 None 0\n", "11 11 funding_round_type TEXT 0 None 0\n", "12 12 funded_at TEXT 0 None 0\n", "13 13 raised_amount_usd REAL 0 None 0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Table Columns info\n", "\n", "df = run_query('PRAGMA table_info(investments)')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Exploration and Analysis\n", "\n", "We can use the pandas SQLite workflow we learned in the last mission to explore and analyze startup investments. Remember that each row isn't a unique company, but a unique investment from a single investor. This means that many startups will span multiple rows.\n", "\n", "- Use the pandas SQLite workflow to answer the following questions:\n", " - What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.\n", " - Which category of company attracted the most investments?\n", " - Which investor contributed the most money (across all startups)?\n", " - Which investors contributed the most money per startup?\n", " - Which funding round was the most popular? Which was the least popular?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper Functions" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def run_query(query):\n", " with sqlite3.connect('crunchbase.db') as conn:\n", " return pd.read_sql(query, conn)\n", "\n", "def run_command(command):\n", " with sqlite3.connect('crunchbase.db') as conn:\n", " conn.isolation_level = None\n", " conn.execute(command)\n", "\n", "def show_tables():\n", " query = '''\n", " SELECT\n", " name,\n", " type\n", " FROM sqlite_master\n", " WHERE type IN (\"table\", \"view\");\n", " '''\n", " return run_query(query)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametype
0investmentstable
\n", "
" ], "text/plain": [ " name type\n", "0 investments table" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_tables()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
company_namecompany_category_codecompany_country_codecompany_state_codecompany_regioncompany_cityinvestor_nameinvestor_country_codeinvestor_state_codeinvestor_regioninvestor_cityfunding_round_typefunded_atraised_amount_usd
0AdverCaradvertisingUSACASF BaySan Francisco1-800-FLOWERS.COMUSANYNew YorkNew Yorkseries-a2012-10-302000000.0
1LaunchGramnewsUSACASF BayMountain View10XeleratorUSAOHColumbusColumbusother2012-01-2320000.0
2uTaPmessagingUSANoneUnited States - OtherNone10XeleratorUSAOHColumbusColumbusother2012-01-0120000.0
3ZoopShopsoftwareUSAOHColumbuscolumbus10XeleratorUSAOHColumbusColumbusangel2012-02-1520000.0
4eFuneralwebUSAOHClevelandCleveland10XeleratorUSAOHColumbusColumbusother2011-09-0820000.0
5TackkwebUSAOHClevelandCleveland10XeleratorUSAOHColumbusColumbusother2012-02-0120000.0
6AcclaimdanalyticsUSAOHColumbusColumbus10XeleratorUSAOHColumbusColumbusangel2012-06-0120000.0
7AcclaimdanalyticsUSAOHColumbusColumbus10XeleratorUSAOHColumbusColumbusangel2012-08-0770000.0
8ToVieForecommerceUSANYNew YorkNew York2010 NYU Stern Business Plan CompetitionNoneNoneunknownNoneangel2010-04-0175000.0
9OHK LabssportsUSAFLPalm BeachBoca Raton22Hundred GroupNoneNoneunknownNoneangel2011-09-01100000.0
\n", "
" ], "text/plain": [ " company_name company_category_code company_country_code company_state_code \\\n", "0 AdverCar advertising USA CA \n", "1 LaunchGram news USA CA \n", "2 uTaP messaging USA None \n", "3 ZoopShop software USA OH \n", "4 eFuneral web USA OH \n", "5 Tackk web USA OH \n", "6 Acclaimd analytics USA OH \n", "7 Acclaimd analytics USA OH \n", "8 ToVieFor ecommerce USA NY \n", "9 OHK Labs sports USA FL \n", "\n", " company_region company_city \\\n", "0 SF Bay San Francisco \n", "1 SF Bay Mountain View \n", "2 United States - Other None \n", "3 Columbus columbus \n", "4 Cleveland Cleveland \n", "5 Cleveland Cleveland \n", "6 Columbus Columbus \n", "7 Columbus Columbus \n", "8 New York New York \n", "9 Palm Beach Boca Raton \n", "\n", " investor_name investor_country_code \\\n", "0 1-800-FLOWERS.COM USA \n", "1 10Xelerator USA \n", "2 10Xelerator USA \n", "3 10Xelerator USA \n", "4 10Xelerator USA \n", "5 10Xelerator USA \n", "6 10Xelerator USA \n", "7 10Xelerator USA \n", "8 2010 NYU Stern Business Plan Competition None \n", "9 22Hundred Group None \n", "\n", " investor_state_code investor_region investor_city funding_round_type \\\n", "0 NY New York New York series-a \n", "1 OH Columbus Columbus other \n", "2 OH Columbus Columbus other \n", "3 OH Columbus Columbus angel \n", "4 OH Columbus Columbus other \n", "5 OH Columbus Columbus other \n", "6 OH Columbus Columbus angel \n", "7 OH Columbus Columbus angel \n", "8 None unknown None angel \n", "9 None unknown None angel \n", "\n", " funded_at raised_amount_usd \n", "0 2012-10-30 2000000.0 \n", "1 2012-01-23 20000.0 \n", "2 2012-01-01 20000.0 \n", "3 2012-02-15 20000.0 \n", "4 2011-09-08 20000.0 \n", "5 2012-02-01 20000.0 \n", "6 2012-06-01 20000.0 \n", "7 2012-08-07 70000.0 \n", "8 2010-04-01 75000.0 \n", "9 2011-09-01 100000.0 " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = '''\n", "SELECT *\n", "FROM investments\n", "LIMIT 10'''\n", "run_query(query)" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(52870, 14)" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = '''SELECT * FROM investments'''\n", "investments = run_query(query)\n", "investments.shape" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
company_namecompany_category_codecompany_country_codecompany_state_codecompany_regioncompany_cityinvestor_nameinvestor_country_codeinvestor_state_codeinvestor_regioninvestor_cityfunding_round_typefunded_atraised_amount_usd
0AdverCaradvertisingUSACASF BaySan Francisco1-800-FLOWERS.COMUSANYNew YorkNew Yorkseries-a2012-10-302000000.0
1LaunchGramnewsUSACASF BayMountain View10XeleratorUSAOHColumbusColumbusother2012-01-2320000.0
2uTaPmessagingUSANoneUnited States - OtherNone10XeleratorUSAOHColumbusColumbusother2012-01-0120000.0
3ZoopShopsoftwareUSAOHColumbuscolumbus10XeleratorUSAOHColumbusColumbusangel2012-02-1520000.0
4eFuneralwebUSAOHClevelandCleveland10XeleratorUSAOHColumbusColumbusother2011-09-0820000.0
\n", "
" ], "text/plain": [ " company_name company_category_code company_country_code company_state_code \\\n", "0 AdverCar advertising USA CA \n", "1 LaunchGram news USA CA \n", "2 uTaP messaging USA None \n", "3 ZoopShop software USA OH \n", "4 eFuneral web USA OH \n", "\n", " company_region company_city investor_name \\\n", "0 SF Bay San Francisco 1-800-FLOWERS.COM \n", "1 SF Bay Mountain View 10Xelerator \n", "2 United States - Other None 10Xelerator \n", "3 Columbus columbus 10Xelerator \n", "4 Cleveland Cleveland 10Xelerator \n", "\n", " investor_country_code investor_state_code investor_region investor_city \\\n", "0 USA NY New York New York \n", "1 USA OH Columbus Columbus \n", "2 USA OH Columbus Columbus \n", "3 USA OH Columbus Columbus \n", "4 USA OH Columbus Columbus \n", "\n", " funding_round_type funded_at raised_amount_usd \n", "0 series-a 2012-10-30 2000000.0 \n", "1 other 2012-01-23 20000.0 \n", "2 other 2012-01-01 20000.0 \n", "3 angel 2012-02-15 20000.0 \n", "4 other 2011-09-08 20000.0 " ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "investments.head()" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11574" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify unique companies\n", "unique_companies = list(investments['company_name'].unique())\n", "len(unique_companies)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
company_namecompany_category_codecompany_country_codecompany_state_codecompany_regioncompany_cityinvestor_nameinvestor_country_codeinvestor_state_codeinvestor_regioninvestor_cityfunding_round_typefunded_atraised_amount_usd
34225NoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNoneNaN
\n", "
" ], "text/plain": [ " company_name company_category_code company_country_code \\\n", "34225 None None None \n", "\n", " company_state_code company_region company_city investor_name \\\n", "34225 None None None None \n", "\n", " investor_country_code investor_state_code investor_region investor_city \\\n", "34225 None None None None \n", "\n", " funding_round_type funded_at raised_amount_usd \n", "34225 None None NaN " ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Found the additional row - this messed up the\n", "# unique companies list above with the unique startups list\n", "# below\n", "investments[investments['company_name'].isnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Group Investments by Startup" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "funds_raised_per_startup = investments.groupby('company_name')['raised_amount_usd'].sum()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "company_name\n", "#waywire 8750000.0\n", "0xdata 1700000.0\n", "1-800-DENTIST 0.0\n", "1000memories 20190000.0\n", "100Plus 2750000.0\n", "Name: raised_amount_usd, dtype: float64" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "funds_raised_per_startup.head()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(11573,)" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "funds_raised_per_startup.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason for the difference between this number and the earlier number (11574) is because there is a row with all values set as None. We can safely ignore this discrepancy. \n", "\n", "**Note**: Obviously pandas `groupby` function would not keep a None value." ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "#suppress displaying long numbers in scientific notation\n", "pd.set_option('display.float_format', lambda x: '%.2f' % x)" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "#reset option\n", "pd.reset_option('display.float_format')" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "company_name\n", "Clearwire 2.968000e+10\n", "Groupon 1.018540e+10\n", "Nanosolar 4.505000e+09\n", "Facebook 4.154100e+09\n", "SurveyMonkey 3.250000e+09\n", "Zynga 2.886013e+09\n", "Fisker Automotive 2.788000e+09\n", "Dropbox 2.764400e+09\n", "LivingSocial 2.685000e+09\n", "sigmacare 2.600000e+09\n", "Bloom Energy 2.530000e+09\n", "Pacific Biosciences 2.374160e+09\n", "Fab.com 2.349100e+09\n", "BrightSource Energy 2.219000e+09\n", "TRUECar 2.106500e+09\n", "Wave Broadband 2.100000e+09\n", "Twitter 1.950000e+09\n", "PayPal 1.857000e+09\n", "Solyndra 1.833000e+09\n", "Demand Media 1.819992e+09\n", "KAYAK 1.813500e+09\n", "MiaSolŽ 1.750000e+09\n", "Tesla Motors 1.736500e+09\n", "Pure Storage 1.650000e+09\n", "SunEdison 1.635300e+09\n", "Kosmos Biotherapeutics 1.595000e+09\n", "Gilt Groupe 1.556000e+09\n", "NextG Networks 1.540000e+09\n", "Legendary Entertainment 1.375000e+09\n", "Pinterest 1.356000e+09\n", " ... \n", "Tacit Software 0.000000e+00\n", "Tacere Therapeutics 0.000000e+00\n", "Taaz 0.000000e+00\n", "Sparkcloud 0.000000e+00\n", "Spartan Race 0.000000e+00\n", "Speaktoit 0.000000e+00\n", "Spill Inc 0.000000e+00\n", "mo9 (moKredit) 0.000000e+00\n", "Splitcast Technology 0.000000e+00\n", "SponsorHub 0.000000e+00\n", "Timbuktu Labs 0.000000e+00\n", "Tideland Signal Corporation 0.000000e+00\n", "8020 Media 0.000000e+00\n", "TicketBiscuit 0.000000e+00\n", "g2One 0.000000e+00\n", "for; to (do) Centers 0.000000e+00\n", "Triad Digital Media 0.000000e+00\n", "Tomorrowish 0.000000e+00\n", "iCracked 0.000000e+00\n", "Tippr 0.000000e+00\n", "Affinaquest 0.000000e+00\n", "Thinkspeed 0.000000e+00\n", "Thounds 0.000000e+00\n", "Threat Stack 0.000000e+00\n", "Thumb Friendly 0.000000e+00\n", "AfterCollege 0.000000e+00\n", "Tickengo 0.000000e+00\n", "Ticket Cake 0.000000e+00\n", "The Talk Market 0.000000e+00\n", "zuuka! 0.000000e+00\n", "Name: raised_amount_usd, Length: 11573, dtype: float64" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "funds_raised_per_startup.sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "company_name\n", "zuuka! 0.00\n", "Marinexplore 0.00\n", "MarketGid 0.00\n", "Marketbright 0.00\n", "Marketsync 0.00\n", "Marquee 0.00\n", "Massage Envy 0.00\n", "Massively Fun 0.00\n", "Masterson Industries 0.00\n", "Matchmove Games 0.00\n", "Mattermark 0.00\n", "Measureful 0.00\n", "Best Apps Market 0.00\n", "MedSolutions 0.00\n", "Bespoke Global 0.00\n", "MedTest DX 0.00\n", "MediKeeper 0.00\n", "MediProPharma 0.00\n", "MediaSpike 0.00\n", "Medical Solutions 0.00\n", "Manymoon 0.00\n", "Medico.com 0.00\n", "Mangia 0.00\n", "Mandalay Sports Media (MSM) 0.00\n", "Location Based Technologies 0.00\n", "Lockitron 0.00\n", "Look.io 0.00\n", "LoopIt 0.00\n", "Lucent Sky 0.00\n", "LumaSense Technologies 0.00\n", " ... \n", "Pinterest 1356000000.00\n", "Legendary Entertainment 1375000000.00\n", "NextG Networks 1540000000.00\n", "Gilt Groupe 1556000000.00\n", "Kosmos Biotherapeutics 1595000000.00\n", "SunEdison 1635300000.00\n", "Pure Storage 1650000000.00\n", "Tesla Motors 1736500000.00\n", "MiaSolŽ 1750000000.00\n", "KAYAK 1813500000.00\n", "Demand Media 1819992000.00\n", "Solyndra 1833000000.00\n", "PayPal 1857000000.00\n", "Twitter 1950000000.00\n", "Wave Broadband 2100000000.00\n", "TRUECar 2106500000.00\n", "BrightSource Energy 2219000000.00\n", "Fab.com 2349100000.00\n", "Pacific Biosciences 2374160000.00\n", "Bloom Energy 2530000000.00\n", "sigmacare 2600000000.00\n", "LivingSocial 2685000000.00\n", "Dropbox 2764400000.00\n", "Fisker Automotive 2788000000.00\n", "Zynga 2886013000.00\n", "SurveyMonkey 3250000000.00\n", "Facebook 4154100000.00\n", "Nanosolar 4505000000.00\n", "Groupon 10185400000.00\n", "Clearwire 29680000000.00\n", "Name: raised_amount_usd, Length: 11573, dtype: float64" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "investments.groupby('company_name')['raised_amount_usd'].sum().sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'm getting 29,680,000,000 raised for Clearwire which seems to be off by a power of 10. As of 2009, [Clearwire had raised $2.8 billion](https://www.ecnmag.com/news/2009/11/clearwires-fundraising-hits-28-billion).\n", "\n", "**NOT $29 billion**" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "scrolled": true }, "outputs": [], "source": [ "clearwire_investments = investments[investments['company_name'] == 'Clearwire']" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
company_namecompany_category_codecompany_country_codecompany_state_codecompany_regioncompany_cityinvestor_nameinvestor_country_codeinvestor_state_codeinvestor_regioninvestor_cityfunding_round_typefunded_atraised_amount_usd
521ClearwiremobileUSAWASeattleKirklandBrightHouseUSACALos AngelesSanta Monicapost-ipo2009-11-101.500000e+09
522ClearwiremobileUSAWASeattleKirklandBrightHouseUSACALos AngelesSanta Monicapost-ipo2008-05-063.200000e+09
714ClearwiremobileUSAWASeattleKirklandComcastUSAPAPhiladelphiaPhiladelphiapost-ipo2009-11-249.200000e+08
715ClearwiremobileUSAWASeattleKirklandComcastUSAPAPhiladelphiaPhiladelphiapost-ipo2009-11-101.500000e+09
716ClearwiremobileUSAWASeattleKirklandComcastUSAPAPhiladelphiaPhiladelphiapost-ipo2008-05-063.200000e+09
1172ClearwiremobileUSAWASeattleKirklandGoogleUSACASF BayMountain Viewpost-ipo2008-05-063.200000e+09
1348ClearwiremobileUSAWASeattleKirklandIntelUSACASF BaySanta Clarapost-ipo2009-11-249.200000e+08
1349ClearwiremobileUSAWASeattleKirklandIntelUSACASF BaySanta Clarapost-ipo2009-11-101.500000e+09
1350ClearwiremobileUSAWASeattleKirklandIntelUSACASF BaySanta Clarapost-ipo2008-05-063.200000e+09
2367ClearwiremobileUSAWASeattleKirklandSprint NextelNoneNoneunknownNonepost-ipo2009-11-249.200000e+08
2368ClearwiremobileUSAWASeattleKirklandSprint NextelNoneNoneunknownNonepost-ipo2009-11-101.500000e+09
2369ClearwiremobileUSAWASeattleKirklandSprint NextelNoneNoneunknownNoneprivate-equity2013-02-278.000000e+07
2851ClearwiremobileUSAWASeattleKirklandTime WarnerUSANYNew YorkNew Yorkpost-ipo2009-11-249.200000e+08
2852ClearwiremobileUSAWASeattleKirklandTime WarnerUSANYNew YorkNew Yorkpost-ipo2009-11-101.500000e+09
2853ClearwiremobileUSAWASeattleKirklandTime WarnerUSANYNew YorkNew Yorkpost-ipo2008-05-063.200000e+09
14252ClearwiremobileUSAWASeattleKirklandEagle River HoldingsUSAWASeattleKirklandpost-ipo2009-11-249.200000e+08
14253ClearwiremobileUSAWASeattleKirklandEagle River HoldingsUSAWASeattleKirklandpost-ipo2009-11-101.500000e+09
\n", "
" ], "text/plain": [ " company_name company_category_code company_country_code \\\n", "521 Clearwire mobile USA \n", "522 Clearwire mobile USA \n", "714 Clearwire mobile USA \n", "715 Clearwire mobile USA \n", "716 Clearwire mobile USA \n", "1172 Clearwire mobile USA \n", "1348 Clearwire mobile USA \n", "1349 Clearwire mobile USA \n", "1350 Clearwire mobile USA \n", "2367 Clearwire mobile USA \n", "2368 Clearwire mobile USA \n", "2369 Clearwire mobile USA \n", "2851 Clearwire mobile USA \n", "2852 Clearwire mobile USA \n", "2853 Clearwire mobile USA \n", "14252 Clearwire mobile USA \n", "14253 Clearwire mobile USA \n", "\n", " company_state_code company_region company_city investor_name \\\n", "521 WA Seattle Kirkland BrightHouse \n", "522 WA Seattle Kirkland BrightHouse \n", "714 WA Seattle Kirkland Comcast \n", "715 WA Seattle Kirkland Comcast \n", "716 WA Seattle Kirkland Comcast \n", "1172 WA Seattle Kirkland Google \n", "1348 WA Seattle Kirkland Intel \n", "1349 WA Seattle Kirkland Intel \n", "1350 WA Seattle Kirkland Intel \n", "2367 WA Seattle Kirkland Sprint Nextel \n", "2368 WA Seattle Kirkland Sprint Nextel \n", "2369 WA Seattle Kirkland Sprint Nextel \n", "2851 WA Seattle Kirkland Time Warner \n", "2852 WA Seattle Kirkland Time Warner \n", "2853 WA Seattle Kirkland Time Warner \n", "14252 WA Seattle Kirkland Eagle River Holdings \n", "14253 WA Seattle Kirkland Eagle River Holdings \n", "\n", " investor_country_code investor_state_code investor_region \\\n", "521 USA CA Los Angeles \n", "522 USA CA Los Angeles \n", "714 USA PA Philadelphia \n", "715 USA PA Philadelphia \n", "716 USA PA Philadelphia \n", "1172 USA CA SF Bay \n", "1348 USA CA SF Bay \n", "1349 USA CA SF Bay \n", "1350 USA CA SF Bay \n", "2367 None None unknown \n", "2368 None None unknown \n", "2369 None None unknown \n", "2851 USA NY New York \n", "2852 USA NY New York \n", "2853 USA NY New York \n", "14252 USA WA Seattle \n", "14253 USA WA Seattle \n", "\n", " investor_city funding_round_type funded_at raised_amount_usd \n", "521 Santa Monica post-ipo 2009-11-10 1.500000e+09 \n", "522 Santa Monica post-ipo 2008-05-06 3.200000e+09 \n", "714 Philadelphia post-ipo 2009-11-24 9.200000e+08 \n", "715 Philadelphia post-ipo 2009-11-10 1.500000e+09 \n", "716 Philadelphia post-ipo 2008-05-06 3.200000e+09 \n", "1172 Mountain View post-ipo 2008-05-06 3.200000e+09 \n", "1348 Santa Clara post-ipo 2009-11-24 9.200000e+08 \n", "1349 Santa Clara post-ipo 2009-11-10 1.500000e+09 \n", "1350 Santa Clara post-ipo 2008-05-06 3.200000e+09 \n", "2367 None post-ipo 2009-11-24 9.200000e+08 \n", "2368 None post-ipo 2009-11-10 1.500000e+09 \n", "2369 None private-equity 2013-02-27 8.000000e+07 \n", "2851 New York post-ipo 2009-11-24 9.200000e+08 \n", "2852 New York post-ipo 2009-11-10 1.500000e+09 \n", "2853 New York post-ipo 2008-05-06 3.200000e+09 \n", "14252 Kirkland post-ipo 2009-11-24 9.200000e+08 \n", "14253 Kirkland post-ipo 2009-11-10 1.500000e+09 " ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clearwire_investments" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1500000000.0,\n", " 3200000000.0,\n", " 920000000.0,\n", " 1500000000.0,\n", " 3200000000.0,\n", " 3200000000.0,\n", " 920000000.0,\n", " 1500000000.0,\n", " 3200000000.0,\n", " 920000000.0,\n", " 1500000000.0,\n", " 80000000.0,\n", " 920000000.0,\n", " 1500000000.0,\n", " 3200000000.0,\n", " 920000000.0,\n", " 1500000000.0]" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(clearwire_investments['raised_amount_usd'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }