{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Premium Primitive Examples\n", "\n", "Examples by type\n", "\n", "* [Geospatial Primitives](#Geospatial-Primitives)\n", "* [NLP Primitives](#Natural-Language-Processing-Primitives)\n", "* [Date of Birth Primitives](#Date-of-Birth-Primitives)\n", "* [Time Primitives](#Time-Primitives)\n", "* [Phone Number Primitives](#Phone-Number-Primitives)\n", "* [ZIP Code Primitives](#ZIP-Code-Primitives)\n", "* [Numeric Primitives](#Numeric-Primitives)\n", "* [Miscellaneous Data Types](#Miscellaneous-Data-Types)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Geospatial Primitives\n", "\n", "`LatLong` variables store a tuple containing the Latitude and Longitude of a point on the globe. These primitives transform that location (e.g. what country is it in) and can do comparions between multiple `LatLong` variables (e.g. distance between them).\n", "\n", "#### CityBlockDistance\n", "\n", "Calculates the distance between points in a city road grid." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 301.518836\n", "1 672.088624\n", "dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import CityblockDistance\n", "\n", "cityblock_distance = CityblockDistance()\n", "DC = (38, -77)\n", "Boston = (43, -71)\n", "NYC = (40, -74)\n", "cityblock_distance([DC, DC], [NYC, Boston]) # DC -> NYC, DC -> Boston" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### PathLength\n", "Determines the length of a path defined by a series of coordinates." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "805.5203180792812" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import PathLength\n", "\n", "path_length_km = PathLength(unit='kilometers')\n", "path_length_km([(41.881832, -87.623177), (38.6270, -90.1994), (39.0997, -94.5786)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LatLongToCity\n", "\n", "Determines city/town corresponding to given Latitude and Longitude coordinates." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Bayswater\n", "1 Cochin\n", "2 Mountain View\n", "3 None\n", "Name: results, dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import LatLongToCity\n", "\n", "latlong_to_city = LatLongToCity()\n", "latlong_to_city([(51.52, -0.17), (9.93, 76.25), (37.38, -122.08), (np.nan, np.nan)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Natural Language Processing Primitives\n", "\n", "NLP primitives can apply various natural language processing techniques to text data.\n", "\n", "#### PolarityScore\n", "\n", "Calculates the polarity of a text on a scale from -1 (negative) to 1 (positive)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.677\n", "1 -0.649\n", "2 0.000\n", "3 0.000\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import PolarityScore\n", "\n", "x = ['He loves dogs', 'She hates cats', 'There is a dog', '']\n", "polarity_score = PolarityScore()\n", "polarity_score(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### PartOfSpeechCount\n", "\n", "Calculates the occurences of each different part of speech." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [0.0, 0.0]\n", "1 [0.0, 0.0]\n", "2 [0.0, 0.0]\n", "3 [0.0, 0.0]\n", "4 [0.0, 0.0]\n", "5 [1.0, 0.0]\n", "6 [0.0, 0.0]\n", "7 [0.0, 0.0]\n", "8 [0.0, 0.0]\n", "9 [1.0, 0.0]\n", "10 [0.0, 0.0]\n", "11 [0.0, 0.0]\n", "12 [0.0, 0.0]\n", "13 [1.0, 0.0]\n", "14 [0.0, 0.0]\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import PartOfSpeechCount\n", "\n", "x = ['He was eating cheese', '']\n", "part_of_speech_count = PartOfSpeechCount()\n", "part_of_speech_count(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Date of Birth Primitives\n", "\n", "These primitives transform `DateOfBirth` type variables. They use the time of the feature calculation to extrapolate the current age of a person. This is set by using a [cutoff time](https://featuretools.featurelabs.com/automated_feature_engineering/handling_time.html#what-is-the-cutoff-time).\n", "\n", "#### Age\n", "\n", "Calculates the age in years as a floating point number given a date of birth." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 19.013699\n", "1 35.616438\n", "2 21.221918\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import Age\n", "\n", "reference_date = pd.to_datetime(\"01-01-2019\")\n", "age = Age()\n", "input_ages = [pd.to_datetime(\"01-01-2000\"),\n", " pd.to_datetime(\"05-30-1983\"),\n", " pd.to_datetime(\"10-17-1997\")]\n", "age(input_ages, time=reference_date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are also primitives to see if a birth date falls within a give age range\n", "\n", "#### AgeOver18\n", "\n", "Determines whether a person is over 18 years old given their date of birth." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 True\n", "2 True\n", "dtype: bool" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import AgeOver18\n", "\n", "over18 = AgeOver18()\n", "over18(input_ages, time=reference_date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### AgeUnder65\n", "\n", "Determines whether a person is under 65 years old given their date of birth." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 True\n", "2 True\n", "dtype: bool" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import AgeUnder65\n", "\n", "under65 = AgeUnder65()\n", "under65(input_ages, time=reference_date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time Primitives\n", "\n", "`Datetime` variables store dates or timestamps. These primitives can extract special properties from a `Datetime` field like if it is a holiday.\n", "\n", "#### DateToHoliday\n", "\n", "If there is no holiday, it returns `NaN`. Currently only works for the United States and Canada with dates between 1950 and 2100." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([\"New Year's Day\", nan, 'Memorial Day', 'Independence Day'],\n", " dtype=object)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import DateToHoliday\n", "\n", "date_to_holiday = DateToHoliday()\n", "dates = pd.Series([datetime(2016, 1, 1),\n", " datetime(2016, 2, 27),\n", " datetime(2017, 5, 29, 10, 30, 5),\n", " datetime(2018, 7, 4)])\n", "date_to_holiday(dates)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NUniqueDaysOfCalendarYear\n", "\n", "Determines the number of unique calendar days." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import NUniqueDaysOfCalendarYear\n", "\n", "n_unique_days_of_calendar_year = NUniqueDaysOfCalendarYear()\n", "times = [datetime(2019, 2, 1),\n", " datetime(2019, 2, 1),\n", " datetime(2018, 2, 1),\n", " datetime(2019, 1, 1)]\n", "n_unique_days_of_calendar_year(times)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phone Number Primitives\n", "\n", "The `PhoneNumber` variable stores phone numbers. These primitives can transformations on the phone numbers to extract metadata like country or area code that are general enough for a model to use\n", "\n", "#### PhoneNumberToCountry\n", "\n", "Determines the country of a phone number." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 BR\n", "1 JP\n", "2 US\n", "dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import PhoneNumberToCountry\n", "\n", "phone_number_to_country = PhoneNumberToCountry()\n", "phone_number_to_country(['+55 85 5555555', '+81 55-555-5555', '+1-541-754-3010',])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ZIP Code Primitives\n", "\n", "These primitives can bring in various metadata about a zip code (like geographic or economic information).\n", "\n", "#### ZIPCodeToState\n", "\n", "Extracts the state from a ZIPCode." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 IL\n", "1 CA\n", "2 MA\n", "dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import ZIPCodeToState\n", "\n", "zipcode_to_state = ZIPCodeToState()\n", "zipcode_to_state(['60622', '94120', '02111-1253'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ZIPCodeToHouseholdIncome\n", "\n", "Determines the median household income for a ZIP Code." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 59000., 103422., 103422.])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import ZIPCodeToHouseholdIncome\n", "\n", "zipcode_to_household_income = ZIPCodeToHouseholdIncome()\n", "zipcode_to_household_income([\"82838\", \"02116\", \"02116-3899\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numeric Primitives\n", "\n", "The premium primitives have additional nmumeric primitives that add new mathematical transformations and aggregations that aren't present in the open-source library. They are frequently useful in time-series analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NumPeaks\n", "\n", "Determines the number of peaks in a list of numbers." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import NumPeaks\n", "\n", "num_peaks = NumPeaks()\n", "num_peaks([-5, 0, 10, 0, 10, -5, -4, -5, 10, 0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NumZeroCrossings\n", "Determines the number of times a list crosses 0." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import NumZeroCrossings\n", "\n", "num_zero_crossings = NumZeroCrossings()\n", "num_zero_crossings([1, -1, 2, -2, 3, -3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Correlation\n", "\n", "Computes the correlation between two columns of values." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9221388919541468" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import Correlation\n", "\n", "correlation = Correlation()\n", "array_1 = [1, 4, 6, 7]\n", "array_2 = [1, 5, 9, 7]\n", "correlation(array_1, array_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### CountOutsideRange\n", "\n", "Determines the number of values that fall outside a certain range." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import CountOutsideRange\n", "\n", "count_outside_range = CountOutsideRange(lower=1.5, upper=3.6)\n", "count_outside_range([1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Miscellaneous Data Types\n", "\n", "Other variable types include FullName, EmailAddress, URL, CountryCode, SubRegionCode, FilePath\n", "\n", "### FullNameToLastName\n", "\n", "Determines the first name from a person's name." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Spector\n", "1 Oliva y Ocana\n", "2 Ware\n", "3 Peter\n", "4 Brown\n", "Name: last_name, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import FullNameToLastName\n", "\n", "full_name_to_last_name = FullNameToLastName()\n", "names = ['Woolf Spector', 'Oliva y Ocana, Dona. Fermina',\n", " 'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown']\n", "full_name_to_last_name(names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IsFreeEmailDomain\n", "\n", "Determines if an email address is from a free email domain." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, False])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import IsFreeEmailDomain\n", "\n", "is_free_email_domain = IsFreeEmailDomain()\n", "is_free_email_domain(['name@gmail.com', 'name@featuretools.com'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### URLToDomain\n", "\n", "Determines the domain of a url." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 play.google.com\n", "1 google.co.in\n", "2 facebook.com\n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import URLToDomain\n", "\n", "url_to_domain = URLToDomain()\n", "urls = ['https://play.google.com', 'http://www.google.co.in', 'www.facebook.com']\n", "url_to_domain(urls)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CountryCodeToIncome\n", "\n", "Transforms a 2-digit or 3-digit ISO-3166-1 country code into Gross National Income (GNI) per capita." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([58270., 3990., 5920.])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import CountryCodeToIncome\n", "\n", "country_code_to_income = CountryCodeToIncome()\n", "country_code_to_income(['USA', 'AM', 'EC'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SubRegionCodeToMedianHouseholdIncome\n", "\n", "Determines the median household income of a US sub-region." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([51113, 63481, 63805, 83382, 57700, 62447])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import SubRegionCodeToMedianHouseholdIncome\n", "\n", "sub_region_code_to_median_household_income = SubRegionCodeToMedianHouseholdIncome()\n", "subregions = [\"US-AL\", \"US-IA\", \"US-VT\", \"US-DC\", \"US-MI\", \"US-NY\"]\n", "sub_region_code_to_median_household_income(subregions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### FileExtension\n", "\n", "Determines the extension of a filepath." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 .txt\n", "1 .json\n", "2 NaN\n", "dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from featuretools.primitives import FileExtension\n", "\n", "file_extension = FileExtension()\n", "file_extension(['doc.txt', '~/documents/data.json', 'file'])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }