{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing - Putting it all together\n", "> Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings. This is the Summary of lecture \"Preprocessing for Machine Learning in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## UFOs and preprocessing\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checking column types\n", "Take a look at the UFO dataset's column types using the `dtypes` attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as `object`, and the `date` column, which can be transformed into the `datetime` type. That will make our feature engineering efforts easier later on.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | date | \n", "city | \n", "state | \n", "country | \n", "type | \n", "seconds | \n", "length_of_time | \n", "desc | \n", "recorded | \n", "lat | \n", "long | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "11/3/2011 19:21 | \n", "woodville | \n", "wi | \n", "us | \n", "unknown | \n", "1209600.0 | \n", "2 weeks | \n", "Red blinking objects similar to airplanes or s... | \n", "12/12/2011 | \n", "44.9530556 | \n", "-92.291111 | \n", "
1 | \n", "10/3/2004 19:05 | \n", "cleveland | \n", "oh | \n", "us | \n", "circle | \n", "30.0 | \n", "30sec. | \n", "Many fighter jets flying towards UFO | \n", "10/27/2004 | \n", "41.4994444 | \n", "-81.695556 | \n", "
2 | \n", "9/25/2009 21:00 | \n", "coon rapids | \n", "mn | \n", "us | \n", "cigar | \n", "0.0 | \n", "NaN | \n", "Green, red, and blue pulses of light tha... | \n", "12/12/2009 | \n", "45.1200000 | \n", "-93.287500 | \n", "
3 | \n", "11/21/2002 05:45 | \n", "clemmons | \n", "nc | \n", "us | \n", "triangle | \n", "300.0 | \n", "about 5 minutes | \n", "It was a large, triangular shaped flying ob... | \n", "12/23/2002 | \n", "36.0213889 | \n", "-80.382222 | \n", "
4 | \n", "8/19/2010 12:55 | \n", "calgary (canada) | \n", "ab | \n", "ca | \n", "oval | \n", "0.0 | \n", "2 | \n", "A white spinning disc in the shape of an oval. | \n", "8/24/2010 | \n", "51.083333 | \n", "-114.083333 | \n", "
\n", " | type | \n", "seconds_log | \n", "country_enc | \n", "changing | \n", "chevron | \n", "cigar | \n", "circle | \n", "cone | \n", "cross | \n", "cylinder | \n", "... | \n", "light | \n", "other | \n", "oval | \n", "rectangle | \n", "sphere | \n", "teardrop | \n", "triangle | \n", "unknown | \n", "month | \n", "year | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "triangle | \n", "5.703782 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "11 | \n", "2002 | \n", "
1 | \n", "light | \n", "6.396930 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "6 | \n", "2012 | \n", "
2 | \n", "light | \n", "4.787492 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "6 | \n", "2013 | \n", "
3 | \n", "light | \n", "4.787492 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "4 | \n", "2013 | \n", "
4 | \n", "sphere | \n", "5.703782 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "9 | \n", "2013 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1861 | \n", "unknown | \n", "7.901007 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "8 | \n", "2002 | \n", "
1862 | \n", "oval | \n", "5.703782 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "7 | \n", "2013 | \n", "
1863 | \n", "changing | \n", "5.192957 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "11 | \n", "2008 | \n", "
1864 | \n", "circle | \n", "5.192957 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "6 | \n", "1998 | \n", "
1865 | \n", "other | \n", "4.094345 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "12 | \n", "2005 | \n", "
1866 rows × 26 columns
\n", "