{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dealing with Text Data\n", "> Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created. This is the Summary of lecture \"Feature Engineering for Machine Learning in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "plt.rcParams['figure.figsize'] = (8, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encoding text\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning up your text\n", "Unstructured text data cannot be directly used in most analyses. Multiple steps need to be taken to go from a long free form string to a set of numeric columns in the right format that can be ingested by a machine learning model. The first step of this process is to standardize the data and eliminate any characters that could cause problems later on in your analytic pipeline.\n", "\n", "In this chapter you will be working with a new dataset containing the inaugural speeches of the presidents of the United States loaded as `speech_df`, with the speeches stored in the `text` column." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "speech_df = pd.read_csv('./dataset/inaugural_speeches.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Fellow-Citizens of the Senate and of the House...\n", "1 Fellow Citizens: I AM again called upon by th...\n", "2 WHEN it was first perceived, in early times, t...\n", "3 Friends and Fellow-Citizens: CALLED upon to u...\n", "4 PROCEEDING, fellow-citizens, to that qualifica...\n", "Name: text, dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the first 5 rows of the text column\n", "speech_df['text'].head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 fellow citizens of the senate and of the house...\n", "1 fellow citizens i am again called upon by th...\n", "2 when it was first perceived in early times t...\n", "3 friends and fellow citizens called upon to u...\n", "4 proceeding fellow citizens to that qualifica...\n", "Name: text_clean, dtype: object\n" ] } ], "source": [ "# Replace all non letter characters with a whitespace\n", "speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')\n", "\n", "# Change to lower case\n", "speech_df['text_clean'] = speech_df['text_clean'].str.lower()\n", "\n", "# Print the first 5 rows of text_clean column\n", "print(speech_df['text_clean'].head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### High level text features\n", "Once the text has been cleaned and standardized you can begin creating features from the data. The most fundamental information you can calculate about free form text is its size, such as its length and number of words. In this exercise (and the rest of this chapter), you will focus on the cleaned/transformed text column (`text_clean`) you created in the last exercise.\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | text_clean | \n", "char_cnt | \n", "word_cnt | \n", "avg_word_length | \n", "
---|---|---|---|---|
0 | \n", "fellow citizens of the senate and of the house... | \n", "8616 | \n", "1432 | \n", "6.016760 | \n", "
1 | \n", "fellow citizens i am again called upon by th... | \n", "787 | \n", "135 | \n", "5.829630 | \n", "
2 | \n", "when it was first perceived in early times t... | \n", "13871 | \n", "2323 | \n", "5.971158 | \n", "
3 | \n", "friends and fellow citizens called upon to u... | \n", "10144 | \n", "1736 | \n", "5.843318 | \n", "
4 | \n", "proceeding fellow citizens to that qualifica... | \n", "12902 | \n", "2169 | \n", "5.948363 | \n", "
\n", " | Name | \n", "Inaugural Address | \n", "Date | \n", "text | \n", "text_clean | \n", "char_cnt | \n", "word_cnt | \n", "avg_word_length | \n", "Counts_abiding | \n", "Counts_ability | \n", "... | \n", "Counts_women | \n", "Counts_words | \n", "Counts_work | \n", "Counts_wrong | \n", "Counts_year | \n", "Counts_years | \n", "Counts_yet | \n", "Counts_you | \n", "Counts_young | \n", "Counts_your | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "George Washington | \n", "First Inaugural Address | \n", "Thursday, April 30, 1789 | \n", "Fellow-Citizens of the Senate and of the House... | \n", "fellow citizens of the senate and of the house... | \n", "8616 | \n", "1432 | \n", "6.016760 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "5 | \n", "0 | \n", "9 | \n", "
1 | \n", "George Washington | \n", "Second Inaugural Address | \n", "Monday, March 4, 1793 | \n", "Fellow Citizens: I AM again called upon by th... | \n", "fellow citizens i am again called upon by th... | \n", "787 | \n", "135 | \n", "5.829630 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
2 | \n", "John Adams | \n", "Inaugural Address | \n", "Saturday, March 4, 1797 | \n", "WHEN it was first perceived, in early times, t... | \n", "when it was first perceived in early times t... | \n", "13871 | \n", "2323 | \n", "5.971158 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
3 | \n", "Thomas Jefferson | \n", "First Inaugural Address | \n", "Wednesday, March 4, 1801 | \n", "Friends and Fellow-Citizens: CALLED upon to u... | \n", "friends and fellow citizens called upon to u... | \n", "10144 | \n", "1736 | \n", "5.843318 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "2 | \n", "0 | \n", "0 | \n", "2 | \n", "7 | \n", "0 | \n", "7 | \n", "
4 | \n", "Thomas Jefferson | \n", "Second Inaugural Address | \n", "Monday, March 4, 1805 | \n", "PROCEEDING, fellow-citizens, to that qualifica... | \n", "proceeding fellow citizens to that qualifica... | \n", "12902 | \n", "2169 | \n", "5.948363 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "2 | \n", "2 | \n", "4 | \n", "0 | \n", "4 | \n", "
5 rows × 826 columns
\n", "\n", " | TFIDF_action | \n", "TFIDF_administration | \n", "TFIDF_america | \n", "TFIDF_american | \n", "TFIDF_americans | \n", "TFIDF_believe | \n", "TFIDF_best | \n", "TFIDF_better | \n", "TFIDF_change | \n", "TFIDF_citizens | \n", "... | \n", "TFIDF_things | \n", "TFIDF_time | \n", "TFIDF_today | \n", "TFIDF_union | \n", "TFIDF_united | \n", "TFIDF_war | \n", "TFIDF_way | \n", "TFIDF_work | \n", "TFIDF_world | \n", "TFIDF_years | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.000000 | \n", "0.133415 | \n", "0.000000 | \n", "0.105388 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.229644 | \n", "... | \n", "0.000000 | \n", "0.045929 | \n", "0.0 | \n", "0.136012 | \n", "0.203593 | \n", "0.000000 | \n", "0.060755 | \n", "0.000000 | \n", "0.045929 | \n", "0.052694 | \n", "
1 | \n", "0.000000 | \n", "0.261016 | \n", "0.266097 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.179712 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.199157 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
2 | \n", "0.000000 | \n", "0.092436 | \n", "0.157058 | \n", "0.073018 | \n", "0.0 | \n", "0.000000 | \n", "0.026112 | \n", "0.060460 | \n", "0.000000 | \n", "0.106072 | \n", "... | \n", "0.032030 | \n", "0.021214 | \n", "0.0 | \n", "0.062823 | \n", "0.070529 | \n", "0.024339 | \n", "0.000000 | \n", "0.000000 | \n", "0.063643 | \n", "0.073018 | \n", "
3 | \n", "0.000000 | \n", "0.092693 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.090942 | \n", "0.117831 | \n", "0.045471 | \n", "0.053335 | \n", "0.223369 | \n", "... | \n", "0.048179 | \n", "0.000000 | \n", "0.0 | \n", "0.094497 | \n", "0.000000 | \n", "0.036610 | \n", "0.000000 | \n", "0.039277 | \n", "0.095729 | \n", "0.000000 | \n", "
4 | \n", "0.041334 | \n", "0.039761 | \n", "0.000000 | \n", "0.031408 | \n", "0.0 | \n", "0.000000 | \n", "0.067393 | \n", "0.039011 | \n", "0.091514 | \n", "0.273760 | \n", "... | \n", "0.082667 | \n", "0.164256 | \n", "0.0 | \n", "0.121605 | \n", "0.030338 | \n", "0.094225 | \n", "0.000000 | \n", "0.000000 | \n", "0.054752 | \n", "0.062817 | \n", "
5 rows × 100 columns
\n", "\n", " | TFIDF_action | \n", "TFIDF_administration | \n", "TFIDF_america | \n", "TFIDF_american | \n", "TFIDF_authority | \n", "TFIDF_best | \n", "TFIDF_business | \n", "TFIDF_citizens | \n", "TFIDF_commerce | \n", "TFIDF_common | \n", "... | \n", "TFIDF_subject | \n", "TFIDF_support | \n", "TFIDF_time | \n", "TFIDF_union | \n", "TFIDF_united | \n", "TFIDF_war | \n", "TFIDF_way | \n", "TFIDF_work | \n", "TFIDF_world | \n", "TFIDF_years | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.000000 | \n", "0.029540 | \n", "0.233954 | \n", "0.082703 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.022577 | \n", "0.0 | \n", "0.000000 | \n", "... | \n", "0.0 | \n", "0.000000 | \n", "0.115378 | \n", "0.000000 | \n", "0.024648 | \n", "0.079050 | \n", "0.033313 | \n", "0.000000 | \n", "0.299983 | \n", "0.134749 | \n", "
1 | \n", "0.000000 | \n", "0.000000 | \n", "0.547457 | \n", "0.036862 | \n", "0.000000 | \n", "0.036036 | \n", "0.000000 | \n", "0.015094 | \n", "0.0 | \n", "0.000000 | \n", "... | \n", "0.0 | \n", "0.019296 | \n", "0.092567 | \n", "0.000000 | \n", "0.000000 | \n", "0.052851 | \n", "0.066817 | \n", "0.078999 | \n", "0.277701 | \n", "0.126126 | \n", "
2 | \n", "0.000000 | \n", "0.000000 | \n", "0.126987 | \n", "0.134669 | \n", "0.000000 | \n", "0.131652 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.046997 | \n", "... | \n", "0.0 | \n", "0.000000 | \n", "0.075151 | \n", "0.000000 | \n", "0.080272 | \n", "0.042907 | \n", "0.054245 | \n", "0.096203 | \n", "0.225452 | \n", "0.043884 | \n", "
3 | \n", "0.037094 | \n", "0.067428 | \n", "0.267012 | \n", "0.031463 | \n", "0.039990 | \n", "0.061516 | \n", "0.050085 | \n", "0.077301 | \n", "0.0 | \n", "0.000000 | \n", "... | \n", "0.0 | \n", "0.098819 | \n", "0.210690 | \n", "0.000000 | \n", "0.056262 | \n", "0.030073 | \n", "0.038020 | \n", "0.235998 | \n", "0.237026 | \n", "0.061516 | \n", "
4 | \n", "0.000000 | \n", "0.000000 | \n", "0.221561 | \n", "0.156644 | \n", "0.028442 | \n", "0.087505 | \n", "0.000000 | \n", "0.109959 | \n", "0.0 | \n", "0.023428 | \n", "... | \n", "0.0 | \n", "0.023428 | \n", "0.187313 | \n", "0.131913 | \n", "0.040016 | \n", "0.021389 | \n", "0.081124 | \n", "0.119894 | \n", "0.299701 | \n", "0.153133 | \n", "
5 rows × 100 columns
\n", "