{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "import re" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# You should download the NLTK data once.\n", "# It can be a bit time consuming.\n", "# nltk.download()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. More about the NLTK library:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# From the US president B. Obama's Nov-4th-2008 speech. \n", "\n", "paragraph = \"\"\"If there is anyone out there who still doubts that America is a place where all things are possible; who still wonders if the dream of our founders is alive in our time; who still questions the power of our democracy: Tonight is your answer.\n", "It's the answer told by lines that stretched around schools and churches in numbers this nation has never seen; by people who waited three hours and four hours, many for the very first time in their lives, because they believed that this time must be different; that their voices could be that difference. \n", "It's the answer spoken by young and old, rich and poor, Democrat and Republican, black, white, Hispanic, Asian, Native American, gay, straight, disabled and not disabled -- Americans who sent a message to the world that we have never been just a collection of individuals or a collection of Red States and Blue States: we are, and always will be, the United States of America!\n", "It's the answer that -- that led those who have been told for so long by so many to be cynical, and fearful, and doubtful about what we can achieve to put their hands on the arc of history and bend it once more toward the hope of a better day.\n", "It's been a long time coming, but tonight, because of what we did on this day, in this election, at this defining moment, change has come to America.\n", "A little bit earlier this evening, I received an extraordinarily gracious call from Senator McCain. Senator McCain fought long and hard in this campaign, and he's fought even longer and harder for the country that he loves. He has endured sacrifices for America that most of us cannot begin to imagine. We are better off for the service rendered by this brave and selfless leader. I congratulate him; I congratulate Governor Palin for all that they've achieved, and I look forward to working with them to renew this nation's promise in the months ahead.\n", "I want to thank my partner in this journey, a man who campaigned from his heart and spoke for the men and women he grew up with on the streets of Scranton and rode with on the train home to Delaware, the Vice President-elect of the United States, Joe Biden. \"\"\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Tokenization and pre-processing: 토큰화와 전처리" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['if there is anyone out there who still doubts that america is a place where all things are possible who still wonders if the dream of our founders is alive in our time who still questions the power of our democracy tonight is your answer', 'it s the answer told by lines that stretched around schools and churches in numbers this nation has never seen by people who waited three hours and four hours many for the very first time in their lives because they believed that this time must be different that their voices could be that difference', 'it s the answer spoken by young and old rich and poor democrat and republican black white hispanic asian native american gay straight disabled and not disabled americans who sent a message to the world that we have never been just a collection of individuals or a collection of red states and blue states we are and always will be the united states of america', 'it s the answer that that led those who have been told for so long by so many to be cynical and fearful and doubtful about what we can achieve to put their hands on the arc of history and bend it once more toward the hope of a better day', 'it s been a long time coming but tonight because of what we did on this day in this election at this defining moment change has come to america', 'a little bit earlier this evening i received an extraordinarily gracious call from senator mccain', 'senator mccain fought long and hard in this campaign and he s fought even longer and harder for the country that he loves', 'he has endured sacrifices for america that most of us cannot begin to imagine', 'we are better off for the service rendered by this brave and selfless leader', 'i congratulate him i congratulate governor palin for all that they ve achieved and i look forward to working with them to renew this nation s promise in the months ahead', 'i want to thank my partner in this journey a man who campaigned from his heart and spoke for the men and women he grew up with on the streets of scranton and rode with on the train home to delaware the vice president elect of the united states joe biden']\n" ] } ], "source": [ "#문장으로 토큰화\n", "sentences = nltk.sent_tokenize(paragraph)\n", "for i in range(len(sentences)):\n", " sentences[i] = sentences[i].lower()\n", " sentences[i] = re.sub(r'\\W',' ',sentences[i]) # Substitute the non-alphanumeric characters with space. \n", " sentences[i] = re.sub(r'\\s+',' ',sentences[i]) # Remove the excess of white spaces.\n", " sentences[i] = re.sub(r'\\s$','',sentences[i]) # Remove the space at the end of a sentence.\n", "print(sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Removal of the stop words: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import stopwords" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "for i in range(len(sentences)):\n", " words = nltk.word_tokenize(sentences[i]) # Tokenize into words.\n", " words = [x for x in words if x not in stopwords.words('english')] # Remove the stop words.\n", " sentences[i] = ' '.join(words) # Rejoin as a sentence." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['anyone still doubts america place things possible still wonders dream founders alive time still questions power democracy tonight answer', 'answer told lines stretched around schools churches numbers nation never seen people waited three hours four hours many first time lives believed time must different voices could difference', 'answer spoken young old rich poor democrat republican black white hispanic asian native american gay straight disabled disabled americans sent message world never collection individuals collection red states blue states always united states america', 'answer led told long many cynical fearful doubtful achieve put hands arc history bend toward hope better day', 'long time coming tonight day election defining moment change come america', 'little bit earlier evening received extraordinarily gracious call senator mccain', 'senator mccain fought long hard campaign fought even longer harder country loves', 'endured sacrifices america us begin imagine', 'better service rendered brave selfless leader', 'congratulate congratulate governor palin achieved look forward working renew nation promise months ahead', 'want thank partner journey man campaigned heart spoke men women grew streets scranton rode train home delaware vice president elect united states joe biden']\n" ] } ], "source": [ "print(sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. POS tagging: 품사태깅" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Test sentence.\n", "my_sentence = \"The Colosseum was built by the emperor Vespassian\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['the', 'colosseum', 'was', 'built', 'by', 'the', 'emperor', 'vespassian']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Simple pre-processing.\n", "my_words = nltk.word_tokenize(my_sentence)\n", "for i in range(len(my_words)):\n", " my_words[i] = my_words[i].lower()\n", "my_words" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 'DT'),\n", " ('colosseum', 'NN'),\n", " ('was', 'VBD'),\n", " ('built', 'VBN'),\n", " ('by', 'IN'),\n", " ('the', 'DT'),\n", " ('emperor', 'NN'),\n", " ('vespassian', 'NN')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# POS tagging.\n", "# OUTPUT: A list of tuples.\n", "my_words_tagged = nltk.pos_tag(my_words)\n", "my_words_tagged" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'the(DT) colosseum(NN) was(VBD) built(VBN) by(IN) the(DT) emperor(NN) vespassian(NN)'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Join the words + POS as a sentence.\n", "my_words_tagged2 = []\n", "for tw in my_words_tagged:\n", " my_words_tagged2.append(tw[0] + '(' + tw[1] + ')')\n", "my_sentence_tagged = ' '.join(my_words_tagged2)\n", "my_sentence_tagged" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }