{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "logical-italy", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: spacy in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (3.1.1)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (3.0.5)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (4.59.0)\n", "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (1.7.4)\n", "Requirement already satisfied: setuptools in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (54.1.1)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (2.25.1)" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING: You are using pip version 21.1.2; however, version 21.2.4 is available." ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (2.0.5)\n", "Requirement already satisfied: jinja2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (2.11.3)\n", "Requirement already satisfied: srsly<3.0.0,>=2.4.1 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (2.4.1)\n", "Requirement already satisfied: catalogue<2.1.0,>=2.0.4 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (2.0.4)\n", "Requirement already satisfied: blis<0.8.0,>=0.4.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (0.7.4)\n", "Requirement already satisfied: thinc<8.1.0,>=8.0.8 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (8.0.8)\n", "Requirement already satisfied: numpy>=1.15.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (1.20.2)\n", "Requirement already satisfied: pathy>=0.3.5 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (0.4.0)\n", "Requirement already satisfied: packaging>=20.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (20.9)\n", "Requirement already satisfied: typer<0.4.0,>=0.3.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (0.3.2)\n", "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.7 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (3.0.8)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (1.0.5)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy) (0.8.2)\n", "Requirement already satisfied: pyparsing>=2.0.2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from packaging>=20.0->spacy) (2.4.7)\n", "Requirement already satisfied: smart-open<4.0.0,>=2.2.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from pathy>=0.3.5->spacy) (3.0.0)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.26.3)\n", "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.12.5)\n", "Requirement already satisfied: idna<3,>=2.5 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)\n", "Requirement already satisfied: chardet<5,>=3.0.2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy) (4.0.0)\n", "Requirement already satisfied: click<7.2.0,>=7.1.1 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from typer<0.4.0,>=0.3.0->spacy) (7.1.2)\n", "Requirement already satisfied: MarkupSafe>=0.23 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from jinja2->spacy) (1.1.1)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n", "You should consider upgrading via the 'c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\python.exe -m pip install --upgrade pip' command.\n" ] } ], "source": [ "!pip install spacy" ] }, { "cell_type": "code", "execution_count": 2, "id": "rising-right", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting en-core-web-sm==3.1.0" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-09-14 12:51:21.698439: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll\n", "WARNING: You are using pip version 21.1.2; however, version 21.2.4 is available.\n", "You should consider upgrading via the 'C:\\Users\\wma22\\AppData\\Local\\Programs\\Python\\Python39\\python.exe -m pip install --upgrade pip' command.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)\n", "Requirement already satisfied: spacy<3.2.0,>=3.1.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from en-core-web-sm==3.1.0) (3.1.1)\n", "Requirement already satisfied: srsly<3.0.0,>=2.4.1 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (2.4.1)\n", "Requirement already satisfied: typer<0.4.0,>=0.3.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (0.3.2)\n", "Requirement already satisfied: numpy>=1.15.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (1.20.2)\n", "Requirement already satisfied: blis<0.8.0,>=0.4.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (0.7.4)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (2.25.1)\n", "Requirement already satisfied: catalogue<2.1.0,>=2.0.4 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (2.0.4)\n", "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (1.7.4)\n", "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.7 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (3.0.8)\n", "Requirement already satisfied: packaging>=20.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (20.9)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (2.0.5)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (0.8.2)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (1.0.5)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (4.59.0)\n", "Requirement already satisfied: pathy>=0.3.5 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (0.4.0)\n", "Requirement already satisfied: thinc<8.1.0,>=8.0.8 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (8.0.8)\n", "Requirement already satisfied: setuptools in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (54.1.1)\n", "Requirement already satisfied: jinja2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (2.11.3)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (3.0.5)\n", "Requirement already satisfied: pyparsing>=2.0.2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from packaging>=20.0->spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (2.4.7)\n", "Requirement already satisfied: smart-open<4.0.0,>=2.2.0 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from pathy>=0.3.5->spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (3.0.0)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (1.26.3)\n", "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (2020.12.5)\n", "Requirement already satisfied: chardet<5,>=3.0.2 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (4.0.0)\n", "Requirement already satisfied: idna<3,>=2.5 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (2.10)\n", "Requirement already satisfied: click<7.2.0,>=7.1.1 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from typer<0.4.0,>=0.3.0->spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (7.1.2)\n", "Requirement already satisfied: MarkupSafe>=0.23 in c:\\users\\wma22\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (from jinja2->spacy<3.2.0,>=3.1.0->en-core-web-sm==3.1.0) (1.1.1)\n", "[+] Download and installation successful\n", "You can now load the package via spacy.load('en_core_web_sm')\n" ] } ], "source": [ "!python -m spacy download en_core_web_sm" ] }, { "cell_type": "code", "execution_count": 3, "id": "qualified-contest", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Enabling eager execution\n", "INFO:tensorflow:Enabling v2 tensorshape\n", "INFO:tensorflow:Enabling resource variables\n", "INFO:tensorflow:Enabling tensor equality\n", "INFO:tensorflow:Enabling control flow v2\n" ] } ], "source": [ "import spacy" ] }, { "cell_type": "code", "execution_count": 4, "id": "celtic-liverpool", "metadata": {}, "outputs": [], "source": [ "nlp = spacy.load(\"en_core_web_sm\")" ] }, { "cell_type": "code", "execution_count": 5, "id": "advisory-burke", "metadata": {}, "outputs": [], "source": [ "with open (\"data/wiki_us.txt\", \"r\") as f:\n", " text = f.read()" ] }, { "cell_type": "code", "execution_count": 6, "id": "miniature-legislation", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.\n", "\n", "Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies established along the East Coast. Disputes over taxation and political representation with Great Britain led to the American Revolutionary War (1775–1783), which established independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states; by 1848, the United States spanned the continent. Slavery was legal in the southern United States until the second half of the 19th century when the American Civil War led to its abolition. The Spanish–American War and World War I established the U.S. as a world power, a status confirmed by the outcome of World War II.\n", "\n", "During the Cold War, the United States fought the Korean War and the Vietnam War but avoided direct military conflict with the Soviet Union. The two superpowers competed in the Space Race, culminating in the 1969 spaceflight that first landed humans on the Moon. The Soviet Union's dissolution in 1991 ended the Cold War, leaving the United States as the world's sole superpower.\n", "\n", "The United States is a federal republic and a representative democracy with three separate branches of government, including a bicameral legislature. It is a founding member of the United Nations, World Bank, International Monetary Fund, Organization of American States, NATO, and other international organizations. It is a permanent member of the United Nations Security Council. Considered a melting pot of cultures and ethnicities, its population has been profoundly shaped by centuries of immigration. The country ranks high in international measures of economic freedom, quality of life, education, and human rights, and has low levels of perceived corruption. However, the country has received criticism concerning inequality related to race, wealth and income, the use of capital punishment, high incarceration rates, and lack of universal health care.\n", "\n", "The United States is a highly developed country, accounts for approximately a quarter of global GDP, and is the world's largest economy. By value, the United States is the world's largest importer and the second-largest exporter of goods. Although its population is only 4.2% of the world's total, it holds 29.4% of the total wealth in the world, the largest share held by any country. Making up more than a third of global military spending, it is the foremost military power in the world; and it is a leading political, cultural, and scientific force internationally.[23]\n" ] } ], "source": [ "print (text)" ] }, { "cell_type": "code", "execution_count": 7, "id": "spread-banana", "metadata": {}, "outputs": [], "source": [ "doc = nlp(text)" ] }, { "cell_type": "code", "execution_count": 8, "id": "selective-banana", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.\n", "\n", "Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies established along the East Coast. Disputes over taxation and political representation with Great Britain led to the American Revolutionary War (1775–1783), which established independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states; by 1848, the United States spanned the continent. Slavery was legal in the southern United States until the second half of the 19th century when the American Civil War led to its abolition. The Spanish–American War and World War I established the U.S. as a world power, a status confirmed by the outcome of World War II.\n", "\n", "During the Cold War, the United States fought the Korean War and the Vietnam War but avoided direct military conflict with the Soviet Union. The two superpowers competed in the Space Race, culminating in the 1969 spaceflight that first landed humans on the Moon. The Soviet Union's dissolution in 1991 ended the Cold War, leaving the United States as the world's sole superpower.\n", "\n", "The United States is a federal republic and a representative democracy with three separate branches of government, including a bicameral legislature. It is a founding member of the United Nations, World Bank, International Monetary Fund, Organization of American States, NATO, and other international organizations. It is a permanent member of the United Nations Security Council. Considered a melting pot of cultures and ethnicities, its population has been profoundly shaped by centuries of immigration. The country ranks high in international measures of economic freedom, quality of life, education, and human rights, and has low levels of perceived corruption. However, the country has received criticism concerning inequality related to race, wealth and income, the use of capital punishment, high incarceration rates, and lack of universal health care.\n", "\n", "The United States is a highly developed country, accounts for approximately a quarter of global GDP, and is the world's largest economy. By value, the United States is the world's largest importer and the second-largest exporter of goods. Although its population is only 4.2% of the world's total, it holds 29.4% of the total wealth in the world, the largest share held by any country. Making up more than a third of global military spending, it is the foremost military power in the world; and it is a leading political, cultural, and scientific force internationally.[23]\n" ] } ], "source": [ "print (doc)" ] }, { "cell_type": "code", "execution_count": 9, "id": "signed-procedure", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3525\n", "652\n" ] } ], "source": [ "print (len(text))\n", "print (len(doc))" ] }, { "cell_type": "code", "execution_count": 10, "id": "wrong-center", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "T\n", "h\n", "e\n", " \n", "U\n", "n\n", "i\n", "t\n", "e\n", "d\n" ] } ], "source": [ "for token in text[0:10]:\n", " print (token)" ] }, { "cell_type": "code", "execution_count": 11, "id": "religious-logan", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The\n", "United\n", "States\n", "of\n", "America\n", "(\n", "U.S.A.\n", "or\n", "USA\n", ")\n" ] } ], "source": [ "for token in doc[:10]:\n", " print (token)" ] }, { "cell_type": "code", "execution_count": 12, "id": "broadband-income", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The\n", "United\n", "States\n", "of\n", "America\n", "(U.S.A.\n", "or\n", "USA),\n", "commonly\n", "known\n" ] } ], "source": [ "for token in text.split()[:10]:\n", " print (token)" ] }, { "cell_type": "code", "execution_count": 13, "id": "recognized-contract", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.\n", "It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]\n", "At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.\n", "The national capital is Washington, D.C., and the most populous city is New York.\n", "\n", "\n", "Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.\n", "The United States emerged from the thirteen British colonies established along the East Coast.\n", "Disputes over taxation and political representation with Great Britain led to the American Revolutionary War (1775–1783), which established independence.\n", "In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states; by 1848, the United States spanned the continent.\n", "Slavery was legal in the southern United States until the second half of the 19th century when the American Civil War led to its abolition.\n", "The Spanish–American War and World War I established the U.S. as a world power, a status confirmed by the outcome of World War II.\n", "\n", "\n", "\n", "During the Cold War, the United States fought the Korean War and the Vietnam War but avoided direct military conflict with the Soviet Union.\n", "The two superpowers competed in the Space Race, culminating in the 1969 spaceflight that first landed humans on the Moon.\n", "The Soviet Union's dissolution in 1991 ended the Cold War, leaving the United States as the world's sole superpower.\n", "\n", "\n", "\n", "The United States is a federal republic and a representative democracy with three separate branches of government, including a bicameral legislature.\n", "It is a founding member of the United Nations, World Bank, International Monetary Fund, Organization of American States, NATO, and other international organizations.\n", "It is a permanent member of the United Nations Security Council.\n", "Considered a melting pot of cultures and ethnicities, its population has been profoundly shaped by centuries of immigration.\n", "The country ranks high in international measures of economic freedom, quality of life, education, and human rights, and has low levels of perceived corruption.\n", "However, the country has received criticism concerning inequality related to race, wealth and income, the use of capital punishment, high incarceration rates, and lack of universal health care.\n", "\n", "\n", "\n", "The United States is a highly developed country, accounts for approximately a quarter of global GDP, and is the world's largest economy.\n", "By value, the United States is the world's largest importer and the second-largest exporter of goods.\n", "Although its population is only 4.2% of the world's total, it holds 29.4% of the total wealth in the world, the largest share held by any country.\n", "Making up more than a third of global military spending, it is the foremost military power in the world; and it is a leading political, cultural, and scientific force internationally.[23]\n" ] } ], "source": [ "for sent in doc.sents:\n", " print (sent)" ] }, { "cell_type": "code", "execution_count": 14, "id": "external-letters", "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'generator' object is not subscriptable", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0msentence1\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdoc\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msents\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0mprint\u001b[0m \u001b[1;33m(\u001b[0m\u001b[0msentence1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mTypeError\u001b[0m: 'generator' object is not subscriptable" ] } ], "source": [ "sentence1 = doc.sents[0]\n", "print (sentence1)" ] }, { "cell_type": "code", "execution_count": 15, "id": "boxed-water", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.\n" ] } ], "source": [ "sentence1 = list(doc.sents)[0]\n", "print (sentence1)" ] }, { "cell_type": "code", "execution_count": 16, "id": "useful-merit", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The\n", "United\n", "States\n", "of\n", "America\n", "(\n", "U.S.A.\n", "or\n", "USA\n", ")\n" ] } ], "source": [ "for token in doc[:10]:\n", " print (token)" ] }, { "cell_type": "code", "execution_count": 17, "id": "colonial-agent", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "States\n" ] } ], "source": [ "token2 = sentence1[2]\n", "print (token2)" ] }, { "cell_type": "code", "execution_count": 18, "id": "separate-cooperation", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'States'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.text" ] }, { "cell_type": "code", "execution_count": 19, "id": "fundamental-sensitivity", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "The" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.left_edge" ] }, { "cell_type": "code", "execution_count": 20, "id": "impressed-biology", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "America" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.right_edge" ] }, { "cell_type": "code", "execution_count": 21, "id": "valued-pollution", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "384" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.ent_type" ] }, { "cell_type": "code", "execution_count": 22, "id": "indie-pacific", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'GPE'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.ent_type_" ] }, { "cell_type": "code", "execution_count": 23, "id": "french-disabled", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'I'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.ent_iob_" ] }, { "cell_type": "code", "execution_count": 24, "id": "solar-corrections", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'States'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.lemma_" ] }, { "cell_type": "code", "execution_count": 25, "id": "coupled-simon", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'know'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence1[12].lemma_" ] }, { "cell_type": "code", "execution_count": 27, "id": "associate-influence", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "known\n" ] } ], "source": [ "print (sentence1[12])" ] }, { "cell_type": "code", "execution_count": 28, "id": "direct-straight", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NounType=Prop|Number=Sing" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.morph" ] }, { "cell_type": "code", "execution_count": 29, "id": "straight-pacific", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Aspect=Perf|Tense=Past|VerbForm=Part" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence1[12].morph" ] }, { "cell_type": "code", "execution_count": 30, "id": "described-western", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'PROPN'" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.pos_" ] }, { "cell_type": "code", "execution_count": 31, "id": "directed-documentary", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'nsubj'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.dep_" ] }, { "cell_type": "code", "execution_count": 32, "id": "surprising-taste", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'en'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token2.lang_" ] }, { "cell_type": "code", "execution_count": 33, "id": "southern-component", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mike enjoys playing football.\n" ] } ], "source": [ "text = \"Mike enjoys playing football.\"\n", "doc2 = nlp(text)\n", "print (doc2)" ] }, { "cell_type": "code", "execution_count": 35, "id": "suitable-technical", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mike PROPN nsubj\n", "enjoys VERB ROOT\n", "playing VERB xcomp\n", "football NOUN dobj\n", ". PUNCT punct\n" ] } ], "source": [ "for token in doc2:\n", " print (token.text, token.pos_, token.dep_)" ] }, { "cell_type": "code", "execution_count": 36, "id": "banner-height", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Mike\n", " PROPN\n", "\n", "\n", "\n", " enjoys\n", " VERB\n", "\n", "\n", "\n", " playing\n", " VERB\n", "\n", "\n", "\n", " football.\n", " NOUN\n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " xcomp\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " dobj\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from spacy import displacy\n", "displacy.render(doc2, style=\"dep\")" ] }, { "cell_type": "code", "execution_count": 37, "id": "helpful-exposure", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The United States of America GPE\n", "U.S.A. GPE\n", "USA GPE\n", "the United States GPE\n", "U.S. GPE\n", "US GPE\n", "America GPE\n", "North America LOC\n", "50 CARDINAL\n", "five CARDINAL\n", "326 CARDINAL\n", "Indian NORP\n", "3.8 million square miles QUANTITY\n", "9.8 million square kilometers QUANTITY\n", "fourth ORDINAL\n", "The United States GPE\n", "Canada GPE\n", "Mexico GPE\n", "Bahamas GPE\n", "Cuba GPE\n", "more than 331 million CARDINAL\n", "third ORDINAL\n", "Washington GPE\n", "D.C. GPE\n", "New York GPE\n", "Paleo-Indians NORP\n", "Siberia LOC\n", "North American NORP\n", "at least 12,000 years ago DATE\n", "European NORP\n", "the 16th century DATE\n", "The United States GPE\n", "thirteen CARDINAL\n", "British NORP\n", "the East Coast LOC\n", "Great Britain GPE\n", "the American Revolutionary War ORG\n", "the late 18th century DATE\n", "U.S. GPE\n", "North America LOC\n", "Native Americans NORP\n", "1848 DATE\n", "the United States GPE\n", "United States GPE\n", "the second half of the 19th century DATE\n", "the American Civil War EVENT\n", "The Spanish–American War and World War EVENT\n", "U.S. GPE\n", "World War II EVENT\n", "the Cold War EVENT\n", "the United States GPE\n", "the Korean War EVENT\n", "the Vietnam War EVENT\n", "the Soviet Union GPE\n", "two CARDINAL\n", "the Space Race FAC\n", "1969 DATE\n", "first ORDINAL\n", "The Soviet Union's GPE\n", "1991 DATE\n", "the Cold War EVENT\n", "the United States GPE\n", "The United States GPE\n", "three CARDINAL\n", "the United Nations ORG\n", "World Bank ORG\n", "International Monetary Fund ORG\n", "Organization of American States ORG\n", "NATO ORG\n", "the United Nations Security Council ORG\n", "centuries DATE\n", "The United States GPE\n", "approximately a quarter DATE\n", "the United States GPE\n", "second ORDINAL\n", "only 4.2% PERCENT\n", "29.4% PERCENT\n", "more than a third CARDINAL\n" ] } ], "source": [ "for ent in doc.ents:\n", " print (ent.text, ent.label_)" ] }, { "cell_type": "code", "execution_count": 38, "id": "prescribed-sudan", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " The United States of America\n", " GPE\n", "\n", " (\n", "\n", " U.S.A.\n", " GPE\n", "\n", " or \n", "\n", " USA\n", " GPE\n", "\n", "), commonly known as \n", "\n", " the United States\n", " GPE\n", "\n", " (\n", "\n", " U.S.\n", " GPE\n", "\n", " or \n", "\n", " US\n", " GPE\n", "\n", ") or \n", "\n", " America\n", " GPE\n", "\n", ", is a country primarily located in \n", "\n", " North America\n", " LOC\n", "\n", ". It consists of \n", "\n", " 50\n", " CARDINAL\n", "\n", " states, a federal district, \n", "\n", " five\n", " CARDINAL\n", "\n", " major unincorporated territories, \n", "\n", " 326\n", " CARDINAL\n", "\n", " \n", "\n", " Indian\n", " NORP\n", "\n", " reservations, and some minor possessions.[j] At \n", "\n", " 3.8 million square miles\n", " QUANTITY\n", "\n", " (\n", "\n", " 9.8 million square kilometers\n", " QUANTITY\n", "\n", "), it is the world's third- or \n", "\n", " fourth\n", " ORDINAL\n", "\n", "-largest country by total area.[d] \n", "\n", " The United States\n", " GPE\n", "\n", " shares significant land borders with \n", "\n", " Canada\n", " GPE\n", "\n", " to the north and \n", "\n", " Mexico\n", " GPE\n", "\n", " to the south, as well as limited maritime borders with the \n", "\n", " Bahamas\n", " GPE\n", "\n", ", \n", "\n", " Cuba\n", " GPE\n", "\n", ", and Russia.[22] With a population of \n", "\n", " more than 331 million\n", " CARDINAL\n", "\n", " people, it is the \n", "\n", " third\n", " ORDINAL\n", "\n", " most populous country in the world. The national capital is \n", "\n", " Washington\n", " GPE\n", "\n", ", \n", "\n", " D.C.\n", " GPE\n", "\n", ", and the most populous city is \n", "\n", " New York\n", " GPE\n", "\n", ".

\n", "\n", " Paleo-Indians\n", " NORP\n", "\n", " migrated from \n", "\n", " Siberia\n", " LOC\n", "\n", " to the \n", "\n", " North American\n", " NORP\n", "\n", " mainland \n", "\n", " at least 12,000 years ago\n", " DATE\n", "\n", ", and \n", "\n", " European\n", " NORP\n", "\n", " colonization began in \n", "\n", " the 16th century\n", " DATE\n", "\n", ". \n", "\n", " The United States\n", " GPE\n", "\n", " emerged from the \n", "\n", " thirteen\n", " CARDINAL\n", "\n", " \n", "\n", " British\n", " NORP\n", "\n", " colonies established along \n", "\n", " the East Coast\n", " LOC\n", "\n", ". Disputes over taxation and political representation with \n", "\n", " Great Britain\n", " GPE\n", "\n", " led to \n", "\n", " the American Revolutionary War\n", " ORG\n", "\n", " (1775–1783), which established independence. In \n", "\n", " the late 18th century\n", " DATE\n", "\n", ", the \n", "\n", " U.S.\n", " GPE\n", "\n", " began expanding across \n", "\n", " North America\n", " LOC\n", "\n", ", gradually obtaining new territories, sometimes through war, frequently displacing \n", "\n", " Native Americans\n", " NORP\n", "\n", ", and admitting new states; by \n", "\n", " 1848\n", " DATE\n", "\n", ", \n", "\n", " the United States\n", " GPE\n", "\n", " spanned the continent. Slavery was legal in the southern \n", "\n", " United States\n", " GPE\n", "\n", " until \n", "\n", " the second half of the 19th century\n", " DATE\n", "\n", " when \n", "\n", " the American Civil War\n", " EVENT\n", "\n", " led to its abolition. \n", "\n", " The Spanish–American War and World War\n", " EVENT\n", "\n", " I established the \n", "\n", " U.S.\n", " GPE\n", "\n", " as a world power, a status confirmed by the outcome of \n", "\n", " World War II\n", " EVENT\n", "\n", ".

During \n", "\n", " the Cold War\n", " EVENT\n", "\n", ", \n", "\n", " the United States\n", " GPE\n", "\n", " fought \n", "\n", " the Korean War\n", " EVENT\n", "\n", " and \n", "\n", " the Vietnam War\n", " EVENT\n", "\n", " but avoided direct military conflict with \n", "\n", " the Soviet Union\n", " GPE\n", "\n", ". The \n", "\n", " two\n", " CARDINAL\n", "\n", " superpowers competed in \n", "\n", " the Space Race\n", " FAC\n", "\n", ", culminating in the \n", "\n", " 1969\n", " DATE\n", "\n", " spaceflight that \n", "\n", " first\n", " ORDINAL\n", "\n", " landed humans on the Moon. \n", "\n", " The Soviet Union's\n", " GPE\n", "\n", " dissolution in \n", "\n", " 1991\n", " DATE\n", "\n", " ended \n", "\n", " the Cold War\n", " EVENT\n", "\n", ", leaving \n", "\n", " the United States\n", " GPE\n", "\n", " as the world's sole superpower.

\n", "\n", " The United States\n", " GPE\n", "\n", " is a federal republic and a representative democracy with \n", "\n", " three\n", " CARDINAL\n", "\n", " separate branches of government, including a bicameral legislature. It is a founding member of \n", "\n", " the United Nations\n", " ORG\n", "\n", ", \n", "\n", " World Bank\n", " ORG\n", "\n", ", \n", "\n", " International Monetary Fund\n", " ORG\n", "\n", ", \n", "\n", " Organization of American States\n", " ORG\n", "\n", ", \n", "\n", " NATO\n", " ORG\n", "\n", ", and other international organizations. It is a permanent member of \n", "\n", " the United Nations Security Council\n", " ORG\n", "\n", ". Considered a melting pot of cultures and ethnicities, its population has been profoundly shaped by \n", "\n", " centuries\n", " DATE\n", "\n", " of immigration. The country ranks high in international measures of economic freedom, quality of life, education, and human rights, and has low levels of perceived corruption. However, the country has received criticism concerning inequality related to race, wealth and income, the use of capital punishment, high incarceration rates, and lack of universal health care.

\n", "\n", " The United States\n", " GPE\n", "\n", " is a highly developed country, accounts for \n", "\n", " approximately a quarter\n", " DATE\n", "\n", " of global GDP, and is the world's largest economy. By value, \n", "\n", " the United States\n", " GPE\n", "\n", " is the world's largest importer and the \n", "\n", " second\n", " ORDINAL\n", "\n", "-largest exporter of goods. Although its population is \n", "\n", " only 4.2%\n", " PERCENT\n", "\n", " of the world's total, it holds \n", "\n", " 29.4%\n", " PERCENT\n", "\n", " of the total wealth in the world, the largest share held by any country. Making up \n", "\n", " more than a third\n", " CARDINAL\n", "\n", " of global military spending, it is the foremost military power in the world; and it is a leading political, cultural, and scientific force internationally.[23]
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(doc, style=\"ent\")" ] }, { "cell_type": "code", "execution_count": null, "id": "overhead-consultancy", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 }