{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Extraction\n", "\n", "As of v0.28.x, **ktrain** includes the `TextExtractor` class allows you to easily extract text from various file formats such as PDFs and MS Word documents." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2021-10-13 14:43:00-- https://aclanthology.org/N19-1423.pdf\n", "Resolving aclanthology.org (aclanthology.org)... 174.138.37.75\n", "Connecting to aclanthology.org (aclanthology.org)|174.138.37.75|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 786279 (768K) [application/pdf]\n", "Saving to: ‘/tmp/bert_paper.pdf’\n", "\n", "/tmp/bert_paper.pdf 100%[===================>] 767.85K --.-KB/s in 0.08s \n", "\n", "2021-10-13 14:43:00 (9.90 MB/s) - ‘/tmp/bert_paper.pdf’ saved [786279/786279]\n", "\n" ] } ], "source": [ "!wget https://aclanthology.org/N19-1423.pdf -O /tmp/bert_paper.pdf" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from ktrain.text import TextExtractor\n", "te = TextExtractor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract text into single string variable:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BERT: Pre-training of Deep Bidirectional Transformers for\n", "Language Understanding\n", "Jacob Devlin\n", "\n", "Ming-Wei Chang Kenton Lee Kristina Toutanova\n", "Google AI Language\n", "{jacobdevlin,mingweichang,kentonl,kristout}@google.com\n", "\n", "Abstract\n", "We introduce a new language representation model called BERT, which stands for\n", "Bidirectional Encoder Representations from\n", "Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from\n", "unlabeled text by jointly conditioning on both\n", "left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer\n", "to create state-of-the-art models for a wide\n", "range of tasks, such as question answering and\n", "language inference, without substantial taskspecific architecture modifications.\n", "BERT is conceptually simple and empirically\n", "powerful. It obtains new state-of-the-art results on eleven natural language proces\n" ] } ], "source": [ "rawtext = te.extract('/tmp/bert_paper.pdf')\n", "print(rawtext[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract text and split into sentences:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['BERT : Pre training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin', 'Ming Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin , mingweichang , kentonl , kristout } @google.com', 'Abstract We introduce a new language representation model called BERT , which stands for Bidirectional Encoder Representations from Transformers .', 'Unlike recent language representation models ( Peters et al . , 2018a ; Radford et al . , 2018 ) , BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers .', 'As a result , the pre trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks , such as question answering and language inference , without substantial taskspecific architecture modifications .']\n" ] } ], "source": [ "sentences = te.extract('/tmp/bert_paper.pdf', return_format='sentences')\n", "print(sentences[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract text and split into paragraphs:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "495 paragraphs\n", "Third paragraph from paper is:\n", "\n", "Abstract We introduce a new language representation model called BERT , which stands for Bidirectional Encoder Representations from Transformers . Unlike recent language representation models ( Peters et al . , 2018a ; Radford et al . , 2018 ) , BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers . As a result , the pre trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks , such as question answering and language inference , without substantial taskspecific architecture modifications . BERT is conceptually simple and empirically powerful . It obtains new state of the art results on eleven natural language processing tasks , including pushing the GLUE score to 80.5 % ( 7.7 % point absolute improvement ) , Multi NLI accuracy to 86.7 % ( 4.6 % absolute improvement ) , SQu AD v1.1 question answering Test F1 to 93.2 ( 1.5 point absolute improvement ) and SQu AD v2.0 Test F1 to 83.1 ( 5.1 point absolute improvement ) .\n" ] } ], "source": [ "paragraphs = te.extract('/tmp/bert_paper.pdf', return_format='paragraphs')\n", "print(\"%s paragraphs\" % (len(paragraphs)))\n", "print('Third paragraph from paper is:\\n')\n", "print(paragraphs[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also feed the `TextExtractor` strings to simply split them into lists of sentences or paragraphs:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This is the first sentence .', 'This is the second sentence .']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "two_sentences = 'This is the first sentence. This is the second sentence.'\n", "te.extract(text=two_sentences, return_format='sentences')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }