{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline\n",
    "import os\n",
    "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Extraction\n",
    "\n",
    "As of v0.28.x, **ktrain** includes the `TextExtractor` class allows you to easily extract text from various file formats such as PDFs and MS Word documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2021-10-13 14:43:00--  https://aclanthology.org/N19-1423.pdf\n",
      "Resolving aclanthology.org (aclanthology.org)... 174.138.37.75\n",
      "Connecting to aclanthology.org (aclanthology.org)|174.138.37.75|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 786279 (768K) [application/pdf]\n",
      "Saving to: ‘/tmp/bert_paper.pdf’\n",
      "\n",
      "/tmp/bert_paper.pdf 100%[===================>] 767.85K  --.-KB/s    in 0.08s   \n",
      "\n",
      "2021-10-13 14:43:00 (9.90 MB/s) - ‘/tmp/bert_paper.pdf’ saved [786279/786279]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://aclanthology.org/N19-1423.pdf -O /tmp/bert_paper.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ktrain.text import TextExtractor\n",
    "te = TextExtractor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Extract text into single string variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BERT: Pre-training of Deep Bidirectional Transformers for\n",
      "Language Understanding\n",
      "Jacob Devlin\n",
      "\n",
      "Ming-Wei Chang Kenton Lee Kristina Toutanova\n",
      "Google AI Language\n",
      "{jacobdevlin,mingweichang,kentonl,kristout}@google.com\n",
      "\n",
      "Abstract\n",
      "We introduce a new language representation model called BERT, which stands for\n",
      "Bidirectional Encoder Representations from\n",
      "Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from\n",
      "unlabeled text by jointly conditioning on both\n",
      "left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer\n",
      "to create state-of-the-art models for a wide\n",
      "range of tasks, such as question answering and\n",
      "language inference, without substantial taskspecific architecture modifications.\n",
      "BERT is conceptually simple and empirically\n",
      "powerful. It obtains new state-of-the-art results on eleven natural language proces\n"
     ]
    }
   ],
   "source": [
    "rawtext = te.extract('/tmp/bert_paper.pdf')\n",
    "print(rawtext[:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Extract text and split into sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['BERT : Pre training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin', 'Ming Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin , mingweichang , kentonl , kristout } @google.com', 'Abstract We introduce a new language representation model called BERT , which stands for Bidirectional Encoder Representations from Transformers .', 'Unlike recent language representation models ( Peters et al . , 2018a ; Radford et al . , 2018 ) , BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers .', 'As a result , the pre trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks , such as question answering and language inference , without substantial taskspecific architecture modifications .']\n"
     ]
    }
   ],
   "source": [
    "sentences = te.extract('/tmp/bert_paper.pdf', return_format='sentences')\n",
    "print(sentences[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Extract text and split into paragraphs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "495 paragraphs\n",
      "Third paragraph from paper is:\n",
      "\n",
      "Abstract We introduce a new language representation model called BERT , which stands for Bidirectional Encoder Representations from Transformers . Unlike recent language representation models ( Peters et al . , 2018a ; Radford et al . , 2018 ) , BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers . As a result , the pre trained BERT model can be finetuned with just one additional output layer to create state of the art models for a wide range of tasks , such as question answering and language inference , without substantial taskspecific architecture modifications . BERT is conceptually simple and empirically powerful . It obtains new state of the art results on eleven natural language processing tasks , including pushing the GLUE score to 80.5 % ( 7.7 % point absolute improvement ) , Multi NLI accuracy to 86.7 % ( 4.6 % absolute improvement ) , SQu AD v1.1 question answering Test F1 to 93.2 ( 1.5 point absolute improvement ) and SQu AD v2.0 Test F1 to 83.1 ( 5.1 point absolute improvement ) .\n"
     ]
    }
   ],
   "source": [
    "paragraphs = te.extract('/tmp/bert_paper.pdf', return_format='paragraphs')\n",
    "print(\"%s paragraphs\" % (len(paragraphs)))\n",
    "print('Third paragraph from paper is:\\n')\n",
    "print(paragraphs[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also feed the `TextExtractor` strings to simply split them into lists of sentences or paragraphs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['This is the first sentence .', 'This is the second sentence .']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "two_sentences = 'This is the first sentence.  This is the second sentence.'\n",
    "te.extract(text=two_sentences, return_format='sentences')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}