{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with PDF Files\n", "\n", "Welcome back Agent. Often you will have to deal with PDF files. There are [many libraries in Python for working with PDFs](https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167), each with their pros and cons, the most common one being **PyPDF2**. You can install it with (note the case-sensitivity, you need to make sure your capitilization matches):\n", "\n", " pip install PyPDF2\n", " \n", "Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with PyPDF2 won't be able to be read. If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how non-standard the settings can be, text could be shown as an image instead of a utf-8 encoding. There are many parameters to consider in this aspect.\n", "\n", "As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.\n", "___\n", "\n", "## Working with PyPDF2\n", "\n", "Let's being showing the basics of the PyPDF2 library." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# note the capitalization\n", "import PyPDF2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading PDFs\n", "\n", "Similar to the csv library, we open a pdf, then create a reader object for it. Notice how we use the binary method of reading , 'rb', instead of just 'r'." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Notice we read it as a binary with 'rb'\n", "f = open('Working_Business_Proposal.pdf','rb')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "pdf_reader = PyPDF2.PdfFileReader(f)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pdf_reader.numPages" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "page_one = pdf_reader.getPage(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then extract the text:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "page_one_text = page_one.extractText()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Business Proposal\\n The Revolution is Coming\\n Leverage agile frameworks to provide a robust synopsis for high level \\noverviews. Iterative approaches to corporate strategy foster collaborative \\nthinking to further the overall value proposition. Organically grow the \\nholistic world view of disruptive innovation via workplace diversity and \\nempowerment. \\nBring to the table win-win survival strategies to ensure proactive \\ndomination. At the end of the day, going forward, a new normal that has \\nevolved from generation X is on the runway heading towards a streamlined \\ncloud solution. User generated content in real-time will have multiple \\ntouchpoints for offshoring. \\nCapitalize on low hanging fruit to identify a ballpark value added activity to \\nbeta test. Override the digital divide with additional clickthroughs from \\nDevOps. Nanotechnology immersion along the information highway will \\nclose the loop on focusing solely on the bottom line. Podcasting operational change management inside of workßows to \\nestablish a framework. Taking seamless key performance indicators ofßine \\nto maximise the long tail. Keeping your eye on the ball while performing a \\ndeep dive on the start-up mentality to derive convergence on cross-\\nplatform integration. \\nCollaboratively administrate empowered markets via plug-and-play \\nnetworks. Dynamically procrastinate B2C users after installed base \\nbeneÞts. Dramatically visualize customer directed convergence without \\nrevolutionary ROI. \\nEfÞciently unleash cross-media information without cross-media value. \\nQuickly maximize timely deliverables for real-time schemas. Dramatically \\nmaintain clicks-and-mortar solutions without functional solutions. \\nBUSINESS PROPOSAL\\n!1'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "page_one_text" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding to PDFs\n", "\n", "We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.\n", "\n", "What we can do is copy pages and append pages to the end." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "f = open('Working_Business_Proposal.pdf','rb')\n", "pdf_reader = PyPDF2.PdfFileReader(f)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "first_page = pdf_reader.getPage(0)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "pdf_writer = PyPDF2.PdfFileWriter()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "pdf_writer.addPage(first_page)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "pdf_output = open(\"Some_New_Doc.pdf\",\"wb\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "pdf_writer.write(pdf_output)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have copied a page and added it to another new document!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Example\n", "\n", "Let's try to grab all the text from this PDF file:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "f = open('Working_Business_Proposal.pdf','rb')\n", "\n", "# List of every page's text.\n", "# The index will correspond to the page number.\n", "pdf_text = []\n", "\n", "pdf_reader = PyPDF2.PdfFileReader(f)\n", "\n", "for p in range(pdf_reader.numPages):\n", " \n", " page = pdf_reader.getPage(p)\n", " \n", " pdf_text.append(page.extractText())\n", " " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Business Proposal\\n The Revolution is Coming\\n Leverage agile frameworks to provide a robust synopsis for high level \\noverviews. Iterative approaches to corporate strategy foster collaborative \\nthinking to further the overall value proposition. Organically grow the \\nholistic world view of disruptive innovation via workplace diversity and \\nempowerment. \\nBring to the table win-win survival strategies to ensure proactive \\ndomination. At the end of the day, going forward, a new normal that has \\nevolved from generation X is on the runway heading towards a streamlined \\ncloud solution. User generated content in real-time will have multiple \\ntouchpoints for offshoring. \\nCapitalize on low hanging fruit to identify a ballpark value added activity to \\nbeta test. Override the digital divide with additional clickthroughs from \\nDevOps. Nanotechnology immersion along the information highway will \\nclose the loop on focusing solely on the bottom line. Podcasting operational change management inside of workßows to \\nestablish a framework. Taking seamless key performance indicators ofßine \\nto maximise the long tail. Keeping your eye on the ball while performing a \\ndeep dive on the start-up mentality to derive convergence on cross-\\nplatform integration. \\nCollaboratively administrate empowered markets via plug-and-play \\nnetworks. Dynamically procrastinate B2C users after installed base \\nbeneÞts. Dramatically visualize customer directed convergence without \\nrevolutionary ROI. \\nEfÞciently unleash cross-media information without cross-media value. \\nQuickly maximize timely deliverables for real-time schemas. Dramatically \\nmaintain clicks-and-mortar solutions without functional solutions. \\nBUSINESS PROPOSAL\\n!1',\n", " 'Completely synergize resource taxing relationships via premier niche \\nmarkets. Professionally cultivate one-to-one customer service with robust \\nideas. Dynamically innovate resource-leveling customer service for state of \\nthe art customer service. \\nObjectively innovate empowered manufactured products whereas parallel \\nplatforms. Holisticly predominate extensible testing procedures for reliable \\nsupply chains. Dramatically engage top-line web services vis-a-vis \\ncutting-edge deliverables. Proactively envisioned multimedia based expertise and cross-media \\ngrowth strategies. Seamlessly visualize quality intellectual capital without \\nsuperior collaboration and idea-sharing. Holistically pontiÞcate installed \\nbase portals after maintainable products. \\nPhosßuorescently engage worldwide methodologies with web-enabled \\ntechnology. Interactively coordinate proactive e-commerce via process-\\ncentric \"outside the box\" thinking. Completely pursue scalable customer \\nservice through sustainable potentialities. \\nCollaboratively administrate turnkey channels whereas virtual e-tailers. \\nObjectively seize scalable metrics whereas proactive e-services. \\nSeamlessly empower fully researched growth strategies and interoperable \\ninternal or \"organic\" sources. \\nCredibly innovate granular internal or \"organic\" sources whereas high \\nstandards in web-readiness. Energistically scale future-proof core \\ncompetencies vis-a-vis impactful experiences. Dramatically synthesize \\nintegrated schemas with optimal networks. Interactively procrastinate high-payoff content without backward-\\ncompatible data. Quickly cultivate optimal processes and tactical \\narchitectures. Completely iterate covalent strategic theme areas via \\naccurate e-markets. Globally incubate standards compliant channels before scalable beneÞts. \\nQuickly disseminate superior deliverables whereas web-enabled \\nBUSINESS PROPOSAL\\n!2',\n", " 'applications. Quickly drive clicks-and-mortar catalysts for change before \\nvertical architectures. \\nCredibly reintermediate backend ideas for cross-platform models. \\nContinually reintermediate integrated processes through technically sound \\nintellectual capital. Holistically foster superior methodologies without \\nmarket-driven best practices. Distinctively exploit optimal alignments for intuitive bandwidth. Quickly \\ncoordinate e-business applications through revolutionary catalysts for \\nchange. Seamlessly underwhelm optimal testing procedures whereas \\nbricks-and-clicks processes. \\nSynergistically evolve 2.0 technologies rather than just in time initiatives. \\nQuickly deploy strategic networks with compelling e-business. Credibly \\npontiÞcate highly efÞcient manufactured products and enabled data. \\nDynamically target high-payoff intellectual capital for customized \\ntechnologies. Objectively integrate emerging core competencies before \\nprocess-centric communities. Dramatically evisculate holistic innovation \\nrather than client-centric data. Progressively maintain extensive infomediaries via extensible niches. \\nDramatically disseminate standardized metrics after resource-leveling \\nprocesses. Objectively pursue diverse catalysts for change for \\ninteroperable meta-services. \\nProactively fabricate one-to-one materials via effective e-business. \\nCompletely synergize scalable e-commerce rather than high standards in \\ne-services. Assertively iterate resource maximizing products after leading-\\nedge intellectual capital. Distinctively re-engineer revolutionary meta-services and premium \\narchitectures. Intrinsically incubate intuitive opportunities and real-time \\npotentialities. Appropriately communicate one-to-one technology after \\nplug-and-play networks. Quickly aggregate B2B users and worldwide potentialities. Progressively \\nplagiarize resource-leveling e-commerce through resource-leveling core \\nBUSINESS PROPOSAL\\n!3',\n", " 'competencies. Dramatically mesh low-risk high-yield alignments before \\ntransparent e-tailers. \\nAppropriately empower dynamic leadership skills after business portals. \\nGlobally myocardinate interactive supply chains with distinctive quality \\nvectors. Globally revolutionize global sources through interoperable \\nservices. Enthusiastically mesh long-term high-impact infrastructures vis-a-vis \\nefÞcient customer service. Professionally fashion wireless leadership rather \\nthan prospective experiences. Energistically myocardinate clicks-and-\\nmortar testing procedures whereas next-generation manufactured \\nproducts. \\nDynamically reinvent market-driven opportunities and ubiquitous \\ninterfaces. Energistically fabricate an expanded array of niche markets \\nthrough robust products. Appropriately implement visionary e-services vis-\\na-vis strategic web-readiness. \\nCompellingly embrace empowered e-business after user friendly \\nintellectual capital. Interactively actualize front-end processes with \\neffective convergence. Synergistically deliver performance based \\nmethods of empowerment whereas distributed expertise. \\nEfÞciently enable enabled sources and cost effective products. \\nCompletely synthesize principle-centered information after ethical \\ncommunities. EfÞciently innovate open-source infrastructures via \\ninexpensive materials. Objectively integrate enterprise-wide strategic theme areas with \\nfunctionalized infrastructures. Interactively productize premium \\ntechnologies whereas interdependent quality vectors. Rapaciously utilize \\nenterprise experiences via 24/7 markets. Uniquely matrix economically sound value through cooperative \\ntechnology. Competently parallel task fully researched data and enterprise \\nprocess improvements. Collaboratively expedite quality manufactured \\nproducts via client-focused results. \\nBUSINESS PROPOSAL\\n!4',\n", " 'Quickly communicate enabled technology and turnkey leadership skills. \\nUniquely enable accurate supply chains rather than frictionless \\ntechnology. Globally network focused materials vis-a-vis cost effective \\nmanufactured products. \\nBUSINESS PROPOSAL\\n!5']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pdf_text" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "competencies. Dramatically mesh low-risk high-yield alignments before \n", "transparent e-tailers. \n", "Appropriately empower dynamic leadership skills after business portals. \n", "Globally myocardinate interactive supply chains with distinctive quality \n", "vectors. Globally revolutionize global sources through interoperable \n", "services. Enthusiastically mesh long-term high-impact infrastructures vis-a-vis \n", "efÞcient customer service. Professionally fashion wireless leadership rather \n", "than prospective experiences. Energistically myocardinate clicks-and-\n", "mortar testing procedures whereas next-generation manufactured \n", "products. \n", "Dynamically reinvent market-driven opportunities and ubiquitous \n", "interfaces. Energistically fabricate an expanded array of niche markets \n", "through robust products. Appropriately implement visionary e-services vis-\n", "a-vis strategic web-readiness. \n", "Compellingly embrace empowered e-business after user friendly \n", "intellectual capital. Interactively actualize front-end processes with \n", "effective convergence. Synergistically deliver performance based \n", "methods of empowerment whereas distributed expertise. \n", "EfÞciently enable enabled sources and cost effective products. \n", "Completely synthesize principle-centered information after ethical \n", "communities. EfÞciently innovate open-source infrastructures via \n", "inexpensive materials. Objectively integrate enterprise-wide strategic theme areas with \n", "functionalized infrastructures. Interactively productize premium \n", "technologies whereas interdependent quality vectors. Rapaciously utilize \n", "enterprise experiences via 24/7 markets. Uniquely matrix economically sound value through cooperative \n", "technology. Competently parallel task fully researched data and enterprise \n", "process improvements. Collaboratively expedite quality manufactured \n", "products via client-focused results. \n", "BUSINESS PROPOSAL\n", "!4\n" ] } ], "source": [ "print(pdf_text[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Excellent work! That is all for PyPDF2 for now, remember that this won't work with every PDF file and is limited in its scope to only text of PDFs." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }