{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "kl_py_scraping01.ipynb", "provenance": [], "collapsed_sections": [], "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "bsp3x_zlo74F", "colab_type": "text" }, "source": [ "

\n", " \n", " \n", "

\n", "\n", "\n", "

\n", "\n", "\n", "\n", "# Python alapok WEB bányászat\n", "\n", "\n", "\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "collapsed": true, "id": "yXxqgKt6SYTt" }, "source": [ "# WEB bányászat / scraping\n", "\n", "\n", "## WEB lapról linkek kigyűjtése\n", "\n", "\n", "A megadott induló linken keresztül egy WEB lapértlapmezés után a linkekk (külső vagy belső) kigyüjtése és kiiratása a képernyőre" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "WUb_ZSiVSYTw", "outputId": "ea3cee42-3445-4c13-e193-dfb042701121", "colab": { "base_uri": "https://localhost:8080/", "height": 329 } }, "source": [ "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup \n", "\n", "szamlal=1\n", "html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')\n", "bs = BeautifulSoup(html, 'html.parser')\n", "for link in bs.find_all('a'):\n", " szamlal += 1\n", " if 'href' in link.attrs:\n", " print(link.attrs['href'])\n", " if szamlal >=20: ## Csak 20 db kilistázása\n", " break" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "/wiki/Wikipedia:Protection_policy#semi\n", "#mw-head\n", "#p-search\n", "/wiki/Kevin_Bacon_(disambiguation)\n", "/wiki/File:Kevin_Bacon_SDCC_2014.jpg\n", "/wiki/Philadelphia\n", "/wiki/Pennsylvania\n", "/wiki/Kyra_Sedgwick\n", "/wiki/Sosie_Bacon\n", "#cite_note-1\n", "/wiki/Edmund_Bacon_(architect)\n", "/wiki/Michael_Bacon_(musician)\n", "/wiki/Holly_Near\n", "http://baconbros.com/\n", "#cite_note-2\n", "#cite_note-actor-3\n", "/wiki/Footloose_(1984_film)\n", "/wiki/JFK_(film)\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "8PluJRB6SYT0" }, "source": [ "## Csak a belső cikkek linkjeinek kigyüjtése\n", "\n", "\n", "A linkek elemzése helyi domén elemzés (a Wikin belülre mutató linkek)" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "x_-FKIqiSYT1", "outputId": "7456e7c0-6b97-464d-dc4b-9b5310c0c94a", "colab": { "base_uri": "https://localhost:8080/", "height": 347 } }, "source": [ "from urllib.request import urlopen \n", "from bs4 import BeautifulSoup \n", "import re\n", "\n", "szamlal=1\n", "html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')\n", "bs = BeautifulSoup(html, 'html.parser')\n", "for link in bs.find('div', {'id':'bodyContent'}).find_all(\n", " 'a', href=re.compile('^(/wiki/)((?!:).)*$')):\n", " szamlal += 1\n", " if 'href' in link.attrs:\n", " print(link.attrs['href'])\n", " if szamlal >=20: ## Csak 20 db kilistázása\n", " break" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "/wiki/Kevin_Bacon_(disambiguation)\n", "/wiki/Philadelphia\n", "/wiki/Pennsylvania\n", "/wiki/Kyra_Sedgwick\n", "/wiki/Sosie_Bacon\n", "/wiki/Edmund_Bacon_(architect)\n", "/wiki/Michael_Bacon_(musician)\n", "/wiki/Holly_Near\n", "/wiki/Footloose_(1984_film)\n", "/wiki/JFK_(film)\n", "/wiki/A_Few_Good_Men\n", "/wiki/Apollo_13_(film)\n", "/wiki/Mystic_River_(film)\n", "/wiki/Sleepers\n", "/wiki/The_Woodsman_(2004_film)\n", "/wiki/He_Said,_She_Said_(film)\n", "/wiki/Fox_Broadcasting_Company\n", "/wiki/The_Following\n", "/wiki/HBO\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "dfdD7c1HSYT4" }, "source": [ "## Véletlen szerű bejárás\n", "\n", "\n", "A beső cikkek linkjeinek vélbejárása és kiíratása" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "D54GA-yiSYT4", "outputId": "c71e5cff-0026-406c-b9a3-8f3262ffe607", "colab": { "base_uri": "https://localhost:8080/", "height": 347 } }, "source": [ "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "import datetime\n", "import random\n", "import re\n", "\n", "szamlal=1\n", "random.seed(datetime.datetime.now())\n", "def getLinks(articleUrl):\n", " html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))\n", " bs = BeautifulSoup(html, 'html.parser')\n", " return bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))\n", "\n", "links = getLinks('/wiki/Kevin_Bacon')\n", "while len(links) > 0:\n", " newArticle = links[random.randint(0, len(links)-1)].attrs['href']\n", " print(newArticle)\n", " szamlal += 1\n", " if szamlal >=20: ## Csak 20 db kilistázása\n", " break\n", " links = getLinks(newArticle)" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "/wiki/Golden_Globe_Award\n", "/wiki/To_Kill_a_Mockingbird_(film)\n", "/wiki/La_Jolla,_California\n", "/wiki/Klamath_Basin\n", "/wiki/Crater_Lake\n", "/wiki/Rhyodacite\n", "/wiki/Latite\n", "/wiki/Petrology\n", "/wiki/Seismology\n", "/wiki/Volcanology\n", "/wiki/INGV\n", "/wiki/CNN\n", "/wiki/Warner_Bros._International_Television_Production\n", "/wiki/DirecTV-5\n", "/wiki/AT%26T_satellite_fleet\n", "/wiki/Red_by_HBO\n", "/wiki/Sky_News_International\n", "/wiki/Digital_television_in_the_United_Kingdom\n", "/wiki/Royal_Television_Society\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "dF2p1NxASYT-" }, "source": [ "## Rekurzív módon bejárjuk az egész webhelyet\n", "\n", "Önmeghívó eljárással bejárluk az egészet és az új lapokat eltároljuk egy pages változóba" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "RmEeMf8SSYT_", "outputId": "0f0f0f7a-69f8-400b-c3fd-ab97828e7891", "colab": { "base_uri": "https://localhost:8080/", "height": 329 } }, "source": [ "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "import re\n", "\n", "szamlal=1\n", "pages = set()\n", "def getLinks(pageUrl):\n", " global pages, szamlal\n", " html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))\n", " bs = BeautifulSoup(html, 'html.parser')\n", " for link in bs.find_all('a', href=re.compile('^(/wiki/)')):\n", " if 'href' in link.attrs:\n", " if link.attrs['href'] not in pages:\n", " szamlal += 1\n", " if szamlal >=20: ## Csak 20 db kilistázása\n", " break\n", " #We have encountered a new page\n", " newPage = link.attrs['href']\n", " print(newPage)\n", " pages.add(newPage)\n", " getLinks(newPage)\n", "getLinks('')" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "/wiki/Wikipedia\n", "/wiki/Wikipedia:Protection_policy#semi\n", "/wiki/Wikipedia:Requests_for_page_protection\n", "/wiki/Wikipedia:Protection_policy#move\n", "/wiki/Wikipedia:Lists_of_protected_pages\n", "/wiki/Wikipedia:Protection_policy\n", "/wiki/Wikipedia:Perennial_proposals\n", "/wiki/Wikipedia:Reliable_sources/Perennial_sources\n", "/wiki/Wikipedia:Reliable_sources\n", "/wiki/Wikipedia:WikiProject_Reliability\n", "/wiki/Wikipedia:WRE\n", "/wiki/File:People_icon.svg\n", "/wiki/Special:WhatLinksHere/File:People_icon.svg\n", "/wiki/Help:What_links_here\n", "/wiki/Wikipedia:Project_namespace#How-to_and_information_pages\n", "/wiki/Wikipedia:Policies_and_guidelines\n", "/wiki/Wikipedia:WikiProject_Politics\n", "/wiki/File:A_coloured_voting_box.svg\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "bXWeecq1SYUB" }, "source": [ "## Adatgyűjtés az egész webhelyen\n", "\n", "Hibakezeléssel bejárjuk a WEB lapokat és az új lapok tartalmát listához adjuk." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "6dJLczHlSYUC", "outputId": "9114b6f4-90db-4e52-981c-c9fec08c4e79", "colab": { "base_uri": "https://localhost:8080/", "height": 419 } }, "source": [ "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "import re\n", "\n", "szamlal=1\n", "pages = set()\n", "def getLinks(pageUrl):\n", " global pages, szamlal\n", " html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))\n", " bs = BeautifulSoup(html, 'html.parser')\n", " try:\n", " print(bs.h1.get_text())\n", " print(bs.find(id ='mw-content-text').find_all('p')[0])\n", " print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])\n", " except AttributeError:\n", " print('This page is missing something! Continuing.')\n", " \n", " for link in bs.find_all('a', href=re.compile('^(/wiki/)')):\n", " if 'href' in link.attrs:\n", " if link.attrs['href'] not in pages:\n", " szamlal += 1\n", " if szamlal >=5: ## Csak 5 db kilistázása\n", " break\n", " #We have encountered a new page\n", " newPage = link.attrs['href']\n", " print('-'*20)\n", " print(newPage)\n", " pages.add(newPage)\n", " getLinks(newPage)\n", "getLinks('') " ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "Main Page\n", "

Jean-François-Marie de Surville (1717–1770) was a merchant captain with the French East India Company who commanded a voyage of exploration to the Pacific in 1769 and 1770. Born in Brittany, France, Surville joined the company when he was 10 years old. For the next several years, he sailed on voyages in Indian and Chinese waters. In 1740, he joined the French Navy. He fought in the War of the Austrian Succession and the Seven Years' War, twice becoming a prisoner of war. In 1769, in command of Saint Jean-Baptiste, he sailed from India on an expedition to the Pacific looking for trading opportunities. He explored the seas around the Solomon Islands and anchored in December at Doubtless Bay, New Zealand (commemorative plaque pictured). Part of his route around New Zealand overlapped that of James Cook in Endeavour, who had preceded him by only a few days. Three months later, Surville drowned off the coast of Peru while seeking help for his scurvy-afflicted crew. (Full article...)\n", "

\n", "This page is missing something! Continuing.\n", "--------------------\n", "/wiki/Wikipedia\n", "Wikipedia\n", "

\n", "

\n", "This page is missing something! Continuing.\n", "--------------------\n", "/wiki/Wikipedia:Protection_policy#semi\n", "Wikipedia:Protection policy\n", "

\n", "

\n", "This page is missing something! Continuing.\n", "--------------------\n", "/wiki/Wikipedia:Requests_for_page_protection\n", "Wikipedia:Requests for page protection\n", "

This page is for requesting that a page, file or template be fully protected, create protected (salted), extended confirmed protected, semi-protected, added to pending changes, move-protected, template protected, upload protected (file-specific), or unprotected. Please read up on the protection policy. Full protection is used to stop edit warring between multiple users or to prevent vandalism to high-risk templates; semi-protection and pending changes are usually used only to prevent IP and new user vandalism (see the rough guide to semi-protection); and move protection is used to stop page-move wars. Extended confirmed protection is used where semi-protection has proved insufficient (see the rough guide to extended confirmed protection)\n", "

\n", "This page is missing something! Continuing.\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "m0H-KSw9SYUE" }, "source": [ "## Feltérképezés az interneten keresztül" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "HKSWbuhLSYUF", "outputId": "b4aa8909-829b-451c-89dd-7dc9178ceaed", "colab": { "base_uri": "https://localhost:8080/", "height": 52 } }, "source": [ "from urllib.request import urlopen\n", "from urllib.parse import urlparse\n", "from bs4 import BeautifulSoup\n", "import re\n", "import datetime\n", "import random\n", "\n", "pages = set()\n", "random.seed(datetime.datetime.now())\n", "szamlal=1\n", "\n", "#Retrieves a list of all Internal links found on a page\n", "def getInternalLinks(bs, includeUrl):\n", " includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)\n", " internalLinks = []\n", " #Finds all links that begin with a \"/\"\n", " for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):\n", " if link.attrs['href'] is not None:\n", " if link.attrs['href'] not in internalLinks:\n", " if(link.attrs['href'].startswith('/')):\n", " internalLinks.append(includeUrl+link.attrs['href'])\n", " else:\n", " internalLinks.append(link.attrs['href'])\n", " return internalLinks\n", " \n", "#Retrieves a list of all external links found on a page\n", "def getExternalLinks(bs, excludeUrl):\n", " global szamlal\n", " externalLinks = []\n", " #Finds all links that start with \"http\" that do\n", " #not contain the current URL\n", " for link in bs.find_all('a', href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):\n", " szamlal += 1\n", " if szamlal >= 5: ## Csak 5 db kilistázása\n", " break\n", " if link.attrs['href'] is not None:\n", " if link.attrs['href'] not in externalLinks:\n", " externalLinks.append(link.attrs['href'])\n", " return externalLinks\n", "\n", "def getRandomExternalLink(startingPage):\n", " try:\n", " html = urlopen(startingPage)\n", " bs = BeautifulSoup(html, 'html.parser')\n", " externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)\n", " if len(externalLinks) == 0:\n", " print('No external links, looking around the site for one')\n", " domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)\n", " internalLinks = getInternalLinks(bs, domain)\n", " return getRandomExternalLink(internalLinks[random.randint(0, len(internalLinks)-1)])\n", " else:\n", " return externalLinks[random.randint(0, len(externalLinks)-1)]\n", " except Exception:\n", " return None\n", " \n", "def followExternalOnly(startingSite):\n", " externalLink = getRandomExternalLink(startingSite)\n", " if externalLink == None:\n", " quit\n", " else: \n", " print('Random external link is: {}'.format(externalLink))\n", " followExternalOnly(externalLink)\n", " \n", " \n", "followExternalOnly('https://klajosw.blogspot.com/p/kezdolap.html')\n", "\n" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "Random external link is: https://klajosw.blogspot.hu/\n", "No external links, looking around the site for one\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "IHYJZZMCSYUH" }, "source": [ "## Collect all External Links from a Site" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "iiSzM6C8SYUI", "outputId": "2e0d9e10-ac27-47fa-d8e6-faf719a249f2", "colab": { "base_uri": "https://localhost:8080/", "height": 87 } }, "source": [ "from urllib.request import urlopen\n", "from urllib.parse import urlparse\n", "from bs4 import BeautifulSoup\n", "\n", "allExtLinks = set()\n", "allIntLinks = set()\n", "szamlal=1\n", "\n", "def getAllExternalLinks(siteUrl):\n", " global szamlal\n", " html = urlopen(siteUrl)\n", " domain = '{}://{}'.format(urlparse(siteUrl).scheme,\n", " urlparse(siteUrl).netloc)\n", " bs = BeautifulSoup(html, 'html.parser')\n", " internalLinks = getInternalLinks(bs, domain)\n", " externalLinks = getExternalLinks(bs, domain)\n", "\n", " for link in externalLinks:\n", " szamlal += 1\n", " if szamlal >= 10: ## Csak 10 db kilistázása\n", " break\n", " if link not in allExtLinks:\n", " allExtLinks.add(link)\n", " print('Új külsö link : ', link)\n", " for link in internalLinks:\n", " szamlal += 1\n", " if szamlal >= 10: ## Csak 10 db kilistázása\n", " break\n", " if link not in allIntLinks:\n", " print('Új belső link : ', link)\n", " allIntLinks.add(link)\n", " getAllExternalLinks(link)\n", "\n", "\n", "allIntLinks.add('https://klajosw.blogspot.com')\n", "getAllExternalLinks('https://klajosw.blogspot.com/p/kezdolap.html')" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "Új külsö link : https://klajosw.blogspot.com/\n", "Új külsö link : https://mierdekel.hu/\n", "Új külsö link : https://klajosw.blogspot.hu/\n", "Új belső link : https://klajosw.blogspot.com/\n" ], "name": "stdout" } ] } ] }