{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#
Python Web Scraping [27 exercises with solution]
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Web Scrapping:\n", "\n", "- Web scraping or web data extraction is data scraping used for extracting data from websites. \n", "- Web scraping softwares are used to access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. \n", "- While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. \n", "- It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.\n", "\n", "### Python request module :\n", "\n", "- Requests allows user to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. \n", "- There’s no need to manually add query strings to your URLs, or to form-encode your POST data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Write a Python program to test if a given page is found or not on the server." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "URL error! Server not found.\n", "\n", "\n", "b'\\n\\n\\n Example Domain\\n\\n \\n \\n \\n \\n\\n\\n\\n
\\n

Example Domain

\\n

This domain is for use in illustrative examples in documents. You may use this\\n domain in literature without prior coordination or asking for permission.

\\n

More information...

\\n
\\n\\n\\n'\n" ] } ], "source": [ "from urllib.request import urlopen\n", "from urllib.error import HTTPError\n", "from urllib.error import URLError\n", "\n", "try:\n", " html = urlopen('https://abcxyz.com')\n", "except HTTPError as e:\n", " print('HTTP error!')\n", "except URLError as e:\n", " print('URL error! Server not found.')\n", "else:\n", " print(html.read())\n", " \n", "print('\\n')\n", "\n", "try:\n", " html = urlopen('http://www.example.com/')\n", " print(html.read())\n", "except HTTPError as e:\n", " print('HTTP error!')\n", "except URLError as e:\n", " print('URL error! Server not found.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Write a Python program to download and display the content of robot.txt for en.wikipedia.org." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# robots.txt for http://www.wikipedia.org/ and friends\n", "#\n", "# Please note: There are a lot of pages on this site, and there are\n", "# some misbehaved spiders out there that go _way_ too fast. If you're\n", "# irresponsible, your access to the site may be blocked.\n", "#\n", "\n", "# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN\n", "# and ignoring 429 ratelimit responses, claims to respect robots:\n", "# http://mj12bot.com/\n", "User-agent: MJ12bot\n", "Disallow: /\n", "\n", "# advertising-related bots:\n", "User-agent: Mediapartners-Google*\n", "Disallow: /\n", "\n", "# Wikipedia work bots:\n", "User-agent: IsraBot\n", "Disallow:\n", "\n", "User-agent: Orthogaffe\n", "Disallow:\n", "\n", "# Crawlers that are kind enough to obey, but which we'd rather not have\n", "# unless they're feeding search engines.\n", "User-agent: UbiCrawler\n", "Disallow: /\n", "\n", "User-agent: DOC\n", "Disallow: /\n", "\n", "User-agent: Zao\n", "Disallow: /\n", "\n", "# Some bots are known to be trouble, particularly those designed to copy\n", "# entire sites. Please obey robots.txt.\n", "User-agent: sitecheck.internetseer.com\n", "Disallow: /\n", "\n", "User-agent: Zealbot\n", "Disallow: /\n", "\n", "User-agent: MSIECrawler\n", "Disallow: /\n", "\n", "User-agent: SiteSnagger\n", "Disallow: /\n", "\n", "User-agent: WebStripper\n", "Disallow: /\n", "\n", "User-agent: WebCopier\n", "Disallow: /\n", "\n", "User-agent: Fetch\n", "Disallow: /\n", "\n", "User-agent: Offline Explorer\n", "Disallow: /\n", "\n", "User-agent: Teleport\n", "Disallow: /\n", "\n", "User-agent: TeleportPro\n", "Disallow: /\n", "\n", "User-agent: WebZIP\n", "Disallow: /\n", "\n", "User-agent: linko\n", "Disallow: /\n", "\n", "User-agent: HTTrack\n", "Disallow: /\n", "\n", "User-agent: Microsoft.URL.Control\n", "Disallow: /\n", "\n", "User-agent: Xenu\n", "Disallow: /\n", "\n", "User-agent: larbin\n", "Disallow: /\n", "\n", "User-agent: libwww\n", "Disallow: /\n", "\n", "User-agent: ZyBORG\n", "Disallow: /\n", "\n", "User-agent: Download Ninja\n", "Disallow: /\n", "\n", "# Misbehaving: requests much too fast:\n", "User-agent: fast\n", "Disallow: /\n", "\n", "#\n", "# Sorry, wget in its recursive mode is a frequent problem.\n", "# Please read the man page and use it properly; there is a\n", "# --wait option you can use to set the delay between hits,\n", "# for instance.\n", "#\n", "User-agent: wget\n", "Disallow: /\n", "\n", "#\n", "# The 'grub' distributed client has been *very* poorly behaved.\n", "#\n", "User-agent: grub-client\n", "Disallow: /\n", "\n", "#\n", "# Doesn't follow robots.txt anyway, but...\n", "#\n", "User-agent: k2spider\n", "Disallow: /\n", "\n", "#\n", "# Hits many times per second, not acceptable\n", "# http://www.nameprotect.com/botinfo.html\n", "User-agent: NPBot\n", "Disallow: /\n", "\n", "# A capture bot, downloads gazillions of pages with no public benefit\n", "# http://www.webreaper.net/\n", "User-agent: WebReaper\n", "Disallow: /\n", "\n", "\n", "#\n", "# Friendly, low-speed bots are welcome viewing article pages, but not\n", "# dynamically-generated pages please.\n", "#\n", "# Inktomi's \"Slurp\" can read a minimum delay between hits; if your\n", "# bot supports such a thing using the 'Crawl-delay' or another\n", "# instruction, please let us know.\n", "#\n", "# There is a special exception for API mobileview to allow dynamic\n", "# mobile web & app views to load section content.\n", "# These views aren't HTTP-cached but use parser cache aggressively\n", "# and don't expose special: pages etc.\n", "#\n", "# Another exception is for REST API documentation, located at\n", "# /api/rest_v1/?doc.\n", "#\n", "User-agent: *\n", "Allow: /w/api.php?action=mobileview&\n", "Allow: /w/load.php?\n", "Allow: /api/rest_v1/?doc\n", "Disallow: /w/\n", "Disallow: /api/\n", "Disallow: /trap/\n", "Disallow: /wiki/Special:\n", "Disallow: /wiki/Spezial:\n", "Disallow: /wiki/Spesial:\n", "Disallow: /wiki/Special%3A\n", "Disallow: /wiki/Spezial%3A\n", "Disallow: /wiki/Spesial%3A\n", "\n", "#\n", "# ar:\n", "Disallow: /wiki/%D8%AE%D8%A7%D8%B5:Search\n", "Disallow: /wiki/%D8%AE%D8%A7%D8%B5%3ASearch\n", "#\n", "# dewiki:\n", "# T6937\n", "# sensible deletion and meta user discussion pages:\n", "Disallow: /wiki/Wikipedia:L%C3%B6schkandidaten/\n", "Disallow: /wiki/Wikipedia:Löschkandidaten/\n", "Disallow: /wiki/Wikipedia:Vandalensperrung/\n", "Disallow: /wiki/Wikipedia:Benutzersperrung/\n", "Disallow: /wiki/Wikipedia:Vermittlungsausschuss/\n", "Disallow: /wiki/Wikipedia:Administratoren/Probleme/\n", "Disallow: /wiki/Wikipedia:Adminkandidaturen/\n", "Disallow: /wiki/Wikipedia:Qualitätssicherung/\n", "Disallow: /wiki/Wikipedia:Qualit%C3%A4tssicherung/\n", "# 4937#5\n", "Disallow: /wiki/Wikipedia:Vandalismusmeldung/\n", "Disallow: /wiki/Wikipedia:Gesperrte_Lemmata/\n", "Disallow: /wiki/Wikipedia:Löschprüfung/\n", "Disallow: /wiki/Wikipedia:L%C3%B6schprüfung/\n", "Disallow: /wiki/Wikipedia:Administratoren/Notizen/\n", "Disallow: /wiki/Wikipedia:Schiedsgericht/Anfragen/\n", "Disallow: /wiki/Wikipedia:L%C3%B6schpr%C3%BCfung/\n", "# T14111\n", "Disallow: /wiki/Wikipedia:Checkuser/\n", "Disallow: /wiki/Wikipedia_Diskussion:Checkuser/\n", "Disallow: /wiki/Wikipedia_Diskussion:Adminkandidaturen/\n", "# T15961\n", "Disallow: /wiki/Wikipedia:Spam-Blacklist-Log\n", "Disallow: /wiki/Wikipedia%3ASpam-Blacklist-Log\n", "Disallow: /wiki/Wikipedia_Diskussion:Spam-Blacklist-Log\n", "Disallow: /wiki/Wikipedia_Diskussion%3ASpam-Blacklist-Log\n", "#\n", "# enwiki:\n", "# Folks get annoyed when VfD discussions end up the number 1 google hit for\n", "# their name. See T6776\n", "Disallow: /wiki/Wikipedia:Articles_for_deletion/\n", "Disallow: /wiki/Wikipedia%3AArticles_for_deletion/\n", "Disallow: /wiki/Wikipedia:Votes_for_deletion/\n", "Disallow: /wiki/Wikipedia%3AVotes_for_deletion/\n", "Disallow: /wiki/Wikipedia:Pages_for_deletion/\n", "Disallow: /wiki/Wikipedia%3APages_for_deletion/\n", "Disallow: /wiki/Wikipedia:Miscellany_for_deletion/\n", "Disallow: /wiki/Wikipedia%3AMiscellany_for_deletion/\n", "Disallow: /wiki/Wikipedia:Miscellaneous_deletion/\n", "Disallow: /wiki/Wikipedia%3AMiscellaneous_deletion/\n", "Disallow: /wiki/Wikipedia:Copyright_problems\n", "Disallow: /wiki/Wikipedia%3ACopyright_problems\n", "Disallow: /wiki/Wikipedia:Protected_titles/\n", "Disallow: /wiki/Wikipedia%3AProtected_titles/\n", "# T15398\n", "Disallow: /wiki/Wikipedia:WikiProject_Spam/\n", "Disallow: /wiki/Wikipedia%3AWikiProject_Spam/\n", "# T16075\n", "Disallow: /wiki/MediaWiki:Spam-blacklist\n", "Disallow: /wiki/MediaWiki%3ASpam-blacklist\n", "Disallow: /wiki/MediaWiki_talk:Spam-blacklist\n", "Disallow: /wiki/MediaWiki_talk%3ASpam-blacklist\n", "# T13261\n", "Disallow: /wiki/Wikipedia:Requests_for_arbitration/\n", "Disallow: /wiki/Wikipedia%3ARequests_for_arbitration/\n", "Disallow: /wiki/Wikipedia:Requests_for_comment/\n", "Disallow: /wiki/Wikipedia%3ARequests_for_comment/\n", "Disallow: /wiki/Wikipedia:Requests_for_adminship/\n", "Disallow: /wiki/Wikipedia%3ARequests_for_adminship/\n", "# T12288\n", "Disallow: /wiki/Wikipedia_talk:Articles_for_deletion/\n", "Disallow: /wiki/Wikipedia_talk%3AArticles_for_deletion/\n", "Disallow: /wiki/Wikipedia_talk:Votes_for_deletion/\n", "Disallow: /wiki/Wikipedia_talk%3AVotes_for_deletion/\n", "Disallow: /wiki/Wikipedia_talk:Pages_for_deletion/\n", "Disallow: /wiki/Wikipedia_talk%3APages_for_deletion/\n", "Disallow: /wiki/Wikipedia_talk:Miscellany_for_deletion/\n", "Disallow: /wiki/Wikipedia_talk%3AMiscellany_for_deletion/\n", "Disallow: /wiki/Wikipedia_talk:Miscellaneous_deletion/\n", "Disallow: /wiki/Wikipedia_talk%3AMiscellaneous_deletion/\n", "# T16793\n", "Disallow: /wiki/Wikipedia:Changing_username\n", "Disallow: /wiki/Wikipedia%3AChanging_username\n", "Disallow: /wiki/Wikipedia:Changing_username/\n", "Disallow: /wiki/Wikipedia%3AChanging_username/\n", "Disallow: /wiki/Wikipedia_talk:Changing_username\n", "Disallow: /wiki/Wikipedia_talk%3AChanging_username\n", "Disallow: /wiki/Wikipedia_talk:Changing_username/\n", "Disallow: /wiki/Wikipedia_talk%3AChanging_username/\n", "#\n", "# eswiki:\n", "# T8746\n", "Disallow: /wiki/Wikipedia:Consultas_de_borrado/\n", "Disallow: /wiki/Wikipedia%3AConsultas_de_borrado/\n", "#\n", "# fiwiki:\n", "# T10695\n", "Disallow: /wiki/Wikipedia:Poistettavat_sivut\n", "Disallow: /wiki/K%C3%A4ytt%C3%A4j%C3%A4:\n", "Disallow: /wiki/Käyttäjä:\n", "Disallow: /wiki/Keskustelu_k%C3%A4ytt%C3%A4j%C3%A4st%C3%A4:\n", "Disallow: /wiki/Keskustelu_käyttäjästä:\n", "Disallow: /wiki/Wikipedia:Yll%C3%A4pit%C3%A4j%C3%A4t/\n", "Disallow: /wiki/Wikipedia:Ylläpitäjät/\n", "#\n", "# hewiki:\n", "Disallow: /wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:Search\n", "Disallow: /wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93%3ASearch\n", "#T11517\n", "Disallow: /wiki/ויקיפדיה:רשימת_מועמדים_למחיקה/\n", "Disallow: /wiki/ויקיפדיה%3Aרשימת_מועמדים_למחיקה/\n", "Disallow: /wiki/%D7%95%D7%99%D7%A7%D7%99%D7%A4%D7%93%D7%99%D7%94:%D7%A8%D7%A9%D7%99%D7%9E%D7%AA_%D7%9E%D7%95%D7%A2%D7%9E%D7%93%D7%99%D7%9D_%D7%9C%D7%9E%D7%97%D7%99%D7%A7%D7%94/\n", "Disallow: /wiki/%D7%95%D7%99%D7%A7%D7%99%D7%A4%D7%93%D7%99%D7%94%3A%D7%A8%D7%A9%D7%99%D7%9E%D7%AA_%D7%9E%D7%95%D7%A2%D7%9E%D7%93%D7%99%D7%9D_%D7%9C%D7%9E%D7%97%D7%99%D7%A7%D7%94/\n", "Disallow: /wiki/ויקיפדיה:ערכים_לא_קיימים_ומוגנים\n", "Disallow: /wiki/ויקיפדיה%3Aערכים_לא_קיימים_ומוגנים\n", "Disallow: /wiki/%D7%95%D7%99%D7%A7%D7%99%D7%A4%D7%93%D7%99%D7%94:%D7%A2%D7%A8%D7%9B%D7%99%D7%9D_%D7%9C%D7%90_%D7%A7%D7%99%D7%99%D7%9E%D7%99%D7%9D_%D7%95%D7%9E%D7%95%D7%92%D7%A0%D7%99%D7%9D\n", "Disallow: /wiki/%D7%95%D7%99%D7%A7%D7%99%D7%A4%D7%93%D7%99%D7%94%3A%D7%A2%D7%A8%D7%9B%D7%99%D7%9D_%D7%9C%D7%90_%D7%A7%D7%99%D7%99%D7%9E%D7%99%D7%9D_%D7%95%D7%9E%D7%95%D7%92%D7%A0%D7%99%D7%9D\n", "Disallow: /wiki/ויקיפדיה:דפים_לא_קיימים_ומוגנים\n", "Disallow: /wiki/ויקיפדיה%3Aדפים_לא_קיימים_ומוגנים\n", "Disallow: /wiki/%D7%95%D7%99%D7%A7%D7%99%D7%A4%D7%93%D7%99%D7%94:%D7%93%D7%A4%D7%99%D7%9D_%D7%9C%D7%90_%D7%A7%D7%99%D7%99%D7%9E%D7%99%D7%9D_%D7%95%D7%9E%D7%95%D7%92%D7%A0%D7%99%D7%9D\n", "Disallow: /wiki/%D7%95%D7%99%D7%A7%D7%99%D7%A4%D7%93%D7%99%D7%94%3A%D7%93%D7%A4%D7%99%D7%9D_%D7%9C%D7%90_%D7%A7%D7%99%D7%99%D7%9E%D7%99%D7%9D_%D7%95%D7%9E%D7%95%D7%92%D7%A0%D7%99%D7%9D\n", "#\n", "# huwiki:\n", "Disallow: /wiki/Speci%C3%A1lis:Search\n", "Disallow: /wiki/Speci%C3%A1lis%3ASearch\n", "#\n", "# itwiki:\n", "# T7545\n", "Disallow: /wiki/Wikipedia:Pagine_da_cancellare\n", "Disallow: /wiki/Wikipedia%3APagine_da_cancellare\n", "Disallow: /wiki/Wikipedia:Utenti_problematici\n", "Disallow: /wiki/Wikipedia%3AUtenti_problematici\n", "Disallow: /wiki/Wikipedia:Vandalismi_in_corso\n", "Disallow: /wiki/Wikipedia%3AVandalismi_in_corso\n", "Disallow: /wiki/Wikipedia:Amministratori\n", "Disallow: /wiki/Wikipedia%3AAmministratori\n", "Disallow: /wiki/Wikipedia:Proposte_di_cancellazione_semplificata\n", "Disallow: /wiki/Wikipedia%3AProposte_di_cancellazione_semplificata\n", "Disallow: /wiki/Categoria:Da_cancellare_subito\n", "Disallow: /wiki/Categoria%3ADa_cancellare_subito\n", "Disallow: /wiki/Wikipedia:Sospette_violazioni_di_copyright\n", "Disallow: /wiki/Wikipedia%3ASospette_violazioni_di_copyright\n", "Disallow: /wiki/Categoria:Da_controllare_per_copyright\n", "Disallow: /wiki/Categoria%3ADa_controllare_per_copyright\n", "Disallow: /wiki/Progetto:Rimozione_contributi_sospetti\n", "Disallow: /wiki/Progetto%3ARimozione_contributi_sospetti\n", "Disallow: /wiki/Categoria:Da_cancellare_subito_per_violazione_integrale_copyright\n", "Disallow: /wiki/Categoria%3ADa_cancellare_subito_per_violazione_integrale_copyright\n", "Disallow: /wiki/Progetto:Cococo\n", "Disallow: /wiki/Progetto%3ACococo\n", "Disallow: /wiki/Discussioni_progetto:Cococo\n", "Disallow: /wiki/Discussioni_progetto%3ACococo\n", "#\n", "# jawiki\n", "Disallow: /wiki/%E7%89%B9%E5%88%A5:Search\n", "Disallow: /wiki/%E7%89%B9%E5%88%A5%3ASearch\n", "# T7239\n", "Disallow: /wiki/Wikipedia:%E5%89%8A%E9%99%A4%E4%BE%9D%E9%A0%BC/\n", "Disallow: /wiki/Wikipedia%3A%E5%89%8A%E9%99%A4%E4%BE%9D%E9%A0%BC/\n", "Disallow: /wiki/Wikipedia:%E5%88%A9%E7%94%A8%E8%80%85%E3%83%9A%E3%83%BC%E3%82%B8%E3%81%AE%E5%89%8A%E9%99%A4%E4%BE%9D%E9%A0%BC\n", "Disallow: /wiki/Wikipedia%3A%E5%88%A9%E7%94%A8%E8%80%85%E3%83%9A%E3%83%BC%E3%82%B8%E3%81%AE%E5%89%8A%E9%99%A4%E4%BE%9D%E9%A0%BC\n", "# nowiki\n", "# T13432\n", "Disallow: /wiki/Bruker:\n", "Disallow: /wiki/Bruker%3A\n", "Disallow: /wiki/Brukerdiskusjon\n", "Disallow: /wiki/Wikipedia:Administratorer\n", "Disallow: /wiki/Wikipedia%3AAdministratorer\n", "Disallow: /wiki/Wikipedia-diskusjon:Administratorer\n", "Disallow: /wiki/Wikipedia-diskusjon%3AAdministratorer\n", "Disallow: /wiki/Wikipedia:Sletting\n", "Disallow: /wiki/Wikipedia%3ASletting\n", "Disallow: /wiki/Wikipedia-diskusjon:Sletting\n", "Disallow: /wiki/Wikipedia-diskusjon%3ASletting\n", "#\n", "# plwiki\n", "# T10067\n", "Disallow: /wiki/Wikipedia:Strony_do_usuni%C4%99cia\n", "Disallow: /wiki/Wikipedia%3AStrony_do_usuni%C4%99cia\n", "Disallow: /wiki/Wikipedia:Do_usuni%C4%99cia\n", "Disallow: /wiki/Wikipedia%3ADo_usuni%C4%99cia\n", "Disallow: /wiki/Wikipedia:SDU/\n", "Disallow: /wiki/Wikipedia%3ASDU/\n", "Disallow: /wiki/Wikipedia:Strony_podejrzane_o_naruszenie_praw_autorskich\n", "Disallow: /wiki/Wikipedia%3AStrony_podejrzane_o_naruszenie_praw_autorskich\n", "#\n", "# ptwiki:\n", "# T7394\n", "Disallow: /wiki/Wikipedia:Páginas_para_eliminar/\n", "Disallow: /wiki/Wikipedia:P%C3%A1ginas_para_eliminar/\n", "Disallow: /wiki/Wikipedia%3AP%C3%A1ginas_para_eliminar/\n", "Disallow: /wiki/Wikipedia_Discussão:Páginas_para_eliminar/\n", "Disallow: /wiki/Wikipedia_Discuss%C3%A3o:P%C3%A1ginas_para_eliminar/\n", "Disallow: /wiki/Wikipedia_Discuss%C3%A3o%3AP%C3%A1ginas_para_eliminar/\n", "#\n", "# rowiki:\n", "# T14546\n", "Disallow: /wiki/Wikipedia:Pagini_de_%C5%9Fters\n", "Disallow: /wiki/Wikipedia%3APagini_de_%C5%9Fters\n", "Disallow: /wiki/Discu%C5%A3ie_Wikipedia:Pagini_de_%C5%9Fters\n", "Disallow: /wiki/Discu%C5%A3ie_Wikipedia%3APagini_de_%C5%9Fters\n", "#\n", "# ruwiki:\n", "Disallow: /wiki/%D0%A1%D0%BF%D0%B5%D1%86%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5:Search\n", "Disallow: /wiki/%D0%A1%D0%BF%D0%B5%D1%86%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D1%8B%D0%B5%3ASearch\n", "#\n", "# svwiki:\n", "# T12229\n", "Disallow: /wiki/Wikipedia%3ASidor_f%C3%B6reslagna_f%C3%B6r_radering\n", "Disallow: /wiki/Wikipedia:Sidor_f%C3%B6reslagna_f%C3%B6r_radering\n", "Disallow: /wiki/Wikipedia:Sidor_föreslagna_för_radering\n", "Disallow: /wiki/Användare\n", "Disallow: /wiki/Anv%C3%A4ndare\n", "Disallow: /wiki/Användardiskussion\n", "Disallow: /wiki/Anv%C3%A4ndardiskussion\n", "Disallow: /wiki/Wikipedia:Skyddade_sidnamn\n", "Disallow: /wiki/Wikipedia%3ASkyddade_sidnamn\n", "# T13291\n", "Disallow: /wiki/Wikipedia:Sidor_som_bör_raderas\n", "Disallow: /wiki/Wikipedia:Sidor_som_b%C3%B6r_raderas\n", "Disallow: /wiki/Wikipedia%3ASidor_som_b%C3%B6r_raderas\n", "#\n", "# zhwiki:\n", "# T7104\n", "Disallow: /wiki/Wikipedia:删除投票/侵权\n", "Disallow: /wiki/Wikipedia:%E5%88%A0%E9%99%A4%E6%8A%95%E7%A5%A8/%E4%BE%B5%E6%9D%83\n", "Disallow: /wiki/Wikipedia:删除投票和请求\n", "Disallow: /wiki/Wikipedia:%E5%88%A0%E9%99%A4%E6%8A%95%E7%A5%A8%E5%92%8C%E8%AF%B7%E6%B1%82\n", "Disallow: /wiki/Category:快速删除候选\n", "Disallow: /wiki/Category:%E5%BF%AB%E9%80%9F%E5%88%A0%E9%99%A4%E5%80%99%E9%80%89\n", "Disallow: /wiki/Category:维基百科需要翻译的文章\n", "Disallow: /wiki/Category:%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E9%9C%80%E8%A6%81%E7%BF%BB%E8%AF%91%E7%9A%84%E6%96%87%E7%AB%A0\n", "#\n", "# sister projects\n", "#\n", "# enwikinews:\n", "# T7340\n", "Disallow: /wiki/Portal:Prepared_stories/\n", "Disallow: /wiki/Portal%3APrepared_stories/\n", "#\n", "# itwikinews\n", "# T11138\n", "Disallow: /wiki/Wikinotizie:Richieste_di_cancellazione\n", "Disallow: /wiki/Wikinotizie:Sospette_violazioni_di_copyright\n", "Disallow: /wiki/Categoria:Da_cancellare_subito\n", "Disallow: /wiki/Categoria:Da_cancellare_subito_per_violazione_integrale_copyright\n", "Disallow: /wiki/Wikinotizie:Storie_in_preparazione\n", "#\n", "# enwikiquote:\n", "# T17095\n", "Disallow: /wiki/Wikiquote:Votes_for_deletion/\n", "Disallow: /wiki/Wikiquote%3AVotes_for_deletion/\n", "Disallow: /wiki/Wikiquote_talk:Votes_for_deletion/\n", "Disallow: /wiki/Wikiquote_talk%3AVotes_for_deletion/\n", "Disallow: /wiki/Wikiquote:Votes_for_deletion_archive/\n", "Disallow: /wiki/Wikiquote%3AVotes_for_deletion_archive/\n", "Disallow: /wiki/Wikiquote_talk:Votes_for_deletion_archive/\n", "Disallow: /wiki/Wikiquote_talk%3AVotes_for_deletion_archive/\n", "#\n", "# enwikibooks\n", "Disallow: /wiki/Wikibooks:Votes_for_deletion\n", "#\n", "# working...\n", "Disallow: /wiki/Fundraising_2007/comments\n", "#\n", "#\n", "#\n", "#----------------------------------------------------------#\n", "#\n", "#\n", "#\n", " #
\n",
      "#\n",
      "# Localisable part of robots.txt for en.wikipedia.org\n",
      "#\n",
      "# Edit at https://en.wikipedia.org/w/index.php?title=MediaWiki:Robots.txt&action=edit\n",
      "# Don't add newlines here. All rules set here are active for every user-agent.\n",
      "#\n",
      "# Please check any changes using a syntax validator such as http://tool.motoricerca.info/robots-checker.phtml\n",
      "# Enter https://en.wikipedia.org/robots.txt as the URL to check.\n",
      "#\n",
      "# https://phabricator.wikimedia.org/T16075\n",
      "Disallow: /wiki/MediaWiki:Spam-blacklist\n",
      "Disallow: /wiki/MediaWiki%3ASpam-blacklist\n",
      "Disallow: /wiki/MediaWiki_talk:Spam-blacklist\n",
      "Disallow: /wiki/MediaWiki_talk%3ASpam-blacklist\n",
      "Disallow: /wiki/Wikipedia:WikiProject_Spam\n",
      "Disallow: /wiki/Wikipedia_talk:WikiProject_Spam\n",
      "#\n",
      "# Folks get annoyed when XfD discussions end up the number 1 google hit for\n",
      "# their name. \n",
      "# https://phabricator.wikimedia.org/T16075\n",
      "Disallow: /wiki/Wikipedia:Articles_for_deletion\n",
      "Disallow: /wiki/Wikipedia%3AArticles_for_deletion\n",
      "Disallow: /wiki/Wikipedia:Votes_for_deletion\n",
      "Disallow: /wiki/Wikipedia%3AVotes_for_deletion\n",
      "Disallow: /wiki/Wikipedia:Pages_for_deletion\n",
      "Disallow: /wiki/Wikipedia%3APages_for_deletion\n",
      "Disallow: /wiki/Wikipedia:Miscellany_for_deletion\n",
      "Disallow: /wiki/Wikipedia%3AMiscellany_for_deletion\n",
      "Disallow: /wiki/Wikipedia:Miscellaneous_deletion\n",
      "Disallow: /wiki/Wikipedia%3AMiscellaneous_deletion\n",
      "Disallow: /wiki/Wikipedia:Categories_for_discussion\n",
      "Disallow: /wiki/Wikipedia%3ACategories_for_discussion\n",
      "Disallow: /wiki/Wikipedia:Templates_for_deletion\n",
      "Disallow: /wiki/Wikipedia%3ATemplates_for_deletion\n",
      "Disallow: /wiki/Wikipedia:Redirects_for_discussion\n",
      "Disallow: /wiki/Wikipedia%3ARedirects_for_discussion\n",
      "Disallow: /wiki/Wikipedia:Deletion_review\n",
      "Disallow: /wiki/Wikipedia%3ADeletion_review\n",
      "Disallow: /wiki/Wikipedia:WikiProject_Deletion_sorting\n",
      "Disallow: /wiki/Wikipedia%3AWikiProject_Deletion_sorting\n",
      "Disallow: /wiki/Wikipedia:Files_for_deletion\n",
      "Disallow: /wiki/Wikipedia%3AFiles_for_deletion\n",
      "Disallow: /wiki/Wikipedia:Files_for_discussion\n",
      "Disallow: /wiki/Wikipedia%3AFiles_for_discussion\n",
      "Disallow: /wiki/Wikipedia:Possibly_unfree_files\n",
      "Disallow: /wiki/Wikipedia%3APossibly_unfree_files\n",
      "#\n",
      "# https://phabricator.wikimedia.org/T12288\n",
      "Disallow: /wiki/Wikipedia_talk:Articles_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk%3AArticles_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk:Votes_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk%3AVotes_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk:Pages_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk%3APages_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk:Miscellany_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk%3AMiscellany_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk:Miscellaneous_deletion\n",
      "Disallow: /wiki/Wikipedia_talk%3AMiscellaneous_deletion\n",
      "Disallow: /wiki/Wikipedia_talk:Templates_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk%3ATemplates_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk:Categories_for_discussion\n",
      "Disallow: /wiki/Wikipedia_talk%3ACategories_for_discussion\n",
      "Disallow: /wiki/Wikipedia_talk:Deletion_review\n",
      "Disallow: /wiki/Wikipedia_talk%3ADeletion_review\n",
      "Disallow: /wiki/Wikipedia_talk:WikiProject_Deletion_sorting\n",
      "Disallow: /wiki/Wikipedia_talk%3AWikiProject_Deletion_sorting\n",
      "Disallow: /wiki/Wikipedia_talk:Files_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk%3AFiles_for_deletion\n",
      "Disallow: /wiki/Wikipedia_talk:Files_for_discussion\n",
      "Disallow: /wiki/Wikipedia_talk%3AFiles_for_discussion\n",
      "Disallow: /wiki/Wikipedia_talk:Possibly_unfree_files\n",
      "Disallow: /wiki/Wikipedia_talk%3APossibly_unfree_files\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Copyright_problems\n",
      "Disallow: /wiki/Wikipedia%3ACopyright_problems\n",
      "Disallow: /wiki/Wikipedia_talk:Copyright_problems\n",
      "Disallow: /wiki/Wikipedia_talk%3ACopyright_problems\n",
      "Disallow: /wiki/Wikipedia:Suspected_copyright_violations\n",
      "Disallow: /wiki/Wikipedia%3ASuspected_copyright_violations\n",
      "Disallow: /wiki/Wikipedia_talk:Suspected_copyright_violations\n",
      "Disallow: /wiki/Wikipedia_talk%3ASuspected_copyright_violations\n",
      "Disallow: /wiki/Wikipedia:Contributor_copyright_investigations\n",
      "Disallow: /wiki/Wikipedia%3AContributor_copyright_investigations\n",
      "Disallow: /wiki/Wikipedia:Contributor_copyright_investigations\n",
      "Disallow: /wiki/Wikipedia%3AContributor_copyright_investigations\n",
      "Disallow: /wiki/Wikipedia_talk:Contributor_copyright_investigations\n",
      "Disallow: /wiki/Wikipedia_talk%3AContributor_copyright_investigations\n",
      "Disallow: /wiki/Wikipedia_talk:Contributor_copyright_investigations\n",
      "Disallow: /wiki/Wikipedia_talk%3AContributor_copyright_investigations\n",
      "Disallow: /wiki/Wikipedia:Protected_titles\n",
      "Disallow: /wiki/Wikipedia%3AProtected_titles\n",
      "Disallow: /wiki/Wikipedia_talk:Protected_titles\n",
      "Disallow: /wiki/Wikipedia_talk%3AProtected_titles\n",
      "Disallow: /wiki/Wikipedia:Articles_for_creation\n",
      "Disallow: /wiki/Wikipedia%3AArticles_for_creation\n",
      "Disallow: /wiki/Wikipedia_talk:Articles_for_creation\n",
      "Disallow: /wiki/Wikipedia_talk%3AArticles_for_creation\n",
      "Disallow: /wiki/Wikipedia_talk:Article_wizard\n",
      "Disallow: /wiki/Wikipedia_talk%3AArticle_wizard\n",
      "#\n",
      "# https://phabricator.wikimedia.org/T13261\n",
      "Disallow: /wiki/Wikipedia:Requests_for_arbitration\n",
      "Disallow: /wiki/Wikipedia%3ARequests_for_arbitration\n",
      "Disallow: /wiki/Wikipedia_talk:Requests_for_arbitration\n",
      "Disallow: /wiki/Wikipedia_talk%3ARequests_for_arbitration\n",
      "Disallow: /wiki/Wikipedia:Requests_for_comment\n",
      "Disallow: /wiki/Wikipedia%3ARequests_for_comment\n",
      "Disallow: /wiki/Wikipedia_talk:Requests_for_comment\n",
      "Disallow: /wiki/Wikipedia_talk%3ARequests_for_comment\n",
      "Disallow: /wiki/Wikipedia:Requests_for_adminship\n",
      "Disallow: /wiki/Wikipedia%3ARequests_for_adminship\n",
      "Disallow: /wiki/Wikipedia_talk:Requests_for_adminship\n",
      "Disallow: /wiki/Wikipedia_talk%3ARequests_for_adminship\n",
      "#\n",
      "# https://phabricator.wikimedia.org/T14111\n",
      "Disallow: /wiki/Wikipedia:Requests_for_checkuser\n",
      "Disallow: /wiki/Wikipedia%3ARequests_for_checkuser\n",
      "Disallow: /wiki/Wikipedia_talk:Requests_for_checkuser\n",
      "Disallow: /wiki/Wikipedia_talk%3ARequests_for_checkuser\n",
      "#\n",
      "# https://phabricator.wikimedia.org/T15398\n",
      "Disallow: /wiki/Wikipedia:WikiProject_Spam\n",
      "Disallow: /wiki/Wikipedia%3AWikiProject_Spam\n",
      "#\n",
      "# https://phabricator.wikimedia.org/T16793\n",
      "Disallow: /wiki/Wikipedia:Changing_username\n",
      "Disallow: /wiki/Wikipedia%3AChanging_username\n",
      "Disallow: /wiki/Wikipedia:Changing_username\n",
      "Disallow: /wiki/Wikipedia%3AChanging_username\n",
      "Disallow: /wiki/Wikipedia_talk:Changing_username\n",
      "Disallow: /wiki/Wikipedia_talk%3AChanging_username\n",
      "Disallow: /wiki/Wikipedia_talk:Changing_username\n",
      "Disallow: /wiki/Wikipedia_talk%3AChanging_username\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Administrators%27_noticeboard\n",
      "Disallow: /wiki/Wikipedia%3AAdministrators%27_noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Administrators%27_noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3AAdministrators%27_noticeboard\n",
      "Disallow: /wiki/Wikipedia:Community_sanction_noticeboard\n",
      "Disallow: /wiki/Wikipedia%3ACommunity_sanction_noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Community_sanction_noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3ACommunity_sanction_noticeboard\n",
      "Disallow: /wiki/Wikipedia:Bureaucrats%27_noticeboard\n",
      "Disallow: /wiki/Wikipedia%3ABureaucrats%27_noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Bureaucrats%27_noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3ABureaucrats%27_noticeboard\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Sockpuppet_investigations\n",
      "Disallow: /wiki/Wikipedia%3ASockpuppet_investigations\n",
      "Disallow: /wiki/Wikipedia_talk:Sockpuppet_investigations\n",
      "Disallow: /wiki/Wikipedia_talk%3ASockpuppet_investigations\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Neutral_point_of_view/Noticeboard\n",
      "Disallow: /wiki/Wikipedia%3ANeutral_point_of_view/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Neutral_point_of_view/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3ANeutral_point_of_view/Noticeboard\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:No_original_research/noticeboard\n",
      "Disallow: /wiki/Wikipedia%3ANo_original_research/noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:No_original_research/noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3ANo_original_research/noticeboard\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Fringe_theories/Noticeboard\n",
      "Disallow: /wiki/Wikipedia%3AFringe_theories/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Fringe_theories/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3AFringe_theories/Noticeboard\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Conflict_of_interest/Noticeboard\n",
      "Disallow: /wiki/Wikipedia%3AConflict_of_interest/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Conflict_of_interest/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3AConflict_of_interest/Noticeboard\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Long-term_abuse\n",
      "Disallow: /wiki/Wikipedia%3ALong-term_abuse\n",
      "Disallow: /wiki/Wikipedia_talk:Long-term_abuse\n",
      "Disallow: /wiki/Wikipedia_talk%3ALong-term_abuse\n",
      "Disallow: /wiki/Wikipedia:Long_term_abuse\n",
      "Disallow: /wiki/Wikipedia%3ALong_term_abuse\n",
      "Disallow: /wiki/Wikipedia_talk:Long_term_abuse\n",
      "Disallow: /wiki/Wikipedia_talk%3ALong_term_abuse\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Wikiquette_assistance\n",
      "Disallow: /wiki/Wikipedia%3AWikiquette_assistance\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Abuse_reports\n",
      "Disallow: /wiki/Wikipedia%3AAbuse_reports\n",
      "Disallow: /wiki/Wikipedia_talk:Abuse_reports\n",
      "Disallow: /wiki/Wikipedia_talk%3AAbuse_reports\n",
      "Disallow: /wiki/Wikipedia:Abuse_response\n",
      "Disallow: /wiki/Wikipedia%3AAbuse_response\n",
      "Disallow: /wiki/Wikipedia_talk:Abuse_response\n",
      "Disallow: /wiki/Wikipedia_talk%3AAbuse_response\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Reliable_sources/Noticeboard\n",
      "Disallow: /wiki/Wikipedia%3AReliable_sources/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Reliable_sources/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3AReliable_sources/Noticeboard\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Suspected_sock_puppets\n",
      "Disallow: /wiki/Wikipedia%3ASuspected_sock_puppets\n",
      "Disallow: /wiki/Wikipedia_talk:Suspected_sock_puppets\n",
      "Disallow: /wiki/Wikipedia_talk%3ASuspected_sock_puppets\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Biographies_of_living_persons/Noticeboard\n",
      "Disallow: /wiki/Wikipedia%3ABiographies_of_living_persons/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Biographies_of_living_persons/Noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3ABiographies_of_living_persons/Noticeboard\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Content_noticeboard\n",
      "Disallow: /wiki/Wikipedia%3AContent_noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk:Content_noticeboard\n",
      "Disallow: /wiki/Wikipedia_talk%3AContent_noticeboard\n",
      "#\n",
      "Disallow: /wiki/Template:Editnotices\n",
      "Disallow: /wiki/Template%3AEditnotices\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Arbitration\n",
      "Disallow: /wiki/Wikipedia%3AArbitration\n",
      "Disallow: /wiki/Wikipedia_talk:Arbitration\n",
      "Disallow: /wiki/Wikipedia_talk%3AArbitration\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Arbitration_Committee\n",
      "Disallow: /wiki/Wikipedia%3AArbitration_Committee\n",
      "Disallow: /wiki/Wikipedia_talk:Arbitration_Committee\n",
      "Disallow: /wiki/Wikipedia_talk%3AArbitration_Committee\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Arbitration_Committee_Elections\n",
      "Disallow: /wiki/Wikipedia%3AArbitration_Committee_Elections\n",
      "Disallow: /wiki/Wikipedia_talk:Arbitration_Committee_Elections\n",
      "Disallow: /wiki/Wikipedia_talk%3AArbitration_Committee_Elections\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Mediation_Committee\n",
      "Disallow: /wiki/Wikipedia%3AMediation_Committee\n",
      "Disallow: /wiki/Wikipedia_talk:Mediation_Committee\n",
      "Disallow: /wiki/Wikipedia_talk%3AMediation_Committee\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Mediation_Cabal/Cases\n",
      "Disallow: /wiki/Wikipedia%3AMediation_Cabal/Cases\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Requests_for_bureaucratship\n",
      "Disallow: /wiki/Wikipedia%3ARequests_for_bureaucratship\n",
      "Disallow: /wiki/Wikipedia_talk:Requests_for_bureaucratship\n",
      "Disallow: /wiki/Wikipedia_talk%3ARequests_for_bureaucratship\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Administrator_review\n",
      "Disallow: /wiki/Wikipedia%3AAdministrator_review\n",
      "Disallow: /wiki/Wikipedia_talk:Administrator_review\n",
      "Disallow: /wiki/Wikipedia_talk%3AAdministrator_review\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Editor_review\n",
      "Disallow: /wiki/Wikipedia%3AEditor_review\n",
      "Disallow: /wiki/Wikipedia_talk:Editor_review\n",
      "Disallow: /wiki/Wikipedia_talk%3AEditor_review\n",
      "#\n",
      "Disallow: /wiki/Wikipedia:Article_Incubator\n",
      "Disallow: /wiki/Wikipedia%3AArticle_Incubator\n",
      "Disallow: /wiki/Wikipedia_talk:Article_Incubator\n",
      "Disallow: /wiki/Wikipedia_talk%3AArticle_Incubator\n",
      "#\n",
      "Disallow: /wiki/Category:Noindexed_pages\n",
      "Disallow: /wiki/Category%3ANoindexed_pages\n",
      "#\n",
      "# User sandboxes for modules are placed in these subpages for testing\n",
      "#\n",
      "Disallow: /wiki/Module:Sandbox\n",
      "Disallow: /wiki/Module%3ASandbox\n",
      "#\n",
      "# 
\n" ] } ], "source": [ "import requests\n", "response = requests.get('https://en.wikipedia.org/robots.txt')\n", "print(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Write a Python program to get the number of datasets currently listed on data.gov." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Module `cssselect` don't come with python as built in. I installed it using - `conda install cssselect`.\n", "\n", "**lxml** - the most feature-rich and easy-to-use library for processing XML and HTML in the Python language." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of datasets currently listed on 'data.gov' website is: \n", "250,060 datasets\n" ] } ], "source": [ "from lxml import html \n", "import requests\n", "\n", "response = requests.get('http://www.data.gov/') # Output: \n", "# Output of title: '\\n \\n\\n \\n\\n Observations of total phosphorous (TP) to support nearshore nutrient modeling, 2015.\\n \\n \\n\\n\\n\\n\\n\\n '\n", "# As we have lots of whitespace so we will use strip to remove them:\n", "\n", "print(\"The name of the most recently added dataset on \\'data.gov\\' website:\")\n", "print(title.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Write a Python program to extract h1 tag from example.com." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "

Example Domain

\n" ] } ], "source": [ "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "\n", "html = urlopen('http://www.example.com/')\n", "bsh = BeautifulSoup(html.read(), 'html.parser')\n", "print(bsh.h1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Write a Python program to extract and display all the header tags from en.wikipedia.org/wiki/Main_Page. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "List of all headers tag:\n", "========================= \n", "\n", "

Main Page

\n", "\n", "

From today's featured article

\n", "\n", "

Did you know...

\n", "\n", "

In the news

\n", "\n", "

On this day

\n", "\n", "

From today's featured list

\n", "\n", "

Today's featured picture

\n", "\n", "

Other areas of Wikipedia

\n", "\n", "

Wikipedia's sister projects

\n", "\n", "

Wikipedia languages

\n", "\n", "

Navigation menu

\n", "\n", "

Personal tools

\n", "\n", "

Namespaces

\n", "\n", "

\n", "Variants\n", "

\n", "\n", "

Views

\n", "\n", "

More

\n", "\n", "

\n", "\n", "

\n", "\n", "

Navigation

\n", "\n", "

Interaction

\n", "\n", "

Tools

\n", "\n", "

In other projects

\n", "\n", "

Print/export

\n", "\n", "

Languages

\n" ] } ], "source": [ "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "\n", "html = urlopen('https://en.wikipedia.org/wiki/Main_Page')\n", "bsh = BeautifulSoup(html.read(), 'html.parser')\n", "headers = bsh.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])\n", "print('List of all headers tag:')\n", "print('=' * 25, '\\n')\n", "print(*headers, sep = '\\n\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8. Write a Python program to extract and display all the image links from en.wikipedia.org/wiki/Peter_Jeffrey_(RAAF_officer). " ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "//upload.wikimedia.org/wikipedia/commons/thumb/a/af/NlaJeffrey1942-43.jpg/220px-NlaJeffrey1942-43.jpg \n", "\n", "//upload.wikimedia.org/wikipedia/commons/thumb/c/c5/008315JeffreyTurnbull1941.jpg/260px-008315JeffreyTurnbull1941.jpg \n", "\n", "//upload.wikimedia.org/wikipedia/commons/e/ea/021807CameronJeffrey1941.jpg \n", "\n", "//upload.wikimedia.org/wikipedia/commons/thumb/9/92/AC0072JeffreyTruscottKittyhawks1942.jpg/280px-AC0072JeffreyTruscottKittyhawks1942.jpg \n", "\n", "//upload.wikimedia.org/wikipedia/commons/thumb/2/26/VIC1689Jeffrey1945.jpg/280px-VIC1689Jeffrey1945.jpg \n", "\n" ] } ], "source": [ "from urllib.request import urlopen\n", "from bs4 import BeautifulSoup\n", "import re\n", "\n", "html = urlopen('https://en.wikipedia.org/wiki/Peter_Jeffrey_(RAAF_officer)')\n", "bsh = BeautifulSoup(html.read(), 'html.parser')\n", "images = bsh.find_all('img', {'src': re.compile('.jpg')}) #Compile a regular expression pattern, returning a Pattern object.\n", "for image in images:\n", " print(image['src'], '\\n') # I am not clear about how this print works. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9. Write a Python program to get 90 days of visits broken down by browser for all sites on data.gov." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10. Write a Python program to that retrieves an arbitary Wikipedia page of \"Python\" and creates a list of links on that page." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11. Write a Python program to check whether a page contains a title or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 12. Write a Python program to list all language names and number of related articles in the order they appear in wikipedia.org." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 13. Write a Python program to get the number of people visiting a U.S. government website right now.\n", "\n", "*Source*: [https://analytics.usa.gov/data/live/realtime.json](https://analytics.usa.gov/data/live/realtime.json)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 14. Write a Python program get the number of security alerts issued by US-CERT in the current year.\n", "\n", "*Source:* [https://www.us-cert.gov/ncas/alerts](https://www.us-cert.gov/ncas/alerts)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 15. Write a Python program to get the number of Pinterest accounts maintained by U.S. State Department embassies and missions\n", "\n", "*Source:* [https://www.state.gov/r/pa/ode/socialmedia/](https://www.state.gov/r/pa/ode/socialmedia/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 16. Write a Python program to get the number of followers of a given twitter account." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 17. Write a Python program to get the number of following on Twitter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 18. Write a Python program to get the number of post on Twitter liked by a given account" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 19. Write a Python program to count number of tweets by a given Twitter account." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 20. Write a Python program to scrap number of tweets of a given Twitter account." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 21. Write a Python program to find the live weather report (temperature, wind speed, description and weather) of a given city." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 22. Write a Python program to display the date, days, title, city, country of next 25 Hackevents." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 23. Write a Python program to download IMDB's Top 250 data (movie name, Initial release, director name and stars)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 24. Write a Python program to get movie name, year and a brief summary of the top 10 random movies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 25. Write a Python program to get the number of magnitude 4.5+ earthquakes detected worldwide by the USGS." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 26. Write a Python program to display the contains of different attributes like different attributes like status_code, headers, url, history, encoding, reason, cookies, elapsed, request and content of a specified resource." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 27. Write a Python program to verifiy SSL certificates for HTTPS requests using requests module.\n", "\n", "**Note:** Requests verifies SSL certificates for HTTPS requests, just like a web browser. By default, SSL verification is enabled, and Requests will throw a SSLError if it's unable to verify the certificate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Problem's Source:[https://www.w3resource.com/python-exercises/web-scraping/index.php](https://www.w3resource.com/python-exercises/web-scraping/index.php)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }