{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 7—Fetching data online" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Version 1.0. Prepared by [Makzan](https://makzan.net). Updated at 2021 March." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this series, we will use 3 lectures to learn fetching data online. This includes:\n", "\n", "- Finding patterns in URL\n", "- Open web URL\n", "- Downloading files in Python\n", "- Fetch data with API\n", "- **Web scraping with Requests and BeautifulSoup**\n", "- Web automation with Selenium\n", "- Converting Wikipedia tabular data into CSV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lesson, we will learn to download web page and parse the HTML to extract the data we need. We will use `requests` and `BeautifulSoup`. `Requests` downloads the web page HTML file and `BeautifulSoup` parses the HTML into tree structure for us to access and extract data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Web Scraping\n", "\n", "1. Querying web page\n", "1. Parse the DOM tree\n", "1. Get the data we want from the HTML code" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "廉政公署首支親子義工團接受培訓\n", "2020年6月澳門國際性銀行業務統計\n", "文化局對龍環葡韻進行外牆修繕工程 期間照常對外開放\n", "2020年第2季私人建築及不動產交易統計\n", "約12,000個社會房屋租戶將會減租\n", "特區政府將於8月20日起實施社會房屋恆常性申請\n", "第46屆世界技能大賽澳門區選拔賽現正接受報名\n", "新經屋法明(18)日起生效\n", "2020/2021學年“大專助學金計劃”的奬學金及特別助學金 即將截止申請\n", "都更公司呼籲祐漢七樓群業主約洽重建\n", "行政長官賀一誠在北京與國家發展和改革委員會主任何立峰會面\n", "行政長官賀一誠在北京與國務院國有資產監督管理委員會黨委書記、主任郝鵬會面\n", "社會文化司司長歐陽瑜代表行政長官出席澳門城市大學2019/2020學年高等學位頒授典禮\n", "行政長官賀一誠在北京與國家稅務總局局長王軍會面\n", "行政長官賀一誠在北京與海關總署署長倪岳峰會面\n", "行政長官賀一誠在北京與公安部副部長兼出入境管理局局長許甘露會面\n", "【澳門都市更新股份有限公司】「祐漢七棟樓群」入戶調查--問題天天都多,祐漢都更問清楚\n", "【新聞局】行政長官賀一誠在北京與國家發展和改革委員會主任何立峰會面\n", "【新聞局】行政長官賀一誠在北京與國務院國有資產監督管理委員會黨委書記、主任郝鵬會面\n", "【新聞局】行政長官賀一誠與國家稅務總局局長王軍會面\n", "【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(14-08)\n", "【新聞局】行政長官賀一誠與水利部部長鄂竟平會面\n", "【新聞局】行政長官賀一誠:暫未打算開放外國人及非中國籍外僱入境\n", "【新聞局】行政長官賀一誠在北京與商務部部長鍾山會見\n", "【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(12-08)\n", "【財政局】財政局新推移動支付繳交定期稅\n", "\n", "\n", "\n", "\n", "煙頭化作「百鳥歸巢」 水煙袋大碌竹分貧富\n", "九澳七苦聖母小教堂\n", "木造 古與今\n", "碎木打造文創品 「廢柴」有幸慶新生\n", "行政長官賀一誠率團往北京拜會部委\n", "行政長官賀一誠:恢復內地旅客訪澳有助經濟復甦\n", "【圖文包】8月12日零時起澳門進入內地人員不再實行集中隔離醫學觀察出入境安排\n", "行政長官賀一誠與商務部部長鍾山會面\n", "行政長官在北京拜訪中國人民銀行和國家外匯管理局\n", "行政長官賀一誠:暫未打算開放外國人及非中國籍外僱入境\n", "行政長官賀一誠拜訪公安部與出入境管理局\n", "-記者會快訊(澳門市民進入中國內地各省市,可事先申請核酸檢測的紙本報告)-\n", "行政長官賀一誠與海關總署署長倪岳峰會面\n", "行政長官賀一誠與水利部部長鄂竟平會面\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "\n", "\n", "res = requests.get(\"https://news.gov.mo/home/zh-hant\")\n", "soup = BeautifulSoup(res.text, \"html.parser\")\n", "\n", "for h5 in soup.select(\"h5\"):\n", " print(h5.text.strip())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extra: Fetching with try-except" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "社會工作局新任正副局長就職\n", "-記者會快訊(過去兩天澳門居民入境珠海豁免隔離醫學觀察的網上申請系統運作良好)-\n", "當事人需就沒有在訴訟程序中適時採取防禦手段承擔後果\n", "新型冠狀病毒感染應變協調中心查詢熱線統計數字(6月22日 08:00至16:00)\n", "-記者會快訊(北京疫情)-\n", "-記者會快訊(出入境及市面情況)-\n", "-記者會快訊(6月19-21日共新增354名入境人士須作醫學觀察)-\n", "-記者會快訊(有867人在指定酒店作醫學觀察)-\n", "托兒所友善措施首日運作暢順\n", "-記者會快訊(本澳最新疫情)-\n", "“消費補貼計劃”中期報告新聞發佈會\n", "行政長官賀一誠與香港特區政府保安局局長李家超一行會面\n", "行政長官賀一誠會見澳門理工學院校董會一行\n", "澳門歷史名人足跡(攝影:周文來)\n", "經香港國際機場返澳的居民今日下午乘坐特別渡輪服務抵達氹仔北安碼頭\n", "新城A區B4地段公共房屋建造工程 - 基礎及地庫公開開標\n", "【新聞局】消費補貼計劃中期報告新聞發佈會22-06\n", "【澳門都市更新股份有限公司】祐漢七棟樓群入戶調查\n", "【新聞局】消費補貼計劃中期報告新聞發佈會22-06\n", "新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(19-06)\n", "【新聞局】市政署人員到冷凍倉庫、批發市場和街市採取樣本作新冠病毒核酸檢測\n", "【新聞局】經香港國際機場返澳的居民乘搭首班特別渡輪抵達氹仔北安碼頭\n", "【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(17-06)\n", "【新聞局】心出發-遊澳門新聞發佈會16-06\n", "【新聞局】新型冠狀病毒最新疫情及本澳各項防控措施新聞發佈會(15-06)\n", "【新聞局】行政長官 賀一誠 栽種幼樹 宣揚珍惜大自然\n", "\n", "\n", "\n", "\n", "焯公亭 記華商 抗疫貢獻\n", "夜香行業七十年代式微 垃圾處理三十年前變天\n", "病疫影響城市規劃\n", "澳門藝術界 為抗疫打氣\n", "非強制央積金2020年度預算盈餘特別分配款項名單公佈\n", "明起重新開放澳門居民入境珠海豁免隔離預約系統 獲批人士可 7天內入境珠海獲豁免隔離\n", "“心出發‧遊澳門”明日起接受報名 冀逐步恢復旅遊業活動\n", "-記者會快訊(“心出發‧遊澳門”本地遊活動的統籌及詳情)-\n", "明起珠海對本澳合資格人士入境暫不實施集中醫學隔離安排 應變協調中心提醒市民留意獲批的開始日期及提早填報資料\n", "百億抗疫援助基金計劃款項6月16日起發放\n", "兩項防浸設備資助計劃月杪結束,呼籲商戶如有需要及早申請\n", "【圖文包】常住珠海屬內地居民的外地僱員 進入本澳前免14天醫學觀察\n", "-記者會快訊(申請豁免隔離入境珠海的網上系統明日重開)-\n", "行政長官賀一誠與香港保安局局長李家超會面\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "\n", "try:\n", " res = requests.get(\"https://news.gov.mo/home/zh-hant\")\n", "except requests.exceptions.ConnectionError:\n", " print(\"Error: Invalid URL or Connection Lost.\")\n", " exit()\n", "\n", "soup = BeautifulSoup(res.text, \"html.parser\")\n", "\n", "for h5 in soup.select(\"h5\"):\n", " print(h5.text.strip())\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "廉政公署首支親子義工團接受培訓\n", "\n", "\n", "廉政公署組成首支“廉潔義工隊—親子義工團”,並完成首場培訓活動。廉署期望,新組成的親子義工團,進一步壯大廉潔義工隊,擴展深入社區的倡廉力量。\n", "廉署親子義工團於上月展開招募,報名人數超出招募名額逾4倍。由於反應熱烈,廉署決定將原定40個親子組別增加至60個,並以抽籤方式確定入圍者。親子義工團組成後已於本月15日展開首場培訓活動,包括舉辦講座認識廉政公署職能和了解廉潔義工隊的工作等,親子成員並參觀及體驗了黑沙環社區辦事處的多媒體設施。有家長認為親子義工活動富有意義,將會身體力行,帶領小朋友為廉潔社會獻一分力,並期望孩子能在參與義工工作中培養正確價值觀。\n", "廉署自2001年成立“廉潔義工隊"以來,歷年成員超過600人。廉署稍後將為親子團成員繼續提供各種培訓,讓親子義工協助廉署推行各項倡廉活動,例如參與義務工作實踐、探訪社會服務機構及參與公益活動等。\n", "\n", "---\n", "2020年6月澳門國際性銀行業務統計\n", "\n", "\n", "澳門金融管理局今天發佈的統計顯示,在2020年第二季,國際性業務佔澳門整體銀行業務的比重回落。至2020年6月底,國際資產佔銀行體系總資產的比重,從2020年3月底的85.9% 下跌至85.5%;而國際負債佔銀行體系總負債的比重,亦從2020年3月底的82.9% 下跌至82.6%。\n", "外幣是澳門國際性銀行業務的主要交易單位。至2020年6月底,澳門元佔國際性銀行資產及負債的比重均為0.7%,而港元、美元、人民幣及其他外幣在國際資產中分別佔35.0%、47.5%、12.7% 及4.1%,其在國際負債中的比重則分別為40.0%、45.1%、10.9% 及3.3%。\n", "澳門銀行的國際資產         \n", "至2020年6月底,澳門銀行的國際資產總額按季上升3.1%,而按年亦增加12.7%,金額為18,759億澳門元(2,350億美元);其中,對外資產按年上升11.9% 至14,045億澳門元,而本地外幣資產亦增長15.3%,金額為4,713億澳門元。作為國際資產主要組成部份的外地非銀行貸款增加16.7%,金額達6,666億澳門元。\n", "澳門銀行的國際負債\n", "澳門銀行的國際負債總額較三個月前上升3.2%,而按年亦增長13.8%,金額為18,112億澳門元(2,269億美元);其中,對外負債及本地外幣負債分別按年上升18.5% 及9.0%,金額達9,614億及8,497億澳門元。澳門居民及特區政府存放於澳門銀行的各類外幣存款,仍然是國際負債的主要組成部份,這類存款按年上升4.7%,至2020年6月底的7,237億澳門元。\n", "澳門銀行對外資產及負債的地區分佈\n", "澳門的國際性銀行業務主要分佈在亞洲及歐洲。至2020年6月底,澳門銀行體系對外資產中,對中國內地及香港的債權各佔40.1% 及27.8%;而對德國及英國的債權則各佔0.9% 及0.8%。與此同時,對葡語國家及“一帶一路”沿線國家的債權比重分別為1.3% 及 7.0%。在對外負債方面,對香港及中國內地的負債各佔總體對外負債的48.3% 及24.9%;對英國及法國的負債則各佔4.5% 及3.7%;而對葡語國家及“一帶一路”沿線國家的負債比重分別為0.4% 及8.4%。\n", "澳門國際性銀行業務統計是根據國際清算銀行(Bank for International Settlements)所倡導的方法編制,藉以配合澳門特別行政區參與該國際組織的“地區性國際銀行業務統計”計劃。\n", "\n", "---\n", "文化局對龍環葡韻進行外牆修繕工程 期間照常對外開放\n", "\n", "\n", "為更好地保護澳門的文物建築及展館設施,文化局將於8月19日(星期三)起對氹仔龍環葡韻進行外牆修繕工程,期間參觀時間不受影響,敬請公眾留意。\n", "是次修繕工程主要對龍環葡韻五座建築物的外牆和門窗等進行維修保養、粉刷及翻新,屆時建築物外部會搭建排柵及保護圍網,工程預計於11月底完成。\n", "龍環葡韻開放時間為上午10時至下午7時(下午6時30分停止入場),逢星期一休館,免費入場,每周六下午3時至5時設有導賞服務。查詢可於辦公時間致電 8988 4000。\n", "\n", "---\n", "2020年第2季私人建築及不動產交易統計\n", "\n", "\n", "統計暨普查局資料顯示,今年第2季完成繳納印花稅程序的樓宇單位及停車位買賣共2,636個,總值147.1億元(澳門元,下同),按季分別上升81.2%及86.3%。\n", "住宅單位買賣有1,971個,按季增加990個,成交總值126.9億元,上升107.0%;現貨住宅成交1,678個,增加838個,成交金額上升116.5%至105.5億元;住宅樓花買賣增加152個至293個,成交金額21.5億元,上升70.2%。\n", "整體住宅單位每平方米實用面積平均價格(105,134元)錄得4.8%的按季升幅,當中澳門半島(104,888元)與氹仔(102,240元)分別上升7.1%及4.5%,路環(119,587元)則下跌2.9%。\n", "現貨住宅平均價格為98,710元,按季上升4.7%;成交主要集中在氹仔中心區(267個)、黑沙環新填海區(247個)和望廈及水塘(122個),平均價格分別為每平方米95,349元、121,605元及135,418元。\n", "住宅樓花平均價格按季上升13.7%至152,088元;筷子基(77個)、氹仔舊城及馬場(73個)和路環(70個)的平均價格分別為每平方米165,854元、191,853元及132,765元。\n", "辦公室單位每平方米實用面積平均價格為112,216元,按季下跌0.7%,工業單位則上升1.5%至52,367元。\n", "第2季訂立的不動產買賣契約有2,315宗,涉及的不動產數目(2,509個)按季上升34.8%;訂立按揭貸款契約有2,223宗,涉及不動產數目(2,955個)下跌5.3%。\n", "私人建築方面,今年第2季末處於設計階段和在建中的住宅單位分別有8,916個及3,766個,另有689個正進行驗樓。季內獲發動工批示的住宅單位有43個,其中42個位於澳門半島。獲發使用准照的住宅單位共176個,當中無間隔單位佔61.9%。\n", "\n", "---\n", "約12,000個社會房屋租戶將會減租\n", "\n", "\n", "在訂定租金金額時,應考慮社屋的類型,尤其應考慮承租人及其家團成員的每月收入、社屋的實用面積及房屋自由市場的租金水平。\n", "新社屋租金計算方式設兩個級別:\n", "\n", "當家團總收入低於或等於維持生計的開支時,租金為零。\n", "當家團總收入高於維持生計的開支,以家團總收入減去維持生計的開支後之17.5%計算。\n", "\n", "按新方式計算,並配合經調整的社屋家團每月收入上限,約12,000個租戶將會減租。\n", "\n", "---\n", "Done.\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "\n", "\n", "res = requests.get(\"https://news.gov.mo/home/zh-hant\")\n", "soup = BeautifulSoup(res.text, \"html.parser\")\n", "\n", "for h5 in soup.select(\"h5\")[:5]:\n", " print(h5.getText().strip())\n", " \n", " # Fetch the content\n", " href = h5.select_one(\"a\")[\"href\"]\n", " res = requests.get(\"https://news.gov.mo/\" + href)\n", " soup2 = BeautifulSoup(res.text, \"html.parser\")\n", " content = soup2.select_one(\".asideBody p:first-of-type\")\n", " print(content.text)\n", " print(\"---\")\n", "\n", "print(\"Done.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching Macao Daily news" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "A01:澳聞\n", "橫琴新口岸明開通\n", "供澳鮮活空車入跨工區快捷\n", "三巴士線經澳方口岸\n", "港珠澳橋口岸貨運通關暢順\n", "\n", "A02:澳聞\n", "本報記者 澳大實習生 鄭詠心 報道\n", "漁業復運當局多措防疫\n", "漁會:漁民漁工具防疫意識\n", "順風順水\n", "居民幫襯街市“首撈”海鮮\n", "粵港澳聯合執法助復工復產\n", "\n", "A03:澳聞\n", "賀:疫情凸顯行業單一弊端\n", "賀一誠冀支持粵澳深度合作\n", "鄭安庭促提前做好交通規劃\n", "氹促會籲科學研判減交通困局\n", "社諮委倡步行系統串連黑沙\n", "城規師:十年變遷都更重查必要\n", "\n", "A04:要聞\n", "鍾:粵疫情不會大擴散\n", "粵八招防秋冬疫情反彈\n", "深採樣逾八萬份均陰性\n", "我首個新冠疫苗專利獲批\n", "疆新增四本土病例\n", "穗停進口疫區冷凍肉水產\n", "\n", "A05:要聞\n", "特朗普被指操縱選舉\n", "特朗普胞弟病逝\n", "特:將考慮是否赦免斯諾登\n", "\n", "A06:特刋\n", "金沙物美嘉年華引十萬人次到場\n", "王英偉:免費展銷平台助內銷經濟\n", "心明治:珍惜參展機會\n", "Finished.\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "import datetime\n", "\n", "today = datetime.date.today()\n", "year = today.year\n", "month = today.month\n", "day = today.day\n", "\n", "month = str(month).zfill(2)\n", "day = str(day).zfill(2) \n", "res = requests.get(f\"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm\")\n", "\n", "res.encoding = \"utf-8\"\n", "\n", "soup = BeautifulSoup(res.text, \"html.parser\") # Be aware that you may need a different parser if \"lxml\" not found.\n", "\n", "links = soup.select(\"#all_article_list a\")\n", "for link in links[:40]:\n", " print(link.text) \n", "\n", "\n", "print(\"Finished.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ✏️ Exercise time: Lab 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Please try to execute the code to see the program result.\n", "1. Please try to change the keyword inside the code to fetch different queries.\n", "1. Please try to make the code more flexible by changing the date and query into input.\n", "1. Please try to save the result into a text file.\n", "1. Please try to change the code to allow multiple searches until user enters \"q\"." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2020-06-01: 穗力推大灣區軌道交通融合\n", "2020-06-07: 穗市長:四領域推進大灣區建設\n", "2020-06-09: 全國政協調研大灣區創新合作\n", "2020-06-09: 大灣區葡語教育聯盟成立\n", "2020-06-14: 粵發佈大灣區文遺遊徑\n", "2020-06-14: 再現大灣區詩歌地圖\n", "Finished.\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "\n", "# Task 1: Change year and month into input\n", "year = \"2020\"\n", "month = \"06\"\n", "\n", "for i in range(1,32):\n", " day = str(i).zfill(2) \n", " res = requests.get(f\"http://www.macaodaily.com/html/{year}-{month}/{day}/node_1.htm\")\n", "\n", " res.encoding = \"utf-8\"\n", "\n", " soup = BeautifulSoup(res.text, \"html.parser\")\n", "\n", " links = soup.select(\"#all_article_list a\")\n", " for link in links:\n", " news_title = link.getText()\n", "\n", " # Task 2: Change keyword into input\n", " if \"大灣區\" in news_title:\n", " # Task 3: Save the result in TXT intead of printing out.\n", " print(f\"{year}-{month}-{day}: {news_title}\")\n", "\n", "print(\"Finished.\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solution to Lab 3\n", "\n", "https://makclass.com/vimeo_players/335074765" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When is the next holiday?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 接下來的公眾假期星期四25六月端午節公眾假期\n" ] } ], "source": [ "import datetime\n", "\n", "url = f\"https://www.gov.mo/zh-hant/public-holidays/year-{datetime.date.today().year}/\"\n", "response = requests.get(url)\n", "soup = BeautifulSoup(response.text, \"html.parser\")\n", "\n", "print(soup.select(\"#public-holidays\")[0].text.replace('\\n',''))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "接下來的公眾假期:端午節, 六月25日星期四\n" ] } ], "source": [ "month = soup.select(\"#public-holidays .month\")[0].text\n", "day = soup.select(\"#public-holidays .day\")[0].text\n", "weekday = soup.select(\"#public-holidays .weekday\")[0].text\n", "description = soup.select(\"#next-holiday-description strong\")[0].text\n", "\n", "print(f\"接下來的公眾假期:{description}, {month}{day}日{weekday}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A list of holidays in Macao" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1月1日: 元旦\n", "1月25日: 農曆正月初一\n", "1月26日: 農曆正月初二\n", "1月27日: 農曆正月初三\n", "4月4日: 清明節\n", "4月10日: 耶穌受難日\n", "4月11日: 復活節前日\n", "4月30日: 佛誕節\n", "5月1日: 勞動節\n", "6月25日: 端午節\n", "10月1日: 中華人民共和國國慶日\n", "10月2日: 中華人民共和國國慶日翌日\n", "10月2日: 中秋節翌日\n", "10月25日: 重陽節\n", "11月2日: 追思節\n", "12月8日: 聖母無原罪瞻禮\n", "12月20日: 澳門特別行政區成立紀念日\n", "12月21日: 冬至\n", "12月24日: 聖誕節前日\n", "12月25日: 聖誕節\n" ] } ], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "response = requests.get(\"https://www.gov.mo/zh-hant/public-holidays/year-2020/\")\n", "soup = BeautifulSoup(response.text, \"html.parser\")\n", "\n", "tables = soup.select(\".table\")\n", "\n", "for row in tables[0].select(\"tr\"):\n", " if len(row.select(\"td\")) > 0:\n", " date = row.select(\"td\")[1].text\n", " name = row.select(\"td\")[3].text\n", " print(f\"{date}: {name}\")\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only listing obligatory holidays" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1月1日: 元旦\n", "1月25日: 農曆正月初一\n", "1月26日: 農曆正月初二\n", "1月27日: 農曆正月初三\n", "4月4日: 清明節\n", "5月1日: 勞動節\n", "10月1日: 中華人民共和國國慶日\n", "10月2日: 中秋節翌日\n", "10月25日: 重陽節\n", "12月20日: 澳門特別行政區成立紀念日\n" ] } ], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "response = requests.get(\"https://www.gov.mo/zh-hant/public-holidays/year-2020/\")\n", "soup = BeautifulSoup(response.text, \"html.parser\")\n", "\n", "tables = soup.select(\".table\")\n", "\n", "for row in tables[0].select(\"tr\"):\n", " if len(row.select(\"td\")) > 0:\n", " is_obligatory = (row.select(\"td\")[0].text == \"*\")\n", " if is_obligatory:\n", " date = row.select(\"td\")[1].text\n", " name = row.select(\"td\")[3].text\n", " print(f\"{date}: {name}\")\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Is today government holiday?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6月22日\n", "今天是星期日,但不是公眾假期。\n" ] } ], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "import datetime\n", "\n", "# Get today's year, month and day\n", "today = datetime.date.today()\n", "year = today.year\n", "month = today.month\n", "day = today.day\n", "today_weekday = today.weekday()\n", "today_date = f\"{month}月{day}日\"\n", "\n", "\n", "# Fetch gov.mo\n", "url = f\"https://www.gov.mo/zh-hant/public-holidays/year-{year}/\"\n", "response = requests.get(url)\n", "soup = BeautifulSoup(response.text, \"html.parser\")\n", "\n", "tables = soup.select(\".table\")\n", "\n", "holidays = {}\n", "\n", "for table in tables:\n", " for row in table.select(\"tr\"):\n", " if len(row.select(\"td\")) > 0: \n", " date = row.select(\"td\")[1].text\n", " weekday = row.select(\"td\")[2].text\n", " name = row.select(\"td\")[3].text\n", " holidays[date] = name\n", "\n", "\n", "# Query holidays\n", "print(today_date)\n", "if today_date in holidays:\n", " holiday = holidays[today_date]\n", " print(f\"今天是公眾假期:{holiday}\")\n", "elif today_weekday == 0:\n", " print(\"今天是星期日,但不是公眾假期。\")\n", "elif today_weekday == 6:\n", " print(\"今天是星期六,但不是公眾假期。\") \n", "else:\n", " print(\"今天不是公眾假期。\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our code is getting longer now. We can group the parts of the code that fetch gov.mo into a function. We name it `is_macao_holiday` and take a date parameter." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def is_macao_holiday(query_date): \n", " # Fetch gov.mo\n", " url = f\"https://www.gov.mo/zh-hant/public-holidays/year-{query_date.year}/\"\n", " response = requests.get(url)\n", " soup = BeautifulSoup(response.text, \"html.parser\")\n", "\n", " tables = soup.select(\".table\")\n", "\n", " holidays = {}\n", "\n", " for table in tables:\n", " for row in table.select(\"tr\"):\n", " if len(row.select(\"td\")) > 0: \n", " date = row.select(\"td\")[1].text\n", " weekday = row.select(\"td\")[2].text\n", " name = row.select(\"td\")[3].text\n", " holidays[date] = name\n", "\n", "\n", " # Query holidays\n", " date_key = f\"{query_date.month}月{query_date.day}日\"\n", "\n", " if date_key in holidays: \n", " holiday = holidays[date_key]\n", " print(f\"{date_key}是公眾假期:{holiday}\")\n", " elif query_date.weekday() == 0:\n", " print(f\"{date_key}是星期日,但不是公眾假期。\")\n", " elif query_date.weekday() == 6:\n", " print(f\"{date_key}是星期六,但不是公眾假期。\") \n", " else:\n", " print(f\"{date_key}不是公眾假期。\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6月18日不是公眾假期。\n" ] } ], "source": [ "is_macao_holiday(datetime.date.today())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Picking a date other than today" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use parser in `dateutil` to parse a given date in string format into date format." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1月1日是公眾假期:元旦\n" ] } ], "source": [ "import dateutil\n", "date = dateutil.parser.parse(\"2020-01-01\")\n", "is_macao_holiday(date)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10月26日是公眾假期:重陽節的補假\n" ] } ], "source": [ "import dateutil\n", "date = dateutil.parser.parse(\"2020-10-26\")\n", "is_macao_holiday(date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Futhermore, we can store the result in dictionary for further querying." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "import datetime\n", "\n", "# Get today's year, month and day\n", "today = datetime.date.today()\n", "year = today.year\n", "month = today.month\n", "day = today.day\n", "today_weekday = today.weekday()\n", "today_date = f\"{month}月{day}日\"\n", "\n", "\n", "# Fetch gov.mo\n", "url = f\"https://www.gov.mo/zh-hant/public-holidays/year-{year}/\"\n", "response = requests.get(url)\n", "soup = BeautifulSoup(response.text, \"html.parser\")\n", "\n", "tables = soup.select(\".table\")\n", "\n", "holidays = {}\n", "\n", "for table in tables:\n", " for row in table.select(\"tr\"):\n", " if len(row.select(\"td\")) > 0: \n", " is_obligatory = (row.select(\"td\")[0].text == \"*\")\n", " date = row.select(\"td\")[1].text\n", " weekday = row.select(\"td\")[2].text\n", " name = row.select(\"td\")[3].text\n", " holidays[date] = {\n", " 'date': date,\n", " 'weekday': weekday,\n", " 'name': name,\n", " 'is_obligatory': is_obligatory,\n", " }\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is stored in dictionary `holidays`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "28" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(holidays)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'1月1日': {'date': '1月1日',\n", " 'weekday': '星期三',\n", " 'name': '元旦',\n", " 'is_obligatory': True},\n", " '1月25日': {'date': '1月25日',\n", " 'weekday': '星期六',\n", " 'name': '農曆正月初一',\n", " 'is_obligatory': True},\n", " '1月26日': {'date': '1月26日',\n", " 'weekday': '星期日',\n", " 'name': '農曆正月初二',\n", " 'is_obligatory': True},\n", " '1月27日': {'date': '1月27日',\n", " 'weekday': '星期一',\n", " 'name': '農曆正月初三',\n", " 'is_obligatory': True},\n", " '4月4日': {'date': '4月4日',\n", " 'weekday': '星期六',\n", " 'name': '清明節',\n", " 'is_obligatory': True},\n", " '4月10日': {'date': '4月10日',\n", " 'weekday': '星期五',\n", " 'name': '耶穌受難日',\n", " 'is_obligatory': False},\n", " '4月11日': {'date': '4月11日',\n", " 'weekday': '星期六',\n", " 'name': '復活節前日',\n", " 'is_obligatory': False},\n", " '4月30日': {'date': '4月30日',\n", " 'weekday': '星期四',\n", " 'name': '佛誕節',\n", " 'is_obligatory': False},\n", " '5月1日': {'date': '5月1日',\n", " 'weekday': '星期五',\n", " 'name': '勞動節',\n", " 'is_obligatory': True},\n", " '6月25日': {'date': '6月25日',\n", " 'weekday': '星期四',\n", " 'name': '端午節',\n", " 'is_obligatory': False},\n", " '10月1日': {'date': '10月1日',\n", " 'weekday': '星期四',\n", " 'name': '中華人民共和國國慶日',\n", " 'is_obligatory': True},\n", " '10月2日': {'date': '10月2日',\n", " 'weekday': '星期五',\n", " 'name': '中秋節翌日',\n", " 'is_obligatory': True},\n", " '10月25日': {'date': '10月25日',\n", " 'weekday': '星期日',\n", " 'name': '重陽節',\n", " 'is_obligatory': True},\n", " '11月2日': {'date': '11月2日',\n", " 'weekday': '星期一',\n", " 'name': '追思節',\n", " 'is_obligatory': False},\n", " '12月8日': {'date': '12月8日',\n", " 'weekday': '星期二',\n", " 'name': '聖母無原罪瞻禮',\n", " 'is_obligatory': False},\n", " '12月20日': {'date': '12月20日',\n", " 'weekday': '星期日',\n", " 'name': '澳門特別行政區成立紀念日',\n", " 'is_obligatory': True},\n", " '12月21日': {'date': '12月21日',\n", " 'weekday': '星期一',\n", " 'name': '冬至',\n", " 'is_obligatory': False},\n", " '12月24日': {'date': '12月24日',\n", " 'weekday': '星期四',\n", " 'name': '聖誕節前日',\n", " 'is_obligatory': False},\n", " '12月25日': {'date': '12月25日',\n", " 'weekday': '星期五',\n", " 'name': '聖誕節',\n", " 'is_obligatory': False},\n", " '1月24日(下午)': {'date': '1月24日(下午)',\n", " 'weekday': '星期五',\n", " 'name': '農曆除夕',\n", " 'is_obligatory': False},\n", " '10月5日': {'date': '10月5日',\n", " 'weekday': '星期一',\n", " 'name': '中華人民共和國國慶日翌日及中秋節翌日重疊',\n", " 'is_obligatory': False},\n", " '12月31日(下午)': {'date': '12月31日(下午)',\n", " 'weekday': '星期四',\n", " 'name': '除夕',\n", " 'is_obligatory': False},\n", " '1月28日': {'date': '1月28日',\n", " 'weekday': '星期二',\n", " 'name': '農曆正月初一的補假',\n", " 'is_obligatory': False},\n", " '1月29日': {'date': '1月29日',\n", " 'weekday': '星期三',\n", " 'name': '農曆正月初二的補假',\n", " 'is_obligatory': False},\n", " '4月6日': {'date': '4月6日',\n", " 'weekday': '星期一',\n", " 'name': '清明節的補假',\n", " 'is_obligatory': False},\n", " '4月13日': {'date': '4月13日',\n", " 'weekday': '星期一',\n", " 'name': '復活節前日的補假',\n", " 'is_obligatory': False},\n", " '10月26日': {'date': '10月26日',\n", " 'weekday': '星期一',\n", " 'name': '重陽節的補假',\n", " 'is_obligatory': False},\n", " '12月22日': {'date': '12月22日',\n", " 'weekday': '星期二',\n", " 'name': '澳門特別行政區成立紀念日的補假',\n", " 'is_obligatory': False}}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "holidays" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6月22日\n", "今天是星期日,但不是公眾假期。\n" ] } ], "source": [ "# Query holidays\n", "print(today_date)\n", "if today_date in holidays:\n", " holiday = holidays[today_date]\n", " if holiday['is_obligatory']:\n", " print(f\"今天是強制公眾假期:{holiday['name']}\")\n", " else:\n", " print(f\"今天是公眾假期:{holiday['name']}\")\n", "elif today_weekday == 0:\n", " print(\"今天是星期日,但不是公眾假期。\")\n", "elif today_weekday == 6:\n", " print(\"今天是星期六,但不是公眾假期。\") \n", "else:\n", " print(\"今天不是公眾假期。\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this lesson, we learned about using BeautifulSoup to extract data from the web." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }