{ "cells": [ { "cell_type": "markdown", "id": "d162b058", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

\n" ] }, { "cell_type": "markdown", "id": "3184edc1", "metadata": {}, "source": [ "

Lecture 5.4 (Web Scraping using Selenium - II)

" ] }, { "cell_type": "markdown", "id": "35b552c1", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "bca7407c", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "90ab85fc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2d24678e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c8f244e9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c7682a1d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ba77e7b5", "metadata": {}, "source": [ "\n", "


\n", "\n", "## Learning agenda of this notebook\n", "\n", "**Recap of Previous Session**\n", "\n", "\n", "- **Best Practices for Web Scraping & Some Points to Ponder**\n", "\n", "\n", "- **Example 1:** Searching and Downloading Images for ML Classification:https://google.com\n", "\n", "\n", "\n", "- **Example 2:** Scraping Comments from a YouTube Video for NLP:https://www.youtube.com/watch?v=mHONNcZbwDY\n", "\n", "\n", "- **Example 3:** Scraping Jobs from a Job Website: https://pk.indeed.com\n", "\n", "\n", "- **Example 4:** Scraping Tweets of a celebirty: https://twitter.com/login\n", "\n", "\n", "- **Example 5:** Scraping News Articles for a News Website: https://www.thenews.com.pk/today\n", "\n", "\n", "- **Exercise:**\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5db20d16", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e6d009c1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ab3035a4", "metadata": {}, "source": [ "## Best Practices for Web Scraping & Some Points to Ponder" ] }, { "cell_type": "markdown", "id": "573400af", "metadata": {}, "source": [ "### a. Check if Website is Changing Layouts and use Robust Locators\n", "- Locating correct web element is a pre-requiste of web scraping. \n", "- We can use ID, Name, Class, Tag, LinkText and PartialLinkText to locate web elements in Selenium.\n", "- In dynamic environments the web elements mostly donot have consistent attribute values, therefore, finding a unique static attribute is quite a tricky task. Hence directly using above mentioned six selenium locators might not be able to uniquely identify a web element..\n", "- So in such situations CSS-Selector and XPATH should be preferred." ] }, { "cell_type": "code", "execution_count": null, "id": "c706299d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9ade05ea", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2562f856", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "56957001", "metadata": {}, "source": [ "#### CSS SELECTOR\n", "- Basic Syntax: `tag[attribute='value']`\n", "- Using ID: `input[id='username']` or `input#username`\n", "- Using Class: `input[class='form-control']` or `input.form-control`\n", "- Using any attribute: `input[any-attr='attr-value']` \n", "- Combining attributes: `input.form-control[attr='value']`\n", "- Using Parent/Child Hierarchy: \n", " - Basic Syntax: `parent-locator > child-locator`\n", " - Direct Parent/Child: `div > input[attr = 'value']`" ] }, { "cell_type": "code", "execution_count": null, "id": "01accc26", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "11fc761b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "a1d5c1fd", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e4c483ca", "metadata": {}, "source": [ "#### XPATH SELECTOR\n", "- Basic Example:\n", " - Absolute XPATH: `/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input`\n", " - Relative XPATH: `//input[@title='Search']`\n", "- For dyanmic websites using simple XPATH might return multiple elements. To write effective XPATHS, one can use XPATH functions to identify elements uniquely:\n", " - Using contains(): `//input[contains(@id, 'userN')]` or `//input[@id, 'userNname']`\n", " - Using starts-with(): `//tagname[starts-with(@attribute, 'initial partial value of attribute')]`\n", " - Using text(): `//input[text() = 'text of the element')]`\n", "- You can use AND & OR operators to identify an element by combining two different conditions or attributes:\n", " - Using and: `//tagname[@name='value' and @id='value']`\n", " - Using or: `//tagname[@name='value' or @id='value']`\n", "- You can use XPATH Axis, which use the relationship between various nodes to locate a web element in the DOM structure:\n", " - `ancestor`: Locates ancestors of current node, which includes the parents upto the root node.\n", " - `descendant`: Locates descendants of current node, which includes the children upto the leaf node.\n", " - `child`: Locates the children of current node.\n", " - `parent`: Locates parent of the current node." ] }, { "cell_type": "code", "execution_count": null, "id": "e3634179", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ea49ea93", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d5a8bbad", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "63d39fa6", "metadata": {}, "source": [ "### b. Wait for the WebElement to be Displayed Before you Start Scraping\n", "- These days most of the web apps are using AJAX techniques. When a page is loaded by the browser, the elements within that page may load at different time intervals. This makes locating elements difficult: if an element is not yet present in the DOM, a locate function will raise an ElementNotVisibleException. \n", "- One can use `time.sleep(10)` to make our script wait for exact 10 seconds before proceeding. One should avoid using these static wait statements rather should use the dynamic waits provided by Selenium Webdriver.\n", "- Many times the web elements are not interactable, not clickable, or not visible, and thats where you have to put the wait so that the page gets loaded and your script can find that particular web element and proceed further.\n", " - **Implicit wait:**\n", " - Implicit wait applies to all the Web Elements in the test script\n", " - In implicit wait you specify a time out and your script wait for all the web elements to be loaded or raises an exception if the time expires.\n", " - Example: The `driver.implicitly_wait(30)` will wait for a maximum of 30 seconds before throwing a timeout exception. If all the web elements are available before 30 seconds, control will move to the next LOC.\n", " \n", " - **Explicit wait:**\n", " - Explicit wait is used to wait for a specific web element.\n", " - In explicit wait, other than specifying the time out, you also specify a condition to be checked, like checking if the element is visible, or clickable and so on.\n", " - Example: The `element = WebDriverWait(driver, 30).until(EC.presence_of_element_located(By.XPATH, 'xpath'))` will wait for a maximum of 30 seconds before throwing a timeout exception. If the specific web element becomes visible within 30 seconds, control will move to the next LOC.\n", " \n", " - **Fluent Wait** is quite similar to explicit wait, where you can specify the polling frequency. See Selenium documentations for details: https://www.selenium.dev/documentation/webdriver/waits/" ] }, { "cell_type": "code", "execution_count": null, "id": "eb69a8e0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "8a7608b5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "59cd8542", "metadata": {}, "source": [ "### c. Robots Exclusion Protocol\n", "- The robots exclusion protocol or simply `robots.txt`, is a standard used by websites to communicate with web crawlers/bots, informing them about which areas of the website can be scanned or scraped. \n", "- The `robots.txt` file is mostly placed in a website's top level directory and is publically available. A sub-domain on a root domain can also have separate `robots.txt` files.\n", "- The `robots.txt` file provides instructions for bots, however, it can't actually enforce the instructions.\n", "- So a good bot follows those instructions, while a bad bot ignore them." ] }, { "cell_type": "code", "execution_count": null, "id": "cfc14f53", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7edd4a24", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ce730b1a", "metadata": {}, "source": [ "### d. Do not Hammer the Webserver \n", "- Web scraping bots fetch data very fast, so it is easy for a website to detect your scraper.\n", "- So to make sure that your bot donot hammer the webserver by sending too many request in a very short span of time, you need to put some random programmatic sleep calls `time.sleep(2)` in between requests. " ] }, { "cell_type": "code", "execution_count": null, "id": "0e4ab7a1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c34726a1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7cd07023", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "9c2404c6", "metadata": {}, "source": [ "### e. Avoid Scraping Data Behind Login\n", "- If a page is protected by login, the scraper would have to send some information or cookies along with each request to view the page. \n", "- So be watchful, if you get caught, your account might get blocked." ] }, { "cell_type": "code", "execution_count": null, "id": "5033fd2e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "906d8c9d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1c8e2371", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a1a018e6", "metadata": {}, "source": [ "### f. Do not Follow Same Crawling Pattern\n", "- When humans browse a website, they have different view time, they are slow, and they perform random clicks. On the contrary bots are very fast and follow the same/fixed browsing pattern.\n", "- Some websites have intelligent anti-crawling mechanisms to detect spiders and may block your IP and you can no more visit that website.\n", "- A simple solution is to incorporate some random clicks on the page, mouse movements and random actions that will make your bot look like a human." ] }, { "cell_type": "code", "execution_count": null, "id": "b5aae158", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "193ef976", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "379e2864", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e17ad1e1", "metadata": {}, "source": [ "### g. Beware of Honey Pots\n", "- Honeypots are systems that are used to lure hackers and detect any scraping attempts that try to gain information.\n", "- Some websites install honeypots, which are links invisible to normal users with color disguised to blend in with the page’s background color. But can be seen by bots and therefore one of the reasons to get caught.\n", "- So make sure that your bot take care that the link has proper visibility with no nofollow tag." ] }, { "cell_type": "code", "execution_count": null, "id": "52c4648b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3ffd62c6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "03b8529a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4863e6a2", "metadata": {}, "source": [ "### h. Rotate User-Agents\n", "- Every request made from a web browser contains a user-agent header, and if the user agent is not set, websites won’t let you view content.\n", "- Using the same user-agent consistently leads to the detection of a bot. \n", "- The only way to make your User-Agent appear more real and bypass detection is to fake the user agent.\n", "> You can get your User-Agent by typing `what is my user agent` in Google’s search bar. " ] }, { "cell_type": "code", "execution_count": null, "id": "18d8327a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "dd76f263", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b2e23c94", "metadata": {}, "source": [ "### i. Make Requests through Proxies and Rotate Them as Needed\n", "- When scraping blindlessly, multiple requests coming from the same IP will lead you to get blocked\n", "- So better scrap from behind a proxy server, so the target website will not know where the original IP is from, making the detection harder.\n", "- There are several methods that can change your outgoing IP\n", " - TOR\n", " - VPNs\n", " - Free Proxies\n", " - Shared Proxies\n", " - Private Proxies\n", " - Data Center Proxies\n", " - Residential Proxies\n", "> You can get your IP by typing `what is my ip` in Google’s search bar." ] }, { "cell_type": "code", "execution_count": null, "id": "d862e66d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d80cee50", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "fbda0b5b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4f96b7b0", "metadata": {}, "source": [ "### j. Use CAPTCHA Solving Services\n", "- Many websites use CAPTCHAs to keep bots out of their websites.\n", "- If you want to scrape websites that use CAPTCHAs, you can use CAPTCHA services to get past these restrictions.\n", " - https://2captcha.com/\n", " - https://anti-captcha.com/\n", " - https://pypi.org/project/pytesseract/0.1/" ] }, { "cell_type": "code", "execution_count": null, "id": "6c3ed557", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "48421b3d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7db9952c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e1bb1959", "metadata": {}, "source": [ "## Example 1: Searching and Downloading Images for ML Classification:https://google.com" ] }, { "cell_type": "markdown", "id": "da53cfcc", "metadata": {}, "source": [ "### a. Search and Load the Images of Cats" ] }, { "cell_type": "code", "execution_count": 3, "id": "f6891dca", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.common.keys import Keys\n", "import time\n", "\n", "#Create an instance of webdriver and load the google webpage\n", "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options()\n", "myoptions.headless = False # default settings\n", "driver = Chrome(service=s, options=myoptions) \n", "driver.maximize_window()\n", "driver.get('https://google.com') \n", "\n", "\n", "# locate the search textbox, enter the search string and press enter key\n", "driver.implicitly_wait(30)\n", "#tbox = driver.find_element(By.XPATH, '/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input')\n", "#tbox = driver.find_element(By.XPATH, \"//input[@title='Search']\")\n", "tbox = driver.find_element(By.CSS_SELECTOR, \"input[title='Search']\")\n", "tbox.send_keys(\"Cat\")\n", "\n", "\n", "# Instead of locating and clicking the search button, you can simplay press enter\n", "time.sleep(2)\n", "tbox.send_keys(Keys.ENTER) \n", "\n", "\n", "# Locate the image tab and click it to visit the images tab\n", "driver.implicitly_wait(30)\n", "#menu_img_link = driver.find_element(By.XPATH, '/html/body/div[7]/div/div[4]/div/div[1]/div/div[1]/div/div[2]/a')\n", "menu_img_link = driver.find_element(By.XPATH, '//*[@id=\"hdtb-msb\"]/div[1]/div/div[2]/a')\n", "menu_img_link.click()" ] }, { "cell_type": "code", "execution_count": null, "id": "c21ba6da", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "83d4464e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7c5a1bbd", "metadata": {}, "source": [ "### b. Self-Scroll to the Bottom of the Webpage\n", "- Create an instance of WebDriver\n", "- The `driver.execute_script(JS)` method is used to synchronously execute JavaScript in the current window/frame.\n", "```\n", "driver.execute_script('alert(\"Hello JavaScript\")')\n", "```\n", "- The `window.scrollTo()` method is used to perform scrolling operation. The pixels to be scrolled horizontally along the x-axis and pixels to be scrolled vertically along the y-axis are passed as parameters to the method." ] }, { "cell_type": "code", "execution_count": 4, "id": "8d997302", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done... Reached the bottom of the page\n" ] } ], "source": [ "# Self-Scroll the entire page till you reach the bottom\n", "last_height =driver.execute_script('return document.body.scrollHeight')\n", "while True:\n", " driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')\n", " time.sleep(4)\n", " new_height =driver.execute_script('return document.body.scrollHeight')\n", " if (new_height == last_height):\n", " break\n", " last_height = new_height\n", "\n", "print(\"Done... Reached the bottom of the page\")" ] }, { "cell_type": "code", "execution_count": null, "id": "bff86ac7", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "82fd09af", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9900f33c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "99560332", "metadata": {}, "source": [ "### c. Save the Images by using the `screenshot()` method\n", "- Two ways to take a screenshot:\n", " - `driver.save_screenshot(filename)` Saves a screenshot of the current window to a PNG image file and returns a bool value\n", "\n", " - `element.screenshot(filename)` saves a screenshot of the current element to a PNG image file and Returns a bool value" ] }, { "cell_type": "code", "execution_count": 5, "id": "73f1173c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done... Check the folder for images of cats\n" ] } ], "source": [ "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "\n", "# downlaod and save 40 images of cats\n", "for i in range(1,41):\n", " try:\n", " WebDriverWait(driver, 5).until\n", " (EC.presence_of_element_located((By.XPATH, '//*[@id=\"islrg\"]/div[1]/div['+str(i)+']/a[1]/div[1]/img'))) \n", " cat_img = driver.find_element(By.XPATH, '//*[@id=\"islrg\"]/div[1]/div['+str(i)+']/a[1]/div[1]/img')\n", " cat_img.screenshot('/Users/arif/Downloads/cat_images/cat_img'+str(i)+'.png')\n", " except:\n", " continue\n", "\n", "print(\"Done... Check the folder for images of cats\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4b3852dc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7a793a01", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 6, "id": "4411d96b", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": 7, "id": "65f966c5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cat_img1.png cat_img17.png cat_img24.png cat_img32.png cat_img4.png\r\n", "cat_img10.png cat_img18.png cat_img26.png cat_img33.png cat_img40.png\r\n", "cat_img11.png cat_img19.png cat_img27.png cat_img34.png cat_img5.png\r\n", "cat_img12.png cat_img2.png cat_img28.png cat_img35.png cat_img6.png\r\n", "cat_img13.png cat_img20.png cat_img29.png cat_img36.png cat_img7.png\r\n", "cat_img14.png cat_img21.png cat_img3.png cat_img37.png cat_img8.png\r\n", "cat_img15.png cat_img22.png cat_img30.png cat_img38.png cat_img9.png\r\n", "cat_img16.png cat_img23.png cat_img31.png cat_img39.png\r\n" ] } ], "source": [ "!ls /Users/arif/Downloads/cat_images" ] }, { "cell_type": "code", "execution_count": 8, "id": "5b206b38", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from matplotlib import image\n", "from matplotlib import pyplot as plt\n", "\n", "img1 = image.imread(\"/Users/arif/Downloads/cat_images/cat_img20.png\")\n", "plt.imshow(img1);" ] }, { "cell_type": "code", "execution_count": null, "id": "93b1246f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "5e48f371", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f82bcbe5", "metadata": {}, "source": [ "## Example 2: Scraping Comments from a YouTube Video\n", "- https://www.youtube.com/watch?v=mHONNcZbwDY&t=80s" ] }, { "cell_type": "code", "execution_count": 10, "id": "3e12c690", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": 11, "id": "f26ef41f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AuthorsComments
0NaNYou're an absolute legend Lionel
1Logene TemonioIt's 2022 and I'm listening to this Masterpiece ♡
2Joelfantastic ASMRGoing on 40 and this song is still one of my t...
3NaN
4Sofia2022 and i'm still addicted to this LEGENDARY ...
5Daiane SantosSou fã do Lionel \\nEle tem uma voz linda\\nViva...
6NaNUm anjo está cantando. Uma pausa para alegrar ...
7Marcia CorbinBeautiful words...beautiful music...beautiful ...
8Music ManeThis masterpiece gives me teary eyes every tim...
9dlnnyc64Still hits close to the heart almost 40 years ...
10Edmar Fernandes CoutoIsso que é música de valor.... muito sensivel ...
11Yancy JohnsonIt’s 2022 and THIS SONG STILL GOES STRONG!!!
12Maria Rosa Helena Do Prado E SilvaQue hino!!!!!! Amooooo
13Nina SchimelAno de 1984\\nSó eu sei o que representa essa c...
14CeciliasantosEu amo este clipe ️
15Sra. FranciscoUm dos clipes mais belos do mundo! Sinto-me en...
16João Marcos RodriguesEssa canção é emocionante, Lionel Richie é uma...
17Billyn IvyDon’t worry you’re not the the only one listen...
18Cicada3301Masterpiece. simply no words can describe the ...
19Carmen BazanOne morning in 1984 I was getting ready for wo...
20NaNI don’t care what I’m doing, whenever this son...
21ElnadrionUm anjo está cantando. Uma pausa para alegrar ...
22Volnei CarvalhoComo estás músicas são perfeitas lembra da min...
23adeyosolaLionel Richie is one of the best song writers ...
24Thisura YapaThat guitar solo is next level. No wonder tha...
25Luis MachadoLinda demais
26DachshundsRuleThis song brings me to tears, it was our song....
27Lourdes. Conceicaodossantos.Amei essa cena, muito liinda!
28Rita SilvaLionel Richie você é o cara.charmoso maravilho...
29drharshz2022 ..... This is golden! One of the best eve...
30Sherry GeorgeI grew up in the 80's. Some of the best music ...
31Gabriel apaixonado musicMúsica linda
32Angela ArringtonIt's 2022,and I'm listening.never get tired of...
33Marcos PintoTempo excelente que não volta mais maravilhos...
34Андрей БаздыревAs they say, this song is for all time! It is ...
35Denise Hedden2022 and I'm still listening to Lionel Richie....
36Saif MessiIt's simply an eternal song and it will still ...
37joshua jenningsSo many memories of this song ..a masterpiece ...
38Joel JimenezThis song is enchanting and touches the soul. ...
39Vilmara ReghiniEssa música faz parte da minha infância , meu ...
\n", "
" ], "text/plain": [ " Authors \\\n", "0 NaN \n", "1 Logene Temonio \n", "2 Joelfantastic ASMR \n", "3 NaN \n", "4 Sofia \n", "5 Daiane Santos \n", "6 NaN \n", "7 Marcia Corbin \n", "8 Music Mane \n", "9 dlnnyc64 \n", "10 Edmar Fernandes Couto \n", "11 Yancy Johnson \n", "12 Maria Rosa Helena Do Prado E Silva \n", "13 Nina Schimel \n", "14 Ceciliasantos \n", "15 Sra. Francisco \n", "16 João Marcos Rodrigues \n", "17 Billyn Ivy \n", "18 Cicada3301 \n", "19 Carmen Bazan \n", "20 NaN \n", "21 Elnadrion \n", "22 Volnei Carvalho \n", "23 adeyosola \n", "24 Thisura Yapa \n", "25 Luis Machado \n", "26 DachshundsRule \n", "27 Lourdes. Conceicaodossantos. \n", "28 Rita Silva \n", "29 drharshz \n", "30 Sherry George \n", "31 Gabriel apaixonado music \n", "32 Angela Arrington \n", "33 Marcos Pinto \n", "34 Андрей Баздырев \n", "35 Denise Hedden \n", "36 Saif Messi \n", "37 joshua jennings \n", "38 Joel Jimenez \n", "39 Vilmara Reghini \n", "\n", " Comments \n", "0 You're an absolute legend Lionel \n", "1 It's 2022 and I'm listening to this Masterpiece ♡ \n", "2 Going on 40 and this song is still one of my t... \n", "3 ️ \n", "4 2022 and i'm still addicted to this LEGENDARY ... \n", "5 Sou fã do Lionel \\nEle tem uma voz linda\\nViva... \n", "6 Um anjo está cantando. Uma pausa para alegrar ... \n", "7 Beautiful words...beautiful music...beautiful ... \n", "8 This masterpiece gives me teary eyes every tim... \n", "9 Still hits close to the heart almost 40 years ... \n", "10 Isso que é música de valor.... muito sensivel ... \n", "11 It’s 2022 and THIS SONG STILL GOES STRONG!!! \n", "12 Que hino!!!!!! Amooooo \n", "13 Ano de 1984\\nSó eu sei o que representa essa c... \n", "14 Eu amo este clipe ️ \n", "15 Um dos clipes mais belos do mundo! Sinto-me en... \n", "16 Essa canção é emocionante, Lionel Richie é uma... \n", "17 Don’t worry you’re not the the only one listen... \n", "18 Masterpiece. simply no words can describe the ... \n", "19 One morning in 1984 I was getting ready for wo... \n", "20 I don’t care what I’m doing, whenever this son... \n", "21 Um anjo está cantando. Uma pausa para alegrar ... \n", "22 Como estás músicas são perfeitas lembra da min... \n", "23 Lionel Richie is one of the best song writers ... \n", "24 That guitar solo is next level. No wonder tha... \n", "25 Linda demais \n", "26 This song brings me to tears, it was our song.... \n", "27 Amei essa cena, muito liinda! \n", "28 Lionel Richie você é o cara.charmoso maravilho... \n", "29 2022 ..... This is golden! One of the best eve... \n", "30 I grew up in the 80's. Some of the best music ... \n", "31 Música linda \n", "32 It's 2022,and I'm listening.never get tired of... \n", "33 Tempo excelente que não volta mais maravilhos... \n", "34 As they say, this song is for all time! It is ... \n", "35 2022 and I'm still listening to Lionel Richie.... \n", "36 It's simply an eternal song and it will still ... \n", "37 So many memories of this song ..a masterpiece ... \n", "38 This song is enchanting and touches the soul. ... \n", "39 Essa música faz parte da minha infância , meu ... " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.common.keys import Keys\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "import pandas as pd\n", "import time\n", "\n", "#Create an instance of webdriver and load/run the youtube video page\n", "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options()\n", "myoptions.headless = False # default settings\n", "driver = Chrome(service=s, options=myoptions) \n", "driver.maximize_window()\n", "time.sleep(1)\n", "driver.get('https://www.youtube.com/watch?v=mHONNcZbwDY&t=80s')\n", "driver.implicitly_wait(30)\n", "play = driver.find_element(By.XPATH, '//*[@id=\"movie_player\"]/div[5]/button')\n", "play.click()\n", "\n", "# Perform three scrolls to get around 60 comments\n", "for scroll in range(1, 4): \n", " body = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.TAG_NAME, \"body\")))\n", " body.send_keys(Keys.END)\n", " time.sleep(12)\n", " \n", "\n", "# Scrape the comments\n", "comments = []\n", "comments_list = driver.find_elements(By.CSS_SELECTOR,\"#content-text\" )\n", "for comment in comments_list:\n", " text = comment.text.strip()\n", " comments.append(text)\n", "\n", "# Scrape the authors who made the comments\n", "authors = []\n", "authors_list = driver.find_elements(By.ID,\"author-text\")\n", "for author in authors_list:\n", " text = author.text.strip()\n", " authors.append(text)\n", "\n", "# Save the comments in csv file\n", "data = {'Authors':authors, 'Comments':comments}\n", "df = pd.DataFrame(data, columns=['Authors', 'Comments'])\n", "df.to_csv('hello.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "b42b93d5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 12, "id": "3a4b67a4", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": 2, "id": "da9bff48", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AuthorsComments
0NaNYou're an absolute legend Lionel
1Logene TemonioIt's 2022 and I'm listening to this Masterpiece ♡
2Joelfantastic ASMRGoing on 40 and this song is still one of my top 10. Still gives me goose bumps!
3NaN
4Sofia2022 and i'm still addicted to this LEGENDARY SONG.️
5Daiane SantosSou fã do Lionel \\nEle tem uma voz linda\\nViva anos 70,80,90 tempos bons,com músicas boas.\\nHoje em dia só porcarias.
6NaNUm anjo está cantando. Uma pausa para alegrar o coração em tempos difíceis.
7Marcia CorbinBeautiful words...beautiful music...beautiful emotions...will always be timeless...
8Music ManeThis masterpiece gives me teary eyes every time, a pleasing pain and joy.
9dlnnyc64Still hits close to the heart almost 40 years ago. I can remember details of where I was and what my life was back then. I adored the video but ca...
10Edmar Fernandes CoutoIsso que é música de valor.... muito sensivel a pessoa q compôs essa letra
11Yancy JohnsonIt’s 2022 and THIS SONG STILL GOES STRONG!!!
12Maria Rosa Helena Do Prado E SilvaQue hino!!!!!! Amooooo
13Nina SchimelAno de 1984\\nSó eu sei o que representa essa canção pra mim....\\nSaudades!!
14CeciliasantosEu amo este clipe ️
15Sra. FranciscoUm dos clipes mais belos do mundo! Sinto-me encantada, apaixonada!
16João Marcos RodriguesEssa canção é emocionante, Lionel Richie é uma lenda
17Billyn IvyDon’t worry you’re not the the only one listening to this masterpiece in 2021
18Cicada3301Masterpiece. simply no words can describe the feelings this song evokes .
19Carmen BazanOne morning in 1984 I was getting ready for work when I first heard this song . Called the radio station to enquire about the title, once I knew i...
20NaNI don’t care what I’m doing, whenever this song plays on the radio, I stop everything I’m doing and focus on Lionels amazing voice
21ElnadrionUm anjo está cantando. Uma pausa para alegrar o coração em tempos difíceis.
22Volnei CarvalhoComo estás músicas são perfeitas lembra da minha infância ,pena que o tempo não volta
23adeyosolaLionel Richie is one of the best song writers of all time: this song is timeless ... it reminds me so much of my childhood \\nEndless Love is also ...
24Thisura YapaThat guitar solo is next level. No wonder that this went to number one on the UK Singles Chart for six weeks (In 1984).\\n\\nSuch a nice song........
25Luis MachadoLinda demais
26DachshundsRuleThis song brings me to tears, it was our song. He's been gone long years now, and our children have grown up and have families of their own. Our o...
27Lourdes. Conceicaodossantos.Amei essa cena, muito liinda!
28Rita SilvaLionel Richie você é o cara.charmoso maravilhoso muito charme cantor otimo voz maravilhosa melodias lindas.gosto muito de ti.rita Silva Souza Salv...
29drharshz2022 ..... This is golden! One of the best ever songs
30Sherry GeorgeI grew up in the 80's. Some of the best music came from the 70's and 80's. Lionel Ritchie's music is so beautiful and romantic.
31Gabriel apaixonado musicMúsica linda
32Angela ArringtonIt's 2022,and I'm listening.never get tired of Lionel's voice.
33Marcos PintoTempo excelente que não volta mais maravilhosas canções e obrigado a cantor que Deus abençoe
34Андрей БаздыревAs they say, this song is for all time! It is not often that you will find such a harmony of music, words and video in songs and clips. Thank you ...
35Denise Hedden2022 and I'm still listening to Lionel Richie. I love this song it's so precious to me️
36Saif MessiIt's simply an eternal song and it will still an icon of all romantic songs for along time , greetings to the legend Lionel Richie from Iraq
37joshua jenningsSo many memories of this song ..a masterpiece of hard work and dedication
38Joel JimenezThis song is enchanting and touches the soul. It brings back memories of a lost love. Recorded in 1983 and released on Feb 13, 1984. This song r...
39Vilmara ReghiniEssa música faz parte da minha infância , meu Deus como era tudo tao lindo e perfeito , meus pais os melhores do mundo , o amor que eles tiveram e...
\n", "
" ], "text/plain": [ " Authors \\\n", "0 NaN \n", "1 Logene Temonio \n", "2 Joelfantastic ASMR \n", "3 NaN \n", "4 Sofia \n", "5 Daiane Santos \n", "6 NaN \n", "7 Marcia Corbin \n", "8 Music Mane \n", "9 dlnnyc64 \n", "10 Edmar Fernandes Couto \n", "11 Yancy Johnson \n", "12 Maria Rosa Helena Do Prado E Silva \n", "13 Nina Schimel \n", "14 Ceciliasantos \n", "15 Sra. Francisco \n", "16 João Marcos Rodrigues \n", "17 Billyn Ivy \n", "18 Cicada3301 \n", "19 Carmen Bazan \n", "20 NaN \n", "21 Elnadrion \n", "22 Volnei Carvalho \n", "23 adeyosola \n", "24 Thisura Yapa \n", "25 Luis Machado \n", "26 DachshundsRule \n", "27 Lourdes. Conceicaodossantos. \n", "28 Rita Silva \n", "29 drharshz \n", "30 Sherry George \n", "31 Gabriel apaixonado music \n", "32 Angela Arrington \n", "33 Marcos Pinto \n", "34 Андрей Баздырев \n", "35 Denise Hedden \n", "36 Saif Messi \n", "37 joshua jennings \n", "38 Joel Jimenez \n", "39 Vilmara Reghini \n", "\n", " Comments \n", "0 You're an absolute legend Lionel \n", "1 It's 2022 and I'm listening to this Masterpiece ♡ \n", "2 Going on 40 and this song is still one of my top 10. Still gives me goose bumps! \n", "3 ️ \n", "4 2022 and i'm still addicted to this LEGENDARY SONG.️ \n", "5 Sou fã do Lionel \\nEle tem uma voz linda\\nViva anos 70,80,90 tempos bons,com músicas boas.\\nHoje em dia só porcarias. \n", "6 Um anjo está cantando. Uma pausa para alegrar o coração em tempos difíceis. \n", "7 Beautiful words...beautiful music...beautiful emotions...will always be timeless... \n", "8 This masterpiece gives me teary eyes every time, a pleasing pain and joy. \n", "9 Still hits close to the heart almost 40 years ago. I can remember details of where I was and what my life was back then. I adored the video but ca... \n", "10 Isso que é música de valor.... muito sensivel a pessoa q compôs essa letra \n", "11 It’s 2022 and THIS SONG STILL GOES STRONG!!! \n", "12 Que hino!!!!!! Amooooo \n", "13 Ano de 1984\\nSó eu sei o que representa essa canção pra mim....\\nSaudades!! \n", "14 Eu amo este clipe ️ \n", "15 Um dos clipes mais belos do mundo! Sinto-me encantada, apaixonada! \n", "16 Essa canção é emocionante, Lionel Richie é uma lenda \n", "17 Don’t worry you’re not the the only one listening to this masterpiece in 2021 \n", "18 Masterpiece. simply no words can describe the feelings this song evokes . \n", "19 One morning in 1984 I was getting ready for work when I first heard this song . Called the radio station to enquire about the title, once I knew i... \n", "20 I don’t care what I’m doing, whenever this song plays on the radio, I stop everything I’m doing and focus on Lionels amazing voice \n", "21 Um anjo está cantando. Uma pausa para alegrar o coração em tempos difíceis. \n", "22 Como estás músicas são perfeitas lembra da minha infância ,pena que o tempo não volta \n", "23 Lionel Richie is one of the best song writers of all time: this song is timeless ... it reminds me so much of my childhood \\nEndless Love is also ... \n", "24 That guitar solo is next level. No wonder that this went to number one on the UK Singles Chart for six weeks (In 1984).\\n\\nSuch a nice song........ \n", "25 Linda demais \n", "26 This song brings me to tears, it was our song. He's been gone long years now, and our children have grown up and have families of their own. Our o... \n", "27 Amei essa cena, muito liinda! \n", "28 Lionel Richie você é o cara.charmoso maravilhoso muito charme cantor otimo voz maravilhosa melodias lindas.gosto muito de ti.rita Silva Souza Salv... \n", "29 2022 ..... This is golden! One of the best ever songs \n", "30 I grew up in the 80's. Some of the best music came from the 70's and 80's. Lionel Ritchie's music is so beautiful and romantic. \n", "31 Música linda \n", "32 It's 2022,and I'm listening.never get tired of Lionel's voice. \n", "33 Tempo excelente que não volta mais maravilhosas canções e obrigado a cantor que Deus abençoe \n", "34 As they say, this song is for all time! It is not often that you will find such a harmony of music, words and video in songs and clips. Thank you ... \n", "35 2022 and I'm still listening to Lionel Richie. I love this song it's so precious to me️ \n", "36 It's simply an eternal song and it will still an icon of all romantic songs for along time , greetings to the legend Lionel Richie from Iraq \n", "37 So many memories of this song ..a masterpiece of hard work and dedication \n", "38 This song is enchanting and touches the soul. It brings back memories of a lost love. Recorded in 1983 and released on Feb 13, 1984. This song r... \n", "39 Essa música faz parte da minha infância , meu Deus como era tudo tao lindo e perfeito , meus pais os melhores do mundo , o amor que eles tiveram e... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv('hello.csv')\n", "pd.set_option('max_colwidth',150)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "391b9e14", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b40ce802", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "2f531e38", "metadata": {}, "source": [ "## Example 3: Scraping Jobs: \n", "- https://pk.indeed.com" ] }, { "cell_type": "code", "execution_count": 13, "id": "7d502782", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CompanyJob TitleSalary
0Smart Placement (Pvt) LtdJunior/Intern Web DeveloperNo Salary
1Digital Media LineWeb Developer InternRs 5,000 - Rs 10,000 a month
2ThingTraxJr. Full Stack DeveloperRs 50,000 a month
3Alliance SolutionsWeb DeveloperRs 25,000 a month
4iDig DigitalFull Stack PHP Laravel Developer - Remote Posi...Rs 60,000 - Rs 126,159 a month
............
70VeriPark Software SolutionsSoftware Developer. NETNo Salary
71Cinco SolutionsPHP DevelopersRs 30,000 - Rs 50,000 a month
72FINCA InternationalFull Stack Web DeveloperNo Salary
73Nextbridge Pvt LtdSenior MERN DeveloperNo Salary
74ENTERTAINER FZ LLCPHP Full Stack Web DeveloperNo Salary
\n", "

75 rows × 3 columns

\n", "
" ], "text/plain": [ " Company \\\n", "0 Smart Placement (Pvt) Ltd \n", "1 Digital Media Line \n", "2 ThingTrax \n", "3 Alliance Solutions \n", "4 iDig Digital \n", ".. ... \n", "70 VeriPark Software Solutions \n", "71 Cinco Solutions \n", "72 FINCA International \n", "73 Nextbridge Pvt Ltd \n", "74 ENTERTAINER FZ LLC \n", "\n", " Job Title \\\n", "0 Junior/Intern Web Developer \n", "1 Web Developer Intern \n", "2 Jr. Full Stack Developer \n", "3 Web Developer \n", "4 Full Stack PHP Laravel Developer - Remote Posi... \n", ".. ... \n", "70 Software Developer. NET \n", "71 PHP Developers \n", "72 Full Stack Web Developer \n", "73 Senior MERN Developer \n", "74 PHP Full Stack Web Developer \n", "\n", " Salary \n", "0 No Salary \n", "1 Rs 5,000 - Rs 10,000 a month \n", "2 Rs 50,000 a month \n", "3 Rs 25,000 a month \n", "4 Rs 60,000 - Rs 126,159 a month \n", ".. ... \n", "70 No Salary \n", "71 Rs 30,000 - Rs 50,000 a month \n", "72 No Salary \n", "73 No Salary \n", "74 No Salary \n", "\n", "[75 rows x 3 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.common.keys import Keys\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "import pandas as pd\n", "import time\n", "\n", "\n", "#Create an instance of webdriver and go the the appropriate job page\n", "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options()\n", "myoptions.headless = False \n", "driver = Chrome(service=s, options=myoptions) \n", "driver.maximize_window()\n", "driver.get('https://pk.indeed.com')\n", "time.sleep(5)\n", "\n", "# Enter your search parameters and click FindJobs button\n", "what_box = driver.find_element(By.XPATH,'//*[@id=\"text-input-what\"]')\n", "where_box = driver.find_element(By.XPATH,'//*[@id=\"text-input-where\"]')\n", "button = driver.find_element(By.XPATH,'//*[@id=\"jobsearch\"]/button')\n", "what_box.send_keys('Full Stack Web Developer')\n", "where_box.send_keys('Lahore')\n", "button.click()\n", "time.sleep(5)\n", "\n", "\n", "# Function that scrape the three pieces of information of each job and is called on each page\n", "jobtitles = []\n", "companies = []\n", "salaries = []\n", "def jobs():\n", " time.sleep(2)\n", " postings = driver.find_elements(By.CSS_SELECTOR, '.resultContent') \n", " for posting in postings: \n", " try:\n", " job_title = posting.find_element(By.CSS_SELECTOR,'h2 a').text \n", " except:\n", " job_title = 'No Job title'\n", " try: \n", " company = posting.find_element(By.CSS_SELECTOR,'.companyName').text\n", " except:\n", " company = \"No company name\"\n", " try:\n", " salary = posting.find_element(By.CSS_SELECTOR,'.salary-snippet-container').text\n", " except:\n", " salary = \"No Salary\" \n", " companies.append(company)\n", " salaries.append(salary)\n", " jobtitles.append(job_title) \n", "\n", "\n", " \n", "# Click the next page button in the pagination bar\n", "while(True):\n", " time.sleep(4)\n", " try:\n", " pop_up = driver.find_element(By.CSS_SELECTOR,'.popover-x-button-close.icl-CloseButton')\n", " driver.find_element(By.CSS_SELECTOR,'.popover-x-button-close.icl-CloseButton').click()\n", " except:\n", " pass\n", " \n", " jobs() \n", " \n", " try: \n", " driver.find_element(By.CLASS_NAME,'pagination-list')\n", " driver.execute_script(\"arguments[0].scrollIntoView();\", driver.find_element(By.CLASS_NAME,'pagination-list'))\n", " try:\n", " driver.find_element(By.XPATH,'//*[@aria-label=\"Next\"]').click()\n", " except:\n", " break\n", " except:\n", " break\n", "\n", "\n", "\n", "# # Writing in the file\n", "data = {'Company':companies, 'Job Title':jobtitles, 'Salary':salaries}\n", "df = pd.DataFrame(data, columns=['Company', 'Job Title', 'Salary'])\n", "df.to_csv('jobs.csv', index=False)\n", "df = pd.read_csv('jobs.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": 14, "id": "8c079a23", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "ccfc462e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3288fef6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6dba26fb", "metadata": {}, "source": [ "## Example 4: Scraping: https://twitter.com/login" ] }, { "cell_type": "code", "execution_count": 15, "id": "c168e37b", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.common.keys import Keys\n", "from selenium.webdriver.support import expected_conditions as EC\n", "import pandas as pd\n", "import time\n", "import os\n", "\n", "# Create an instance of webdriver and get the twitter login page\n", "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options()\n", "driver = Chrome(service=s, options=myoptions) \n", "driver.maximize_window()\n", "driver.get('https://twitter.com/login') \n", "driver.implicitly_wait(30)\n", "\n", "\n", "# Enter username and password\n", "time.sleep(5)\n", "username = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//input[@name=\"text\"]'))) \n", "username.send_keys('username')\n", "username.send_keys(Keys.ENTER) \n", "time.sleep(2)\n", "passwd = WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, '//input[@name=\"password\"]')))\n", "passwd.send_keys(os.environ['yourtwitterpassword']) # actual passwd is saved in an environment variable :)\n", "passwd.send_keys(Keys.ENTER)\n", "\n", "\n", "# Enter Celebrity name (Imran Khan) in Search Textbox\n", "time.sleep(2)\n", "search_input = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH,'//input[@aria-label=\"Search query\"]')))\n", "search_input.send_keys(\"Imran Khan\")\n", "time.sleep(2)\n", "search_input.send_keys(Keys.ENTER)\n", "\n", "\n", "## Click on People tab for People Profiles using LINK_TEXT Locator\n", "time.sleep(2)\n", "people = WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.LINK_TEXT, 'People')))\n", "people.click()\n", "\n", "\n", "# Click on the twitter link of Imran Khan\n", "time.sleep(2)\n", "click_imran = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.LINK_TEXT, 'Imran Khan')))\n", "click_imran.click()\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "1e3f1499", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserTimesTweets
0@ImranKhanPTI1hThe nation continues to suffer crushing econom...
1@ImranKhanPTI1hThis was despite the sharp slow down after reg...
2@ImranKhanPTI3 Julامپورٹڈ حکومت اوراسکےپشت پناہوں کیلئےمیرا واضح...
3@ImranKhanPTI3 Julبدمعاشوں کا یہ ٹولہ جس انداز میں عوام کو مہنگا...
4@ImranKhanPTI3 JulMy clear message to Imported govt & its backer...
5@ImranKhanPTI3 Julwealth, its only a matter of time before we g...
6@ImranKhanPTI3 Julنہایت کثیر تعداد میں باہرنکلنے اور مجرموں کی ا...
7@ImranKhanPTI1hThe nation continues to suffer crushing econom...
8@ImranKhanPTI1hThis was despite the sharp slow down after reg...
9@ImranKhanPTIJul 3امپورٹڈ حکومت اوراسکےپشت پناہوں کیلئےمیرا واضح...
10@ImranKhanPTIJul 3بدمعاشوں کا یہ ٹولہ جس انداز میں عوام کو مہنگا...
11@ImranKhanPTIJul 3My clear message to Imported govt & its backer...
12@ImranKhanPTIJul 3wealth, its only a matter of time before we g...
13@ImranKhanPTIJul 3نہایت کثیر تعداد میں باہرنکلنے اور مجرموں کی ا...
14@ImranKhanPTIJul 3I want to thank the people of Islamabad & Pind...
15@ImranKhanPTIJul 2میں پاکستان+دنیابھرمیں مقیم پاکستانیوں سےملتمس...
16@ImranKhanPTIJul 2I request all Pakistanis living in Pakistan an...
17@ImranKhanPTIJul 1لاہورمیں آج سینئرصحافی ایاز امیر پر تشدد کی شد...
18@ImranKhanPTIJul 1I condemn in strongest terms the violence agai...
19@ImranKhanPTIJul 1امپورٹڈحکومت کےسیاسی عدمِ استحکام اور موسمِ گر...
20@ImranKhanPTIJul 1InshaAllah tomorrow will be our historic Islam...
21@ImranKhanPTIJul 1روس سےسستا تیل خریدنےکی بجائےتبدیلئ سرکار کی س...
22@ImranKhanPTIJul 1Instead of buying cheaper oil from Russia Impo...
23@ImranKhanPTIJun 27Congratulations Ahmad Nawaz on being elected P...
24@ImranKhanPTIJun 27اپنی ٹائیگر فورس، اپنے نوجوانوں اور اپنی خواتی...
\n", "
" ], "text/plain": [ " User Times Tweets\n", "0 @ImranKhanPTI 1h The nation continues to suffer crushing econom...\n", "1 @ImranKhanPTI 1h This was despite the sharp slow down after reg...\n", "2 @ImranKhanPTI 3 Jul امپورٹڈ حکومت اوراسکےپشت پناہوں کیلئےمیرا واضح...\n", "3 @ImranKhanPTI 3 Jul بدمعاشوں کا یہ ٹولہ جس انداز میں عوام کو مہنگا...\n", "4 @ImranKhanPTI 3 Jul My clear message to Imported govt & its backer...\n", "5 @ImranKhanPTI 3 Jul wealth, its only a matter of time before we g...\n", "6 @ImranKhanPTI 3 Jul نہایت کثیر تعداد میں باہرنکلنے اور مجرموں کی ا...\n", "7 @ImranKhanPTI 1h The nation continues to suffer crushing econom...\n", "8 @ImranKhanPTI 1h This was despite the sharp slow down after reg...\n", "9 @ImranKhanPTI Jul 3 امپورٹڈ حکومت اوراسکےپشت پناہوں کیلئےمیرا واضح...\n", "10 @ImranKhanPTI Jul 3 بدمعاشوں کا یہ ٹولہ جس انداز میں عوام کو مہنگا...\n", "11 @ImranKhanPTI Jul 3 My clear message to Imported govt & its backer...\n", "12 @ImranKhanPTI Jul 3 wealth, its only a matter of time before we g...\n", "13 @ImranKhanPTI Jul 3 نہایت کثیر تعداد میں باہرنکلنے اور مجرموں کی ا...\n", "14 @ImranKhanPTI Jul 3 I want to thank the people of Islamabad & Pind...\n", "15 @ImranKhanPTI Jul 2 میں پاکستان+دنیابھرمیں مقیم پاکستانیوں سےملتمس...\n", "16 @ImranKhanPTI Jul 2 I request all Pakistanis living in Pakistan an...\n", "17 @ImranKhanPTI Jul 1 لاہورمیں آج سینئرصحافی ایاز امیر پر تشدد کی شد...\n", "18 @ImranKhanPTI Jul 1 I condemn in strongest terms the violence agai...\n", "19 @ImranKhanPTI Jul 1 امپورٹڈحکومت کےسیاسی عدمِ استحکام اور موسمِ گر...\n", "20 @ImranKhanPTI Jul 1 InshaAllah tomorrow will be our historic Islam...\n", "21 @ImranKhanPTI Jul 1 روس سےسستا تیل خریدنےکی بجائےتبدیلئ سرکار کی س...\n", "22 @ImranKhanPTI Jul 1 Instead of buying cheaper oil from Russia Impo...\n", "23 @ImranKhanPTI Jun 27 Congratulations Ahmad Nawaz on being elected P...\n", "24 @ImranKhanPTI Jun 27 اپنی ٹائیگر فورس، اپنے نوجوانوں اور اپنی خواتی..." ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Scrape username\n", "user_name = driver.find_element(By.XPATH,'((//*[@data-testid=\"UserName\"])//span)[last()]').text\n", "\n", "# Scrape 25 tweets with their dates\n", "articles = []\n", "tweets = []\n", "times=[]\n", "\n", "while True:\n", " time.sleep(1)\n", " article = driver.find_elements(By.TAG_NAME,'article')\n", " for a in article:\n", " if a not in articles:\n", " tweet = a.find_element(By.XPATH, './/*[@data-testid=\"tweetText\"]')\n", " articles.append(a)\n", " t = a.find_element(By.XPATH,'.//time')\n", " times.append(t.text)\n", " tweets.append(tweet.text)\n", " if len(tweets) >=25:\n", " break\n", " driver.execute_script(\"window.scrollBy(0,500);\") \n", "\n", " \n", "# Write scraped data in csv file\n", "data = {'User':user_name, 'Times':times,'Tweets':tweets}\n", "df = pd.DataFrame(data, columns=['User', 'Times','Tweets'])\n", "df.to_csv('tweets.csv', index=False)\n", "df = pd.read_csv('tweets.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": 17, "id": "ee1a4bc0", "metadata": {}, "outputs": [], "source": [ "driver.quit()" ] }, { "cell_type": "code", "execution_count": null, "id": "f067e8ef", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6c1e41b6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "39e51d38", "metadata": {}, "source": [ "## Example 5: Scraping News: https://www.thenews.com.pk/today" ] }, { "cell_type": "code", "execution_count": 18, "id": "a23aef57", "metadata": {}, "outputs": [], "source": [ "from selenium.webdriver import Chrome\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.common.by import By\n", "from selenium.webdriver.support.ui import WebDriverWait\n", "from selenium.webdriver.support import expected_conditions as EC\n", "import pandas as pd\n", "import time\n", "\n", "# Create an instance of webdriver and load the newspaper\n", "s = Service('/Users/arif/Documents/chromedriver')\n", "myoptions = Options()\n", "driver = Chrome(service=s, options=myoptions) \n", "driver.maximize_window()\n", "driver.get('https://www.thenews.com.pk/today') \n", "time.sleep(2)" ] }, { "cell_type": "code", "execution_count": null, "id": "a686eab6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "84754e70", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 19, "id": "9d6b54e4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://www.thenews.com.pk/print/971449-shireen-asks-sc-to-take-notice-of-imran-s-phone-tapping\n", "https://www.thenews.com.pk/print/971615-gogi-pinky-economic-corruption-corridor-nawaz-made-cpec-imran-gpec-maryam\n", "https://www.thenews.com.pk/print/971591-amendments-to-nab-law-imran-files-appeal-in-sc-against-registrar-s-objections\n", "https://www.thenews.com.pk/print/971602-punjab-bypolls-imran-to-start-election-campaign-from-july-7\n", "https://www.thenews.com.pk/print/971539-imran-ismail-made-pti-s-additional-secretary-general\n" ] } ], "source": [ "# Create a list of all the URLs of your interest\n", "urls = []\n", "try:\n", " s = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT,\"Imran\"))) \n", " search_urls = driver.find_elements(By.PARTIAL_LINK_TEXT,\"Imran\")\n", " for i in search_urls:\n", " urls.append(i.get_attribute(\"href\"))\n", "except:\n", " print(\"I did not find it \") \n", "\n", "for url in urls:\n", " print(url)" ] }, { "cell_type": "code", "execution_count": null, "id": "29ae8a36", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "94a99fa3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 20, "id": "f0025b16", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shireen Mazari asks Supreme Court to take notice of Imran Khan’s ‘phone tapping’\n", "‘Gogi-Pinky Economic Corruption Corridor’: Nawaz made CPEC, Imran GPEC: Maryam\n", "Amendments to NAB law: Imran files appeal in SC against registrar’s objections\n", "Punjab bypolls: Imran to start election campaign from July 7\n", "Imran Ismail made PTI’s additional secretary general\n" ] } ], "source": [ "original_window = driver.current_window_handle\n", "news_articles = []\n", "authors = []\n", "headings = []\n", "for url in urls:\n", " driver.switch_to.new_window('tab')\n", " driver.get(url)\n", " try:\n", " heading = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR,\".detail-heading h1\"))) \n", " author = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR,\".category-source\")))\n", " article = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR,\".story-detail\")))\n", " headings.append(heading.text)\n", " news_articles.append(article.text)\n", " authors.append(author.text)\n", " except:\n", " pass \n", " driver.switch_to.window(original_window)\n", "\n", "for heading in headings:\n", " print(heading)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8032fae9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0c38410b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 21, "id": "1e0e5c48", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HeadingsAuthorsNews Articles
0Shireen Mazari asks Supreme Court to take noti...By Mumtaz AlviPTI leaders, Shireen Mazari and Fawad Chaudhry...
1‘Gogi-Pinky Economic Corruption Corridor’: Naw...By Our CorrespondentPML-N Vice President Maryam Nawaz addressing a...
2Amendments to NAB law: Imran files appeal in S...By Our CorrespondentISLAMABAD: Former prime minister and Pakistan ...
3Punjab bypolls: Imran to start election campai...By Our CorrespondentPTI Chairman Imran Khan speaks during a media ...
4Imran Ismail made PTI’s additional secretary g...By Our CorrespondentThe Pakistan Tehreek-e-Insaf on Monday appoint...
\n", "
" ], "text/plain": [ " Headings Authors \\\n", "0 Shireen Mazari asks Supreme Court to take noti... By Mumtaz Alvi \n", "1 ‘Gogi-Pinky Economic Corruption Corridor’: Naw... By Our Correspondent \n", "2 Amendments to NAB law: Imran files appeal in S... By Our Correspondent \n", "3 Punjab bypolls: Imran to start election campai... By Our Correspondent \n", "4 Imran Ismail made PTI’s additional secretary g... By Our Correspondent \n", "\n", " News Articles \n", "0 PTI leaders, Shireen Mazari and Fawad Chaudhry... \n", "1 PML-N Vice President Maryam Nawaz addressing a... \n", "2 ISLAMABAD: Former prime minister and Pakistan ... \n", "3 PTI Chairman Imran Khan speaks during a media ... \n", "4 The Pakistan Tehreek-e-Insaf on Monday appoint... " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {'Headings':headings, 'Authors':authors, 'News Articles':news_articles}\n", "df = pd.DataFrame(data, columns=['Headings', 'Authors', 'News Articles'])\n", "df.to_csv('news.csv', index=False)\n", "df = pd.read_csv('news.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": 22, "id": "98ae189b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'PTI leaders, Shireen Mazari and Fawad Chaudhry addressing a press conference in islamabad on July 4, 2022. Photo: Screengrab of a Twitter video. \\nISLAMABAD: Pakistan Tehreek-e-Insaf (PTI) Senior Vice-President and former human rights minister Dr Shireen Mazari said Monday the intelligence agencies illegally tap phones and the Supreme Court (SC) should take a suo motu notice of tapping then PM Imran Khan\\'s phone.\\nReferring to media reports, she said that another phone conversation on the secure line between Imran Khan and his principal secretary Azam Khan is also going to be leaked.\\nMazari said what is in the audio tape is not the issue, insisting the real issue was phone tapping. She, along with PTI leader Chaudhry Fawad Hussain, told a hurriedly-called news conference here that an audio tape of the former first lady is circulating, which should be brought to light after a forensic test. She also asked how much assistance was given by the US in phone tapping.\\nRelated Stories\\nPTI warns against leaking Imran Khan’s audio calls\\nGovt proposes forensic audit of Bushra Bibi’s audio\\nDr Mazari emphasised that the real issue is phone tapping. The government of former premier Benazir Bhutto was ousted in 1997 on the issue of phone tapping and the Supreme Court in its judgment held that phone tapping is illegal under Articles 8 and 14 of the Constitution. The court had said in its judgment that official or personal conversations cannot be recorded.\\nCiting an English daily, she pointed out that apart from this decision, there is a decision of Justice Saqib Nisar of 2015 in which one of the most interesting things was admission on part of country\\'s prime intelligence agency that it recorded 6,856 phone calls in the month of May alone.\\nShe said that intelligence agencies illegally tap phones because sensitive agencies have phone tapping technology. She remarked, “I ask why this series of illegal phone tapping continues despite Supreme Court orders as the secure line of Imran Khan\\'s house was tapped. The Supreme Court should take a suo motu notice as to which agencies are there that are, despite court orders, taking illegal steps and violating the apex court order”.\\nMazari contended that it has been reported that another audio is going to be leaked which is based on the conversation between Imran Khan and his principal secretary Azam Khan on the secure line. The former minister said if the audio is made public, it will not only violate the Supreme Court order, but also the Official Secret Act, and she will not remain silent on it.\\nThe purpose of such acts, she claimed, was to hide the conspiracy and the nation has accepted the American conspiracy, so the neutrals and those who brought them are doing such acts so that the nation\\'s attention is diverted from the issue.\\nShireen Mazari said that after ‘our successful rallies, such conspirators, their handlers, neutrals and those they have brought, are nervous and trying to divert the attention of the people from this conspiracy and the country\\'s complicated affairs in some way.\\nThe IMF, she noted, was also asking this government to hold accountability against corruption while they (rulers) are not getting a place to flee. “I appeal to the defence institutions of Pakistan as to why they are pushing Pakistan to such difficult situations from where it becomes very difficult to return. The date of elections should be announced so that the nation elects its representatives and then form policies for the betterment of the country,” she contended.\\nShe accused PMLN leader Maryam Nawaz of violating the Official Secret Act during her rallies every day, and wondered if this government and its handler are not violating the Constitution and law by showing official documents to a convicted person. “There are many similar questions that we want to ask neutrals and this government,” she said.\\nSpeaking on the occasion, PTI Senior Vice President Chaudhry Fawad Hussain said that the loadshedding crisis in this country is getting serious and now it is being said that loadshedding will take place even during Eid holidays, while the government has admitted that all plants are on imported fuel and they do not have money to buy fuel.\\nFawad said that the government has accepted all IMF demands to make petroleum products costlier, tax the public yet the IMF is not ready to pay, the IMF is saying it is hesitant to issue a package on changing anti-corruption laws.\\nThe government, he claimed, immediately accepted the tax on the people but stuck to the issue of not enacting anti-corruption laws which gives an idea of the mindset of this government.\\nHe alleged that the government changed the laws and gave itself a benefit of Rs1,100 billion as they gave themselves financial benefit under NRO, similarly Asif Zardari has been given NRO in the fake account case. “Therefore, Imran Khan says corruption makes your country poor and corruption worth billions of rupees of Sharif and Zardari family is hollowing out the roots of Pakistan,” he charged.\\nFawad also said that phone tapping is a very important matter and phone tapping is being done in Pakistan and there is no monitoring, these calls are edited, not put to forensic test, human rights are being violated, allegations are being levelled against people.\\nHe said that only yesterday, Farah Gujjar was accused by the PMLN of acquiring a plot in Faisalabad Industrial Zone at a price less than the market value, ‘while according to the documents we have, Ayaz Sadiq also acquired two plots at the same place, so Ayaz Sadiq also committed corruption’.\\nPTI leader noted that Farah Gujjar was reportedly being issued red warrants. He asked how warrants could be issued when no case was registered against her and the purpose of these false allegation campaign issued by the PMLN is only that they know that Imran Khan is the leader who has no lust for money.\\nFawad said that ‘once again we repeat that we want good relations with the US, Europe, Russia and Western countries but that does not mean that a country can tell us who will rule our country as we cannot allow this thing. “It does not mean that no one even invited you to the ceremony and still you go there forcibly,” he said.\\n“After the accident that happened to Khawaja Asif at the age of 70-72, he suffers from psychological problems and often talks nonsense. If he has any evidence of contact with Donald Lu, he should make it public,” he said. He also talked on how \\'rigging plan had been readied by PMLN leadership\\' for upcoming by-election.\\nMeanwhile, Minister for Information and Broadcasting Marriyum Aurangzeb Monday said PTI leaders Fawad Chaudhry and Shireen Mazari confirmed that Bushra Bibi is the head of PTI\\'s social media, generating a false narrative.\\nReferring to the news conference held earlier by these two PTI leaders, she said they confirmed Bushra Bibi was the ‘mastermind’ of the campaign of treason certificates and the drive against the institutions. “Fawad Chaudhry and Shireen Mazari have accepted that the audio is of Bushra Bibi and that ‘linking’ political opponents with treason, is her ‘opinion’.\\nThe minister contented that the campaign against institutions was being run by Bushra Bibi, adding it was proved that an immoral press conference with dirty-language speaking spokespersons was being done at her behest.\\n“Bushra Bibi is running a social media campaign against national institutions, the narratives of treason, external conspiracy while hiding in Bani Gala. Bushra Bibi is in the forefront and behind the drive to link political opponents with treason and bad-mouthing. She is the mastermind of the campaign against institutions,” the minister charged.\\nMarriyum alleged that Bushra Bibi fabricated the narrative of external conspiracy and treason to hide her corruption. Bushra Bibi campaign against political opponents, journalists and institutions.\\n“PTI admits to making treason cases to hide its corruption. The champions of the false narrative of conspiracy abroad have been caught apologising and cajoling the US. Imran Sahib is delivering \"bygones are bygones\" apology to the US,” she concluded. '" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['News Articles'][0]" ] }, { "cell_type": "code", "execution_count": null, "id": "f1061bcc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "809330e5", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ecfa7ff2", "metadata": {}, "source": [ "## Practice Problem: Scraping Houses Data: https://zameen.com\n", "- For machine learning tasks, we need to have following fields for a hundred thousand **houses** in Lahore and within cities different locations/societies\n", " - City\n", " - Location/Address\n", " - Covered Area\n", " - Number of Bedrooms\n", " - Number of Bathrooms\n", " - Price" ] }, { "cell_type": "code", "execution_count": null, "id": "8bce3dc0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }