# Web Scraping using Spider API

This notebook shows how to use the open 
source [Spider](https://spider.cloud/) web crawler together with AutoGen agents.

First we need to install the Spider SDK and the AutoGen library.

In [1]:
! pip install -qqq pyautogen spider-client

Setting up the LLM configuration and the Spider API key is also required.

In [2]:
import os

config_list = [
    {"model": "gpt-4o", "api_key": os.getenv("OPENAI_API_KEY")},
]

spider_api_key = os.getenv("SPIDER_API_KEY")

Let's define the tool for scraping and crawling data from any website with Spider.
Read more about tool use in this [tutorial chapter](/docs/tutorial/tool-use).

In [3]:
from typing import Any, Dict, List

from spider import Spider
from typing_extensions import Annotated


def scrape_page(
    url: Annotated[str, "The URL of the web page to scrape"],
    params: Annotated[dict, "Dictionary of additional params."] = None,
) -> Annotated[Dict[str, Any], "Scraped content"]:
    # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
    client = Spider(spider_api_key)

    if params is None:
        params = {"return_format": "markdown"}

    scraped_data = client.scrape_url(url, params)
    return scraped_data[0]


def crawl_page(
    url: Annotated[str, "The url of the domain to be crawled"],
    params: Annotated[dict, "Dictionary of additional params."] = None,
) -> Annotated[List[Dict[str, Any]], "Scraped content"]:
    # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables
    client = Spider(spider_api_key)

    if params is None:
        params = {"return_format": "markdown"}

    crawled_data = client.crawl_url(url, params)
    return crawled_data



Create the agents and register the tool.

In [4]:
from autogen import ConversableAgent, register_function

# Create web scraper agent.
scraper_agent = ConversableAgent(
    "WebScraper",
    llm_config={"config_list": config_list},
    system_message="You are a web scraper and you can scrape any web page to retrieve its contents."
    "Returns 'TERMINATE' when the scraping is done.",
)

# Create web crawler agent.
crawler_agent = ConversableAgent(
    "WebCrawler",
    llm_config={"config_list": config_list},
    system_message="You are a web crawler and you can crawl any page with deeper crawling following subpages."
    "Returns 'TERMINATE' when the scraping is done.",
)

# Create user proxy agent.
user_proxy_agent = ConversableAgent(
    "UserProxy",
    llm_config=False,  # No LLM for this agent.
    human_input_mode="NEVER",
    code_execution_config=False,  # No code execution for this agent.
    is_termination_msg=lambda x: x.get("content", "") is not None and "terminate" in x["content"].lower(),
    default_auto_reply="Please continue if not finished, otherwise return 'TERMINATE'.",
)

# Register the functions with the agents.
register_function(
    scrape_page,
    caller=scraper_agent,
    executor=user_proxy_agent,
    name="scrape_page",
    description="Scrape a web page and return the content.",
)

register_function(
    crawl_page,
    caller=crawler_agent,
    executor=user_proxy_agent,
    name="crawl_page",
    description="Crawl an entire domain, following subpages and return the content.",
)

Start the conversation for scraping web data. We used the
`reflection_with_llm` option for summary method
to perform the formatting of the output into a desired format.
The summary method is called after the conversation is completed
given the complete history of the conversation.

In [5]:
# Scrape page
scraped_chat_result = user_proxy_agent.initiate_chat(
    scraper_agent,
    message="Can you scrape william-espegren.com for me?",
    summary_method="reflection_with_llm",
    summary_args={"summary_prompt": """Summarize the scraped content"""},
)

[33mUserProxy[0m (to WebScraper):

Can you scrape william-espegren.com for me?

--------------------------------------------------------------------------------
[31m
>>>>>>>> USING AUTO REPLY...[0m
[33mWebScraper[0m (to UserProxy):

[32m***** Suggested tool call (call_qCNYeQCfIPZkUCKejQmm5EhC): scrape_page *****[0m
Arguments: 
{"url":"https://www.william-espegren.com"}
[32m****************************************************************************[0m

--------------------------------------------------------------------------------
[35m
>>>>>>>> EXECUTING FUNCTION scrape_page...[0m
[33mUserProxy[0m (to WebScraper):

[33mUserProxy[0m (to WebScraper):

[32m***** Response from calling tool (call_qCNYeQCfIPZkUCKejQmm5EhC) *****[0m
[{"content": "William Espegren - Portfoliokeep scrollingMADE WITHCSS, JSMADE BYUppsalaWilliam EspegrenWith \u00b7LoveOpen For Projects[CONTACT ME](https://www.linkedin.com/in/william-espegren/)[Instagram](https://www.instagram.com/williamespegre

In [6]:
# Crawl page
crawled_chat_result = user_proxy_agent.initiate_chat(
    crawler_agent,
    message="Can you crawl william-espegren.com for me, I want the whole domains information?",
    summary_method="reflection_with_llm",
    summary_args={"summary_prompt": """Summarize the crawled content"""},
)

[33mUserProxy[0m (to WebCrawler):

Can you crawl william-espegren.com for me, I want the whole domains information?

--------------------------------------------------------------------------------
[31m
>>>>>>>> USING AUTO REPLY...[0m
[33mWebCrawler[0m (to UserProxy):

[32m***** Suggested tool call (call_0FkTtsxBtA0SbChm1PX085Vk): crawl_page *****[0m
Arguments: 
{"url":"http://www.william-espegren.com"}
[32m***************************************************************************[0m

--------------------------------------------------------------------------------
[35m
>>>>>>>> EXECUTING FUNCTION crawl_page...[0m
[33mUserProxy[0m (to WebCrawler):

[33mUserProxy[0m (to WebCrawler):

[32m***** Response from calling tool (call_0FkTtsxBtA0SbChm1PX085Vk) *****[0m
[{"content": "William Espegren - Portfoliokeep scrollingMADE WITHCSS, JSMADE BYUppsalaWilliam EspegrenWith \u00b7LoveOpen For Projects[CONTACT ME](https://www.linkedin.com/in/william-espegren/)[Instagram](https:/

The output is stored in the summary.

In [7]:
print(scraped_chat_result.summary)
# print(crawled_chat_result.summary) # We show one for cleaner output

The website belongs to William Espegren, who is based in Uppsala and possesses skills in CSS and JavaScript. He is open to new projects. You can contact him through the following links:

- [LinkedIn](https://www.linkedin.com/in/william-espegren/)
- [Instagram](https://www.instagram.com/williamespegren/)
- [Twitter](https://twitter.com/WilliamEspegren)
- [GitHub](https://github.com/WilliamEspegren)

Feel free to reach out to him for project collaborations.
