{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "First, we must ensure that the Watson Python SDK is installed and ready to use, we'll then import the SDK as well as the pandas library\n", "\n", "**Note: Using `%%capture` just supresses the output, you can remove the line if you want to see the output from pip**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "!pip install watson_developer_cloud==2.5.1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from watson_developer_cloud import DiscoveryV1\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Initialize Watson Discovery\n", "\n", "Now we'll initialize Watson Discovery using our login credentials. In order to obtain these, create a Watson Discovery services with your IBM Cloud account, and generate new credentials." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Enter your own credentials below - be sure to wrap each inside of single quotation marks\n", "credentials = {\n", " 'IAM_API_KEY': 'YOUR DISCOVERY API KEY HERE'\n", "}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "discovery = DiscoveryV1(\n", " version='2018-08-01',\n", " iam_api_key= credentials['IAM_API_KEY'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating the query\n", "\n", "There are a few elements to querying Watson Discovery news, I'll break down each of the elements.\n", "\n", "`environment_id`: `system` just denotes that we're using the system environment\n", "\n", "`collection_id`: We want to query the news collection\n", "\n", "`query`: Basically we will be using a filter in order to isolate the articles that we want, so a query is not neccesary in this case\n", "\n", "`offset`: The number of results (documents) to skip\n", "\n", "`count`: The number of results (documents) to return \n", "\n", "***note: Count and offset is the way pagination of results is implemented, the maximum of total results (offset + count) cannot exceed 1,000***\n", "\n", "`deduplicate`: This is a beta feature to have Watson remove duplicate articles\n", "\n", "`aggregation`: This is a analytic query of the results set - in this case, the entire collection of news articles, filtered by Company (in this case 'bitcoin' is the company in question)\n", "\n", "`filter`: The query for matching documents\n", "\n", "`return_fields`: What items to actually return to us for our use\n", "\n", "**For more information, check out the [query reference](https://cloud.ibm.com/docs/services/discovery/query-reference.html#query-reference)**\n", "\n", "\n", "We are using DEFAULT_COUNT of 50, the maximum you can query at once, and incrementing until we've captured all available documents or hit 1,000 (the maximum)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "all_results = []\n", "\n", "DEFAULT_COUNT = 50\n", "offset = 0\n", "\n", "while offset + DEFAULT_COUNT <= 1000:\n", " try:\n", " result = discovery.query(environment_id='system',\n", " collection_id='news-en',\n", " query='',\n", " offset=offset,\n", " count=DEFAULT_COUNT,\n", " deduplicate=True,\n", " aggregation='filter(enriched_title.entities.type::Company).term(enriched_title.entities.text).timeslice(crawl_date,1day).term(enriched_text.sentiment.document.score)',\n", " filter='Bitcoin',\n", " return_fields=['publication_date', 'enriched_text.sentiment.document'])\n", " \n", " # If the results are empty, stop querying\n", " if not result['results']:\n", " break\n", " \n", " # Add results to all_results and increment offset\n", " all_results.extend(result['results'])\n", " \n", " offset += offset + DEFAULT_COUNT\n", " except:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using the data\n", "\n", "Now that we've queried Watson Discovery, we need to make the data usable. What is returned is an object with a few different fields, the one we're concerned with though is `results` (the items we asked for in `return_fields` of the query will be here). After each query in the previous `while` loop, we added the `result['results']` to an array of all results. Now we'll work with them.\n", "\n", "First we'll create a pandas dataframe, think of it as putting the data into a spreadsheet. Then we'll iterate over the array of results, and add them to our new dataframe." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "_date = []\n", "_sentiment_label = []\n", "_sentiment_score = []\n", "\n", "for r in all_results:\n", " _date.append(r['publication_date'])\n", " _sentiment_label.append(r['enriched_text']['sentiment']['document']['label'])\n", " _sentiment_score.append(r['enriched_text']['sentiment']['document']['score'])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df['publication_date'] = _date\n", "df['sentiment_label'] = _sentiment_label\n", "df['sentiment_score'] = _sentiment_score\n", "df.index = df['publication_date']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df['publication_date'] = pd.to_datetime(df['publication_date'])\n", "df.index = df['publication_date']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Looking at the data\n", "\n", "When we call the `df.head()` function, we can visually see the first portion of our data - remember the concept of a spreadsheet? Here we can see how the sentiment label coorelates to the score. Sentinent scores range from negative to positive, -1 to 1, with 0 being neutral. Watson has just taken the leg work away from assigning a label, but you could easily do this yourself as well." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | publication_date | \n", "sentiment_label | \n", "sentiment_score | \n", "
|---|---|---|---|
| publication_date | \n", "\n", " | \n", " | \n", " |
| 2018-09-23 12:34:00 | \n", "2018-09-23 12:34:00 | \n", "negative | \n", "-0.029183 | \n", "
| 2018-09-23 11:46:00 | \n", "2018-09-23 11:46:00 | \n", "negative | \n", "-0.560177 | \n", "
| 2018-09-23 12:42:00 | \n", "2018-09-23 12:42:00 | \n", "negative | \n", "-0.022305 | \n", "
| 2018-09-23 10:00:00 | \n", "2018-09-23 10:00:00 | \n", "negative | \n", "-0.368113 | \n", "
| 2018-09-23 12:42:00 | \n", "2018-09-23 12:42:00 | \n", "negative | \n", "-0.167054 | \n", "