# Obsei Tutorial 04
## This example shows following Obsei workflow
 1. Observe: Search and fetch news article via Google News
 2. Cleaner: Clean article text proerply
 3. Analyze: Classify article text while splitting text in small chunks and later computing final inference using given formula

## Install Obsei from latest code, perform these steps -
- Select GPU RunType for faster computation 
- Restart Runtime after installation

In [1]:
!pip install obsei[all]
!pip install trafilatura



## Configure Google News Observer

In [10]:
from obsei.source.google_news_source import GoogleNewsConfig, GoogleNewsSource

source_config = GoogleNewsConfig(
 query="bitcoin",
 max_results=10,
 fetch_article=True,
 lookup_period="1d",
)

source = GoogleNewsSource()

## Configure TextCleaner as Pre-Processor to clean review text
These cleaning function will run serially

In [14]:
from obsei.preprocessor.text_cleaner import TextCleaner, TextCleanerConfig
from obsei.preprocessor.text_cleaning_function import *

text_cleaner_config = TextCleanerConfig(
 cleaning_functions = [
 ToLowerCase(),
 RemoveWhiteSpaceAndEmptyToken(),
 RemovePunctuation(),
 RemoveSpecialChars(),
 DecodeUnicode(),
 RemoveStopWords(),
 RemoveWhiteSpaceAndEmptyToken(),
 ]
)

text_cleaner = TextCleaner()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!


## Configure Classification Analyzer

- List of categories in `labels`
- `TextSplitterConfig` with proper `max_split_length` and `split_stride`
- `InferenceAggregatorConfig` with required `aggregate_function` currently two are supported (average and max frequent class)
- `ClassificationMaxCategories` need `score_threshold` which is used to determine what minimum probability needed to take a class into consideration

**Note**: Select model from https://huggingface.co/models?pipeline_tag=zero-shot-classification, if you want to try different one

In [11]:
from obsei.analyzer.classification_analyzer import ClassificationAnalyzerConfig, ZeroShotClassificationAnalyzer
from obsei.postprocessor.inference_aggregator import InferenceAggregatorConfig
from obsei.postprocessor.inference_aggregator_function import ClassificationMaxCategories
from obsei.preprocessor.text_splitter import TextSplitterConfig

analyzer_config=ClassificationAnalyzerConfig(
 labels=["buy", "sell", "going up", "going down"],
 use_splitter_and_aggregator=True,
 splitter_config=TextSplitterConfig(
 max_split_length=300,
 split_stride=3
 ),
 aggregator_config=InferenceAggregatorConfig(
 aggregate_function=ClassificationMaxCategories(
 score_threshold=0.3
 )
 )
)

text_analyzer = ZeroShotClassificationAnalyzer(
 model_name_or_path="typeform/mobilebert-uncased-mnli",
 device="auto"
)

## Search and fetch news article

In [12]:
source_response_list = source.lookup(source_config)

07/29/2021 19:08:37 - INFO - urllib3.poolmanager - Redirecting https://www.bloomberg.com/news/articles/2021-07-29/brokers-sought-for-78-million-bitcoin-stash-from-finland-bust -> https://www.bloomberg.com/tosv2.html?vid=&uuid=60bfaca0-f0a0-11eb-9e4f-53da6d852d97&url=L25ld3MvYXJ0aWNsZXMvMjAyMS0wNy0yOS9icm9rZXJzLXNvdWdodC1mb3ItNzgtbWlsbGlvbi1iaXRjb2luLXN0YXNoLWZyb20tZmlubGFuZC1idXN0
07/29/2021 19:08:37 - INFO - trafilatura.core - using custom extraction: None
07/29/2021 19:08:37 - INFO - trafilatura.core - not enough comments None
07/29/2021 19:08:37 - INFO - urllib3.poolmanager - Redirecting https://www.bloomberg.com/news/articles/2021-07-29/new-ira-product-allows-for-tax-free-bitcoin-mining -> https://www.bloomberg.com/tosv2.html?vid=&uuid=60f22e50-f0a0-11eb-90b4-6d36db3c27b3&url=L25ld3MvYXJ0aWNsZXMvMjAyMS0wNy0yOS9uZXctaXJhLXByb2R1Y3QtYWxsb3dzLWZvci10YXgtZnJlZS1iaXRjb2luLW1pbmluZw==
07/29/2021 19:08:37 - INFO - trafilatura.core - using custom extraction: None
07/29/2021 19:08:37 - ERRO

## PreProcess text to clean it

In [15]:
cleaner_response_list = text_cleaner.preprocess_input(
 input_list=source_response_list,
 config=text_cleaner_config
)

## Analyze article to perform classification
**Note**: This is compute heavy step

In [16]:
analyzer_response_list = text_analyzer.analyze_input(
 source_response_list=cleaner_response_list,
 analyzer_config=analyzer_config
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Print Result

In [17]:
for analyzer_response in analyzer_response_list:
 print(vars(analyzer_response))

{'segmented_data': {'aggregator_data': {'category_count': {'positive': 2, 'going up': 2, 'sell': 1}, 'max_scores': {'positive': 0.806824266910553, 'going up': 0.5611677169799805, 'sell': 0.5141412019729614}, 'aggregator_name': 'ClassificationMaxCategories'}}, 'meta': {'title': 'Bitcoin (BTC USD) Cryptocurrency Price News: Finland Seeks Broker to Sell Stash - Bloomberg', 'description': 'Bitcoin (BTC USD) Cryptocurrency Price News: Finland Seeks Broker to Sell Stash Bloomberg', 'published date': 'Thu, 29 Jul 2021 13:11:21 GMT', 'url': 'https://www.bloomberg.com/news/articles/2021-07-29/brokers-sought-for-78-million-bitcoin-stash-from-finland-bust', 'publisher': {'href': 'https://www.bloomberg.com', 'title': 'Bloomberg'}, 'extracted_data': {'title': 'Bloomberg', 'author': None, 'hostname': None, 'date': None, 'categories': '', 'tags': '', 'fingerprint': 'BecpvREYR0Bqj6DjTeoRthAFuAs=', 'id': '6e25ac22', 'source': None, 'source-hostname': 'Are you a robot?', 'excerpt': None}}, 'source_name'