{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import copy\n",
"import random\n",
"import hashlib\n",
"import itertools\n",
"from PIL import Image\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"import operator\n",
"import json\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.utils import shuffle\n",
"from sklearn.metrics import confusion_matrix\n",
"from sklearn.linear_model import SGDClassifier\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, average_precision_score\n",
"from sklearn.model_selection import validation_curve, learning_curve\n",
"from IPython.display import Image\n",
"import seaborn as sns\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data can be downloaded at \n",
"\n",
"https://drive.google.com/open?id=1H7N3Y7PEm0_442koQeffVfodlOiR0W_o\n",
"\n",
"https://drive.google.com/open?id=1YbI2DFpuR_689OcqTwvRvEgcpu6BXSVt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('sites_markup.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1. Dataset and features description"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dataset was gathered from grocery retail sites.
\n",
"They all have a similar structure, so it's supposed to be possible to parse this sites not by manual rules, but by crawler which can distinguish between elements in site.
\n",
"Let's have a look at the sites"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/xe/ek/vu/xeekvu2aogn6mljs-obb4y7nns0.png', width=700)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/vz/vz/u8/vzvzu80khqbtlr3cow9xmj8kgwq.png', width=700)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/ng/s3/8w/ngs38wzj4gtl4bnhikmamyjtory.png', width=700)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So they indeed similar.
\n",
"Let's look at the dataset and features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Features were constructed from common sense and couple papers with similar goal (https://medium.com/contentsquare-engineering-blog/automatic-zone-recognition-in-webpages-68fb2efab822 , https://arxiv.org/pdf/1210.6113.pdf).\n",
"\n",
"childs_tags - tags that nested elements have
\n",
"depth - depth in HTML markup tree
\n",
"element_classes_count - how many classes have an element
\n",
"element_classes_words - element's classes names
\n",
"href_seen - is href somewhere inside child elements
\n",
"img_seen - is img somewhere inside child elements
\n",
"inner_elements - how many nested elements
\n",
"is_displayed - is_displayed flag from selenium
\n",
"is_href - is current element href
\n",
"location - loacation of upper-left corner on page
\n",
"parent_tag - which tag have element's ancestor
\n",
"screenshot - path to element's screenshot
\n",
"shop - shop's name
\n",
"siblings_count - how many other elements lay on the same layer
\n",
"size - width and height of element
\n",
"tag - element's tag
\n",
"text - text inside the element
\n",
"y - is it a \"product card\" our target
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our goal to find \"product card\" object"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/if/yo/r2/ifyor2xutwypu-eyqfufqnh2g5o.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see there are many products on the same page. Each product card has many descending elements and specific structure like picture, price, product name, rating, to_cart, etc.\n",
"\n",
"Data was gathered not just by downloading pages to obtain raw html, but with page execution to get element sizes and locations.\n",
"\n",
"When we'll get auto product card recognition, our crawlers will be able to get products from a wide range of sites without explicit rules. It is possible to crawl all sites by hands, but boring and after 20+ sites, it's harder to maintain, because site's structure changes over time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 1.5 Raw data processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"'childs_tags', 'location', 'parent_tag', 'size' - needs to be corrected. All array elements will be strings with \" \" separator, and all ints will be separate fields."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[['loc_x', 'loc_y']] = df['location'].str.replace(\"'\", '\"').apply(json.loads).apply(pd.Series)\n",
"df[['size_h', 'size_w']] = df['size'].str.replace(\"'\", '\"').apply(json.loads).apply(pd.Series)\n",
"df.size_h = df.size_h.astype(int)\n",
"df.size_w = df.size_w.astype(int)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['parent_tag'] = df.parent_tag.str.split('_')\n",
"df['childs_tags'] = df.childs_tags.apply(lambda x: x.replace('[', '').replace(']','').strip())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Drop some columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean = df.copy()\n",
"df_clean.drop(['location', 'size', 'Unnamed: 0'], axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2-3-4. Exploratory data analysis and visual analysis of the features. Patterns, insights, pecularities of data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### target distribution\n",
"First things first. Let's find your target feature distribution."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean['y'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Only 3k true values, which is 42:1 distribution. No surprise though, we got each element of DOM tree as row start from body."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I've made a screenshot for every element, not for deep learning and image recognition but to ease understanding how an element looks like. (Screenshot for every element it's too many images, so I'll make screenshot over the graph with screenshots =))\n",
"\n",
"How looks target elements:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/bz/xy/ds/bzxydsgez9atzsqavsatzgibrlm.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indeed very simular"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Overall stats"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"continuous_vars = ['depth', 'element_classes_count', 'inner_elements', 'siblings_count', 'loc_x', 'loc_y', 'size_h', 'size_w']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[continuous_vars].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1][continuous_vars].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### depth\n",
"At what depth sitting target elements"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1].groupby('shop')['depth'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Strange behavior for 'Окей'. Let's investigate."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[(df_clean.shop == 'Окей') & (df_clean.y == 1) & (df_clean.depth == 13)].shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[(df_clean.shop == 'Окей') & (df_clean.y == 1) & (df_clean.depth != 13)].shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Almost all product cards from 'Окей' depth == 13, let's look at them\n",
"\n",
"images for df_clean[(df_clean.shop == 'Окей') & (df_clean.y == 1) & (df_clean.depth == 13)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/vf/uz/rd/vfuzrdfzhrzihvarhglnfk4zdje.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"images for df_clean[(df_clean.shop == 'Окей') & (df_clean.y == 1) & (df_clean.depth != 13)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/32/ee/ah/32eeahfoleo-ijciyzz1dvbv8hs.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks the same"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[(df_clean.shop == 'Окей') & (df_clean.y == 1) & (df_clean.depth == 13)].head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[(df_clean.shop == 'Окей') & (df_clean.y == 1) & (df_clean.depth != 13)].head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Same tags and class names, inner_elements vary a little. Everything correct, just depth varies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(df_clean.depth, df_clean.y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Outlayer depth==16"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[(df_clean.y == 1) & (df_clean.depth == 16)].head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/gm/fz/hi/gmfzhiial3a_fxgj5lfq2oaa-ac.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most cards in 'Перекрёсток' is 9 depth"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[(df_clean.y == 1) & (df_clean.shop == 'Перекрёсток')].head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/na/us/e4/nause4he1lpfur_l-hvvjug12tk.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Different product cards in one store are in crawler a bag or just store has different cards. In one type there isn't any rating or to_cart"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Element sizes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1].groupby('shop')[['size_w', 'size_h']].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"'Комус' differs a lot\n",
"\n",
"Let's plot all product card's sizes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = df_clean[(df_clean.y == 1)]\n",
"sns.scatterplot(x=tmp_df.size_w, y=tmp_df.size_h)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = df_clean[(df_clean.y == 1)]\n",
"g = sns.FacetGrid(tmp_df, col='shop', hue='y')\n",
"g.map(sns.scatterplot, 'size_w', 'size_h')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firstly we've expected one size for one shop but it is not true. For 'Перекрёсток' and 'Комус' size varies a lot. When we plot all cards size we see they mostly allocated in upper-left corner of plot, with sizes arount 400x220. But what's wrong with 'Комус'? Let's visit the site."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/ua/pg/l6/uapgl6ei_c1knebnz9ls8rvlbdq.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The structure is different. In most shops, we see tile-layout, but here list-layout. For the sake of simplicity, I'll exclude 'Комус' from out dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f'unique shops - {df_clean.shop.unique()}')\n",
"df_clean = df_clean[df_clean.shop != 'Комус']\n",
"print(f'unique shops after drop - {df_clean.shop.unique()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see how target sizes corresponds with all others"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = df_clean\n",
"g = sns.FacetGrid(tmp_df, col='shop', hue='y')\n",
"g.map(sns.scatterplot, 'size_w', 'size_h')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Outliers disturb the picture"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = df_clean[(df_clean.size_w < 1000) & (df_clean.size_h < 4000)]\n",
"g = sns.FacetGrid(tmp_df, col='shop', hue='y')\n",
"g.map(sns.scatterplot, 'size_w', 'size_h')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zoom closer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = df_clean[(df_clean.size_w > 200) &(df_clean.size_w < 300) & (df_clean.size_h < 600)]\n",
"g = sns.FacetGrid(tmp_df, col='shop', hue='y')\n",
"g.map(sns.scatterplot, 'size_w', 'size_h')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like a great feature with a strong correlation between shops and refinement from other DOM elements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### location"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = df_clean\n",
"g = sns.FacetGrid(tmp_df, col='shop', hue='y')\n",
"g.map(sns.scatterplot, 'loc_x', 'loc_y' )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Location at negative positions?!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp_df = df_clean[(df_clean.loc_x > 0) & (df_clean.loc_y > 0)]\n",
"g = sns.FacetGrid(tmp_df, col='shop', hue='y')\n",
"g.map(sns.scatterplot, 'loc_x', 'loc_y' )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Positions as we expected. Strong tabular structure, maybe we can make a new feature from this fact.\n",
"\n",
"Outliers in 'Перекрёсток', lets see"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[(df_clean.loc_x > 900) & (df_clean.y == 1) & (df_clean.shop == 'Перекрёсток')].head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/hz/oe/e5/hzoee5akuroyw3hedikww33qg1k.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"?????\n",
"\n",
"Don't know what it is. Better drop."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f'rows in - {df_clean.shape[0]}')\n",
"df_clean = df_clean.drop(df_clean[(df_clean.loc_x > 900) & (df_clean.y == 1) & (df_clean.shop == 'Перекрёсток')].index)\n",
"print(f'rows after drop - {df_clean.shape[0]}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### is_displayed"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.groupby('is_displayed')['y'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All rows is_displayed = true - redundant feature"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.drop(labels='is_displayed', axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### inner_elements"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean['inner_elements'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1]['inner_elements'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.distplot(df_clean.inner_elements);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.distplot(df_clean[(df_clean.inner_elements > 2) & (df_clean.inner_elements < 100) ].inner_elements);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Maybe the feature was poorly designed. For each element, we count inner_element till the leaves, so in \"body\" it will be all other element treated as inner. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### tag"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean['tag'].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1]['tag'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our cards only 'div' elements. How many other div's ?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.tag == 'div'].groupby('y').size()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.groupby('tag').size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Still, there are a lot div's that isn't our target, but feature is good."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### siblings_count"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1]['siblings_count'].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1].groupby('shop')['siblings_count'].agg(['unique'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The feature was designed to show how many \"siblings\" around an element. Product cards supposed to lay in lists so they are at the same level. If it is 30 products per page siblings_count supposed to be 29. But something is wrong. Let's check out original sites. 'Перекрёсток' for example."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/it/gk/7t/itgk7tfxsnyzsnda-zmiejtnu5o.png', width=900)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/em/i-/v9/emi-v9grsih0_i5rhbpnmhuloec.png', width=900)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/vr/in/0y/vrin0yedugz7xudlg7bpmhm8oi0.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see there is a list 'ul' that contains 'li' that contains 'div'. So element 'li' would behave as we would expect - right siblings_count, but 'div' inside 'li' siblings_count == 1. It's supposed to be reframed or possibly fruitful feature is dead."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### is_href"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1]['is_href'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 0]['is_href'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our product cards not clickable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### element_classes_words"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for shop, tmp_df in df_clean[df_clean.y == 1].groupby('shop'):\n",
" print(shop)\n",
" print(tmp_df['element_classes_words'].unique())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's css class naming that totally depends on developers. But in some cases, we see names like \"product\", \"item\", that could be helpful. There isn't enough data to say for sure."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### parent_tag"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for shop, tmp_df in df_clean[df_clean.y == 1].groupby('shop'):\n",
" print(shop)\n",
" print(tmp_df['parent_tag'].head(1))\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### childs_tags"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for shop, tmp_df in df_clean[df_clean.y == 1].groupby('shop'):\n",
" print(shop)\n",
" print(tmp_df['childs_tags'].head(1))\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's simple and simular parent-child markup for all shops except 'Перекрёсток'. But we should be ready for this. In this state we couldn't use this features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### text"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.text.notna()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feature 'text' is empty. Obviously due to a bag in the crawler. Feature possibly very important. If we know that near inside words like \"cheese\", \"milk\", \"meat\" - definitely it's a grocery. The model will be not just single-language but singe-domain-specific. When we'll crawl construction materials shop there won't be any \"cheese\". I like to try build a model robust to domain and language variations. I want to parse shops that's all."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### img_seen"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1]['img_seen'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 0]['img_seen'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yes, there is an image inside each product card"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### href_seen"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1]['href_seen'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 0]['href_seen'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And a href inside each product card"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 5. Feature engineering and description"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some feature processing was done above we obtain size and loc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some parts of FE was done on crawler's phase. Features like href_seen and img_seen mean that some of the child elements have img or href inside, but it's out of bounds. Maybe 2-3-4 descending elements should mean that image near, but not all way down.\n",
"\n",
"Main limitation now that in dataset not saved relations between elements. Firstly I got row's id, but no reference that this id is a parent somewhere. Fixing this on next iteration will allow me to create more dependent features.\n",
"\n",
"But even now we can recreate parent's tag order."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean['parent_1'] = df_clean.parent_tag.apply(lambda x: x[-1])\n",
"df_clean['parent_2'] = df_clean.parent_tag.apply(lambda x: x[-2] if len(x) > 1 else '')\n",
"df_clean['parent_3'] = df_clean.parent_tag.apply(lambda x: x[-3] if len(x) > 2 else '')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.y == 1].sample(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With parent and id reference I'll be able to add features like ankle_count (parent's siblings =)), parent_size, parent_location, parent_class etc. But now it's not feasible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another good possible feature might be a number of \"similar\" objects on the page. Similar in terms of css class names or in terms of size and location. But I don't have any possibility to do it now since there is no \"url\" feature so I can't implement it now.\n",
"\n",
"You might wonder, why I haven't fixed these properties of the crawler. A quick pipeline is a crucial property of ML system building. While I don't have metric's results I won't be stuck to infinite improve of the previous steps."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can ues element_classes_words, but since it close to work with vocabulary, we should do it only after validation split."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 6. Metrics selection"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.y.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firstly I want point out that loss function and metric score may be different. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### metric"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have a classification problem and unbalanced class, so usage of the simplest possible metrics accuracy is not good. False positive and False negative cases don't have any priority. When new crawler will miss correct product cards some information will be lost. When crawler will wrongly identify product card in a database will be stored garbage information. Actually on top of this product card identifier will be build product name, product price, product image, product category identifiers. So garbage info will cause problems along the line. Now I think FP and FN have the same weight.\n",
"\n",
"So we probably want to check few metrics to understand our model's behavior. It'll be ROC AUC, PR AUC, F1 score and confusion matrix. It's impossible to get better at all metrics so will pick exactly one later."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I like PR AUC the most. PR is short for Precision Recall. They somehow similar to ROC AUC.\n",
"\n",
"A ROC curve is plotting True Positive Rate (TPR) against False Positive Rate (FPR).\n",
"\n",
"A PR curve is plotting Precision against Recall.\n",
"\n",
"PR does not account for true negatives (as TN is not a component of either Precision or Recall). So if there are many more negatives than positives (a characteristic of class imbalance problem), we should use PR.\n",
"\n",
"http://www.chioka.in/differences-between-roc-auc-and-pr-auc/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### loss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Logarithmic loss default option but it may be wrong. I'll remind you."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Image(url='https://habrastorage.org/webt/p7/u6/cp/p7u6cppqgqauzlsenfrsfdymygw.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a different elements, one of them marked as True another as False. It's \"div\" inside \"li\", almost exact shape, almost exact position. Would it be fair to penalize our model guessing it both product cards?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 7. Model selection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We don't have tons of features and tons of data so it is firm NO to neural nets and boosting. Data is rather simple so the model should match. From previous analyze we saw that parent tag with combination to siblings_count can mean a lot (if our target \"div\" inside \"li\" siblings mean nothing if div inside div siblings - correlate to target), this complicated feature interactions can't be captured by the linear model. Random forests supposed to work like a charm (I'll give a try to linear model either)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 8. Data preprocessing"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\"childs_tags\" can't be used. \"text\" always NaN. \"screenshot\" used only for visual validation. \"shop\" can't be used as metric as we are targeting to new shops (but I need it for a while). \"element_classes_words\" will use a little bit later. All other fields were dropped earlier. I'm holding y in X for a while bear with me."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features_to_train = ['depth', 'element_classes_count', 'href_seen', 'img_seen',\n",
" 'inner_elements', 'is_href', 'siblings_count', 'tag', 'loc_x',\n",
" 'loc_y', 'size_h', 'size_w', 'parent_1','parent_2', 'parent_3', 'shop', 'y']\n",
"categorical_columns = ['tag', 'parent_1','parent_2','parent_3']\n",
"continuous_columns = ['depth', 'element_classes_count','inner_elements','siblings_count',\n",
" 'loc_x','loc_y', 'size_h', 'size_w']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = df_clean[features_to_train]\n",
"y = df_clean.y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For RF we don't have to make OHE, so I'll just convert str to int."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"possible_tags_categories = X['tag'].astype('category')\n",
"possible_parent_categories = X['parent_1'].astype('category')\n",
"all_tags = list(possible_tags_categories.cat.categories) + list(possible_parent_categories.cat.categories)\n",
"num_to_category = {i:cat for i, cat in enumerate(all_tags)}\n",
"category_to_num = {cat:i for i, cat in num_to_category.items()}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X['tag'] = X['tag'].apply(lambda x: category_to_num[x])\n",
"X['parent_1'] = X['parent_1'].apply(lambda x: category_to_num[x] if x else -1)\n",
"X['parent_2'] = X['parent_2'].apply(lambda x: category_to_num[x] if x else -1)\n",
"X['parent_3'] = X['parent_3'].apply(lambda x: category_to_num[x] if x else -1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Preprocessing for linear model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_linear = df_clean[['depth', 'element_classes_count', 'href_seen', 'img_seen',\n",
" 'inner_elements', 'is_href', 'siblings_count', 'tag', 'loc_x',\n",
" 'loc_y', 'size_h', 'size_w', 'parent_1','parent_2', 'parent_3', 'shop', 'y']]\n",
"y = df_clean.y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For linear models, we need scaled numeric features and OHE for categorical"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_linear = pd.get_dummies(X_linear, columns=categorical_columns, drop_first=True,\n",
" prefix=categorical_columns, sparse=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"scaler = StandardScaler()\n",
"X_linear[continuous_columns] = scaler.fit_transform(X_linear[continuous_columns])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_linear.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have data for linear and rf models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 9. Cross-validation and adjustment of model hyperparameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def print_scores(y_true, preds, verbose=True):\n",
" f1 = f1_score(y_true, preds)\n",
" roc_auc = roc_auc_score(y_true, preds)\n",
" pr_auc = average_precision_score(y_true, preds)\n",
" if verbose:\n",
" print(f'F1 score is {f1}')\n",
" print(f'ROC_AUC score is {roc_auc}')\n",
" print(f'PR_AUC is {pr_auc}')\n",
" print(f'confusion_matrix \\n {confusion_matrix(y_true, preds)}')\n",
" return f1, roc_auc, pr_auc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Usual approach for data split supposed to look like this. We won't forget to drop target variable"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train,X_val,y_train,y_val = train_test_split(X.drop(['shop', 'y'],axis=1),y, random_state=42)\n",
"X_train.shape,X_val.shape,y_train.shape,y_val.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Random forest"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rf = RandomForestClassifier()\n",
"rf.fit(X_train,y_train)\n",
"preds = rf.predict(X_val)\n",
"print_scores(y_val, preds);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OMG we are perfect! Actually not. The only way to make reasonable test - separate all examples for one shop. It doesn't matter how many examples we leave out 5% or 50%. We should learn on shops we have and use it in the wild on other shops we haven't crawled before.\n",
"\n",
"So better to do it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I'll implement cross-val score by hands"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diff_shops_df = [ (shop_name, shop_df) for shop_name, shop_df in X.groupby('shop')]\n",
"for shop_name, shop_df in diff_shops_df:\n",
" y_cross_val = shop_df.y\n",
" X_cross_val = shop_df.drop(['y', 'shop'],axis=1) # validation only 1 shop\n",
" X_cross_train = X.drop(X_cross_val.index) # train all X without 1 shop\n",
" y_cross_train = X_cross_train.y\n",
" X_cross_train = X_cross_train.drop(['shop', 'y'],axis=1)\n",
" \n",
" rf = RandomForestClassifier(n_estimators=100,max_features='sqrt', criterion='entropy', min_samples_leaf=5,n_jobs=-1,)\n",
" rf.fit(X_cross_train,y_cross_train)\n",
" preds = rf.predict(X_cross_val)\n",
" print(f'**********shop - {shop_name}*************')\n",
" print(f'Train shape {X_cross_train.shape}')\n",
" print(f'Val shape {X_cross_val.shape}')\n",
" print_scores(y_cross_val, preds);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Results are disaster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Almost every time we got 100% false predictions, couple time FP, let's watch them"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_analize = X.drop(X[X.shop == 'Европа'].index)\n",
"X_val_analize = X[X.shop == 'Европа']\n",
"y_train_analize = X_train_analize.y\n",
"y_val_analize = X_val_analize.y\n",
"X_train_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"X_val_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"\n",
"rf = RandomForestClassifier(n_estimators=100,max_features='sqrt', criterion='entropy', min_samples_leaf=5,n_jobs=-1,)\n",
"rf.fit(X_train_analize,y_train_analize)\n",
"preds = rf.predict(X_val_analize)\n",
"print_scores(y_val_analize, preds);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_val_analize[y_val_analize == True].head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_val_analize[preds == True].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check indexes for validation and true_y. Suspiciously, they differ at 1."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.index.isin(X_val_analize[preds == True].index)].head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# screen for df_clean[df_clean.index.isin(X_val_analize[preds == True].index)].screenshot\n",
"Image(url='https://habrastorage.org/webt/x1/no/c5/x1noc54flobtv8lyf3ip66te05k.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yep, correct elements, but some confusions with nested div-div-li-article arose. As humans, we see classification correct, but machine thinks differently.\n",
"\n",
"And by confusion matrix we can see number of FN == 360 and FP == 369. So maybe 9 elements were really misclassified. I think because we treating as correct product card only those in the main catalog, but it's always some promos on the sides."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check 'Метро' either."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_analize = X.drop(X[X.shop == 'Метро'].index)\n",
"X_val_analize = X[X.shop == 'Метро']\n",
"y_train_analize = X_train_analize.y\n",
"y_val_analize = X_val_analize.y\n",
"X_train_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"X_val_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"\n",
"rf = RandomForestClassifier(n_estimators=100,max_features='sqrt', criterion='entropy', min_samples_leaf=5,n_jobs=-1,)\n",
"rf.fit(X_train_analize,y_train_analize)\n",
"preds = rf.predict(X_val_analize)\n",
"print_scores(y_val_analize, preds);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_val_analize[y_val_analize == True].head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_val_analize[preds == True].head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean[df_clean.index.isin(X_val_analize[preds == True].index)].head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# screen for df_clean[df_clean.index.isin(X_val_analize[preds == True].index)].screenshot\n",
"Image(url='https://habrastorage.org/webt/to/4g/uw/to4guwglcb6epg6pf12bjehd69k.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Same story with 'Метро' correct cards, but wrong elements.\n",
"\n",
"However confusion matrix is worser FN == 520 FP == 209, We are missing a lot."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All other shops doing much worse. All predictions for them - False. It's harder to analyze."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_analize = X.drop(X[X.shop == 'Окей'].index)\n",
"X_val_analize = X[X.shop == 'Окей']\n",
"y_train_analize = X_train_analize.y\n",
"y_val_analize = X_val_analize.y\n",
"X_train_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"X_val_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"\n",
"rf = RandomForestClassifier(n_estimators=100,max_features='sqrt', criterion='entropy', min_samples_leaf=5,n_jobs=-1,)\n",
"rf.fit(X_train_analize,y_train_analize)\n",
"preds = rf.predict(X_val_analize)\n",
"print_scores(y_val_analize, preds);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extra module pip install treeinterpreter\n",
"from treeinterpreter import treeinterpreter as ti"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_example = X_val_analize[y_val_analize == False].sample(1)\n",
"print(rf.predict_proba(test_example))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_example = X_val_analize[y_val_analize == True].sample(1)\n",
"print(rf.predict_proba(test_example))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In probabilities, we see the difference between the correct class and incorrect. Tree picks a most probable class, but maybe we can leverage another approach"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prediction, bias, contributions = ti.predict(rf, test_example)\n",
"prediction, bias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Forest is highly biased onto False assumptions, let's pick elements with the probability being true not > 0.5, but >bias."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"preds_proba = rf.predict_proba(X_val_analize)\n",
"true_class_probs = preds_proba[:,1]\n",
"(true_class_probs > 0.023).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1890 elements. Too many for our shop, but who cares. Let's analyze further."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# screen for df_clean[df_clean.index.isin((true_class_probs > 0.023).index)].screenshot\n",
"Image(url='https://habrastorage.org/webt/yk/sk/t5/ykskt5lziqsgazssbkkk8vgm3w8.png', width=900)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# screen for df_clean[df_clean.index.isin((true_class_probs > 0.023).index)].screenshot\n",
"Image(url='https://habrastorage.org/webt/yd/aj/kz/ydajkzlfdllxmdmus-v5vunfwcm.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's 778 true elements for 'Окей' while we predicting 1890 above the bias. Look closer to the pics of correct elements some of them have ~330 pix height, some ~300. It's our favorite problem div in div. I assume there are 2 predictions for every correct element. So we have 778*2 = 1556 \"correct\" predictions and 1890-1556=334 FP. Of course, I can tweak confidence level as I want to, but the bias term gives me a clear sign of shop's distribution.\n",
"\n",
"Let's adopt predictions above the bias level."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diff_shops_df = [ (shop_name, shop_df) for shop_name, shop_df in X.groupby('shop')]\n",
"f1s = list()\n",
"rocs = list()\n",
"prs = list()\n",
"for shop_name, shop_df in diff_shops_df:\n",
" y_cross_val = shop_df.y\n",
" X_cross_val = shop_df.drop(['y', 'shop'],axis=1) # validation only 1 shop\n",
" X_cross_train = X.drop(X_cross_val.index) # train all X without 1 shop\n",
" y_cross_train = X_cross_train.y\n",
" X_cross_train = X_cross_train.drop(['shop', 'y'],axis=1)\n",
" \n",
" rf = RandomForestClassifier(n_estimators=100,max_features='sqrt', criterion='entropy', min_samples_leaf=5,n_jobs=-1,)\n",
" rf.fit(X_cross_train,y_cross_train)\n",
" _, bias, _ = ti.predict(rf,test_example) \n",
" bias_treshold = bias[0][1]\n",
" preds_proba = rf.predict_proba(X_cross_val)\n",
" preds = (preds_proba[:,1] > bias_treshold)\n",
" print(f'**********shop - {shop_name}*************')\n",
" f1, roc_auc, pr_auc = print_scores(y_cross_val, preds);\n",
" f1s.append(f1)\n",
" rocs.append(roc_auc)\n",
" prs.append(pr_auc)\n",
"print(f'Mean f1 - {np.array(f1s).mean()}')\n",
"print(f'Mean ROC AUC - {np.array(rocs).mean()}')\n",
"print(f'Mean PR AUC - {np.array(prs).mean()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Linear Classification"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diff_shops_df = [ (shop_name, shop_df) for shop_name, shop_df in X_linear.groupby('shop')]\n",
"f1s = list()\n",
"rocs = list()\n",
"prs = list()\n",
"for shop_name, shop_df in diff_shops_df:\n",
" y_cross_val = shop_df.y\n",
" X_cross_val = shop_df.drop(['y', 'shop'],axis=1) # validation only 1 shop\n",
" X_cross_train = X_linear.drop(X_cross_val.index) # train all X without 1 shop\n",
" y_cross_train = X_cross_train.y\n",
" X_cross_train = X_cross_train.drop(['shop', 'y'],axis=1)\n",
" \n",
" sgd_l1 = SGDClassifier(loss=\"hinge\", penalty=\"l2\", max_iter=5)\n",
" sgd_l1.fit(X_cross_train,y_cross_train)\n",
" preds = sgd_l1.predict(X_cross_val)\n",
" print(f'**********shop - {shop_name}*************')\n",
" f1, roc_auc, pr_auc = print_scores(y_cross_val, preds);\n",
" f1s.append(f1)\n",
" rocs.append(roc_auc)\n",
" prs.append(pr_auc)\n",
"print(f'Mean f1 - {np.array(f1s).mean()}')\n",
"print(f'Mean ROC AUC - {np.array(rocs).mean()}')\n",
"print(f'Mean PR AUC - {np.array(prs).mean()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Seems more promising out of the box. Lets tune hyperparameters.\n",
"\n",
"I need a special way to implement CV, in train and test data supposed to be different shops. GroupShuffleSplit offers the required functionality. Let's see how it works."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import GroupShuffleSplit"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_tmp = X.y\n",
"X_tmp = X.drop(['y', 'shop'], axis=1)\n",
"g = GroupShuffleSplit(n_splits=5,random_state=56)\n",
"itr = g.split(X, y=X.y, groups=X.shop.values,)\n",
"for tr, tst in itr:\n",
" print(tr.shape, tst.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are repeated splits, not all 5 shops were evaluated as validation data. I can't sacrifice even 1 shop on validation, there is too few of them.\n",
"\n",
"So quick and dirty implementation of GridSearch with GroupShuffleSplit."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diff_shops_df = [ (shop_name, shop_df) for shop_name, shop_df in X_linear.groupby('shop')]\n",
"f1s = list()\n",
"rocs = list()\n",
"prs = list()\n",
"best_scores = dict()\n",
"losses = ['hinge', 'log', 'perceptron']\n",
"penalties = ['l2', 'l1', 'elasticnet']\n",
"max_iter = [5,10,20,40]\n",
"for l in losses:\n",
" for p in penalties:\n",
" for it in max_iter:\n",
" for shop_name, shop_df in diff_shops_df:\n",
" y_cross_val = shop_df.y\n",
" X_cross_val = shop_df.drop(['y', 'shop'],axis=1) # validation only 1 shop\n",
" X_cross_train = X_linear.drop(X_cross_val.index) # train all X without 1 shop\n",
" y_cross_train = X_cross_train.y\n",
" X_cross_train = X_cross_train.drop(['shop', 'y'],axis=1)\n",
"\n",
" sgd_l1 = SGDClassifier(loss=l, penalty=p, max_iter=it)\n",
" sgd_l1.fit(X_cross_train,y_cross_train)\n",
" preds = sgd_l1.predict(X_cross_val)\n",
" f1, roc_auc, pr_auc = print_scores(y_cross_val, preds, verbose=False);\n",
" f1s.append(f1)\n",
" rocs.append(roc_auc)\n",
" prs.append(pr_auc)\n",
" f1_mean = np.array(f1s).mean()\n",
" roc_mean = np.array(rocs).mean()\n",
" pr_mean = np.array(prs).mean()\n",
" best_scores[l+p+str(it)] = [f1_mean, roc_mean, pr_mean]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f'best by f1 {max(best_scores.items(), key=(lambda key: key[1][0]))}') \n",
"print(f'best by roc {max(best_scores.items(), key=(lambda key: key[1][1]))}') \n",
"print(f'best by pr {max(best_scores.items(), key=(lambda key: key[1][2]))}') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our best hyperparameters are logloss, elastic regularization, 40 iterations. 40 was max number in our grid search, so probably we can increase it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_analize = X_linear.drop(X_linear[X_linear.shop == 'Окей'].index)\n",
"X_val_analize = X_linear[X_linear.shop == 'Окей']\n",
"y_train_analize = X_train_analize.y\n",
"y_val_analize = X_val_analize.y\n",
"X_train_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"X_val_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"\n",
"sgd_l1 = SGDClassifier(loss='log', penalty='elasticnet', max_iter=40)\n",
"sgd_l1.fit(X_train_analize,y_train_analize)\n",
"preds = sgd_l1.predict(X_val_analize)\n",
"print_scores(y_val_analize, preds);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_analize = X_linear.drop(X_linear[X_linear.shop == 'Метро'].index)\n",
"X_val_analize = X_linear[X_linear.shop == 'Метро']\n",
"y_train_analize = X_train_analize.y\n",
"y_val_analize = X_val_analize.y\n",
"X_train_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"X_val_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"\n",
"sgd_l1 = SGDClassifier(loss='log', penalty='elasticnet', max_iter=40)\n",
"sgd_l1.fit(X_train_analize,y_train_analize)\n",
"preds = sgd_l1.predict(X_val_analize)\n",
"print_scores(y_val_analize, preds);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Still 'Метро' fails hard, but as a habit let's check what it predicts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_val_analize[preds == True].head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_val_analize[y_val_analize == True].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After all transformations, it's obscure to deduce the pattern of the wrong prediction. Luckily we have our screenshots."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# screen for df_clean[df_clean.index.isin((preds == True).index)].screenshot\n",
"Image(url='https://habrastorage.org/webt/5j/jb/hg/5jjbhgbqzophvn8pdl_2wxicpky.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We got 2 model RF and LR with scores:\n",
"\n",
"| - | RF | LR |\n",
"|-------|------|-----|\n",
"| f1 | 0.449|0.255|\n",
"|ROC AUC| 0.907|0.632|\n",
"| PR AUC| 0.280|0.240|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The good thing is our models works pretty nice, the bad thing we can't understand it by metric scores."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 10. Plotting training and validation curves"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For now I'll pick PR AUC as metric."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### RF parametr n_estimators"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_analize = X.drop(X[X.shop == 'Метро'].index)\n",
"X_val_analize = X[X.shop == 'Метро']\n",
"y_train_analize = X_train_analize.y\n",
"y_val_analize = X_val_analize.y\n",
"X_train_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"X_val_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"\n",
"trees_num = np.linspace(1,500, dtype=int)\n",
"pr_score_val = list()\n",
"pr_score_train = list()\n",
"for tr_num in trees_num:\n",
" rf = RandomForestClassifier(n_estimators=tr_num,max_features='sqrt', criterion='entropy', min_samples_leaf=5,n_jobs=-1,)\n",
" rf.fit(X_train_analize,y_train_analize)\n",
" _, bias, _ = ti.predict(rf,test_example) \n",
" bias_treshold = bias[0][1]\n",
" preds_proba_val = rf.predict_proba(X_val_analize)\n",
" preds_val = (preds_proba_val[:,1] > bias_treshold)\n",
" _,_, pr_auc_val = print_scores(preds_val, y_val_analize, verbose=False);\n",
" \n",
" preds_proba_train = rf.predict_proba(X_train_analize)\n",
" preds_train = (preds_proba_train[:,1] > bias_treshold)\n",
" _,_, pr_auc_train = print_scores(preds_train, y_train_analize, verbose=False);\n",
" \n",
" pr_score_train.append(pr_auc_train)\n",
" pr_score_val.append(pr_auc_val)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"ax.plot(trees_num,pr_score_train, label='train')\n",
"ax.plot(trees_num,pr_score_val, label='valid')\n",
"ax.legend()\n",
"ax.set(xlabel='N_estimators', ylabel='PR AUC', );"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that forest heavily overfitted. We can reduce overfitting by min_samples_leaf."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### RF parametr min_samples_leaf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train_analize = X.drop(X[X.shop == 'Метро'].index)\n",
"X_val_analize = X[X.shop == 'Метро']\n",
"y_train_analize = X_train_analize.y\n",
"y_val_analize = X_val_analize.y\n",
"X_train_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"X_val_analize.drop(['shop','y'],axis=1, inplace=True)\n",
"\n",
"min_leaf_list = np.linspace(1,100, dtype=int)\n",
"pr_score_val = list()\n",
"pr_score_train = list()\n",
"for leaf_num in min_leaf_list:\n",
" rf = RandomForestClassifier(n_estimators=500,max_features='sqrt', criterion='entropy', min_samples_leaf=leaf_num,n_jobs=-1,)\n",
" rf.fit(X_train_analize,y_train_analize)\n",
" _, bias, _ = ti.predict(rf,test_example) \n",
" bias_treshold = bias[0][1]\n",
" preds_proba_val = rf.predict_proba(X_val_analize)\n",
" preds_val = (preds_proba_val[:,1] > bias_treshold)\n",
" _,_, pr_auc_val = print_scores(preds_val, y_val_analize, verbose=False);\n",
" \n",
" preds_proba_train = rf.predict_proba(X_train_analize)\n",
" preds_train = (preds_proba_train[:,1] > bias_treshold)\n",
" _,_, pr_auc_train = print_scores(preds_train, y_train_analize, verbose=False);\n",
" \n",
" pr_score_train.append(pr_auc_train)\n",
" pr_score_val.append(pr_auc_val)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"ax.plot(min_leaf_list,pr_score_train, label='train')\n",
"ax.plot(min_leaf_list,pr_score_val, label='valid')\n",
"ax.legend()\n",
"ax.set(xlabel='min_samples_leaf', ylabel='PR AUC', );"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So min_samples_leaf doesn't help much."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Maybe we need more data?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_tmp = X.y\n",
"X_tmp = X.drop(['y', 'shop'], axis=1)\n",
"groups = X.shop.values\n",
"\n",
"def plot_with_err(x, data, **kwargs):\n",
" mu, std = data.mean(1), data.std(1)\n",
" lines = plt.plot(x, mu, '-', **kwargs)\n",
" plt.fill_between(x, mu - std, mu + std, edgecolor='none',\n",
" facecolor=lines[0].get_color(), alpha=0.2)\n",
"\n",
"def plot_learning_curve(classifier = 'rf'):\n",
" if classifier == 'rf':\n",
" classif = RandomForestClassifier(n_estimators=100,max_features='sqrt', criterion='entropy', min_samples_leaf=25,n_jobs=-1,)\n",
" else:\n",
" classif = SGDClassifier(loss='log', penalty='elasticnet', max_iter=40)\n",
" N_train, val_train, val_test = learning_curve(classif, X_tmp, y_tmp,\n",
" cv=5, groups=groups,\n",
" shuffle=True, scoring='average_precision'\n",
" )\n",
" plot_with_err(N_train, val_train, label='training scores')\n",
" plot_with_err(N_train, val_test, label='validation scores')\n",
" plt.xlabel('Training Set Size'); plt.ylabel('PR AUC')\n",
" plt.legend()\n",
" plt.grid(True);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plot_learning_curve(classifier='rf')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plot_learning_curve(classifier='lr')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This graphs may be misleading. I can't really interpret them, the variance is too high.\n",
"\n",
"There are ~100к training examples but actually, we have only 5 shops. Each of them has only 1 type of product cart. So It's not 100k examples it's somewhat about 5 examples. Additional data would be very very helpful."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 11. Prediction for test or hold-out samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a test, we are going to crawl one more site - Auchan. There won't be any feature checking. We'll just check pics of product card to be sure crawler worked correctly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_test = pd.read_csv('test_sites_markup.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# screen for df_test[(df_test.y == 1)].screenshot\n",
"Image(url='https://habrastorage.org/webt/sv/-m/ac/sv-macheei3wqmgkpdojbuky734.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data preparation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_test[['loc_x', 'loc_y']] = df_test['location'].str.replace(\"'\", '\"').apply(json.loads).apply(pd.Series)\n",
"df_test[['size_h', 'size_w']] = df_test['size'].str.replace(\"'\", '\"').apply(json.loads).apply(pd.Series)\n",
"df_test.size_h = df_test.size_h.astype(int)\n",
"df_test.size_w = df_test.size_w.astype(int)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_test['parent_tag'] = df_test.parent_tag.str.split('_')\n",
"df_test['childs_tags'] = df_test.childs_tags.apply(lambda x: x.replace('[', '').replace(']','').strip())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean_test = df_test.copy()\n",
"df_clean_test.drop(['location', 'size', 'Unnamed: 0'], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean_test['parent_1'] = df_clean_test.parent_tag.apply(lambda x: x[-1])\n",
"df_clean_test['parent_2'] = df_clean_test.parent_tag.apply(lambda x: x[-2] if len(x) > 1 else '')\n",
"df_clean_test['parent_3'] = df_clean_test.parent_tag.apply(lambda x: x[-3] if len(x) > 2 else '')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test = df_clean_test[features_to_train]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test['tag'] = X_test['tag'].apply(lambda x: category_to_num[x] if x in category_to_num else -1)\n",
"X_test['parent_1'] = X_test['parent_1'].apply(lambda x: category_to_num[x] if (x) and (x in category_to_num) else -1)\n",
"X_test['parent_2'] = X_test['parent_2'].apply(lambda x: category_to_num[x] if (x) and (x in category_to_num) else -1)\n",
"X_test['parent_3'] = X_test['parent_3'].apply(lambda x: category_to_num[x] if (x) and (x in category_to_num) else -1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Prediction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = X\n",
"y_train = X_train.y\n",
"X_train = X_train.drop(['shop', 'y'],axis=1)\n",
"\n",
"y_test = X_test.y\n",
"X_test = X_test.drop(['shop', 'y'],axis=1)\n",
"\n",
"test_example = X_test.head(1)\n",
"rf = RandomForestClassifier(n_estimators=400,max_features='sqrt', criterion='entropy', min_samples_leaf=25,n_jobs=-1,)\n",
"rf.fit(X_train,y_train)\n",
"_, bias, _ = ti.predict(rf,test_example) \n",
"bias_treshold = bias[0][1]\n",
"preds_proba = rf.predict_proba(X_test)\n",
"preds = (preds_proba[:,1] > bias_treshold)\n",
"f1, roc_auc, pr_auc = print_scores(y_test, preds);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# screen for df_test[df_test.index.isin((preds == True).index)].screenshot\n",
"Image(url='https://habrastorage.org/webt/_5/_n/uk/_5_nuk1iphmid6sfplw7hmcph1c.png', width=900)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sweet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 11. Conclusions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We tried ML on HTML markup guessing and it worked out.\n",
"\n",
"On the path with our analysis, we understood ways to improve data collection. Which data features we are lacking. Which features were wrongly constructed.\n",
"\n",
"At the start, we couldn't say what's more important false positive or false negative errors. After our journey, we understood that FN more critical and that we have to deal with FP. That means we'll build our data pipeline with this requirement - be stable to FP.\n",
"\n",
"100k training set turned out to be only ~5-examples train set. Because of correct card almost identical in a shop and it doesn't matter if we have 10k card for a shop, it's better to have 1 card for 10k shops (but impossible).\n",
"\n",
"Our metrics haven't worked out (except the confusion matrix), because our train data has examples that labeled as wrong, but in fact they identical to correct ones (div-div-ul issue). It's a good opportunity to elaborate custom loss function that penalizes softly when we are closer to our target element. Maybe it's worth to reformat problem to regression, so we would find 'distance' to the correct element.\n",
"\n",
"In model selection, RF won, but it's too early to say about it because we have problems with our metrics and dataset."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}