{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mercari Price Estimates/Suggestions\n", "\n", "This is a nice dataset that combines textual descriptions and pricing. Almost like a hedonic pricing model, but not quite. Actually it's almost entirely different. But it's similar in that we are converting text into quantifiable features, which is kind of like hedonic pricing again.\n", "\n", "I don't use Mercari, but the data is useful for prototyping the sort of model that could be used on Amazon, Best Buy, and other marketplaces and retailers." ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "from keras.layers import (Input, Dropout, Dense, concatenate, GRU, Embedding, Flatten,\n", " Activation, SpatialDropout1D, GlobalMaxPooling1D)\n", "from keras.optimizers import Adam\n", "from keras.models import Model\n", "from keras import backend as K\n", "from nltk.corpus import stopwords\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# root mean squared error, assuming all y values are already log transformed\n", "def rmse (y_true, y_pred):\n", " return np.sqrt(np.mean((y_pred-y_true)**2))" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train = pd.read_table('d:/data/price/train.tsv')\n", "test = pd.read_table('d:/data/price/test.tsv')" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['train_id', 'name', 'item_condition_id', 'category_name', 'brand_name',\n", " 'price', 'shipping', 'item_description'],\n", " dtype='object')" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.columns" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1482535 693359\n" ] } ], "source": [ "print (len(train),len(test))" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cols = ['name', 'item_condition_id', 'category_name', 'brand_name','shipping', 'item_description']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values in train and test sets." ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name : 0 0\n", "item_condition_id : 0 0\n", "category_name : 6327 3058\n", "brand_name : 632682 295525\n", "shipping : 0 0\n", "item_description : 4 0\n" ] } ], "source": [ "for c in cols:\n", " print(c, ': ', train[c].isnull().sum(), test[c].isnull().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A lot of items without brand names, which in itself is very informative. The lack of category names for some items could be a hassle, but they represent less than 1% of all observations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 874 0 prices and nothing $\\in$ (0,3), so we should remove these since they are incorrect (Mercari has a $3 lower limit)." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "874" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train[train['price']==0])" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
train_idnameitem_condition_idcategory_namebrand_namepriceshippingitem_description
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [train_id, name, item_condition_id, category_name, brand_name, price, shipping, item_description]\n", "Index: []" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[(train['price']<3) & (train['price']>0)]" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train.drop(train[train['price']<3.0].index, inplace=True)" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1481661, 8)" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.shape" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train['log_price'] = np.log(train['price'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature quantification/engineering\n", "\n", "Some will need vectorization, some new features will be created." ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def char_count(text):\n", " try:\n", " # not a real description\n", " if text == 'No description yet':\n", " return 0\n", " else:\n", " chars = text.lower().replace(' ', '')\n", " return len(chars)\n", " except:\n", " return 0\n", "\n", "def word_count(text):\n", " try:\n", " if text == 'No description yet':\n", " return 0\n", " else:\n", " words = [w for w in text.lower().split()]\n", " return len(words)\n", " except:\n", " return 0" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "train['desc_words'] = train['item_description'].apply(lambda s: word_count(s))\n", "train['desc_chars'] = train['item_description'].apply(lambda s: char_count(s))\n", "test['desc_words'] = test['item_description'].apply(lambda s: word_count(s))\n", "test['desc_chars'] = test['item_description'].apply(lambda s: char_count(s))\n", "\n", "train['name_words'] = train['name'].apply(lambda s: word_count(s))\n", "train['name_chars'] = train['name'].apply(lambda s: char_count(s))\n", "test['name_words'] = test['name'].apply(lambda s: word_count(s))\n", "test['name_chars'] = test['name'].apply(lambda s: char_count(s))\n", "\n", "train.loc[train['item_description']=='No description yet', 'item_description'] = 'missing'\n", "test.loc[test['item_description']=='No description yet', 'item_description'] = 'missing'" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
train_idnameitem_condition_idcategory_namebrand_namepriceshippingitem_descriptionlog_pricedesc_wordsdesc_charsname_wordsname_chars
00MLB Cincinnati Reds T Shirt Size XL3Men/Tops/T-shirtsNaN10.01missing2.30258500729
11Razer BlackWidow Chroma Keyboard3Electronics/Computers & Tablets/Components & P...Razer52.00This keyboard is in great condition and works ...3.95124436153429
22AVA-VIV Blouse1Women/Tops & Blouses/BlouseTarget10.01Adorable top with a hint of lace and a key hol...2.3025852996213
33Leather Horse Statues1Home/Home Décor/Home Décor AccentsNaN35.01New with tags. Leather horses. Retail for [rm]...3.55534832142319
4424K GOLD plated rose1Women/Jewelry/NecklacesNaN44.00Complete with certificate of authenticity3.784190537417
\n", "
" ], "text/plain": [ " train_id name item_condition_id \\\n", "0 0 MLB Cincinnati Reds T Shirt Size XL 3 \n", "1 1 Razer BlackWidow Chroma Keyboard 3 \n", "2 2 AVA-VIV Blouse 1 \n", "3 3 Leather Horse Statues 1 \n", "4 4 24K GOLD plated rose 1 \n", "\n", " category_name brand_name price \\\n", "0 Men/Tops/T-shirts NaN 10.0 \n", "1 Electronics/Computers & Tablets/Components & P... Razer 52.0 \n", "2 Women/Tops & Blouses/Blouse Target 10.0 \n", "3 Home/Home Décor/Home Décor Accents NaN 35.0 \n", "4 Women/Jewelry/Necklaces NaN 44.0 \n", "\n", " shipping item_description log_price \\\n", "0 1 missing 2.302585 \n", "1 0 This keyboard is in great condition and works ... 3.951244 \n", "2 1 Adorable top with a hint of lace and a key hol... 2.302585 \n", "3 1 New with tags. Leather horses. Retail for [rm]... 3.555348 \n", "4 0 Complete with certificate of authenticity 3.784190 \n", "\n", " desc_words desc_chars name_words name_chars \n", "0 0 0 7 29 \n", "1 36 153 4 29 \n", "2 29 96 2 13 \n", "3 32 142 3 19 \n", "4 5 37 4 17 " ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking for missing brand names. I have mentioned above that the lack of a brand is itself important information, but let's make sure that we don't have missing brands.\n", "\n", "First get all the unique brand names, ignoring \"None\"." ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8709" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train.brand_name.unique()) + len(test.brand_name.unique())" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(None, None)" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train['brand_name'].fillna('missing',inplace=True), test['brand_name'].fillna('missing',inplace=True)\n", "train['category_name'].fillna('missing',inplace=True), test['category_name'].fillna('missing',inplace=True)\n", "train['item_description'].fillna('missing',inplace=True), test['item_description'].fillna('missing',inplace=True)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": true }, "outputs": [], "source": [ "all_brands = set(list(train.brand_name.unique()) + list(test['brand_name'].unique()))\n", "all_brands = [b for b in all_brands if b is not 'missing']\n", "# I could use pop... but list comprehensions are more fun, no?" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5287" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(all_brands)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we're going to check the names and descriptions for brand name information. 632336 \"None\" brands, let's see what we end up with." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train[train['brand_name']=='missing'])" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def assign_brand(line):\n", " name_words = line[0].split()\n", " brand = line[1]\n", " \n", " # these are okay\n", " if brand != 'missing':\n", " return brand\n", " \n", " # let's see if we can find the brand name for currently unlabelled items\n", " # If a word is in all_brands, return just the word rather than the full name, or we're just creating new brands...\n", " else:\n", " for word in name_words:\n", " if word in all_brands:\n", " return word\n", " else:\n", " return 'missing'" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train['new_brand_name'] = train[['name','brand_name']].apply(lambda l:assign_brand(l), axis=1)\n", "test['new_brand_name'] = test[['name','brand_name']].apply(lambda l:assign_brand(l), axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This isn't perfect, but we've assigned over 70,000 new brands. We can also see that brand_name was used kind of loosely in the first place, and sometimes it was more of an extra description rather than a trademarked brand." ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
train_idnameitem_condition_idcategory_namebrand_namepriceshippingitem_descriptionlog_pricedesc_wordsdesc_charsname_wordsname_charsnew_brand_name
00MLB Cincinnati Reds T Shirt Size XL3Men/Tops/T-shirtsmissing10.01missing2.30258500729MLB
4949Younique 3d fiber lash mascara1Beauty/Makeup/Eyesmissing9.01Younique 3d fiber lash mascara will quickly be...2.19722532166526Younique
5555Vintage wood jewelry lot3Vintage & Collectibles/Jewelry/Broochmissing5.01All are made out of wood. Necklace, earrings b...1.6094381160421Vintage
6666Silver choker Italy 9253Women/Jewelry/Necklacesmissing15.01Signed Italy and 925 Necklace Vintage, lobster...2.7080501263420Silver
7171Partners In Crime Necklace ShipfromChina1Women/Jewelry/Necklacesmissing4.01\"Fine or Fashion: Fashion Item Type: Necklace ...1.38629422118536Partners
\n", "
" ], "text/plain": [ " train_id name item_condition_id \\\n", "0 0 MLB Cincinnati Reds T Shirt Size XL 3 \n", "49 49 Younique 3d fiber lash mascara 1 \n", "55 55 Vintage wood jewelry lot 3 \n", "66 66 Silver choker Italy 925 3 \n", "71 71 Partners In Crime Necklace ShipfromChina 1 \n", "\n", " category_name brand_name price shipping \\\n", "0 Men/Tops/T-shirts missing 10.0 1 \n", "49 Beauty/Makeup/Eyes missing 9.0 1 \n", "55 Vintage & Collectibles/Jewelry/Brooch missing 5.0 1 \n", "66 Women/Jewelry/Necklaces missing 15.0 1 \n", "71 Women/Jewelry/Necklaces missing 4.0 1 \n", "\n", " item_description log_price desc_words \\\n", "0 missing 2.302585 0 \n", "49 Younique 3d fiber lash mascara will quickly be... 2.197225 32 \n", "55 All are made out of wood. Necklace, earrings b... 1.609438 11 \n", "66 Signed Italy and 925 Necklace Vintage, lobster... 2.708050 12 \n", "71 \"Fine or Fashion: Fashion Item Type: Necklace ... 1.386294 22 \n", "\n", " desc_chars name_words name_chars new_brand_name \n", "0 0 7 29 MLB \n", "49 166 5 26 Younique \n", "55 60 4 21 Vintage \n", "66 63 4 20 Silver \n", "71 118 5 36 Partners " ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[(train['brand_name'] == 'missing') & (train['new_brand_name'] != 'missing')].head()" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
train_idnameitem_condition_idcategory_namebrand_namepriceshippingitem_descriptionlog_pricedesc_wordsdesc_charsname_wordsname_charsnew_brand_name
174314174314Silver jeans size 18 reg3Women/Jeans/Straight LegSilver20.01Absolutely love these jeans! Smoke free, pet f...2.995732947520Silver
2279092279096 Total NICE Silver Kennedy Half Dollars2Vintage & Collectibles/Collectibles/OtherSilver20.01The first pic you see are some good (not scrap...2.99573255225734Silver
13974201397420Sterling Silver bracelet accessories1Men/Other/OtherSilver56.00missing4.02535200433Silver
\n", "
" ], "text/plain": [ " train_id name \\\n", "174314 174314 Silver jeans size 18 reg \n", "227909 227909 6 Total NICE Silver Kennedy Half Dollars \n", "1397420 1397420 Sterling Silver bracelet accessories \n", "\n", " item_condition_id category_name \\\n", "174314 3 Women/Jeans/Straight Leg \n", "227909 2 Vintage & Collectibles/Collectibles/Other \n", "1397420 1 Men/Other/Other \n", "\n", " brand_name price shipping \\\n", "174314 Silver 20.0 1 \n", "227909 Silver 20.0 1 \n", "1397420 Silver 56.0 0 \n", "\n", " item_description log_price \\\n", "174314 Absolutely love these jeans! Smoke free, pet f... 2.995732 \n", "227909 The first pic you see are some good (not scrap... 2.995732 \n", "1397420 missing 4.025352 \n", "\n", " desc_words desc_chars name_words name_chars new_brand_name \n", "174314 9 47 5 20 Silver \n", "227909 55 225 7 34 Silver \n", "1397420 0 0 4 33 Silver " ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[train['brand_name'] == 'Silver']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also see that category_name can be made more granular, as there are up to five actual categories (but mostly three) for each category_name.\n", "\n", "But do we create 5 categories or just 3? The latter will avoid increasing sparsity, while using the former will give us extra information for only 7 out of over a million observations. This is probably not worth the extra computational cost. And if we look at the 2 observations with 5 categories, we can be fairly confident that the item name and the first three categories can give us enough information, unless there exists some secret iPad that's not a tablet and can't read eBooks." ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Electronics', 'Computers & Tablets', 'iPad', 'Tablet', 'eBook Readers']\n", "['Sports & Outdoors', 'Exercise', 'Dance', 'Ballet']\n", "['Electronics', 'Computers & Tablets', 'iPad', 'Tablet', 'eBook Access']\n", "['Sports & Outdoors', 'Outdoors', 'Indoor', 'Outdoor Games']\n", "['Men', 'Coats & Jackets', 'Varsity', 'Baseball']\n", "['Men', 'Coats & Jackets', 'Flight', 'Bomber']\n", "['Handmade', 'Housewares', 'Entertaining', 'Serving']\n", "Maximum categories: 5\n", "Minimum categories: 1\n" ] } ], "source": [ "cat_len = []\n", "for cat in train['category_name'].unique():\n", " cat_len.append(len(cat.split('/')))\n", " if len(cat.split('/')) > 3:\n", " print(cat.split('/'))\n", "print ('Maximum categories: ', np.max(cat_len))\n", "print ('Minimum categories: ', np.min(cat_len))" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def granular_cat (line):\n", " splits = line.split('/')\n", " cats = len(splits)\n", " if cats == 1:\n", " return (splits[0],'missing','missing')\n", " elif cats == 2:\n", " return (splits[0],splits[1],'missing')\n", " elif cats >= 3:\n", " return (splits[0],splits[1],splits[2])\n", " else:\n", " return ('missing', 'missing','missing')" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train['cat_1'],train['cat_2'],train['cat_3'] = zip(*train['category_name'].apply(lambda l:granular_cat(l)))\n", "test['cat_1'],test['cat_2'],test['cat_3'] = zip(*test['category_name'].apply(lambda l:granular_cat(l)))" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
train_idnameitem_condition_idcategory_namebrand_namepriceshippingitem_descriptionlog_pricedesc_wordsdesc_charsname_wordsname_charsnew_brand_namecat_1cat_2cat_3
00MLB Cincinnati Reds T Shirt Size XL3Men/Tops/T-shirtsmissing10.01missing2.30258500729MLBMenTopsT-shirts
11Razer BlackWidow Chroma Keyboard3Electronics/Computers & Tablets/Components & P...Razer52.00This keyboard is in great condition and works ...3.95124436153429RazerElectronicsComputers & TabletsComponents & Parts
22AVA-VIV Blouse1Women/Tops & Blouses/BlouseTarget10.01Adorable top with a hint of lace and a key hol...2.3025852996213TargetWomenTops & BlousesBlouse
33Leather Horse Statues1Home/Home Décor/Home Décor Accentsmissing35.01New with tags. Leather horses. Retail for [rm]...3.55534832142319missingHomeHome DécorHome Décor Accents
4424K GOLD plated rose1Women/Jewelry/Necklacesmissing44.00Complete with certificate of authenticity3.784190537417missingWomenJewelryNecklaces
\n", "
" ], "text/plain": [ " train_id name item_condition_id \\\n", "0 0 MLB Cincinnati Reds T Shirt Size XL 3 \n", "1 1 Razer BlackWidow Chroma Keyboard 3 \n", "2 2 AVA-VIV Blouse 1 \n", "3 3 Leather Horse Statues 1 \n", "4 4 24K GOLD plated rose 1 \n", "\n", " category_name brand_name price \\\n", "0 Men/Tops/T-shirts missing 10.0 \n", "1 Electronics/Computers & Tablets/Components & P... Razer 52.0 \n", "2 Women/Tops & Blouses/Blouse Target 10.0 \n", "3 Home/Home Décor/Home Décor Accents missing 35.0 \n", "4 Women/Jewelry/Necklaces missing 44.0 \n", "\n", " shipping item_description log_price \\\n", "0 1 missing 2.302585 \n", "1 0 This keyboard is in great condition and works ... 3.951244 \n", "2 1 Adorable top with a hint of lace and a key hol... 2.302585 \n", "3 1 New with tags. Leather horses. Retail for [rm]... 3.555348 \n", "4 0 Complete with certificate of authenticity 3.784190 \n", "\n", " desc_words desc_chars name_words name_chars new_brand_name cat_1 \\\n", "0 0 0 7 29 MLB Men \n", "1 36 153 4 29 Razer Electronics \n", "2 29 96 2 13 Target Women \n", "3 32 142 3 19 missing Home \n", "4 5 37 4 17 missing Women \n", "\n", " cat_2 cat_3 \n", "0 Tops T-shirts \n", "1 Computers & Tablets Components & Parts \n", "2 Tops & Blouses Blouse \n", "3 Home Décor Home Décor Accents \n", "4 Jewelry Necklaces " ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test_idnameitem_condition_idcategory_namebrand_nameshippingitem_descriptiondesc_wordsdesc_charsname_wordsname_charsnew_brand_namecat_1cat_2cat_3
00Breast cancer \"I fight like a girl\" ring1Women/Jewelry/Ringsmissing1Size 725833missingWomenJewelryRings
1125 pcs NEW 7.5\"x12\" Kraft Bubble Mailers1Other/Office supplies/Shipping Suppliesmissing125 pcs NEW 7.5\"x12\" Kraft Bubble Mailers Lined...38214734missingOtherOffice suppliesShipping Supplies
22Coach bag1Vintage & Collectibles/Bags and Purses/HandbagCoach1Brand new coach bag. Bought for [rm] at a Coac...114528CoachVintage & CollectiblesBags and PursesHandbag
33Floral Kimono2Women/Sweaters/Cardiganmissing0-floral kimono -never worn -lightweight and pe...1058212missingWomenSweatersCardigan
44Life after Death3Other/Books/Religion & Spiritualitymissing1Rediscovering life after the loss of a loved o...29139314missingOtherBooksReligion & Spirituality
\n", "
" ], "text/plain": [ " test_id name item_condition_id \\\n", "0 0 Breast cancer \"I fight like a girl\" ring 1 \n", "1 1 25 pcs NEW 7.5\"x12\" Kraft Bubble Mailers 1 \n", "2 2 Coach bag 1 \n", "3 3 Floral Kimono 2 \n", "4 4 Life after Death 3 \n", "\n", " category_name brand_name shipping \\\n", "0 Women/Jewelry/Rings missing 1 \n", "1 Other/Office supplies/Shipping Supplies missing 1 \n", "2 Vintage & Collectibles/Bags and Purses/Handbag Coach 1 \n", "3 Women/Sweaters/Cardigan missing 0 \n", "4 Other/Books/Religion & Spirituality missing 1 \n", "\n", " item_description desc_words desc_chars \\\n", "0 Size 7 2 5 \n", "1 25 pcs NEW 7.5\"x12\" Kraft Bubble Mailers Lined... 38 214 \n", "2 Brand new coach bag. Bought for [rm] at a Coac... 11 45 \n", "3 -floral kimono -never worn -lightweight and pe... 10 58 \n", "4 Rediscovering life after the loss of a loved o... 29 139 \n", "\n", " name_words name_chars new_brand_name cat_1 \\\n", "0 8 33 missing Women \n", "1 7 34 missing Other \n", "2 2 8 Coach Vintage & Collectibles \n", "3 2 12 missing Women \n", "4 3 14 missing Other \n", "\n", " cat_2 cat_3 \n", "0 Jewelry Rings \n", "1 Office supplies Shipping Supplies \n", "2 Bags and Purses Handbag \n", "3 Sweaters Cardigan \n", "4 Books Religion & Spirituality " ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model creation and validation\n", "\n", "Description analysis is a little arcane for now, but we can still vectorize them and we do have their lengths in words and characters, which might give us some information." ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "collapsed": true }, "outputs": [], "source": [ "combined = pd.concat((train,test),axis=0)" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "lab = LabelEncoder()\n", "# it would be nice if this had fit_transform\n", "lab.fit(combined['category_name'])\n", "combined['category_name_final'] = lab.transform(combined['category_name'])\n", "\n", "lab.fit(combined['new_brand_name'])\n", "combined['new_brand_name_final'] = lab.transform(combined['new_brand_name'])\n", "\n", "lab.fit(combined['cat_1'])\n", "combined['cat_1_final'] = lab.transform(combined['cat_1'])\n", "\n", "lab.fit(combined['cat_2'])\n", "combined['cat_2_final'] = lab.transform(combined['cat_2'])\n", "\n", "lab.fit(combined['cat_3'])\n", "combined['cat_3_final'] = lab.transform(combined['cat_3'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tokenize the descriptions." ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tok = Tokenizer()\n", "all_text = np.hstack((combined['item_description'].str.lower(), combined['name'].str.lower()))\n", "\n", "tok.fit_on_texts(all_text)\n", "\n", "combined['item_description_seq'] = tok.texts_to_sequences(combined['item_description'].str.lower())\n", "combined['name_seq'] = tok.texts_to_sequences(combined['name'].str.lower())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Features:\n", "" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def data_prep(df):\n", " X = {\n", " 'name_final': pad_sequences(df['name_seq'], maxlen=10),\n", " 'name_words' : np.array(df[['name_words']]),\n", " #'name_chars' : np.array(df[['name_chars']]),\n", " 'new_brand_name_final': np.array(df['new_brand_name_final']),\n", " 'category_name_final': np.array(df['category_name_final']),\n", " 'cat_1_final': np.array(df['cat_1_final']),\n", " 'cat_2_final': np.array(df['cat_2_final']),\n", " 'cat_3_final': np.array(df['cat_3_final']),\n", " 'item_description_final': pad_sequences(df['item_description_seq'], maxlen=75),\n", " 'desc_words': np.array(df[['desc_words']]),\n", " #'desc_chars': np.array(df[['desc_chars']]),\n", " 'item_condition': np.array(df['item_condition_id']),\n", " 'shipping': np.array(df[[\"shipping\"]]),\n", " }\n", " return X" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [], "source": [ "def rnn_model (lr=0.005, decay=0.0):\n", " # Inputs\n", " name_final = Input(shape=[X_train[\"name_final\"].shape[1]], name=\"name_final\")\n", " name_words = Input(shape=[1], name=\"name_words\")\n", " #name_chars = Input(shape=[1], name='name_chars')\n", " new_brand_name_final = Input(shape=[1], name='new_brand_name_final')\n", " category_name_final = Input(shape=[1], name='category_name_final')\n", " cat_1_final = Input(shape=[1], name=\"cat_1_final\")\n", " cat_2_final = Input(shape=[1], name=\"cat_2_final\")\n", " cat_3_final = Input(shape=[1], name=\"cat_3_final\")\n", " item_description_final = Input(shape=[X_train['item_description_final'].shape[1]], name=\"item_description_final\")\n", " desc_words = Input(shape=[1], name='desc_words')\n", " #desc_chars = Input(shape=[1], name='desc_chars')\n", " item_condition = Input(shape=[1], name=\"item_condition\")\n", " shipping = Input(shape=[X_train['shipping'].shape[1]], name=\"shipping\")\n", "\n", " # input dimensions are always slightly larger than the maximum values in vectorized features or\n", " # maximum number of words/characters\n", " emb_name = Embedding(350000, 20)(name_final)\n", " emb_name_words = Embedding(18, 5)(name_words)\n", " #emb_name_chars = Embedding(41, 5)(name_chars)\n", " emb_brand_name = Embedding(5288, 10)(new_brand_name_final)\n", " emb_category = Embedding( 1311, 10)(category_name_final)\n", " emb_cat_1 = Embedding(11, 10)(cat_1_final)\n", " emb_cat_2 = Embedding(114, 10)(cat_2_final)\n", " emb_cat_3 = Embedding(883, 10)(cat_3_final)\n", " emb_item_desc = Embedding(350000, 60)(item_description_final)\n", " emb_desc_words = Embedding(250, 5)(desc_words)\n", " #emb_desc_chars = Embedding(900, 5)(desc_chars)\n", " emb_item_condition = Embedding(6, 5)(item_condition)\n", "\n", " rnn_layer1 = GRU(16) (emb_item_desc)\n", " rnn_layer2 = GRU(8) (emb_name)\n", "\n", " layer = concatenate([\n", " Flatten()(emb_name_words),\n", " #Flatten()(emb_name_chars),\n", " Flatten()(emb_brand_name),\n", " Flatten()(emb_category),\n", " Flatten()(emb_cat_1),\n", " Flatten()(emb_cat_2),\n", " Flatten()(emb_cat_3),\n", " Flatten()(emb_desc_words),\n", " #Flatten()(emb_desc_chars),\n", " Flatten()(emb_item_condition),\n", " rnn_layer1,\n", " rnn_layer2,\n", " shipping # only 2 possible values, so it's ok\n", " ])\n", " \n", " layer = Dropout(0.25)(Dense(512,kernel_initializer='normal',activation='relu') (layer))\n", " layer = Dropout(0.20)(Dense(256,kernel_initializer='normal',activation='relu') (layer))\n", " layer = Dropout(0.15)(Dense(128,kernel_initializer='normal',activation='relu') (layer))\n", " layer = Dropout(0.10)(Dense(64,kernel_initializer='normal',activation='relu') (layer))\n", "\n", " # scalar output for each set of features, linear model\n", " output = Dense(1, activation=\"linear\") (layer)\n", " \n", " model = Model([name_final, name_words, new_brand_name_final,\n", " category_name_final,\n", " cat_1_final, cat_2_final, cat_3_final,\n", " item_description_final, desc_words,\n", " item_condition, shipping], output)\n", "\n", " optimizer = Adam(lr=lr, decay=decay)\n", " model.compile(loss = 'mse', optimizer = optimizer)\n", "\n", " return model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that using characters instead of words, or combining characters and words for description and name length decreases predictive power. This makes sense, as words actually give information about a product, while characters are informational only insofar as they form words. For examplek, \"cool\" and \"awesome\" in item descriptions probably give the same effect, so the 3 character difference doesn't really mean much." ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [], "source": [ "train = combined[:len(train)]\n", "test = combined[len(train):]" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# clear up some RAM...\n", "del lab, combined, tok" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = train\n", "y = train['log_price'].values.reshape(-1, 1)" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=10101)" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [], "source": [ "X_train = data_prep(X_train)\n", "X_val = data_prep(X_val)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_test = data_prep(test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 1333494 samples, validate on 148167 samples\n", "Epoch 1/2\n", "1333494/1333494 [==============================] - 457s 342us/step - loss: 0.3008 - val_loss: 0.2219\n", "Epoch 2/2\n", "1214464/1333494 [==========================>...] - ETA: 38s - loss: 0.2085" ] } ], "source": [ "rnn = rnn_model(lr=0.005, decay=1e-6)\n", "rnn.fit(X_train, y_train, epochs=2, batch_size=512,validation_data=(X_val, y_val), verbose=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pred = rnn.predict(X_val, batch_size=512)\n", "print(\"RMSLE:\", rmse(y_val, pred))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "final_pred = rnn.predict(X_test, batch_size=512, verbose=1)\n", "final_pred = np.exp(final_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "submission = pd.DataFrame({\"test_id\": test['test_id'], \"price\": final_pred.reshape(-1)})\n", "submission.to_csv(\"sub.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }