{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A little introduction.\n",
"\n",
"One of my tasks at a previous job was to supply medical equipment to the hospitals in Russia.
\n",
"At this moment there is a goverment regulation of the possible options for the purchase of medical equipment in hospitals.
\n",
"The procedure calls \"Tender\" and Tenders placed on the public site. Everyone can apply for participation and win Tender. The one who offers the lowest price will be a winner.\n",
"In an ideal world, technical specifications for the necessary equipment should not imply a specific brand of manufacturer. (as an example - analogy: costumer can’t write - we want to buy \"The lastest model of iMac.\" It will be correct to write: we need a PC with 8 cores, with a frequency of 2.2 megahertz, 8 GB RAM. A hard disk of at least 250 GB, etc.)
\n",
"But in fact, many manufacturers have their own set of unique technical characteristics that uniquely define the company. And the technical task of the tender is designed so that only one company-brand corresponds to the description.
\n",
"That was in my case - medical equipment.
\n",
"I'am, as a supplier, very interested in quickly identifying a specific manufacturer in order to be the first to negotiate and get the minimum price. This is a guarantee of victory in the tender.
\n",
"I have compiled a parser that pulls data from the site (zakupki.gov.ru), where the results of public tenders are published. Here, by the way, several options are possible. Parse from the page, or from ftp, or take a json from the guys who share the already marked up information (Проект КГИ “Госзатраты” (https://clearspending.ru)
\n",
"I took the data for 2017 year. And in the Sverdlovsk region (geographical entity in Russia).\n",
"Results are a contract with a large number of details. Since, initially, it was not planned to use ML. I've scrapepd data just for only a one-time analytics. The following indicators were unloaded:\n",
"- customer.inn - unique identifier of the customer (hospital)\n",
"- regNum - contract unique identifier\n",
"- signDate - date of contract signing\n",
"- Name - the actual specification of the proposed to the delivery of equipment.\n",
"- product_price - price of equipment\n",
"- Quantity - the amount of equipment\n",
"- inn - the unique identifier of the winner\n",
"- Manufacturer - target. The names of Manufacturers\n",
"\n",
"\n",
"link to the data - https://drive.google.com/open?id=1S9X_B9Vayev_mu9co8mVR0acdeU5uBTw
\n",
"This task is similar to Medium competition.
\n",
"Instead of the content of articles - those task.
\n",
"And various features for a possible improvement in the speed.
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"import seaborn as sns\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.model_selection import train_test_split, TimeSeriesSplit\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.linear_model import LogisticRegression\n",
"from scipy.sparse import csr_matrix\n",
"from scipy.sparse import hstack\n",
"from sklearn.metrics import confusion_matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('Urology_department.csv', delimiter=';', converters={'customer.inn':str,\\\n",
" 'signDate':pd.to_datetime})\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.isnull().any()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = data.dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok. There are no NAN values. all types are correct"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look on our target variable"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.Manufacturer.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many \"0\" values. It's mean that we don't know manufacturer. We'll drop this rows. And one manufacturer duplicated (\"Coloplast\" and \"COLOPLAST\"). We'll lowecase all target manufacturers names. And we don't interesting in counts less then 10. (They are so rare that they can be neglected.)\n",
"\n",
"By the way - the most time i spent to prepare data. Because published contracts don't have information about manufacturers. Either it is, but it very unstructured. I've create target variable myself."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.Manufacturer = data.Manufacturer.str.lower()\n",
"cnt = data.Manufacturer.value_counts()\n",
"data = data.loc[(data.Manufacturer != '0') & (data.Manufacturer.isin(cnt.index[cnt >= 10]).values)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(20,10))\n",
"sns.countplot(y=\"Manufacturer\", data=data, order = data.Manufacturer.value_counts().index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"My task is to create a classifier for 14 classes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look on other categorial variables. Costumer.inn and inn of winner"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(20,10))\n",
"sns.countplot(y=\"customer.inn\", data=data, order = data[\"customer.inn\"].value_counts().index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the winner.inn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(20,10))\n",
"sns.countplot(y=\"inn\", data=data, order = data[\"inn\"].value_counts().index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have leaders in terms of purchases and supplies.
\n",
"Let's take a look what brands of Manufacturers are they sell into hospitals"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cnt_l_c = data[\"customer.inn\"].value_counts()\n",
"df_leaders_costumers = data.loc[data[\"customer.inn\"].isin(cnt_l_c.index[cnt_l_c >= 100]).values]\n",
"plt.figure(figsize=(20,10))\n",
"sns.countplot(y=\"Manufacturer\", data=df_leaders_costumers, order = df_leaders_costumers.Manufacturer.value_counts().index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I don't think this feature will be useful.."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cnt_l_s = data[\"inn\"].value_counts()\n",
"df_leaders_suppliers = data.loc[data[\"inn\"].isin(cnt_l_s.index[cnt_l_s >= 100]).values]\n",
"plt.figure(figsize=(20,10))\n",
"sns.countplot(y=\"Manufacturer\", data=df_leaders_suppliers, order = df_leaders_suppliers.Manufacturer.value_counts().index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that some target manufacturers are not presented. But distribution is look alike to our start. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And let's look to the date of contract"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data['day'] = data['signDate'].apply(pd.datetime.weekday)\n",
"data['month'] = data['signDate'].apply(lambda x: x.month)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(20,10))\n",
"sns.countplot(y=\"Manufacturer\", data=data, hue='month')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(20,10))\n",
"sns.countplot(y=\"Manufacturer\", data=data, hue='day')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the first time let's build a model with only Description of Manufacturer as feature"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_part, test_part = train_test_split(data[['name','Manufacturer']], test_size=0.2, random_state=21, stratify=data[\"Manufacturer\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pipeline_tfidf_lr = Pipeline([('tfidf', TfidfVectorizer()),\n",
" ('lr', LogisticRegression())])\n",
"\n",
"pipeline_tfidf_lr.fit(train_part['name'], train_part[\"Manufacturer\"])\n",
"\n",
"predicted = pipeline_tfidf_lr.predict(test_part[\"name\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confusion matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_classes_counts = test_part[\"Manufacturer\"].value_counts()\n",
"test_classes_names = np.array(test_classes_counts.index)\n",
"total_classes = len(test_classes_counts)\n",
"\n",
"\n",
"cm = confusion_matrix(y_true=test_part[\"Manufacturer\"], y_pred=predicted, labels=test_classes_names)\n",
"for true_class_id in range(total_classes):\n",
" true_class_name = test_classes_names[true_class_id]\n",
" true_class_count = test_classes_counts[true_class_name]\n",
" \n",
" print('For Manufacturer \"{0}\" ({1} test examples) were predicted:'.format(true_class_name, true_class_count))\n",
" for pred_class_id in range(total_classes):\n",
" percent = int(cm[true_class_id, pred_class_id].item()) / int(true_class_count.item()) * 100\n",
" if percent >= 5:\n",
" pred_class_name = test_classes_names[pred_class_id]\n",
" print('\\t\"{0}\" в {1:.2f} % ({2} раз)'.format(pred_class_name, percent, cm[true_class_id, pred_class_id]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time_split = TimeSeriesSplit(n_splits=5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"parameters_lr = {'tfidf__ngram_range': [(1, 1), (1, 2)],\n",
" 'tfidf__use_idf': (True, False),\n",
" 'tfidf__max_features': [50000, 100000],\n",
" 'lr__C': np.logspace(-2, 2, 10),\n",
" }\n",
"\n",
"gs_lr = GridSearchCV(pipeline_tfidf_lr, parameters_lr, scoring=\"accuracy\", n_jobs=4, cv=time_split, verbose=10,\n",
" return_train_score=True)\n",
"gs_lr = gs_lr.fit(data[\"name\"], data[\"Manufacturer\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gs_lr.best_params_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try to prepare text data with removing special characters.
\n",
"And stemm it"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"from nltk.stem import SnowballStemmer\n",
"import re\n",
"stemmer = SnowballStemmer('russian')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def remove_spec_char(string):\n",
" return re.sub('[?|#|$|.|!|0-9|²|)|(|,|–|+|”|—|’|/]', '', string)\n",
"\n",
"def steming(string):\n",
" singles = [stemmer.stem(word) for word in string.split()]\n",
" return \" \".join(singles)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.name = data.name.apply(remove_spec_char)\n",
"data.name = data.name.apply(steming)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Train our model on prepared data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pipeline_tfidf_lr_prep = Pipeline(\n",
" [('tfidf', TfidfVectorizer(ngram_range=(1, 2), use_idf=True, max_features=50000)),\n",
" ('lr', LogisticRegression(C=12.915496650148826)),\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_part, test_part = train_test_split(data[['name','Manufacturer']], test_size=0.2, random_state=21, stratify=data[\"Manufacturer\"])\n",
"pipeline_tfidf_lr_prep.fit(train_part[\"name\"], train_part[\"Manufacturer\"])\n",
"predicted = pipeline_tfidf_lr_prep.predict(test_part[\"name\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_classes_counts = test_part[\"Manufacturer\"].value_counts()\n",
"test_classes_names = np.array(test_classes_counts.index)\n",
"total_classes = len(test_classes_counts)\n",
"\n",
"cm = confusion_matrix(y_true=test_part[\"Manufacturer\"], y_pred=predicted, labels=test_classes_names)\n",
"for true_class_id in range(total_classes):\n",
" true_class_name = test_classes_names[true_class_id]\n",
" true_class_count = test_classes_counts[true_class_name]\n",
" \n",
" print('For Manufacturer \"{0}\" ({1} test examples) were predicted:'.format(true_class_name, true_class_count))\n",
" for pred_class_id in range(total_classes):\n",
" percent = int(cm[true_class_id, pred_class_id].item()) / int(true_class_count.item()) * 100\n",
" if percent >= 5:\n",
" pred_class_name = test_classes_names[pred_class_id]\n",
" print('\\t\"{0}\" в {1:.2f} % ({2} раз)'.format(pred_class_name, percent, cm[true_class_id, pred_class_id]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's add other features, and take a look on results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scaler = StandardScaler()\n",
"tfidf = TfidfVectorizer(ngram_range=(1, 2), use_idf=True, max_features=50000)\n",
"lr = LogisticRegression(C=12.915496650148826)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = pd.get_dummies(data, columns=['customer.inn', 'inn', 'day', 'month'])\n",
"data.drop(columns=['regNum','signDate'], inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmp = StandardScaler().fit_transform(data[['product_price','quantity']])\n",
"text = tfidf.fit_transform(data.name)\n",
"features = data.iloc[:,4:].values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = csr_matrix(hstack([text, tmp, features]))\n",
"y = data.Manufacturer\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21, stratify=y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lr.fit(X_train, y_train)\n",
"predicted = lr.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_classes_counts = test_part[\"Manufacturer\"].value_counts()\n",
"test_classes_names = np.array(test_classes_counts.index)\n",
"total_classes = len(test_classes_counts)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cm = confusion_matrix(y_true=test_part[\"Manufacturer\"], y_pred=predicted, labels=test_classes_names)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for true_class_id in range(total_classes):\n",
" true_class_name = test_classes_names[true_class_id]\n",
" true_class_count = test_classes_counts[true_class_name]\n",
" \n",
" print('For Manufacturer \"{0}\" ({1} test examples) were predicted:'.format(true_class_name, true_class_count))\n",
" for pred_class_id in range(total_classes):\n",
" percent = int(cm[true_class_id, pred_class_id].item()) / int(true_class_count.item()) * 100\n",
" if percent >= 5:\n",
" pred_class_name = test_classes_names[pred_class_id]\n",
" print('\\t\"{0}\" в {1:.2f} % ({2} раз)'.format(pred_class_name, percent, cm[true_class_id, pred_class_id]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"So, we increase our score with new features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The main reason for completing this task is:\n",
"- Reduced time to process incoming tenders
\n",
"- Receiving profitable conditions from manufacturers (remember, who is the first to request - the one has the maximum discount)
\n",
"- And this is the key to winning the tender.
\n",
"- Reduction or switching to other tasks of a qualified staff of the organization.
\n",
"- Cost reduction and profit growth organization."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What can be done next:\n",
"- The main direction I would choose - enrichment with new data. Grabing new contracts, marking on unknown manufacturers.\n",
"- The use of other models (random forest, svm and others ...)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}