{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data clean | Dados de vendas de veículos Brasal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[notebook \"Regressão\" -->](https://github.com/piegu/fastai-projects/blob/master/vendedor_IA_vendas_veiculos_brasal_REGRESSAO.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Autor: [Pierre Guillou](https://www.linkedin.com/in/pierreguillou)\n",
"- Data: Maio 2019\n",
"- Hackathon Brasal/PCTec-UnB 2019: http://www.brasal.com.br/inovacao/hackathon/\n",
"- Datasets do Brasal: http://www.brasal.com.br/inovacao/hackathon/wp-content/uploads/2019/05/oficial.zip\n",
"- Biblioteca de Deep Learning usada: [Fastai v1](https://docs.fast.ai/) (Deep Learning com PyTorch)\n",
"- Inspiração do Fastai notebook \"Rossmann\": https://github.com/fastai/course-v3/blob/master/nbs/dl1/rossman_data_clean.ipynb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"O **Hackathon Brasal/PCTec-UnB 2019** foi uma maratona de dados (dias 9 e 10 de maio de 2019), que reuniu estudantes, profissionais e comunidade, com o desafio de em dois dias, realizaram um projeto de Bussiness Intelligence para um cliente real: [Brasal Veículos](http://vw.brasal.com.br/). Aconteceu no [CDT](http://www.cdt.unb.br/) da Universidade de Brasília (UnB) no Brasil.\n",
"\n",
"Nesse contexto, minha equipe desenvolveu o projeto **\"Vendedor IA\" (VIA), um conjunto de modelos de Inteligência Artificial (IA) usando o Deep Learning** cujo princípio é descrito nos parágrafos seguintes.\n",
"\n",
"**Vendedor IA (VIA)**\n",
"\n",
"O VIA é uma ferramenta de IA (um conjunto de modelos usando o Deep Learning) que **ajuda os vendedores de Brasal Veículos a atingir a suas metas de vendas**. \n",
"\n",
"Ele não é um avatar que vende diretamente os veículos. Graças ao treinamento com os dados de vendas do passado, **faz previsões personalisadas em função do perfil de cada cliente**. Por exemplo: estimativa do orçamento que o cliente está disposto a gastar na compra de um veículo, previsão da categoria de veículo que o cliente mais gosta, previsão da data de compra mais provável, etc.).\n",
"\n",
"**VIA modeliza o mundo cliente-vendedor**\n",
"\n",
"Graças ao treinamento com os dados de vendas do passado, **as redes neurais do VIA modelizam o mundo cliente-vendedor no contexto das vendas de veículos Brasal**. \n",
"\n",
"**Treinamento dos modelos de IA**\n",
"\n",
"Os modelos de IA são treinados a partir dos dados de vendas de veículos Brasal dos últimos anos e são atualizados todas as noites com os dados de vendas do dia. \n",
"\n",
"**Modelos desenvolvidos no contexto do hackathon**\n",
"\n",
"Muitos modelos AI podem ser desenvolvidos para o VIA (veja exemplos dados). Apenas para mostrar como, um **modelo de regressão fornecendo o orçamento que o cliente está disposto a gastar na compra de um veículo** está disponível online no github ([vendedor_IA_vendas_veiculos_brasal_REGRESSAO.ipynb](https://github.com/piegu/fastai-projects/blob/master/vendedor_IA_vendas_veiculos_brasal_REGRESSAO.ipynb)).\n",
"\n",
"**AI4BI (Artificial Intelligence for Business Intelligence)**\n",
"\n",
"O VIA é também um **projeto AI4BI**: as previsões do VIA vão para um aplicativo de BI que permite aos vendedores de Brasal Veículos de visualizá-las em gráficos interativos."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebooks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2 jupyter notebooks foram criados:\n",
"1. **Data clean** ([vendas_veiculos_brasal_data_clean.ipynb](https://github.com/piegu/fastai-projects/blob/master/vendas_veiculos_brasal_data_clean.ipynb)): é o notebook de preparação da tabela de dados de vendas para treinar os modelos do VIA.\n",
"2. **Regressão** ([vendedor_IA_vendas_veiculos_brasal_REGRESSAO.ipynb](https://github.com/piegu/fastai-projects/blob/master/vendedor_IA_vendas_veiculos_brasal_REGRESSAO.ipynb)): é o notebook de treinamento do modelo que fornece o orçamento que o cliente está disposto a gastar na compra de um veículo."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialisation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%reload_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from fastai.basics import *"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fastai: 1.0.52\n",
"cuda: True\n"
]
}
],
"source": [
"import fastai\n",
"print(f'fastai: {fastai.__version__}')\n",
"print(f'cuda: {torch.cuda.is_available()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data preparation / Feature engineering"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"PATH = Config.data_path()/'brasal/'\n",
"# PATH.ls()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Para criar o modelo de probabilidades de vendas por categoria de veículo, usamos os seguintes arquivos.\n",
"\n",
"Nota: o arquivo `MODELOS_VEICULOS.CSV` foi criado a partir do `VEICULOS_BRAVOS.CSV` com a coluna `CLASSIFICACAO` em adição, o que dá a categoria do veículo.\n",
"\n",
"1. Popular\n",
"2. Utilitário\n",
"3. OFF ROAD\n",
"4. SUV (Sport Utility Vehicle)\n",
"5. Luxo"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# list of tables\n",
"table_names = ['CLIENTES_BRAVOS',\n",
" 'MODELOS_VEICULOS',\n",
" 'VENDAS_VEICULOS_BRAVOS', \n",
" 'VENDEDOR_BRAVOS']"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"tables = [pd.read_csv(PATH/f'{fname}.csv', low_memory=False, sep=\";\", encoding='cp1252') for fname in table_names]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4 tabelas: clientes, veiculos, vendas, vendedores\n",
"Número de vendas: 7284\n"
]
}
],
"source": [
"clientes, veiculos, vendas, vendedores = tables\n",
"print(f'{len(tables)} tabelas: clientes, veiculos, vendas, vendedores')\n",
"print(f'Número de vendas: {len(vendas)}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table Clientes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Processo**:\n",
"\n",
"1. mantendo apenas as colunas necessárias para treinamento\n",
"2. adicionando novas colunas se necessário (ex: CEP para CEP1, CEP2, CEP3)\n",
"3. substituindo datas erradas e ausentes\n",
"4. alterendo dtype das colunas quando necessário (ex: ESTADO_CIVIL para int32) e substituindo dados ausentes ou errados (ex: 0 quando não há dado de ESTADO_CIVIL)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"df = clientes"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['CLIENTE_COD', 'NOME_CLIENTE', 'DATA_NASCIMENTO', 'SEXO', 'ESTADO_CIVIL', 'DESC_ESTADO_CIVIL',\n",
" 'SITUACAO_CREDITO', 'CLIENTE_TIPO', 'CLIENTE_CATEGORIA', 'DES_CATEGORIA_CLIENTE', 'ENDERECO_BAIRRO',\n",
" 'ENDERECO_MUNICIPIO', 'ENDERECO_UF', 'CEP', 'CLIENTE_DTCADASTRO', 'CLIENTE_DTULTIMOMOVIMENTO',\n",
" 'DTULTIMAALTERACAO', 'DTPRIMEIRAFATURA', 'DTULTIMAFATURA', 'LIMITE_CREDITO', 'CLIENTE_VIP', 'EMPRESA_ORIGEM',\n",
" 'REVENDA_ORIGEM', 'GRUPO_CLIENTE', 'CADEIRANTE', 'CLIENTE_INATIVO', 'TIPO_PESSOA'], dtype=object)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 1"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# List of fields by client\n",
"\n",
"# client_tipo means\n",
"# 1 -> Ouro 2 -> Prata 3 -> Bronze E -> Especial L -> Lista Negra N -> Normal P -> Prospecção\n",
"\n",
"columns_clientes = array(['CLIENTE_COD', 'DATA_NASCIMENTO', 'SEXO', \n",
" 'ESTADO_CIVIL',\n",
" 'SITUACAO_CREDITO', \n",
" 'CEP',\n",
" 'CLIENTE_TIPO', 'TIPO_PESSOA', # J / F\n",
" 'CLIENTE_DTCADASTRO', 'EMPRESA_ORIGEM','REVENDA_ORIGEM', \n",
" 'DTPRIMEIRAFATURA', 'DTULTIMAFATURA', 'LIMITE_CREDITO'\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"df_clientes = df[columns_clientes].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 2"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# keep informations about client postal address by keeping the 3 first digits of CEP\n",
"\n",
"df_clientes['CEP'] = df_clientes['CEP'].fillna(0.0).astype(np.int32)\n",
" \n",
"for i in range(3):\n",
" f = lambda x: int(list(str(x))[i]) if len(list(str(x)))>=3 else 0\n",
" n = i+1\n",
" df_clientes[f'CEP{n}'] = df_clientes['CEP'].apply(f)\n",
" \n",
"df_clientes = df_clientes.drop(['CEP'], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Remplace all date with error by '1900/01/01 00:00:00.000000000'\n",
"\n",
"# check if value is NaN\n",
"def isNaN(num):\n",
" return num != num\n",
"\n",
"def indexes_date_correction(date_fldname, df):\n",
" idxs_erros = list()\n",
" idxs_empty = list()\n",
" for idx,dt in df[date_fldname].iteritems():\n",
" if isNaN(dt) == False:\n",
" y = int(dt.split(\"/\")[0])\n",
" m = int(dt.split(\"/\")[1])\n",
" if (y < 1900) or (y > 2019) or (m < 1) or (m > 12) or (y == 2019 and m > 4):\n",
" idxs_erros.append(idx)\n",
" else:\n",
" idxs_empty.append(idx)\n",
" return idxs_erros, idxs_empty\n",
"\n",
"def date_correction(idxs, date_fldname, df, new_date='1900/01/01 00:00:00.000000000', flag='error'):\n",
" df.loc[idxs, date_fldname] = new_date\n",
" if flag == 'error':\n",
" print(f'Todas as datas de {date_fldname} têm sido corrigidas.')\n",
" elif flag == 'empty':\n",
" print(f'Todas as cédulas sem data de {date_fldname} têm sido preenchidas.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DATA_NASCIMENTO"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de cédulas sem datas DATA_NASCIMENTO: 69531\n",
"Número de erros nas datas DATA_NASCIMENTO: 7764\n",
"\n",
"7 2040/12/10 00:00:00.000000000\n",
"13 2045/11/08 00:00:00.000000000\n",
"16 2032/03/19 00:00:00.000000000\n",
"23 2049/11/18 00:00:00.000000000\n",
"24 2040/08/28 00:00:00.000000000\n",
"Name: DATA_NASCIMENTO, dtype: object\n",
"\n",
"188002 0977/11/18 00:00:00.000000000\n",
"188217 0001/01/01 00:00:00.000000000\n",
"188253 2031/03/25 00:00:00.000000000\n",
"188405 2019/08/18 00:00:00.000000000\n",
"188407 2033/01/19 00:00:00.000000000\n",
"Name: DATA_NASCIMENTO, dtype: object\n"
]
}
],
"source": [
"df = df_clientes\n",
"date_fldname = 'DATA_NASCIMENTO'\n",
"idxs_erros, idxs_empty = indexes_date_correction(date_fldname,df)\n",
"errors_data = len(idxs_erros)\n",
"empty_data = len(idxs_empty)\n",
"\n",
"print(f'Número de cédulas sem datas {date_fldname}: {empty_data}')\n",
"print(f'Número de erros nas datas {date_fldname}: {errors_data}')\n",
"if errors_data != 0:\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].head())\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].tail())"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Todas as datas de DATA_NASCIMENTO têm sido corrigidas.\n",
"Todas as cédulas sem data de DATA_NASCIMENTO têm sido preenchidas.\n"
]
}
],
"source": [
"# Correction of date errors\n",
"date_correction(idxs_erros,date_fldname,df)\n",
"# Fillin empty cells with '1900/01/01 00:00:00.000000000'\n",
"date_correction(idxs_empty,date_fldname,df,flag='empty')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### CLIENTE_DTCADASTRO"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de cédulas sem datas CLIENTE_DTCADASTRO: 75946\n",
"Número de erros nas datas CLIENTE_DTCADASTRO: 247\n",
"\n",
"94093 2038/05/23 00:00:00.000000000\n",
"101627 2043/06/24 00:00:00.000000000\n",
"101901 2047/12/15 00:00:00.000000000\n",
"101948 2043/12/14 00:00:00.000000000\n",
"101954 2049/10/20 00:00:00.000000000\n",
"Name: CLIENTE_DTCADASTRO, dtype: object\n",
"\n",
"162595 2049/06/15 00:00:00.000000000\n",
"172822 2046/06/25 00:00:00.000000000\n",
"182667 2037/02/20 00:00:00.000000000\n",
"182675 2046/01/28 00:00:00.000000000\n",
"185408 2047/06/03 00:00:00.000000000\n",
"Name: CLIENTE_DTCADASTRO, dtype: object\n"
]
}
],
"source": [
"df = df_clientes\n",
"date_fldname = 'CLIENTE_DTCADASTRO'\n",
"idxs_erros, idxs_empty = indexes_date_correction(date_fldname,df)\n",
"errors_data = len(idxs_erros)\n",
"empty_data = len(idxs_empty)\n",
"\n",
"print(f'Número de cédulas sem datas {date_fldname}: {empty_data}')\n",
"print(f'Número de erros nas datas {date_fldname}: {errors_data}')\n",
"if errors_data != 0:\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].head())\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].tail())"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Todas as datas de CLIENTE_DTCADASTRO têm sido corrigidas.\n",
"Todas as cédulas sem data de CLIENTE_DTCADASTRO têm sido preenchidas.\n"
]
}
],
"source": [
"# Correction of date errors\n",
"date_correction(idxs_erros,date_fldname,df)\n",
"# Fillin empty cells with '1900/01/01 00:00:00.000000000'\n",
"date_correction(idxs_empty,date_fldname,df,flag='empty')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DTPRIMEIRAFATURA"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de cédulas sem datas DTPRIMEIRAFATURA: 177795\n",
"Número de erros nas datas DTPRIMEIRAFATURA: 0\n"
]
}
],
"source": [
"df = df_clientes\n",
"date_fldname = 'DTPRIMEIRAFATURA'\n",
"idxs_erros, idxs_empty = indexes_date_correction(date_fldname,df)\n",
"errors_data = len(idxs_erros)\n",
"empty_data = len(idxs_empty)\n",
"\n",
"print(f'Número de cédulas sem datas {date_fldname}: {empty_data}')\n",
"print(f'Número de erros nas datas {date_fldname}: {errors_data}')\n",
"if errors_data != 0:\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].head())\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].tail())"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Todas as datas de DTPRIMEIRAFATURA têm sido corrigidas.\n",
"Todas as cédulas sem data de DTPRIMEIRAFATURA têm sido preenchidas.\n"
]
}
],
"source": [
"# Correction of date errors\n",
"date_correction(idxs_erros,date_fldname,df)\n",
"# Fillin empty cells with '1900/01/01 00:00:00.000000000'\n",
"date_correction(idxs_empty,date_fldname,df,flag='empty')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### DTULTIMAFATURA"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de cédulas sem datas DTULTIMAFATURA: 177759\n",
"Número de erros nas datas DTULTIMAFATURA: 0\n"
]
}
],
"source": [
"df = df_clientes\n",
"date_fldname = 'DTULTIMAFATURA'\n",
"idxs_erros, idxs_empty = indexes_date_correction(date_fldname,df)\n",
"errors_data = len(idxs_erros)\n",
"empty_data = len(idxs_empty)\n",
"\n",
"print(f'Número de cédulas sem datas {date_fldname}: {empty_data}')\n",
"print(f'Número de erros nas datas {date_fldname}: {errors_data}')\n",
"if errors_data != 0:\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].head())\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].tail())"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Todas as datas de DTULTIMAFATURA têm sido corrigidas.\n",
"Todas as cédulas sem data de DTULTIMAFATURA têm sido preenchidas.\n"
]
}
],
"source": [
"# Correction of date errors\n",
"date_correction(idxs_erros,date_fldname,df)\n",
"# Fillin empty cells with '1900/01/01 00:00:00.000000000'\n",
"date_correction(idxs_empty,date_fldname,df,flag='empty')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 4"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CLIENTE_COD int64\n",
"DATA_NASCIMENTO object\n",
"SEXO object\n",
"ESTADO_CIVIL float64\n",
"SITUACAO_CREDITO object\n",
"CLIENTE_TIPO object\n",
"TIPO_PESSOA object\n",
"CLIENTE_DTCADASTRO object\n",
"EMPRESA_ORIGEM int64\n",
"REVENDA_ORIGEM int64\n",
"DTPRIMEIRAFATURA object\n",
"DTULTIMAFATURA object\n",
"LIMITE_CREDITO float64\n",
"CEP1 int64\n",
"CEP2 int64\n",
"CEP3 int64\n",
"dtype: object"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clientes.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# Delete espace in cells\n",
"f = lambda x: x.strip()\n",
"df_clientes.SEXO = df_clientes.SEXO.apply(f)\n",
"\n",
"# Replace empty values by nan\n",
"df_clientes.SEXO.replace('', np.nan, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"df_clientes['CLIENTE_COD'] = df_clientes['CLIENTE_COD'].astype(np.int32)\n",
"df_clientes['SEXO'] = df_clientes['SEXO'].fillna('N') # N = neutral when there is no sex\n",
"df_clientes['ESTADO_CIVIL'] = df_clientes['ESTADO_CIVIL'].fillna(0.0).astype(np.int32)\n",
"df_clientes['EMPRESA_ORIGEM'] = df_clientes['EMPRESA_ORIGEM'].astype(np.int32)\n",
"df_clientes['REVENDA_ORIGEM'] = df_clientes['REVENDA_ORIGEM'].astype(np.int32)\n",
"df_clientes['LIMITE_CREDITO'] = df_clientes['LIMITE_CREDITO'].fillna(0.0).astype(np.float32)\n",
"df_clientes['CEP1'] = df_clientes['CEP1'].astype(np.int32)\n",
"df_clientes['CEP2'] = df_clientes['CEP2'].astype(np.int32)\n",
"df_clientes['CEP3'] = df_clientes['CEP3'].astype(np.int32)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"df_clientes['DATA_NASCIMENTO'] = pd.to_datetime(df_clientes['DATA_NASCIMENTO'], infer_datetime_format=True)\n",
"df_clientes['CLIENTE_DTCADASTRO'] = pd.to_datetime(df_clientes['CLIENTE_DTCADASTRO'], infer_datetime_format=True)\n",
"df_clientes['DTPRIMEIRAFATURA'] = pd.to_datetime(df_clientes['DTPRIMEIRAFATURA'], infer_datetime_format=True)\n",
"df_clientes['DTULTIMAFATURA'] = pd.to_datetime(df_clientes['DTULTIMAFATURA'], infer_datetime_format=True)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CLIENTE_COD int32\n",
"DATA_NASCIMENTO datetime64[ns]\n",
"SEXO object\n",
"ESTADO_CIVIL int32\n",
"SITUACAO_CREDITO object\n",
"CLIENTE_TIPO object\n",
"TIPO_PESSOA object\n",
"CLIENTE_DTCADASTRO datetime64[ns]\n",
"EMPRESA_ORIGEM int32\n",
"REVENDA_ORIGEM int32\n",
"DTPRIMEIRAFATURA datetime64[ns]\n",
"DTULTIMAFATURA datetime64[ns]\n",
"LIMITE_CREDITO float32\n",
"CEP1 int32\n",
"CEP2 int32\n",
"CEP3 int32\n",
"dtype: object"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clientes.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" CLIENTE_COD | \n",
" 1 | \n",
" 70 | \n",
" 71 | \n",
" 72 | \n",
" 73 | \n",
"
\n",
" \n",
" DATA_NASCIMENTO | \n",
" 1900-01-01 00:00:00 | \n",
" 1966-04-24 00:00:00 | \n",
" 1966-05-21 00:00:00 | \n",
" 1964-09-09 00:00:00 | \n",
" 1968-09-07 00:00:00 | \n",
"
\n",
" \n",
" SEXO | \n",
" N | \n",
" M | \n",
" F | \n",
" M | \n",
" M | \n",
"
\n",
" \n",
" ESTADO_CIVIL | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" SITUACAO_CREDITO | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
"
\n",
" \n",
" CLIENTE_TIPO | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
"
\n",
" \n",
" TIPO_PESSOA | \n",
" J | \n",
" F | \n",
" F | \n",
" F | \n",
" F | \n",
"
\n",
" \n",
" CLIENTE_DTCADASTRO | \n",
" 2015-02-24 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
"
\n",
" \n",
" EMPRESA_ORIGEM | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" REVENDA_ORIGEM | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA | \n",
" 2015-04-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
"
\n",
" \n",
" DTULTIMAFATURA | \n",
" 2016-09-30 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
"
\n",
" \n",
" LIMITE_CREDITO | \n",
" 1e+07 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" CEP1 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
"
\n",
" \n",
" CEP2 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
"
\n",
" \n",
" CEP3 | \n",
" 0 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 \\\n",
"CLIENTE_COD 1 70 \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 1966-04-24 00:00:00 \n",
"SEXO N M \n",
"ESTADO_CIVIL 0 2 \n",
"SITUACAO_CREDITO Aprovado Aprovado \n",
"CLIENTE_TIPO N N \n",
"TIPO_PESSOA J F \n",
"CLIENTE_DTCADASTRO 2015-02-24 00:00:00 1900-01-01 00:00:00 \n",
"EMPRESA_ORIGEM 1 1 \n",
"REVENDA_ORIGEM 1 1 \n",
"DTPRIMEIRAFATURA 2015-04-01 00:00:00 1900-01-01 00:00:00 \n",
"DTULTIMAFATURA 2016-09-30 00:00:00 1900-01-01 00:00:00 \n",
"LIMITE_CREDITO 1e+07 0 \n",
"CEP1 7 7 \n",
"CEP2 7 7 \n",
"CEP3 0 5 \n",
"\n",
" 2 3 \\\n",
"CLIENTE_COD 71 72 \n",
"DATA_NASCIMENTO 1966-05-21 00:00:00 1964-09-09 00:00:00 \n",
"SEXO F M \n",
"ESTADO_CIVIL 1 2 \n",
"SITUACAO_CREDITO Aprovado Aprovado \n",
"CLIENTE_TIPO N N \n",
"TIPO_PESSOA F F \n",
"CLIENTE_DTCADASTRO 1900-01-01 00:00:00 1900-01-01 00:00:00 \n",
"EMPRESA_ORIGEM 1 1 \n",
"REVENDA_ORIGEM 1 1 \n",
"DTPRIMEIRAFATURA 1900-01-01 00:00:00 1900-01-01 00:00:00 \n",
"DTULTIMAFATURA 1900-01-01 00:00:00 1900-01-01 00:00:00 \n",
"LIMITE_CREDITO 0 0 \n",
"CEP1 7 7 \n",
"CEP2 7 7 \n",
"CEP3 0 0 \n",
"\n",
" 4 \n",
"CLIENTE_COD 73 \n",
"DATA_NASCIMENTO 1968-09-07 00:00:00 \n",
"SEXO M \n",
"ESTADO_CIVIL 2 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA F \n",
"CLIENTE_DTCADASTRO 1900-01-01 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 1900-01-01 00:00:00 \n",
"DTULTIMAFATURA 1900-01-01 00:00:00 \n",
"LIMITE_CREDITO 0 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 0 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clientes.head().T"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de clientes: 188432\n"
]
}
],
"source": [
"print(f'Número de clientes: {len(df_clientes)}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table Veiculos"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Processo**: mantendo apenas as colunas necessárias para treinamento."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"df = veiculos"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['EMPRESA', 'MODELO', 'DES_MODELO', 'FAMILIA', 'DES_FAMILIA', 'MARCA', 'CLASSIFICACAO'], dtype=object)"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns.values"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# List of fields by car\n",
"\n",
"# We did not keep the empresa that owns the car and we classify in the same category new and second-hand cars\n",
"\n",
"columns_veiculos = array(['MODELO', 'DES_MODELO', 'FAMILIA', 'MARCA', 'CLASSIFICACAO'])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"df_veiculos = df[columns_veiculos].copy()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# Delete espace in cells\n",
"f = lambda x: x.strip()\n",
"df_veiculos.MODELO = df_veiculos.MODELO.apply(f)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MODELO object\n",
"DES_MODELO object\n",
"FAMILIA int64\n",
"MARCA object\n",
"CLASSIFICACAO int64\n",
"dtype: object"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_veiculos.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"df_veiculos['FAMILIA'] = df_veiculos['FAMILIA'].astype(np.int32)\n",
"df_veiculos['CLASSIFICACAO'] = df_veiculos['CLASSIFICACAO'].fillna(0).astype(np.int32)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MODELO object\n",
"DES_MODELO object\n",
"FAMILIA int32\n",
"MARCA object\n",
"CLASSIFICACAO int32\n",
"dtype: object"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_veiculos.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" MODELO | \n",
" .... | \n",
" 1623H3 | \n",
" 1624N3 | \n",
" 1624U3 | \n",
" 1632F2 | \n",
"
\n",
" \n",
" DES_MODELO | \n",
" FOX CONNECT SB | \n",
" VIRTUS MSI 1.6 | \n",
" VW/GOL 1.0 | \n",
" GOL 1.6 | \n",
" GOL 1,6 | \n",
"
\n",
" \n",
" FAMILIA | \n",
" 11 | \n",
" 107 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" MARCA | \n",
" VW | \n",
" VW | \n",
" VW | \n",
" VW | \n",
" VW | \n",
"
\n",
" \n",
" CLASSIFICACAO | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 \\\n",
"MODELO .... \n",
"DES_MODELO FOX CONNECT SB \n",
"FAMILIA 11 \n",
"MARCA VW \n",
"CLASSIFICACAO 1 \n",
"\n",
" 1 \\\n",
"MODELO 1623H3 \n",
"DES_MODELO VIRTUS MSI 1.6 \n",
"FAMILIA 107 \n",
"MARCA VW \n",
"CLASSIFICACAO 3 \n",
"\n",
" 2 \\\n",
"MODELO 1624N3 \n",
"DES_MODELO VW/GOL 1.0 \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 1 \n",
"\n",
" 3 \\\n",
"MODELO 1624U3 \n",
"DES_MODELO GOL 1.6 \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 1 \n",
"\n",
" 4 \n",
"MODELO 1632F2 \n",
"DES_MODELO GOL 1,6 \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 1 "
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_veiculos.head().T"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de veiculos: 592\n"
]
}
],
"source": [
"print(f'Número de veiculos: {len(df_veiculos)}')"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de família de veiculos: 37\n"
]
}
],
"source": [
"print(f'Número de família de veiculos: {len(df_veiculos.FAMILIA.unique())}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table Vendedores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Este arquivo de vendedores deveria fornecer mais informações sobre eles como sexo, idade etc. e como o local de trabalho do vendedor é indicado na tabela de vendas, **não é necessário usar esse arquivo**."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"df = vendedores"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['VENDEDOR', 'NOME', 'EMPRESA', 'REVENDA', 'DEPARTAMENTO', 'FUNCAO', 'COD_FUNCAO', 'ATIVO'], dtype=object)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns.values"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# List of fields by seller\n",
"columns_vendedores = array(['VENDEDOR',\n",
" 'EMPRESA', 'REVENDA'\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"df_vendedores = df[columns_vendedores].copy()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"VENDEDOR int64\n",
"EMPRESA int64\n",
"REVENDA int64\n",
"dtype: object"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vendedores.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" VENDEDOR | \n",
" 110001 | \n",
" 110002 | \n",
" 110003 | \n",
" 110004 | \n",
" 110005 | \n",
"
\n",
" \n",
" EMPRESA | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" REVENDA | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4\n",
"VENDEDOR 110001 110002 110003 110004 110005\n",
"EMPRESA 1 1 1 1 1\n",
"REVENDA 1 1 1 1 1"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vendedores.head().T"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de vendedores: 447\n"
]
}
],
"source": [
"print(f'Número de vendedores: {len(df_vendedores)}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table Vendas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Processo**:\n",
"\n",
"1. mantendo apenas as colunas necessárias para treinamento\n",
"2. substituindo datas erradas e ausentes\n",
"3. alterendo dtype das colunas quando necessário (ex: DTA_ENTRADA_SAIDA para datetime) "
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"df = vendas"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['EMPRESA', 'REVENDA', 'NUMERO_NOTA_FISCAL', 'SERIE_NOTA_FISCAL', 'TIPO_TRANSACAO', 'DTA_ENTRADA_SAIDA',\n",
" 'DTA_DOCUMENTO', 'CLIENTE', 'DEPARTAMENTO', 'USUARIO', 'REVENDA_ORIGEM', 'QUANTIDADE', 'VAL_FRETE',\n",
" 'VAL_DESCONTO', 'VAL_SEGURO', 'VAL_IPI', 'VAL_PIS', 'VAL_COFINS', 'VAL_TOTAL_REAL_ITEM', 'VAL_TOTAL_NOTA_ITEM',\n",
" 'VAL_ICMS_RETIDO', 'VAL_CUSTO_CONTABIL', 'VAL_ICMS', 'VENDEDOR', 'VAL_ISS_RETIDO', 'COD_SETOR_VENDA',\n",
" 'VAL_BONUS', 'MODELO'], dtype=object)"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 1"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# List of fields by sale\n",
"columns_vendas = array(['EMPRESA', 'REVENDA',\n",
" 'DTA_ENTRADA_SAIDA',\n",
" 'CLIENTE', \n",
" 'VAL_TOTAL_REAL_ITEM',\n",
" 'VAL_TOTAL_NOTA_ITEM', \n",
" 'VAL_CUSTO_CONTABIL',\n",
" 'VENDEDOR', \n",
" 'VAL_BONUS',\n",
" 'MODELO'\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"df_vendas = df[columns_vendas].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 2"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"# Rename columns names to avoid problem when merging with table of sales\n",
"df_vendas.rename(columns={\"EMPRESA\": 'EMPRESA_VENDAS', 'REVENDA': 'REVENDA_VENDAS'}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de cédulas sem datas DTA_ENTRADA_SAIDA: 0\n",
"Número de erros nas datas DTA_ENTRADA_SAIDA: 3\n",
"\n",
"6600 2201/11/16 00:00:00.000\n",
"7035 2019/12/20 00:00:00.000\n",
"7109 2019/12/07 00:00:00.000\n",
"Name: DTA_ENTRADA_SAIDA, dtype: object\n",
"\n",
"6600 2201/11/16 00:00:00.000\n",
"7035 2019/12/20 00:00:00.000\n",
"7109 2019/12/07 00:00:00.000\n",
"Name: DTA_ENTRADA_SAIDA, dtype: object\n"
]
}
],
"source": [
"df = df_vendas\n",
"date_fldname = 'DTA_ENTRADA_SAIDA'\n",
"idxs_erros, idxs_empty = indexes_date_correction(date_fldname,df)\n",
"errors_data = len(idxs_erros)\n",
"empty_data = len(idxs_empty)\n",
"\n",
"print(f'Número de cédulas sem datas {date_fldname}: {empty_data}')\n",
"print(f'Número de erros nas datas {date_fldname}: {errors_data}')\n",
"if errors_data != 0:\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].head())\n",
" print()\n",
" print(df.loc[idxs_erros, date_fldname].tail())"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"# Correction of date errors\n",
"df.loc[6600,'DTA_ENTRADA_SAIDA'] = '2001/11/16 00:00:00.000'\n",
"df.loc[7035,'DTA_ENTRADA_SAIDA'] = '2009/12/20 00:00:00.000'\n",
"df.loc[7109,'DTA_ENTRADA_SAIDA'] = '2009/12/07 00:00:00.000'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"df_vendas['VAL_TOTAL_REAL_ITEM'] = df_vendas['VAL_TOTAL_REAL_ITEM'].abs()\n",
"df_vendas['VAL_TOTAL_NOTA_ITEM'] = df_vendas['VAL_TOTAL_NOTA_ITEM'].abs()\n",
"df_vendas['VAL_CUSTO_CONTABIL'] = df_vendas['VAL_CUSTO_CONTABIL'].abs()"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df_vendas.VAL_TOTAL_REAL_ITEM < 0.).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 4"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"EMPRESA_VENDAS int64\n",
"REVENDA_VENDAS int64\n",
"DTA_ENTRADA_SAIDA object\n",
"CLIENTE int64\n",
"VAL_TOTAL_REAL_ITEM float64\n",
"VAL_TOTAL_NOTA_ITEM float64\n",
"VAL_CUSTO_CONTABIL float64\n",
"VENDEDOR int64\n",
"VAL_BONUS float64\n",
"MODELO object\n",
"dtype: object"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vendas.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"df_vendas['EMPRESA_VENDAS'] = df_vendas['EMPRESA_VENDAS'].astype(np.int32)\n",
"df_vendas['REVENDA_VENDAS'] = df_vendas['REVENDA_VENDAS'].astype(np.int32)\n",
"df_vendas['CLIENTE'] = df_vendas['CLIENTE'].astype(np.int32)\n",
"df_vendas['VENDEDOR'] = df_vendas['VENDEDOR'].astype(np.int32)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"df_vendas['DTA_ENTRADA_SAIDA'] = pd.to_datetime(df_vendas['DTA_ENTRADA_SAIDA'], infer_datetime_format=True)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"EMPRESA_VENDAS int32\n",
"REVENDA_VENDAS int32\n",
"DTA_ENTRADA_SAIDA datetime64[ns]\n",
"CLIENTE int32\n",
"VAL_TOTAL_REAL_ITEM float64\n",
"VAL_TOTAL_NOTA_ITEM float64\n",
"VAL_CUSTO_CONTABIL float64\n",
"VENDEDOR int32\n",
"VAL_BONUS float64\n",
"MODELO object\n",
"dtype: object"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vendas.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" EMPRESA_VENDAS | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 5 | \n",
" 1 | \n",
"
\n",
" \n",
" REVENDA_VENDAS | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" DTA_ENTRADA_SAIDA | \n",
" 2015-03-25 00:00:00 | \n",
" 2015-03-31 00:00:00 | \n",
" 2015-04-30 00:00:00 | \n",
" 2018-09-18 00:00:00 | \n",
" 2015-05-14 00:00:00 | \n",
"
\n",
" \n",
" CLIENTE | \n",
" 52302 | \n",
" 1 | \n",
" 1 | \n",
" 51804 | \n",
" 1 | \n",
"
\n",
" \n",
" VAL_TOTAL_REAL_ITEM | \n",
" 48800 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VAL_TOTAL_NOTA_ITEM | \n",
" 48800 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VAL_CUSTO_CONTABIL | \n",
" 47713.8 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VENDEDOR | \n",
" 110004 | \n",
" 110011 | \n",
" 110011 | \n",
" 4000020 | \n",
" 110011 | \n",
"
\n",
" \n",
" VAL_BONUS | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" MODELO | \n",
" 5U4FE4 | \n",
" 6A53K4 | \n",
" 5U3PN4 | \n",
" 5U7TA4 | \n",
" 2HBB3A | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 \\\n",
"EMPRESA_VENDAS 1 1 \n",
"REVENDA_VENDAS 1 1 \n",
"DTA_ENTRADA_SAIDA 2015-03-25 00:00:00 2015-03-31 00:00:00 \n",
"CLIENTE 52302 1 \n",
"VAL_TOTAL_REAL_ITEM 48800 37008.1 \n",
"VAL_TOTAL_NOTA_ITEM 48800 37008.1 \n",
"VAL_CUSTO_CONTABIL 47713.8 37008.1 \n",
"VENDEDOR 110004 110011 \n",
"VAL_BONUS NaN NaN \n",
"MODELO 5U4FE4 6A53K4 \n",
"\n",
" 2 3 \\\n",
"EMPRESA_VENDAS 1 5 \n",
"REVENDA_VENDAS 1 1 \n",
"DTA_ENTRADA_SAIDA 2015-04-30 00:00:00 2018-09-18 00:00:00 \n",
"CLIENTE 1 51804 \n",
"VAL_TOTAL_REAL_ITEM 37043.2 41869.3 \n",
"VAL_TOTAL_NOTA_ITEM 37043.2 41869.3 \n",
"VAL_CUSTO_CONTABIL 37043.2 41869.3 \n",
"VENDEDOR 110011 4000020 \n",
"VAL_BONUS NaN NaN \n",
"MODELO 5U3PN4 5U7TA4 \n",
"\n",
" 4 \n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-05-14 00:00:00 \n",
"CLIENTE 1 \n",
"VAL_TOTAL_REAL_ITEM 127944 \n",
"VAL_TOTAL_NOTA_ITEM 127944 \n",
"VAL_CUSTO_CONTABIL 127944 \n",
"VENDEDOR 110011 \n",
"VAL_BONUS NaN \n",
"MODELO 2HBB3A "
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vendas.head().T"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de vendas: 7284\n"
]
}
],
"source": [
"print(f'Número de vendas: {len(df_vendas)}')"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"# save tables\n",
"df_tables = [df_clientes, df_veiculos, df_vendas, df_vendedores]\n",
"\n",
"with open(PATH/'df_tables', 'wb') as f:\n",
" pickle.dump(df_tables, f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1 table with all informations"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"# load file\n",
"file = open(PATH/'df_tables','rb')\n",
"df_tables = pickle.load(file)\n",
"df_clientes, df_veiculos, df_vendas, df_vendedores = df_tables\n",
"file.close()"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"# function to merge 2 dataframes\n",
"def join_df(left, right, left_on, right_on=None, suffix='_y'):\n",
" if right_on is None: right_on = left_on\n",
" return left.merge(right, how='left', left_on=left_on, right_on=right_on, \n",
" suffixes=(\"\", suffix))"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"df_vendas = join_df(df_vendas, df_clientes, \"CLIENTE\", \"CLIENTE_COD\")\n",
"df_vendas = join_df(df_vendas, df_veiculos, \"MODELO\", \"MODELO\")"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"df_vendas = df_vendas.drop(['CLIENTE_COD'], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"EMPRESA_VENDAS int32\n",
"REVENDA_VENDAS int32\n",
"DTA_ENTRADA_SAIDA datetime64[ns]\n",
"CLIENTE int32\n",
"VAL_TOTAL_REAL_ITEM float64\n",
"VAL_TOTAL_NOTA_ITEM float64\n",
"VAL_CUSTO_CONTABIL float64\n",
"VENDEDOR int32\n",
"VAL_BONUS float64\n",
"MODELO object\n",
"DATA_NASCIMENTO datetime64[ns]\n",
"SEXO object\n",
"ESTADO_CIVIL int32\n",
"SITUACAO_CREDITO object\n",
"CLIENTE_TIPO object\n",
"TIPO_PESSOA object\n",
"CLIENTE_DTCADASTRO datetime64[ns]\n",
"EMPRESA_ORIGEM int32\n",
"REVENDA_ORIGEM int32\n",
"DTPRIMEIRAFATURA datetime64[ns]\n",
"DTULTIMAFATURA datetime64[ns]\n",
"LIMITE_CREDITO float32\n",
"CEP1 int32\n",
"CEP2 int32\n",
"CEP3 int32\n",
"DES_MODELO object\n",
"FAMILIA float64\n",
"MARCA object\n",
"CLASSIFICACAO float64\n",
"dtype: object"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vendas.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" EMPRESA_VENDAS | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 5 | \n",
" 1 | \n",
"
\n",
" \n",
" REVENDA_VENDAS | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" DTA_ENTRADA_SAIDA | \n",
" 2015-03-25 00:00:00 | \n",
" 2015-03-31 00:00:00 | \n",
" 2015-04-30 00:00:00 | \n",
" 2018-09-18 00:00:00 | \n",
" 2015-05-14 00:00:00 | \n",
"
\n",
" \n",
" CLIENTE | \n",
" 52302 | \n",
" 1 | \n",
" 1 | \n",
" 51804 | \n",
" 1 | \n",
"
\n",
" \n",
" VAL_TOTAL_REAL_ITEM | \n",
" 48800 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VAL_TOTAL_NOTA_ITEM | \n",
" 48800 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VAL_CUSTO_CONTABIL | \n",
" 47713.8 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VENDEDOR | \n",
" 110004 | \n",
" 110011 | \n",
" 110011 | \n",
" 4000020 | \n",
" 110011 | \n",
"
\n",
" \n",
" VAL_BONUS | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" MODELO | \n",
" 5U4FE4 | \n",
" 6A53K4 | \n",
" 5U3PN4 | \n",
" 5U7TA4 | \n",
" 2HBB3A | \n",
"
\n",
" \n",
" DATA_NASCIMENTO | \n",
" 1977-09-09 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
"
\n",
" \n",
" SEXO | \n",
" M | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
"
\n",
" \n",
" ESTADO_CIVIL | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" SITUACAO_CREDITO | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
"
\n",
" \n",
" CLIENTE_TIPO | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
"
\n",
" \n",
" TIPO_PESSOA | \n",
" F | \n",
" J | \n",
" J | \n",
" J | \n",
" J | \n",
"
\n",
" \n",
" CLIENTE_DTCADASTRO | \n",
" 1900-01-01 00:00:00 | \n",
" 2015-02-24 00:00:00 | \n",
" 2015-02-24 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 2015-02-24 00:00:00 | \n",
"
\n",
" \n",
" EMPRESA_ORIGEM | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" REVENDA_ORIGEM | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA | \n",
" 2015-03-02 00:00:00 | \n",
" 2015-04-01 00:00:00 | \n",
" 2015-04-01 00:00:00 | \n",
" 2015-03-25 00:00:00 | \n",
" 2015-04-01 00:00:00 | \n",
"
\n",
" \n",
" DTULTIMAFATURA | \n",
" 2015-03-25 00:00:00 | \n",
" 2016-09-30 00:00:00 | \n",
" 2016-09-30 00:00:00 | \n",
" 2019-03-07 00:00:00 | \n",
" 2016-09-30 00:00:00 | \n",
"
\n",
" \n",
" LIMITE_CREDITO | \n",
" 0 | \n",
" 1e+07 | \n",
" 1e+07 | \n",
" 0 | \n",
" 1e+07 | \n",
"
\n",
" \n",
" CEP1 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
"
\n",
" \n",
" CEP2 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
"
\n",
" \n",
" CEP3 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 8 | \n",
" 0 | \n",
"
\n",
" \n",
" DES_MODELO | \n",
" FRONTIER | \n",
" FIAT SIENA | \n",
" VOYAGE 1.6 | \n",
" VW/FOX 1.6 PIME GII | \n",
" PEUGEOT 207 PASSION XS A | \n",
"
\n",
" \n",
" FAMILIA | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 58 | \n",
"
\n",
" \n",
" MARCA | \n",
" VW | \n",
" VW | \n",
" VW | \n",
" VW | \n",
" PEUG | \n",
"
\n",
" \n",
" CLASSIFICACAO | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 \\\n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-03-25 00:00:00 \n",
"CLIENTE 52302 \n",
"VAL_TOTAL_REAL_ITEM 48800 \n",
"VAL_TOTAL_NOTA_ITEM 48800 \n",
"VAL_CUSTO_CONTABIL 47713.8 \n",
"VENDEDOR 110004 \n",
"VAL_BONUS NaN \n",
"MODELO 5U4FE4 \n",
"DATA_NASCIMENTO 1977-09-09 00:00:00 \n",
"SEXO M \n",
"ESTADO_CIVIL 2 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA F \n",
"CLIENTE_DTCADASTRO 1900-01-01 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-03-02 00:00:00 \n",
"DTULTIMAFATURA 2015-03-25 00:00:00 \n",
"LIMITE_CREDITO 0 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 5 \n",
"DES_MODELO FRONTIER \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 3 \n",
"\n",
" 1 \\\n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-03-31 00:00:00 \n",
"CLIENTE 1 \n",
"VAL_TOTAL_REAL_ITEM 37008.1 \n",
"VAL_TOTAL_NOTA_ITEM 37008.1 \n",
"VAL_CUSTO_CONTABIL 37008.1 \n",
"VENDEDOR 110011 \n",
"VAL_BONUS NaN \n",
"MODELO 6A53K4 \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 \n",
"SEXO N \n",
"ESTADO_CIVIL 0 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA J \n",
"CLIENTE_DTCADASTRO 2015-02-24 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-04-01 00:00:00 \n",
"DTULTIMAFATURA 2016-09-30 00:00:00 \n",
"LIMITE_CREDITO 1e+07 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 0 \n",
"DES_MODELO FIAT SIENA \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 1 \n",
"\n",
" 2 \\\n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-04-30 00:00:00 \n",
"CLIENTE 1 \n",
"VAL_TOTAL_REAL_ITEM 37043.2 \n",
"VAL_TOTAL_NOTA_ITEM 37043.2 \n",
"VAL_CUSTO_CONTABIL 37043.2 \n",
"VENDEDOR 110011 \n",
"VAL_BONUS NaN \n",
"MODELO 5U3PN4 \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 \n",
"SEXO N \n",
"ESTADO_CIVIL 0 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA J \n",
"CLIENTE_DTCADASTRO 2015-02-24 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-04-01 00:00:00 \n",
"DTULTIMAFATURA 2016-09-30 00:00:00 \n",
"LIMITE_CREDITO 1e+07 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 0 \n",
"DES_MODELO VOYAGE 1.6 \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 3 \n",
"\n",
" 3 \\\n",
"EMPRESA_VENDAS 5 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2018-09-18 00:00:00 \n",
"CLIENTE 51804 \n",
"VAL_TOTAL_REAL_ITEM 41869.3 \n",
"VAL_TOTAL_NOTA_ITEM 41869.3 \n",
"VAL_CUSTO_CONTABIL 41869.3 \n",
"VENDEDOR 4000020 \n",
"VAL_BONUS NaN \n",
"MODELO 5U7TA4 \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 \n",
"SEXO N \n",
"ESTADO_CIVIL 0 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA J \n",
"CLIENTE_DTCADASTRO 1900-01-01 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-03-25 00:00:00 \n",
"DTULTIMAFATURA 2019-03-07 00:00:00 \n",
"LIMITE_CREDITO 0 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 8 \n",
"DES_MODELO VW/FOX 1.6 PIME GII \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 1 \n",
"\n",
" 4 \n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-05-14 00:00:00 \n",
"CLIENTE 1 \n",
"VAL_TOTAL_REAL_ITEM 127944 \n",
"VAL_TOTAL_NOTA_ITEM 127944 \n",
"VAL_CUSTO_CONTABIL 127944 \n",
"VENDEDOR 110011 \n",
"VAL_BONUS NaN \n",
"MODELO 2HBB3A \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 \n",
"SEXO N \n",
"ESTADO_CIVIL 0 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA J \n",
"CLIENTE_DTCADASTRO 2015-02-24 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-04-01 00:00:00 \n",
"DTULTIMAFATURA 2016-09-30 00:00:00 \n",
"LIMITE_CREDITO 1e+07 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 0 \n",
"DES_MODELO PEUGEOT 207 PASSION XS A \n",
"FAMILIA 58 \n",
"MARCA PEUG \n",
"CLASSIFICACAO 1 "
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vendas.head().T"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Número de vendas: 7284\n"
]
}
],
"source": [
"print(f'Número de vendas: {len(df_vendas)}')"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"# save file\n",
"df_vendas.to_pickle(PATH/'df_vendas')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.\n",
"\n",
"You should *always* consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"# load file\n",
"file = open(PATH/'df_vendas','rb')\n",
"df_vendas = pickle.load(file)\n",
"file.close()"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"def add_datepart(df, fldname, drop=True, time=False):\n",
" \"Helper function that adds columns relevant to a date.\"\n",
" fld = df[fldname]\n",
" fld_dtype = fld.dtype\n",
" if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):\n",
" fld_dtype = np.datetime64\n",
"\n",
" if not np.issubdtype(fld_dtype, np.datetime64):\n",
" df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True)\n",
" targ_pre = re.sub('[Dd]ate$', '', fldname)\n",
" attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',\n",
" 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']\n",
" if time: attr = attr + ['Hour', 'Minute', 'Second']\n",
" for n in attr: df[targ_pre + '_' + n] = getattr(fld.dt, n.lower())\n",
" df[targ_pre + '_' + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9\n",
" if drop: df.drop(fldname, axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"add_datepart(df_vendas, \"DTA_ENTRADA_SAIDA\", drop=False)\n",
"add_datepart(df_vendas, \"DATA_NASCIMENTO\", drop=False)\n",
"add_datepart(df_vendas, \"CLIENTE_DTCADASTRO\", drop=False)\n",
"add_datepart(df_vendas, \"DTPRIMEIRAFATURA\", drop=False)\n",
"add_datepart(df_vendas, \"DTULTIMAFATURA\", drop=False)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" EMPRESA_VENDAS | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 5 | \n",
" 1 | \n",
"
\n",
" \n",
" REVENDA_VENDAS | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" DTA_ENTRADA_SAIDA | \n",
" 2015-03-25 00:00:00 | \n",
" 2015-03-31 00:00:00 | \n",
" 2015-04-30 00:00:00 | \n",
" 2018-09-18 00:00:00 | \n",
" 2015-05-14 00:00:00 | \n",
"
\n",
" \n",
" CLIENTE | \n",
" 52302 | \n",
" 1 | \n",
" 1 | \n",
" 51804 | \n",
" 1 | \n",
"
\n",
" \n",
" VAL_TOTAL_REAL_ITEM | \n",
" 48800 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VAL_TOTAL_NOTA_ITEM | \n",
" 48800 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VAL_CUSTO_CONTABIL | \n",
" 47713.8 | \n",
" 37008.1 | \n",
" 37043.2 | \n",
" 41869.3 | \n",
" 127944 | \n",
"
\n",
" \n",
" VENDEDOR | \n",
" 110004 | \n",
" 110011 | \n",
" 110011 | \n",
" 4000020 | \n",
" 110011 | \n",
"
\n",
" \n",
" VAL_BONUS | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" MODELO | \n",
" 5U4FE4 | \n",
" 6A53K4 | \n",
" 5U3PN4 | \n",
" 5U7TA4 | \n",
" 2HBB3A | \n",
"
\n",
" \n",
" DATA_NASCIMENTO | \n",
" 1977-09-09 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
"
\n",
" \n",
" SEXO | \n",
" M | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
"
\n",
" \n",
" ESTADO_CIVIL | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" SITUACAO_CREDITO | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
" Aprovado | \n",
"
\n",
" \n",
" CLIENTE_TIPO | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
"
\n",
" \n",
" TIPO_PESSOA | \n",
" F | \n",
" J | \n",
" J | \n",
" J | \n",
" J | \n",
"
\n",
" \n",
" CLIENTE_DTCADASTRO | \n",
" 1900-01-01 00:00:00 | \n",
" 2015-02-24 00:00:00 | \n",
" 2015-02-24 00:00:00 | \n",
" 1900-01-01 00:00:00 | \n",
" 2015-02-24 00:00:00 | \n",
"
\n",
" \n",
" EMPRESA_ORIGEM | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" REVENDA_ORIGEM | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA | \n",
" 2015-03-02 00:00:00 | \n",
" 2015-04-01 00:00:00 | \n",
" 2015-04-01 00:00:00 | \n",
" 2015-03-25 00:00:00 | \n",
" 2015-04-01 00:00:00 | \n",
"
\n",
" \n",
" DTULTIMAFATURA | \n",
" 2015-03-25 00:00:00 | \n",
" 2016-09-30 00:00:00 | \n",
" 2016-09-30 00:00:00 | \n",
" 2019-03-07 00:00:00 | \n",
" 2016-09-30 00:00:00 | \n",
"
\n",
" \n",
" LIMITE_CREDITO | \n",
" 0 | \n",
" 1e+07 | \n",
" 1e+07 | \n",
" 0 | \n",
" 1e+07 | \n",
"
\n",
" \n",
" CEP1 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
"
\n",
" \n",
" CEP2 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
" 7 | \n",
"
\n",
" \n",
" CEP3 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 8 | \n",
" 0 | \n",
"
\n",
" \n",
" DES_MODELO | \n",
" FRONTIER | \n",
" FIAT SIENA | \n",
" VOYAGE 1.6 | \n",
" VW/FOX 1.6 PIME GII | \n",
" PEUGEOT 207 PASSION XS A | \n",
"
\n",
" \n",
" FAMILIA | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 58 | \n",
"
\n",
" \n",
" MARCA | \n",
" VW | \n",
" VW | \n",
" VW | \n",
" VW | \n",
" PEUG | \n",
"
\n",
" \n",
" CLASSIFICACAO | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" DTA_ENTRADA_SAIDA_Year | \n",
" 2015 | \n",
" 2015 | \n",
" 2015 | \n",
" 2018 | \n",
" 2015 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" CLIENTE_DTCADASTRO_Is_quarter_start | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" CLIENTE_DTCADASTRO_Is_year_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" CLIENTE_DTCADASTRO_Is_year_start | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" CLIENTE_DTCADASTRO_Elapsed | \n",
" -2208988800 | \n",
" 1424736000 | \n",
" 1424736000 | \n",
" -2208988800 | \n",
" 1424736000 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Year | \n",
" 2015 | \n",
" 2015 | \n",
" 2015 | \n",
" 2015 | \n",
" 2015 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Month | \n",
" 3 | \n",
" 4 | \n",
" 4 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Week | \n",
" 10 | \n",
" 14 | \n",
" 14 | \n",
" 13 | \n",
" 14 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Day | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 25 | \n",
" 1 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Dayofweek | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Dayofyear | \n",
" 61 | \n",
" 91 | \n",
" 91 | \n",
" 84 | \n",
" 91 | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Is_month_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Is_month_start | \n",
" False | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Is_quarter_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Is_quarter_start | \n",
" False | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Is_year_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Is_year_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" DTPRIMEIRAFATURA_Elapsed | \n",
" 1425254400 | \n",
" 1427846400 | \n",
" 1427846400 | \n",
" 1427241600 | \n",
" 1427846400 | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Year | \n",
" 2015 | \n",
" 2016 | \n",
" 2016 | \n",
" 2019 | \n",
" 2016 | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Month | \n",
" 3 | \n",
" 9 | \n",
" 9 | \n",
" 3 | \n",
" 9 | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Week | \n",
" 13 | \n",
" 39 | \n",
" 39 | \n",
" 10 | \n",
" 39 | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Day | \n",
" 25 | \n",
" 30 | \n",
" 30 | \n",
" 7 | \n",
" 30 | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Dayofweek | \n",
" 2 | \n",
" 4 | \n",
" 4 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Dayofyear | \n",
" 84 | \n",
" 274 | \n",
" 274 | \n",
" 66 | \n",
" 274 | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Is_month_end | \n",
" False | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Is_month_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Is_quarter_end | \n",
" False | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Is_quarter_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Is_year_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Is_year_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" DTULTIMAFATURA_Elapsed | \n",
" 1427241600 | \n",
" 1475193600 | \n",
" 1475193600 | \n",
" 1551916800 | \n",
" 1475193600 | \n",
"
\n",
" \n",
"
\n",
"
94 rows × 5 columns
\n",
"
"
],
"text/plain": [
" 0 \\\n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-03-25 00:00:00 \n",
"CLIENTE 52302 \n",
"VAL_TOTAL_REAL_ITEM 48800 \n",
"VAL_TOTAL_NOTA_ITEM 48800 \n",
"VAL_CUSTO_CONTABIL 47713.8 \n",
"VENDEDOR 110004 \n",
"VAL_BONUS NaN \n",
"MODELO 5U4FE4 \n",
"DATA_NASCIMENTO 1977-09-09 00:00:00 \n",
"SEXO M \n",
"ESTADO_CIVIL 2 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA F \n",
"CLIENTE_DTCADASTRO 1900-01-01 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-03-02 00:00:00 \n",
"DTULTIMAFATURA 2015-03-25 00:00:00 \n",
"LIMITE_CREDITO 0 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 5 \n",
"DES_MODELO FRONTIER \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 3 \n",
"DTA_ENTRADA_SAIDA_Year 2015 \n",
"... ... \n",
"CLIENTE_DTCADASTRO_Is_quarter_start True \n",
"CLIENTE_DTCADASTRO_Is_year_end False \n",
"CLIENTE_DTCADASTRO_Is_year_start True \n",
"CLIENTE_DTCADASTRO_Elapsed -2208988800 \n",
"DTPRIMEIRAFATURA_Year 2015 \n",
"DTPRIMEIRAFATURA_Month 3 \n",
"DTPRIMEIRAFATURA_Week 10 \n",
"DTPRIMEIRAFATURA_Day 2 \n",
"DTPRIMEIRAFATURA_Dayofweek 0 \n",
"DTPRIMEIRAFATURA_Dayofyear 61 \n",
"DTPRIMEIRAFATURA_Is_month_end False \n",
"DTPRIMEIRAFATURA_Is_month_start False \n",
"DTPRIMEIRAFATURA_Is_quarter_end False \n",
"DTPRIMEIRAFATURA_Is_quarter_start False \n",
"DTPRIMEIRAFATURA_Is_year_end False \n",
"DTPRIMEIRAFATURA_Is_year_start False \n",
"DTPRIMEIRAFATURA_Elapsed 1425254400 \n",
"DTULTIMAFATURA_Year 2015 \n",
"DTULTIMAFATURA_Month 3 \n",
"DTULTIMAFATURA_Week 13 \n",
"DTULTIMAFATURA_Day 25 \n",
"DTULTIMAFATURA_Dayofweek 2 \n",
"DTULTIMAFATURA_Dayofyear 84 \n",
"DTULTIMAFATURA_Is_month_end False \n",
"DTULTIMAFATURA_Is_month_start False \n",
"DTULTIMAFATURA_Is_quarter_end False \n",
"DTULTIMAFATURA_Is_quarter_start False \n",
"DTULTIMAFATURA_Is_year_end False \n",
"DTULTIMAFATURA_Is_year_start False \n",
"DTULTIMAFATURA_Elapsed 1427241600 \n",
"\n",
" 1 \\\n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-03-31 00:00:00 \n",
"CLIENTE 1 \n",
"VAL_TOTAL_REAL_ITEM 37008.1 \n",
"VAL_TOTAL_NOTA_ITEM 37008.1 \n",
"VAL_CUSTO_CONTABIL 37008.1 \n",
"VENDEDOR 110011 \n",
"VAL_BONUS NaN \n",
"MODELO 6A53K4 \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 \n",
"SEXO N \n",
"ESTADO_CIVIL 0 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA J \n",
"CLIENTE_DTCADASTRO 2015-02-24 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-04-01 00:00:00 \n",
"DTULTIMAFATURA 2016-09-30 00:00:00 \n",
"LIMITE_CREDITO 1e+07 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 0 \n",
"DES_MODELO FIAT SIENA \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 1 \n",
"DTA_ENTRADA_SAIDA_Year 2015 \n",
"... ... \n",
"CLIENTE_DTCADASTRO_Is_quarter_start False \n",
"CLIENTE_DTCADASTRO_Is_year_end False \n",
"CLIENTE_DTCADASTRO_Is_year_start False \n",
"CLIENTE_DTCADASTRO_Elapsed 1424736000 \n",
"DTPRIMEIRAFATURA_Year 2015 \n",
"DTPRIMEIRAFATURA_Month 4 \n",
"DTPRIMEIRAFATURA_Week 14 \n",
"DTPRIMEIRAFATURA_Day 1 \n",
"DTPRIMEIRAFATURA_Dayofweek 2 \n",
"DTPRIMEIRAFATURA_Dayofyear 91 \n",
"DTPRIMEIRAFATURA_Is_month_end False \n",
"DTPRIMEIRAFATURA_Is_month_start True \n",
"DTPRIMEIRAFATURA_Is_quarter_end False \n",
"DTPRIMEIRAFATURA_Is_quarter_start True \n",
"DTPRIMEIRAFATURA_Is_year_end False \n",
"DTPRIMEIRAFATURA_Is_year_start False \n",
"DTPRIMEIRAFATURA_Elapsed 1427846400 \n",
"DTULTIMAFATURA_Year 2016 \n",
"DTULTIMAFATURA_Month 9 \n",
"DTULTIMAFATURA_Week 39 \n",
"DTULTIMAFATURA_Day 30 \n",
"DTULTIMAFATURA_Dayofweek 4 \n",
"DTULTIMAFATURA_Dayofyear 274 \n",
"DTULTIMAFATURA_Is_month_end True \n",
"DTULTIMAFATURA_Is_month_start False \n",
"DTULTIMAFATURA_Is_quarter_end True \n",
"DTULTIMAFATURA_Is_quarter_start False \n",
"DTULTIMAFATURA_Is_year_end False \n",
"DTULTIMAFATURA_Is_year_start False \n",
"DTULTIMAFATURA_Elapsed 1475193600 \n",
"\n",
" 2 \\\n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-04-30 00:00:00 \n",
"CLIENTE 1 \n",
"VAL_TOTAL_REAL_ITEM 37043.2 \n",
"VAL_TOTAL_NOTA_ITEM 37043.2 \n",
"VAL_CUSTO_CONTABIL 37043.2 \n",
"VENDEDOR 110011 \n",
"VAL_BONUS NaN \n",
"MODELO 5U3PN4 \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 \n",
"SEXO N \n",
"ESTADO_CIVIL 0 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA J \n",
"CLIENTE_DTCADASTRO 2015-02-24 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-04-01 00:00:00 \n",
"DTULTIMAFATURA 2016-09-30 00:00:00 \n",
"LIMITE_CREDITO 1e+07 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 0 \n",
"DES_MODELO VOYAGE 1.6 \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 3 \n",
"DTA_ENTRADA_SAIDA_Year 2015 \n",
"... ... \n",
"CLIENTE_DTCADASTRO_Is_quarter_start False \n",
"CLIENTE_DTCADASTRO_Is_year_end False \n",
"CLIENTE_DTCADASTRO_Is_year_start False \n",
"CLIENTE_DTCADASTRO_Elapsed 1424736000 \n",
"DTPRIMEIRAFATURA_Year 2015 \n",
"DTPRIMEIRAFATURA_Month 4 \n",
"DTPRIMEIRAFATURA_Week 14 \n",
"DTPRIMEIRAFATURA_Day 1 \n",
"DTPRIMEIRAFATURA_Dayofweek 2 \n",
"DTPRIMEIRAFATURA_Dayofyear 91 \n",
"DTPRIMEIRAFATURA_Is_month_end False \n",
"DTPRIMEIRAFATURA_Is_month_start True \n",
"DTPRIMEIRAFATURA_Is_quarter_end False \n",
"DTPRIMEIRAFATURA_Is_quarter_start True \n",
"DTPRIMEIRAFATURA_Is_year_end False \n",
"DTPRIMEIRAFATURA_Is_year_start False \n",
"DTPRIMEIRAFATURA_Elapsed 1427846400 \n",
"DTULTIMAFATURA_Year 2016 \n",
"DTULTIMAFATURA_Month 9 \n",
"DTULTIMAFATURA_Week 39 \n",
"DTULTIMAFATURA_Day 30 \n",
"DTULTIMAFATURA_Dayofweek 4 \n",
"DTULTIMAFATURA_Dayofyear 274 \n",
"DTULTIMAFATURA_Is_month_end True \n",
"DTULTIMAFATURA_Is_month_start False \n",
"DTULTIMAFATURA_Is_quarter_end True \n",
"DTULTIMAFATURA_Is_quarter_start False \n",
"DTULTIMAFATURA_Is_year_end False \n",
"DTULTIMAFATURA_Is_year_start False \n",
"DTULTIMAFATURA_Elapsed 1475193600 \n",
"\n",
" 3 \\\n",
"EMPRESA_VENDAS 5 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2018-09-18 00:00:00 \n",
"CLIENTE 51804 \n",
"VAL_TOTAL_REAL_ITEM 41869.3 \n",
"VAL_TOTAL_NOTA_ITEM 41869.3 \n",
"VAL_CUSTO_CONTABIL 41869.3 \n",
"VENDEDOR 4000020 \n",
"VAL_BONUS NaN \n",
"MODELO 5U7TA4 \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 \n",
"SEXO N \n",
"ESTADO_CIVIL 0 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA J \n",
"CLIENTE_DTCADASTRO 1900-01-01 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-03-25 00:00:00 \n",
"DTULTIMAFATURA 2019-03-07 00:00:00 \n",
"LIMITE_CREDITO 0 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 8 \n",
"DES_MODELO VW/FOX 1.6 PIME GII \n",
"FAMILIA 1 \n",
"MARCA VW \n",
"CLASSIFICACAO 1 \n",
"DTA_ENTRADA_SAIDA_Year 2018 \n",
"... ... \n",
"CLIENTE_DTCADASTRO_Is_quarter_start True \n",
"CLIENTE_DTCADASTRO_Is_year_end False \n",
"CLIENTE_DTCADASTRO_Is_year_start True \n",
"CLIENTE_DTCADASTRO_Elapsed -2208988800 \n",
"DTPRIMEIRAFATURA_Year 2015 \n",
"DTPRIMEIRAFATURA_Month 3 \n",
"DTPRIMEIRAFATURA_Week 13 \n",
"DTPRIMEIRAFATURA_Day 25 \n",
"DTPRIMEIRAFATURA_Dayofweek 2 \n",
"DTPRIMEIRAFATURA_Dayofyear 84 \n",
"DTPRIMEIRAFATURA_Is_month_end False \n",
"DTPRIMEIRAFATURA_Is_month_start False \n",
"DTPRIMEIRAFATURA_Is_quarter_end False \n",
"DTPRIMEIRAFATURA_Is_quarter_start False \n",
"DTPRIMEIRAFATURA_Is_year_end False \n",
"DTPRIMEIRAFATURA_Is_year_start False \n",
"DTPRIMEIRAFATURA_Elapsed 1427241600 \n",
"DTULTIMAFATURA_Year 2019 \n",
"DTULTIMAFATURA_Month 3 \n",
"DTULTIMAFATURA_Week 10 \n",
"DTULTIMAFATURA_Day 7 \n",
"DTULTIMAFATURA_Dayofweek 3 \n",
"DTULTIMAFATURA_Dayofyear 66 \n",
"DTULTIMAFATURA_Is_month_end False \n",
"DTULTIMAFATURA_Is_month_start False \n",
"DTULTIMAFATURA_Is_quarter_end False \n",
"DTULTIMAFATURA_Is_quarter_start False \n",
"DTULTIMAFATURA_Is_year_end False \n",
"DTULTIMAFATURA_Is_year_start False \n",
"DTULTIMAFATURA_Elapsed 1551916800 \n",
"\n",
" 4 \n",
"EMPRESA_VENDAS 1 \n",
"REVENDA_VENDAS 1 \n",
"DTA_ENTRADA_SAIDA 2015-05-14 00:00:00 \n",
"CLIENTE 1 \n",
"VAL_TOTAL_REAL_ITEM 127944 \n",
"VAL_TOTAL_NOTA_ITEM 127944 \n",
"VAL_CUSTO_CONTABIL 127944 \n",
"VENDEDOR 110011 \n",
"VAL_BONUS NaN \n",
"MODELO 2HBB3A \n",
"DATA_NASCIMENTO 1900-01-01 00:00:00 \n",
"SEXO N \n",
"ESTADO_CIVIL 0 \n",
"SITUACAO_CREDITO Aprovado \n",
"CLIENTE_TIPO N \n",
"TIPO_PESSOA J \n",
"CLIENTE_DTCADASTRO 2015-02-24 00:00:00 \n",
"EMPRESA_ORIGEM 1 \n",
"REVENDA_ORIGEM 1 \n",
"DTPRIMEIRAFATURA 2015-04-01 00:00:00 \n",
"DTULTIMAFATURA 2016-09-30 00:00:00 \n",
"LIMITE_CREDITO 1e+07 \n",
"CEP1 7 \n",
"CEP2 7 \n",
"CEP3 0 \n",
"DES_MODELO PEUGEOT 207 PASSION XS A \n",
"FAMILIA 58 \n",
"MARCA PEUG \n",
"CLASSIFICACAO 1 \n",
"DTA_ENTRADA_SAIDA_Year 2015 \n",
"... ... \n",
"CLIENTE_DTCADASTRO_Is_quarter_start False \n",
"CLIENTE_DTCADASTRO_Is_year_end False \n",
"CLIENTE_DTCADASTRO_Is_year_start False \n",
"CLIENTE_DTCADASTRO_Elapsed 1424736000 \n",
"DTPRIMEIRAFATURA_Year 2015 \n",
"DTPRIMEIRAFATURA_Month 4 \n",
"DTPRIMEIRAFATURA_Week 14 \n",
"DTPRIMEIRAFATURA_Day 1 \n",
"DTPRIMEIRAFATURA_Dayofweek 2 \n",
"DTPRIMEIRAFATURA_Dayofyear 91 \n",
"DTPRIMEIRAFATURA_Is_month_end False \n",
"DTPRIMEIRAFATURA_Is_month_start True \n",
"DTPRIMEIRAFATURA_Is_quarter_end False \n",
"DTPRIMEIRAFATURA_Is_quarter_start True \n",
"DTPRIMEIRAFATURA_Is_year_end False \n",
"DTPRIMEIRAFATURA_Is_year_start False \n",
"DTPRIMEIRAFATURA_Elapsed 1427846400 \n",
"DTULTIMAFATURA_Year 2016 \n",
"DTULTIMAFATURA_Month 9 \n",
"DTULTIMAFATURA_Week 39 \n",
"DTULTIMAFATURA_Day 30 \n",
"DTULTIMAFATURA_Dayofweek 4 \n",
"DTULTIMAFATURA_Dayofyear 274 \n",
"DTULTIMAFATURA_Is_month_end True \n",
"DTULTIMAFATURA_Is_month_start False \n",
"DTULTIMAFATURA_Is_quarter_end True \n",
"DTULTIMAFATURA_Is_quarter_start False \n",
"DTULTIMAFATURA_Is_year_end False \n",
"DTULTIMAFATURA_Is_year_start False \n",
"DTULTIMAFATURA_Elapsed 1475193600 \n",
"\n",
"[94 rows x 5 columns]"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vendas.head().T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Missing Values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll fill in missing values to avoid complications with `NA`'s. `NA` (not available) is how Pandas indicates missing values; many models have problems when missing values are present, so it's always important to think about how to deal with them. In these cases, we are picking an arbitrary *signal value* that doesn't otherwise appear in the data."
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"list_col = list(df_vendas.columns.values)"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"VAL_BONUS\n",
"DES_MODELO\n",
"FAMILIA\n",
"MARCA\n",
"CLASSIFICACAO\n"
]
}
],
"source": [
"# list columns with missing values\n",
"for col in list_col:\n",
" s = df_vendas[col].isna().sum()\n",
"# n = (df_vendas[col] == '').sum()\n",
" if s != 0 :\n",
" print(col)"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"df_vendas['VAL_BONUS'] = df_vendas.VAL_BONUS.fillna(0.0).astype(np.float32)\n",
"df_vendas['FAMILIA'] = df_vendas['FAMILIA'].fillna(1.0).astype(np.int32)\n",
"df_vendas['CLASSIFICACAO'] = df_vendas['CLASSIFICACAO'].fillna(1.0).astype(np.int32)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DES_MODELO\n",
"MARCA\n"
]
}
],
"source": [
"# list columns with missing values\n",
"for col in list_col:\n",
" s = df_vendas[col].isna().sum()\n",
"# n = (df_vendas[col] == '').sum()\n",
" if s != 0 :\n",
" print(col)"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"# save file\n",
"df_vendas.to_pickle(PATH/'df_vendas_clean')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nota: como não usaremos as colunas `DES_MODELO` e `MARCA`, seus processamentos não valem a pena."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Durations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"É comum trabalhar com dados de séries temporais para extrair dados que explicam os relacionamentos entre linhas, em oposição a colunas, por exemplo:\n",
"* Média de funcionamento\n",
"* Tempo até o próximo evento\n",
"* Tempo desde o último evento\n",
"\n",
"Isso geralmente é difícil de fazer com a maioria das estruturas de manipulação de tabelas, pois elas são projetadas para trabalhar com relacionamentos entre colunas. \n",
"\n",
"Assim, criamos uma nova coluna `DTA_VENDA_CADASTRO_DaysElapsed` que fornece o número de dias entre o dia de venda (`DTA_ENTRADA_SAIDA`) e o de cadastro (`CLIENTE_DTCADASTRO`)."
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"# load file\n",
"df_vendas = pd.read_pickle(PATH/'df_vendas_clean')"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"df_vendas['DTA_VENDA_CADASTRO_DaysElapsed'] = ((df_vendas.DTA_ENTRADA_SAIDA_Elapsed - df_vendas.CLIENTE_DTCADASTRO_Elapsed)/(24*60*60)).copy()\n",
"\n",
"f = lambda x: int(x)\n",
"df_vendas['DTA_VENDA_CADASTRO_DaysElapsed'] = df_vendas['DTA_VENDA_CADASTRO_DaysElapsed'].apply(f)\n",
"\n",
"df_vendas['DTA_VENDA_CADASTRO_DaysElapsed'] = df_vendas['DTA_VENDA_CADASTRO_DaysElapsed'].astype(np.int32)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"# save file\n",
"df_vendas.to_pickle(PATH/'df_vendas_clean_DaysElapsed')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "fastai test v1",
"language": "python",
"name": "fastai_test_v1"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}