{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Caterpillar Tube Pricing\n", "\n", "In this competition we have data about tube assemblies which suppliers deliver to Caterpillar. The data is comprised of many files describing these tube assemblies.\n", "\n", "- train_set and test_set contain information on price quotes from suppliers. Prices can be quoted in 2 ways: bracket and non-bracket pricing. Bracket pricing has multiple levels of purchase based on quantity (in other words, the cost is given assuming a purchase of quantity tubes). Non-bracket pricing has a minimum order amount (min_order) for which the price would apply. Each quote is issued with an annual_usage, an estimate of how many tube assemblies will be purchased in a given year;\n", "- tube contains information on physical parameters of tube assemblies;\n", "- bill_of_materials contains the list of components, and their quantities, used on each tube assembly;\n", "- specs contains the list of unique specifications for the tube assembly;\n", "- tube_end_form contains list of end types which are physically formed utilizing only the wall of the tube;\n", "- components contains the list of all of the components used;\n", "- type... contain the names for each feature;\n", "- comp... contains information on physical parameters of components by their type;\n", "\n", "The goal of competition if to predict prices of tube assemblies based on this information.\n", "\n", "The main challenge of this competition is combining the data and choosing/creating features. After this I use XGBoost for prediction.\n", "\n", "The metric to calculate the accuracy of predictions is Root Mean Squared Logarithmic Error (it penalizes an under-predicted estimate greater than an over-predicted estimate).\n", "\n", "The RMSLE is calculated as\n", "\n", "$$\n", "\\epsilon = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (\\log(p_i + 1) - \\log(a_i+1))^2 }\n", "$$\n", "\n", "Where:\n", "\n", "\\\\(\\epsilon\\\\) is the RMSLE value (score);\n", "\\\\(n\\\\) is the number of observations;\n", "\\\\(p_i\\\\) is prediction;\n", "\\\\(a_i\\\\) is the actual response for \\\\(i\\\\);\n", "\\\\(\\log(x)\\\\) is the natural logarithm of \\\\(x\\\\)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Data exploration](#datex)\n", "\n", " 1.1 [Bill of materials](#bom)\n", " \n", " 1.2 [Comp Adaptor](#comad)\n", " \n", " 1.3 [Comp Boss](#combo)\n", " \n", " 1.4 [Comp Hfl](#comhf)\n", " \n", " 1.5 [Comp Elbow](#comel)\n", " \n", " 1.6 [Comp Float](#comfl)\n", " \n", " 1.7 [Comp Nut](#comnu)\n", " \n", " 1.8 [Comp Other](#comot)\n", " \n", " 1.9 [Comp Sleeve](#comsl)\n", " \n", " 1.10 [Comp Straight](#comst)\n", " \n", " 1.11 [Comp Tee](#comte)\n", " \n", " 1.12 [Comp Threaded](#comth)\n", " \n", " 1.13 [Tube](#tube)\n", " \n", "2. [Data preparation](#datprep)\n", "\n", "3. [Model](#model)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import xgboost as xgb\n", "from sklearn import preprocessing\n", "from sklearn.preprocessing import LabelEncoder\n", "from scipy.stats import skew\n", "import glob" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data exploration\n", "\n", "An easy way to read all the files in separate variables is to get the list of files and loop through it." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\bill_of_materials.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\components.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_adaptor.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_boss.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_elbow.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_float.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_hfl.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_nut.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_other.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_sleeve.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_straight.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_tee.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\comp_threaded.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\sample_submission.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\specs.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\test_set.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\train_set.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\tube.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\tube_end_form.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\type_component.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\type_connection.csv',\n", " 'Kaggle/Data_Kaggle/Caterpillar Tube Pricing\\\\type_end_form.csv']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/*.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#Read each file in a separate data frame.\n", "bill_of_materials = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/bill_of_materials.csv')\n", "components = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/components.csv')\n", "comp_adaptor = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_adaptor.csv')\n", "comp_boss = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_boss.csv')\n", "comp_elbow = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_elbow.csv')\n", "comp_float = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_float.csv')\n", "comp_hfl = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_hfl.csv')\n", "comp_nut = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_nut.csv')\n", "comp_other = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_other.csv')\n", "comp_sleeve = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_sleeve.csv')\n", "comp_straight = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_straight.csv')\n", "comp_tee = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_tee.csv')\n", "comp_threaded = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/comp_threaded.csv')\n", "specs = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/specs.csv')\n", "tube = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/tube.csv')\n", "tube_end_form = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/tube_end_form.csv')\n", "type_component = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/type_component.csv')\n", "type_connection = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/type_connection.csv')\n", "type_end_form = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/type_end_form.csv')\n", "train = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/train_set.csv', parse_dates=[2,])\n", "test = pd.read_csv('Kaggle/Data_Kaggle/Caterpillar Tube Pricing/test_set.csv', parse_dates=[3,])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bill of materials" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 21198 entries, 0 to 21197\n", "Data columns (total 17 columns):\n", "tube_assembly_id 21198 non-null object\n", "component_id_1 19149 non-null object\n", "quantity_1 19149 non-null float64\n", "component_id_2 14786 non-null object\n", "quantity_2 14786 non-null float64\n", "component_id_3 4791 non-null object\n", "quantity_3 4798 non-null float64\n", "component_id_4 607 non-null object\n", "quantity_4 608 non-null float64\n", "component_id_5 92 non-null object\n", "quantity_5 92 non-null float64\n", "component_id_6 26 non-null object\n", "quantity_6 26 non-null float64\n", "component_id_7 7 non-null object\n", "quantity_7 7 non-null float64\n", "component_id_8 1 non-null object\n", "quantity_8 1 non-null float64\n", "dtypes: float64(8), object(9)\n", "memory usage: 2.7+ MB\n" ] } ], "source": [ "#The file contains information about components of tube assemblies. All information is necessary.\n", "#Missing values could be filled only with 0, but it isn't necessary.\n", "bill_of_materials.info()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tube_assembly_idcomponent_id_1quantity_1component_id_2quantity_2component_id_3quantity_3component_id_4quantity_4component_id_5quantity_5component_id_6quantity_6component_id_7quantity_7component_id_8quantity_8
11523TA-11524C-12531.0C-14962.0C-15001.0C-17081.0C-18111.0C-19201.0C-19211.0C-19811.0
\n", "
" ], "text/plain": [ " tube_assembly_id component_id_1 quantity_1 component_id_2 quantity_2 \\\n", "11523 TA-11524 C-1253 1.0 C-1496 2.0 \n", "\n", " component_id_3 quantity_3 component_id_4 quantity_4 component_id_5 \\\n", "11523 C-1500 1.0 C-1708 1.0 C-1811 \n", "\n", " quantity_5 component_id_6 quantity_6 component_id_7 quantity_7 \\\n", "11523 1.0 C-1920 1.0 C-1921 1.0 \n", "\n", " component_id_8 quantity_8 \n", "11523 C-1981 1.0 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Simply to see the line with all 8 components.\n", "bill_of_materials.loc[bill_of_materials.quantity_8.notnull() == True]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Adaptor" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 25 entries, 0 to 24\n", "Data columns (total 20 columns):\n", "component_id 25 non-null object\n", "component_type_id 25 non-null object\n", "adaptor_angle 1 non-null float64\n", "overall_length 24 non-null float64\n", "end_form_id_1 25 non-null object\n", "connection_type_id_1 24 non-null object\n", "length_1 1 non-null float64\n", "thread_size_1 17 non-null float64\n", "thread_pitch_1 17 non-null float64\n", "nominal_size_1 8 non-null float64\n", "end_form_id_2 25 non-null object\n", "connection_type_id_2 24 non-null object\n", "length_2 1 non-null float64\n", "thread_size_2 23 non-null float64\n", "thread_pitch_2 23 non-null float64\n", "nominal_size_2 2 non-null float64\n", "hex_size 17 non-null float64\n", "unique_feature 25 non-null object\n", "orientation 25 non-null object\n", "weight 23 non-null float64\n", "dtypes: float64(12), object(8)\n", "memory usage: 4.0+ KB\n" ] } ], "source": [ "#There are columns with too few non-null values. But it is necessary to see more.\n", "comp_adaptor.info()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idcomponent_type_idadaptor_angleoverall_lengthend_form_id_1connection_type_id_1length_1thread_size_1thread_pitch_1nominal_size_1end_form_id_2connection_type_id_2length_2thread_size_2thread_pitch_2nominal_size_2hex_sizeunique_featureorientationweight
0C-0005CP-028NaN58.40A-001B-001NaN1.31212.0NaNA-001B-004NaN1.00011.5NaN34.93NoNo0.206
1C-0006CP-028NaN34.80A-001B-001NaN0.43720.0NaNA-001B-005NaN0.75016.0NaN22.20NoNo0.083
2C-1435CP-028NaN20.30A-007B-004NaNNaNNaN15.88A-001B-007NaN0.87518.0NaN22.22NoNo0.023
3C-1546CP-028NaN26.40A-007B-004NaN0.12527.0NaNA-001B-004NaN0.12527.0NaN15.88NoNo0.026
4C-1583CP-028NaN44.50A-001B-005NaN1.31212.0NaNA-007B-005NaN1.06212.0NaN38.10NoNo0.256
5C-1634CP-028NaN34.50A-001B-005NaN0.75016.0NaNA-001B-002NaN0.68716.0NaN22.23NoNo0.060
6C-1975CP-028NaN13.20A-007B-007NaNNaNNaN3.18A-001B-007NaN0.31228.0NaNNaNNoNo0.005
7C-0428CP-028NaN26.99A-001B-004NaN0.25018.0NaNA-007NaNNaNNaNNaN9.5217.46NoNo0.032
8C-0443CP-028NaN22.35A-007B-007NaNNaNNaN19.0599999999NaN1.06216.0NaN26.97NoNoNaN
9C-0823CP-028NaN16.80A-007B-007NaNNaNNaN9.52A-0019999NaN0.62518.09.5215.75NoNo0.014
10C-0354CP-028NaN36.30A-001B-004NaN1.25011.5NaNA-007B-004NaN1.00011.5NaN44.45NoNo0.136
11C-0361CP-028NaN41.00A-005NaNNaNNaNNaN6.35A-006B-002NaN0.56218.0NaN15.88NoNo0.005
12C-1312CP-028NaN14.20A-002B-007NaN0.43724.0NaN99999999NaNNaNNaNNaN11.11NoNo0.009
13C-1313CP-028NaN15.70A-007B-007NaNNaNNaN7.94A-001B-007NaN0.50020.0NaN12.57NoNo0.010
14C-1194CP-028NaN15.00A-007B-004NaN0.12527.0NaNA-007B-004NaN0.12527.0NaNNaNNoNo0.014
15C-1195CP-028NaN30.00A-007B-004NaN0.25018.0NaNA-007B-004NaN0.25018.0NaNNaNYesNo0.031
16C-1196CP-028NaN40.00A-007B-004NaN0.50014.0NaNA-007B-004NaN0.50014.0NaNNaNNoNo0.095
17C-1197CP-028NaN41.00A-007B-004NaN0.75014.0NaNA-007B-004NaN0.75014.0NaNNaNNoNo0.091
18C-1198CP-028NaN51.00A-007B-004NaN1.00011.5NaNA-007B-004NaN1.00011.5NaNNaNNoNo0.181
19C-1229CP-028NaN18.80A-007B-004NaNNaNNaN12.70A-001B-007NaN0.75018.0NaN18.92NoNo0.009
20C-1230CP-028NaN14.20A-007B-007NaNNaNNaN4.76A-001B-005NaN0.37524.0NaN9.40NoNo0.009
21C-1695CP-028NaN30.00A-001B-006NaN0.75016.0NaNA-0079999NaN9999.0009999.0NaN25.40NoNoNaN
22C-1812CP-028NaN35.10A-007B-004NaN0.37518.0NaNA-007B-004NaN0.37518.0NaNNaNNoNo0.091
23C-1828CP-028NaN28.40A-001B-005NaN1.31212.0NaNA-007B-005NaN0.56218.0NaN38.10NoNo0.181
24C-1868CP-02990.0NaNA-001B-00565.50.75016.0NaNA-001B-00228.00.81216.0NaNNaNNoYes0.226
\n", "
" ], "text/plain": [ " component_id component_type_id adaptor_angle overall_length \\\n", "0 C-0005 CP-028 NaN 58.40 \n", "1 C-0006 CP-028 NaN 34.80 \n", "2 C-1435 CP-028 NaN 20.30 \n", "3 C-1546 CP-028 NaN 26.40 \n", "4 C-1583 CP-028 NaN 44.50 \n", "5 C-1634 CP-028 NaN 34.50 \n", "6 C-1975 CP-028 NaN 13.20 \n", "7 C-0428 CP-028 NaN 26.99 \n", "8 C-0443 CP-028 NaN 22.35 \n", "9 C-0823 CP-028 NaN 16.80 \n", "10 C-0354 CP-028 NaN 36.30 \n", "11 C-0361 CP-028 NaN 41.00 \n", "12 C-1312 CP-028 NaN 14.20 \n", "13 C-1313 CP-028 NaN 15.70 \n", "14 C-1194 CP-028 NaN 15.00 \n", "15 C-1195 CP-028 NaN 30.00 \n", "16 C-1196 CP-028 NaN 40.00 \n", "17 C-1197 CP-028 NaN 41.00 \n", "18 C-1198 CP-028 NaN 51.00 \n", "19 C-1229 CP-028 NaN 18.80 \n", "20 C-1230 CP-028 NaN 14.20 \n", "21 C-1695 CP-028 NaN 30.00 \n", "22 C-1812 CP-028 NaN 35.10 \n", "23 C-1828 CP-028 NaN 28.40 \n", "24 C-1868 CP-029 90.0 NaN \n", "\n", " end_form_id_1 connection_type_id_1 length_1 thread_size_1 \\\n", "0 A-001 B-001 NaN 1.312 \n", "1 A-001 B-001 NaN 0.437 \n", "2 A-007 B-004 NaN NaN \n", "3 A-007 B-004 NaN 0.125 \n", "4 A-001 B-005 NaN 1.312 \n", "5 A-001 B-005 NaN 0.750 \n", "6 A-007 B-007 NaN NaN \n", "7 A-001 B-004 NaN 0.250 \n", "8 A-007 B-007 NaN NaN \n", "9 A-007 B-007 NaN NaN \n", "10 A-001 B-004 NaN 1.250 \n", "11 A-005 NaN NaN NaN \n", "12 A-002 B-007 NaN 0.437 \n", "13 A-007 B-007 NaN NaN \n", "14 A-007 B-004 NaN 0.125 \n", "15 A-007 B-004 NaN 0.250 \n", "16 A-007 B-004 NaN 0.500 \n", "17 A-007 B-004 NaN 0.750 \n", "18 A-007 B-004 NaN 1.000 \n", "19 A-007 B-004 NaN NaN \n", "20 A-007 B-007 NaN NaN \n", "21 A-001 B-006 NaN 0.750 \n", "22 A-007 B-004 NaN 0.375 \n", "23 A-001 B-005 NaN 1.312 \n", "24 A-001 B-005 65.5 0.750 \n", "\n", " thread_pitch_1 nominal_size_1 end_form_id_2 connection_type_id_2 \\\n", "0 12.0 NaN A-001 B-004 \n", "1 20.0 NaN A-001 B-005 \n", "2 NaN 15.88 A-001 B-007 \n", "3 27.0 NaN A-001 B-004 \n", "4 12.0 NaN A-007 B-005 \n", "5 16.0 NaN A-001 B-002 \n", "6 NaN 3.18 A-001 B-007 \n", "7 18.0 NaN A-007 NaN \n", "8 NaN 19.05 9999 9999 \n", "9 NaN 9.52 A-001 9999 \n", "10 11.5 NaN A-007 B-004 \n", "11 NaN 6.35 A-006 B-002 \n", "12 24.0 NaN 9999 9999 \n", "13 NaN 7.94 A-001 B-007 \n", "14 27.0 NaN A-007 B-004 \n", "15 18.0 NaN A-007 B-004 \n", "16 14.0 NaN A-007 B-004 \n", "17 14.0 NaN A-007 B-004 \n", "18 11.5 NaN A-007 B-004 \n", "19 NaN 12.70 A-001 B-007 \n", "20 NaN 4.76 A-001 B-005 \n", "21 16.0 NaN A-007 9999 \n", "22 18.0 NaN A-007 B-004 \n", "23 12.0 NaN A-007 B-005 \n", "24 16.0 NaN A-001 B-002 \n", "\n", " length_2 thread_size_2 thread_pitch_2 nominal_size_2 hex_size \\\n", "0 NaN 1.000 11.5 NaN 34.93 \n", "1 NaN 0.750 16.0 NaN 22.20 \n", "2 NaN 0.875 18.0 NaN 22.22 \n", "3 NaN 0.125 27.0 NaN 15.88 \n", "4 NaN 1.062 12.0 NaN 38.10 \n", "5 NaN 0.687 16.0 NaN 22.23 \n", "6 NaN 0.312 28.0 NaN NaN \n", "7 NaN NaN NaN 9.52 17.46 \n", "8 NaN 1.062 16.0 NaN 26.97 \n", "9 NaN 0.625 18.0 9.52 15.75 \n", "10 NaN 1.000 11.5 NaN 44.45 \n", "11 NaN 0.562 18.0 NaN 15.88 \n", "12 NaN NaN NaN NaN 11.11 \n", "13 NaN 0.500 20.0 NaN 12.57 \n", "14 NaN 0.125 27.0 NaN NaN \n", "15 NaN 0.250 18.0 NaN NaN \n", "16 NaN 0.500 14.0 NaN NaN \n", "17 NaN 0.750 14.0 NaN NaN \n", "18 NaN 1.000 11.5 NaN NaN \n", "19 NaN 0.750 18.0 NaN 18.92 \n", "20 NaN 0.375 24.0 NaN 9.40 \n", "21 NaN 9999.000 9999.0 NaN 25.40 \n", "22 NaN 0.375 18.0 NaN NaN \n", "23 NaN 0.562 18.0 NaN 38.10 \n", "24 28.0 0.812 16.0 NaN NaN \n", "\n", " unique_feature orientation weight \n", "0 No No 0.206 \n", "1 No No 0.083 \n", "2 No No 0.023 \n", "3 No No 0.026 \n", "4 No No 0.256 \n", "5 No No 0.060 \n", "6 No No 0.005 \n", "7 No No 0.032 \n", "8 No No NaN \n", "9 No No 0.014 \n", "10 No No 0.136 \n", "11 No No 0.005 \n", "12 No No 0.009 \n", "13 No No 0.010 \n", "14 No No 0.014 \n", "15 Yes No 0.031 \n", "16 No No 0.095 \n", "17 No No 0.091 \n", "18 No No 0.181 \n", "19 No No 0.009 \n", "20 No No 0.009 \n", "21 No No NaN \n", "22 No No 0.091 \n", "23 No No 0.181 \n", "24 No Yes 0.226 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_adaptor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "component_type_id, end_form_id_1, connection_type_id_1, end_form_id_2, connection_type_id_2 - descriptive information, it is useless in all files. Drop.\n", "\n", "adaptor_angle - only one value. Drop.\n", "\n", "unique_feature, orientation - categorical description. Drop.\n", "\n", "C-1695 has abnormal values. And only one tube assembly has it. So I'll drop it.\n", "\n", "C-0443 has NaN weight. And it belongs to only one tube assembly, which has two components, both with NaN weight. So drop.\n", "\n", "C-1868. It seems that for some reason overall_length wasn't calculated. I'll calculate it as a sum of length_1 and length_2.\n", "Drop length_1 and length_2 after using." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idoverall_lengththread_size_1thread_pitch_1nominal_size_1thread_size_2thread_pitch_2nominal_size_2hex_sizeweight
0C-000558.401.31212.0NaN1.00011.5NaN34.930.206
1C-000634.800.43720.0NaN0.75016.0NaN22.200.083
2C-143520.30NaNNaN15.880.87518.0NaN22.220.023
3C-154626.400.12527.0NaN0.12527.0NaN15.880.026
4C-158344.501.31212.0NaN1.06212.0NaN38.100.256
5C-163434.500.75016.0NaN0.68716.0NaN22.230.060
6C-197513.20NaNNaN3.180.31228.0NaNNaN0.005
7C-042826.990.25018.0NaNNaNNaN9.5217.460.032
9C-082316.80NaNNaN9.520.62518.09.5215.750.014
10C-035436.301.25011.5NaN1.00011.5NaN44.450.136
11C-036141.00NaNNaN6.350.56218.0NaN15.880.005
12C-131214.200.43724.0NaNNaNNaNNaN11.110.009
13C-131315.70NaNNaN7.940.50020.0NaN12.570.010
14C-119415.000.12527.0NaN0.12527.0NaNNaN0.014
15C-119530.000.25018.0NaN0.25018.0NaNNaN0.031
16C-119640.000.50014.0NaN0.50014.0NaNNaN0.095
17C-119741.000.75014.0NaN0.75014.0NaNNaN0.091
18C-119851.001.00011.5NaN1.00011.5NaNNaN0.181
19C-122918.80NaNNaN12.700.75018.0NaN18.920.009
20C-123014.20NaNNaN4.760.37524.0NaN9.400.009
22C-181235.100.37518.0NaN0.37518.0NaNNaN0.091
23C-182828.401.31212.0NaN0.56218.0NaN38.100.181
24C-186893.500.75016.0NaN0.81216.0NaNNaN0.226
\n", "
" ], "text/plain": [ " component_id overall_length thread_size_1 thread_pitch_1 \\\n", "0 C-0005 58.40 1.312 12.0 \n", "1 C-0006 34.80 0.437 20.0 \n", "2 C-1435 20.30 NaN NaN \n", "3 C-1546 26.40 0.125 27.0 \n", "4 C-1583 44.50 1.312 12.0 \n", "5 C-1634 34.50 0.750 16.0 \n", "6 C-1975 13.20 NaN NaN \n", "7 C-0428 26.99 0.250 18.0 \n", "9 C-0823 16.80 NaN NaN \n", "10 C-0354 36.30 1.250 11.5 \n", "11 C-0361 41.00 NaN NaN \n", "12 C-1312 14.20 0.437 24.0 \n", "13 C-1313 15.70 NaN NaN \n", "14 C-1194 15.00 0.125 27.0 \n", "15 C-1195 30.00 0.250 18.0 \n", "16 C-1196 40.00 0.500 14.0 \n", "17 C-1197 41.00 0.750 14.0 \n", "18 C-1198 51.00 1.000 11.5 \n", "19 C-1229 18.80 NaN NaN \n", "20 C-1230 14.20 NaN NaN \n", "22 C-1812 35.10 0.375 18.0 \n", "23 C-1828 28.40 1.312 12.0 \n", "24 C-1868 93.50 0.750 16.0 \n", "\n", " nominal_size_1 thread_size_2 thread_pitch_2 nominal_size_2 hex_size \\\n", "0 NaN 1.000 11.5 NaN 34.93 \n", "1 NaN 0.750 16.0 NaN 22.20 \n", "2 15.88 0.875 18.0 NaN 22.22 \n", "3 NaN 0.125 27.0 NaN 15.88 \n", "4 NaN 1.062 12.0 NaN 38.10 \n", "5 NaN 0.687 16.0 NaN 22.23 \n", "6 3.18 0.312 28.0 NaN NaN \n", "7 NaN NaN NaN 9.52 17.46 \n", "9 9.52 0.625 18.0 9.52 15.75 \n", "10 NaN 1.000 11.5 NaN 44.45 \n", "11 6.35 0.562 18.0 NaN 15.88 \n", "12 NaN NaN NaN NaN 11.11 \n", "13 7.94 0.500 20.0 NaN 12.57 \n", "14 NaN 0.125 27.0 NaN NaN \n", "15 NaN 0.250 18.0 NaN NaN \n", "16 NaN 0.500 14.0 NaN NaN \n", "17 NaN 0.750 14.0 NaN NaN \n", "18 NaN 1.000 11.5 NaN NaN \n", "19 12.70 0.750 18.0 NaN 18.92 \n", "20 4.76 0.375 24.0 NaN 9.40 \n", "22 NaN 0.375 18.0 NaN NaN \n", "23 NaN 0.562 18.0 NaN 38.10 \n", "24 NaN 0.812 16.0 NaN NaN \n", "\n", " weight \n", "0 0.206 \n", "1 0.083 \n", "2 0.023 \n", "3 0.026 \n", "4 0.256 \n", "5 0.060 \n", "6 0.005 \n", "7 0.032 \n", "9 0.014 \n", "10 0.136 \n", "11 0.005 \n", "12 0.009 \n", "13 0.010 \n", "14 0.014 \n", "15 0.031 \n", "16 0.095 \n", "17 0.091 \n", "18 0.181 \n", "19 0.009 \n", "20 0.009 \n", "22 0.091 \n", "23 0.181 \n", "24 0.226 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_adaptor.drop(['adaptor_angle', 'component_type_id', 'end_form_id_1', 'connection_type_id_1', 'length_1', 'length_2',\n", " 'unique_feature', 'orientation', 'end_form_id_2', 'connection_type_id_2'], axis=1, inplace=True)\n", "\n", "#Could input a formula, but it single value.\n", "comp_adaptor.loc[comp_adaptor['overall_length'].isnull(), 'overall_length'] = 93.5\n", "comp_adaptor.drop(comp_adaptor.index[[8, 21]], inplace=True)\n", "comp_adaptor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Boss" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 147 entries, 0 to 146\n", "Data columns (total 15 columns):\n", "component_id 147 non-null object\n", "component_type_id 147 non-null object\n", "type 124 non-null object\n", "connection_type_id 147 non-null object\n", "outside_shape 124 non-null object\n", "base_type 124 non-null object\n", "height_over_tube 147 non-null float64\n", "bolt_pattern_long 23 non-null float64\n", "bolt_pattern_wide 17 non-null float64\n", "groove 147 non-null object\n", "base_diameter 57 non-null float64\n", "shoulder_diameter 30 non-null float64\n", "unique_feature 147 non-null object\n", "orientation 147 non-null object\n", "weight 145 non-null float64\n", "dtypes: float64(6), object(9)\n", "memory usage: 17.3+ KB\n" ] } ], "source": [ "#Descriptive and categorical features will be dropped.\n", "comp_boss.info()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 147 entries, 0 to 146\n", "Data columns (total 3 columns):\n", "component_id 147 non-null object\n", "height_over_tube 147 non-null float64\n", "weight 145 non-null float64\n", "dtypes: float64(2), object(1)\n", "memory usage: 3.5+ KB\n" ] } ], "source": [ "#Use only important information.\n", "comp_boss = comp_boss[['component_id', 'height_over_tube', 'weight']]\n", "comp_boss.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I wrote the following to check for outliers. Sometimes there are values which are too big. It is either mistake or it is due to unknown reason. I'll drop such values. Threshold is arbitrary." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "height_over_tube\n", " component_id height_over_tube weight\n", "31 C-0307 9999.0 1.05\n", "weight\n", " component_id height_over_tube weight\n", "21 C-0243 36.0 4.034\n" ] } ], "source": [ "for i in comp_boss.columns:\n", " if comp_boss[i].dtype != object:\n", " if comp_boss[i][comp_boss[i] > 4 * comp_boss[i].mean()].any() == True:\n", " print(i)\n", " print(comp_boss.loc[comp_boss[i] == comp_boss[i].max()])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idheight_over_tubeweight
0C-000817.00.032
1C-000913.00.033
2C-002028.40.070
3C-005427.10.180
4C-007120.00.080
\n", "
" ], "text/plain": [ " component_id height_over_tube weight\n", "0 C-0008 17.0 0.032\n", "1 C-0009 13.0 0.033\n", "2 C-0020 28.4 0.070\n", "3 C-0054 27.1 0.180\n", "4 C-0071 20.0 0.080" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Drop row with too big value. I don't drop weight, because it could be reasonable\n", "comp_boss.drop(comp_boss.index[31], inplace=True)\n", "comp_boss.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Hfl" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 6 entries, 0 to 5\n", "Data columns (total 9 columns):\n", "component_id 6 non-null object\n", "component_type_id 6 non-null object\n", "hose_diameter 6 non-null float64\n", "corresponding_shell 6 non-null object\n", "coupling_class 6 non-null object\n", "material 6 non-null object\n", "plating 6 non-null object\n", "orientation 6 non-null object\n", "weight 6 non-null float64\n", "dtypes: float64(2), object(7)\n", "memory usage: 512.0+ bytes\n" ] } ], "source": [ "comp_hfl.info()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idcomponent_type_idhose_diametercorresponding_shellcoupling_classmaterialplatingorientationweight
0C-0872CP-0234.8C-0855SP-0098SP-0016YesNo0.010
1C-0873CP-0234.8C-0856SP-0098SP-0016YesNo0.010
2C-0874CP-0234.8C-0857SP-0098SP-0038YesNo0.001
3C-1039CP-02315.9C-1040SP-0097SP-0095NoNo0.052
4C-1041CP-02315.9C-1042SP-0099SP-0095NoNo0.065
5C-1043CP-02325.4C-1044SP-0099SP-0095NoNo0.196
\n", "
" ], "text/plain": [ " component_id component_type_id hose_diameter corresponding_shell \\\n", "0 C-0872 CP-023 4.8 C-0855 \n", "1 C-0873 CP-023 4.8 C-0856 \n", "2 C-0874 CP-023 4.8 C-0857 \n", "3 C-1039 CP-023 15.9 C-1040 \n", "4 C-1041 CP-023 15.9 C-1042 \n", "5 C-1043 CP-023 25.4 C-1044 \n", "\n", " coupling_class material plating orientation weight \n", "0 SP-0098 SP-0016 Yes No 0.010 \n", "1 SP-0098 SP-0016 Yes No 0.010 \n", "2 SP-0098 SP-0038 Yes No 0.001 \n", "3 SP-0097 SP-0095 No No 0.052 \n", "4 SP-0099 SP-0095 No No 0.065 \n", "5 SP-0099 SP-0095 No No 0.196 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_hfl" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idhose_diameterweight
0C-08724.80.010
1C-08734.80.010
2C-08744.80.001
3C-103915.90.052
4C-104115.90.065
5C-104325.40.196
\n", "
" ], "text/plain": [ " component_id hose_diameter weight\n", "0 C-0872 4.8 0.010\n", "1 C-0873 4.8 0.010\n", "2 C-0874 4.8 0.001\n", "3 C-1039 15.9 0.052\n", "4 C-1041 15.9 0.065\n", "5 C-1043 25.4 0.196" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#It seems that only three columns are necessary.\n", "comp_hfl = comp_hfl[['component_id', 'hose_diameter', 'weight']]\n", "comp_hfl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Elbow" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 178 entries, 0 to 177\n", "Data columns (total 16 columns):\n", "component_id 178 non-null object\n", "component_type_id 178 non-null object\n", "bolt_pattern_long 171 non-null float64\n", "bolt_pattern_wide 138 non-null float64\n", "extension_length 170 non-null float64\n", "overall_length 175 non-null float64\n", "thickness 171 non-null float64\n", "drop_length 171 non-null float64\n", "elbow_angle 130 non-null float64\n", "mj_class_code 41 non-null object\n", "mj_plug_class_code 40 non-null object\n", "plug_diameter 7 non-null float64\n", "groove 178 non-null object\n", "unique_feature 178 non-null object\n", "orientation 178 non-null object\n", "weight 176 non-null float64\n", "dtypes: float64(9), object(7)\n", "memory usage: 22.3+ KB\n" ] } ], "source": [ "comp_elbow.info()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idcomponent_type_idbolt_pattern_longbolt_pattern_wideextension_lengthoverall_lengththicknessdrop_lengthelbow_anglemj_class_codemj_plug_class_codeplug_diametergrooveunique_featureorientationweight
0C-0013CP-008152.492.08105.0185.0113.075.090.0NaNNaNNaNYesNoYes8.890
1C-0016CP-00957.227.8042.069.044.024.090.0NaNNaNNaNNoNoYes1.172
2C-0017CP-00957.227.8042.069.047.026.090.0NaNNaNNaNYesNoYes1.245
3C-0018CP-00966.631.8050.080.057.031.590.0NaNNaNNaNYesNoYes1.863
4C-0021CP-01075.0NaN31.570.025.012.590.0NaNNaNNaNNoYesYes0.903
\n", "
" ], "text/plain": [ " component_id component_type_id bolt_pattern_long bolt_pattern_wide \\\n", "0 C-0013 CP-008 152.4 92.08 \n", "1 C-0016 CP-009 57.2 27.80 \n", "2 C-0017 CP-009 57.2 27.80 \n", "3 C-0018 CP-009 66.6 31.80 \n", "4 C-0021 CP-010 75.0 NaN \n", "\n", " extension_length overall_length thickness drop_length elbow_angle \\\n", "0 105.0 185.0 113.0 75.0 90.0 \n", "1 42.0 69.0 44.0 24.0 90.0 \n", "2 42.0 69.0 47.0 26.0 90.0 \n", "3 50.0 80.0 57.0 31.5 90.0 \n", "4 31.5 70.0 25.0 12.5 90.0 \n", "\n", " mj_class_code mj_plug_class_code plug_diameter groove unique_feature \\\n", "0 NaN NaN NaN Yes No \n", "1 NaN NaN NaN No No \n", "2 NaN NaN NaN Yes No \n", "3 NaN NaN NaN Yes No \n", "4 NaN NaN NaN No Yes \n", "\n", " orientation weight \n", "0 Yes 8.890 \n", "1 Yes 1.172 \n", "2 Yes 1.245 \n", "3 Yes 1.863 \n", "4 Yes 0.903 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_elbow.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Most of the columns aren't necessary.\n", "comp_elbow.drop(['component_type_id', 'mj_class_code', 'mj_plug_class_code', 'plug_diameter', 'groove', 'unique_feature',\n", " 'orientation',], axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "drop_length\n", " component_id bolt_pattern_long bolt_pattern_wide extension_length \\\n", "52 C-0537 78.2 NaN 32.0 \n", "\n", " overall_length thickness drop_length elbow_angle weight \n", "52 64.0 36.0 9999.0 169.25 0.42 \n", "weight\n", " component_id bolt_pattern_long bolt_pattern_wide extension_length \\\n", "62 C-0646 106.38 61.93 55.5 \n", "\n", " overall_length thickness drop_length elbow_angle weight \n", "62 111.0 115.0 57.5 90.0 10.19 \n" ] } ], "source": [ "for i in comp_elbow.columns:\n", " if comp_elbow[i].dtype != object:\n", " if comp_elbow[i][comp_elbow[i] > 4 * comp_elbow[i].mean()].any() == True:\n", " print(i)\n", " print(comp_elbow.loc[comp_elbow[i] == comp_elbow[i].max()])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "comp_elbow.drop(comp_elbow.index[52], inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Float" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 16 entries, 0 to 15\n", "Data columns (total 7 columns):\n", "component_id 16 non-null object\n", "component_type_id 16 non-null object\n", "bolt_pattern_long 16 non-null float64\n", "bolt_pattern_wide 16 non-null float64\n", "thickness 16 non-null float64\n", "orientation 16 non-null object\n", "weight 16 non-null float64\n", "dtypes: float64(4), object(3)\n", "memory usage: 976.0+ bytes\n" ] } ], "source": [ "comp_float.info()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idbolt_pattern_longbolt_pattern_widethicknessweight
0C-0027148.0096.0018.002.230
1C-045458.7230.1828.000.590
2C-045558.7230.1828.000.525
3C-049452.4026.2015.850.230
4C-049658.8030.2014.200.284
5C-050877.7642.8825.401.144
6C-057269.8535.7115.700.447
7C-0797120.6569.8520.001.983
8C-089147.6222.2228.000.465
9C-109676.4069.9025.002.565
10C-1102130.2077.8036.504.060
11C-1135116.0017.0022.001.880
12C-114952.3626.1936.000.783
13C-145252.3826.1922.000.276
14C-145358.7230.1822.000.389
15C-1584106.3861.9325.400.516
\n", "
" ], "text/plain": [ " component_id bolt_pattern_long bolt_pattern_wide thickness weight\n", "0 C-0027 148.00 96.00 18.00 2.230\n", "1 C-0454 58.72 30.18 28.00 0.590\n", "2 C-0455 58.72 30.18 28.00 0.525\n", "3 C-0494 52.40 26.20 15.85 0.230\n", "4 C-0496 58.80 30.20 14.20 0.284\n", "5 C-0508 77.76 42.88 25.40 1.144\n", "6 C-0572 69.85 35.71 15.70 0.447\n", "7 C-0797 120.65 69.85 20.00 1.983\n", "8 C-0891 47.62 22.22 28.00 0.465\n", "9 C-1096 76.40 69.90 25.00 2.565\n", "10 C-1102 130.20 77.80 36.50 4.060\n", "11 C-1135 116.00 17.00 22.00 1.880\n", "12 C-1149 52.36 26.19 36.00 0.783\n", "13 C-1452 52.38 26.19 22.00 0.276\n", "14 C-1453 58.72 30.18 22.00 0.389\n", "15 C-1584 106.38 61.93 25.40 0.516" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Drop description.\n", "comp_float.drop(['component_type_id', 'orientation'], axis=1, inplace=True)\n", "comp_float" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Nut" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 65 entries, 0 to 64\n", "Data columns (total 11 columns):\n", "component_id 65 non-null object\n", "component_type_id 65 non-null object\n", "hex_nut_size 42 non-null float64\n", "seat_angle 15 non-null float64\n", "length 65 non-null float64\n", "thread_size 65 non-null object\n", "thread_pitch 65 non-null float64\n", "diameter 23 non-null float64\n", "blind_hole 23 non-null object\n", "orientation 65 non-null object\n", "weight 64 non-null float64\n", "dtypes: float64(6), object(5)\n", "memory usage: 5.7+ KB\n" ] } ], "source": [ "comp_nut.info()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idhex_nut_sizelengththread_sizethread_pitchweight
0C-162120.6417.0.68716.00.015
1C-162434.9226.51.18712.00.035
2C-162328.5823.51.00014.00.044
3C-162223.8120.0.81216.00.036
4C-162541.2827.51.43712.00.129
\n", "
" ], "text/plain": [ " component_id hex_nut_size length thread_size thread_pitch weight\n", "0 C-1621 20.64 17.0 .687 16.0 0.015\n", "1 C-1624 34.92 26.5 1.187 12.0 0.035\n", "2 C-1623 28.58 23.5 1.000 14.0 0.044\n", "3 C-1622 23.81 20.0 .812 16.0 0.036\n", "4 C-1625 41.28 27.5 1.437 12.0 0.129" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_nut.drop(['component_type_id', 'seat_angle', 'diameter', 'blind_hole', 'orientation'], axis=1, inplace=True)\n", "comp_nut.head()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "weight\n", " component_id hex_nut_size length thread_size thread_pitch weight\n", "27 C-1442 57.15 36.1 1.875 12.0 0.343\n" ] } ], "source": [ "for i in comp_nut.columns:\n", " if comp_nut[i].dtype != object:\n", " if comp_nut[i][comp_nut[i] > 4 * comp_nut[i].mean()].any() == True:\n", " print(i)\n", " print(comp_nut.loc[comp_nut[i] == comp_nut[i].max()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Other" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1001 entries, 0 to 1000\n", "Data columns (total 3 columns):\n", "component_id 1001 non-null object\n", "part_name 1001 non-null object\n", "weight 945 non-null float64\n", "dtypes: float64(1), object(2)\n", "memory usage: 23.5+ KB\n" ] } ], "source": [ "comp_other.info()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idweight
0C-13850.014
1C-13860.005
2C-13690.003
3C-04220.003
4C-18170.014
\n", "
" ], "text/plain": [ " component_id weight\n", "0 C-1385 0.014\n", "1 C-1386 0.005\n", "2 C-1369 0.003\n", "3 C-0422 0.003\n", "4 C-1817 0.014" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Dtop description.\n", "comp_other.drop(['part_name'], axis=1, inplace=True)\n", "comp_other.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Sleeve" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50 entries, 0 to 49\n", "Data columns (total 10 columns):\n", "component_id 50 non-null object\n", "component_type_id 50 non-null object\n", "connection_type_id 50 non-null object\n", "length 50 non-null float64\n", "intended_nut_thread 50 non-null float64\n", "intended_nut_pitch 50 non-null int64\n", "unique_feature 50 non-null object\n", "plating 50 non-null object\n", "orientation 50 non-null object\n", "weight 50 non-null float64\n", "dtypes: float64(3), int64(1), object(6)\n", "memory usage: 4.0+ KB\n" ] } ], "source": [ "comp_sleeve.info()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idlengthintended_nut_threadintended_nut_pitchweight
0C-000117.31.062120.013
1C-000211.20.500200.005
2C-000319.31.187120.014
3C-00489.50.562180.006
4C-00499.50.812160.012
\n", "
" ], "text/plain": [ " component_id length intended_nut_thread intended_nut_pitch weight\n", "0 C-0001 17.3 1.062 12 0.013\n", "1 C-0002 11.2 0.500 20 0.005\n", "2 C-0003 19.3 1.187 12 0.014\n", "3 C-0048 9.5 0.562 18 0.006\n", "4 C-0049 9.5 0.812 16 0.012" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_sleeve.drop(['component_type_id', 'connection_type_id', 'unique_feature', 'plating', 'orientation'], axis=1, inplace=True)\n", "comp_sleeve.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "length\n", " component_id length intended_nut_thread intended_nut_pitch weight\n", "28 C-1627 9999.0 0.562 18 0.005\n", "29 C-1628 9999.0 0.687 16 0.006\n", "30 C-1629 9999.0 0.812 16 0.012\n", "31 C-1630 9999.0 1.000 14 0.018\n", "32 C-1631 9999.0 1.187 12 0.026\n", "33 C-1632 9999.0 1.437 12 0.033\n", "34 C-1633 9999.0 1.687 12 0.040\n", "48 C-1866 9999.0 2.000 12 0.035\n" ] } ], "source": [ "for i in comp_sleeve.columns:\n", " if comp_sleeve[i].dtype != object:\n", " if comp_sleeve[i][comp_sleeve[i] > 4 * comp_sleeve[i].mean()].any() == True:\n", " print(i)\n", " print(comp_sleeve.loc[comp_sleeve[i] == comp_sleeve[i].max()])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "comp_sleeve.drop(comp_sleeve.index[[28, 29, 30, 31, 32, 33, 34, 48]], inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Straight" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 361 entries, 0 to 360\n", "Data columns (total 12 columns):\n", "component_id 361 non-null object\n", "component_type_id 361 non-null object\n", "bolt_pattern_long 291 non-null float64\n", "bolt_pattern_wide 204 non-null float64\n", "head_diameter 70 non-null float64\n", "overall_length 41 non-null float64\n", "thickness 361 non-null float64\n", "mj_class_code 120 non-null object\n", "groove 361 non-null object\n", "unique_feature 361 non-null object\n", "orientation 361 non-null object\n", "weight 354 non-null float64\n", "dtypes: float64(6), object(6)\n", "memory usage: 33.9+ KB\n" ] } ], "source": [ "comp_straight.info()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idbolt_pattern_longbolt_pattern_widethicknessweight
0C-001266.6831.7520.00.788
1C-001447.6022.2015.00.339
2C-001566.7031.8020.00.788
3C-001977.8042.9036.51.533
4C-002947.6322.2316.00.286
\n", "
" ], "text/plain": [ " component_id bolt_pattern_long bolt_pattern_wide thickness weight\n", "0 C-0012 66.68 31.75 20.0 0.788\n", "1 C-0014 47.60 22.20 15.0 0.339\n", "2 C-0015 66.70 31.80 20.0 0.788\n", "3 C-0019 77.80 42.90 36.5 1.533\n", "4 C-0029 47.63 22.23 16.0 0.286" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_straight.drop(['component_type_id', 'overall_length', 'mj_class_code', 'head_diameter', 'unique_feature', 'groove',\n", " 'orientation'], axis=1, inplace=True)\n", "comp_straight.head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "weight\n", " component_id bolt_pattern_long bolt_pattern_wide thickness weight\n", "171 C-1066 38.1 NaN 13.0 9.693\n" ] } ], "source": [ "for i in comp_straight.columns:\n", " if comp_straight[i].dtype != object:\n", " if comp_straight[i][comp_straight[i] > 4 * comp_straight[i].mean()].any() == True:\n", " print(i)\n", " print(comp_straight.loc[comp_straight[i] == comp_straight[i].max()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Tee" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 4 entries, 0 to 3\n", "Data columns (total 14 columns):\n", "component_id 4 non-null object\n", "component_type_id 4 non-null object\n", "bolt_pattern_long 4 non-null float64\n", "bolt_pattern_wide 4 non-null float64\n", "extension_length 4 non-null float64\n", "overall_length 4 non-null float64\n", "thickness 4 non-null int64\n", "drop_length 4 non-null float64\n", "mj_class_code 4 non-null object\n", "mj_plug_class_code 4 non-null object\n", "groove 4 non-null object\n", "unique_feature 4 non-null object\n", "orientation 4 non-null object\n", "weight 4 non-null float64\n", "dtypes: float64(6), int64(1), object(7)\n", "memory usage: 528.0+ bytes\n" ] } ], "source": [ "comp_tee.info()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idbolt_pattern_longbolt_pattern_wideextension_lengthoverall_lengththicknessdrop_lengthweight
0C-027158.7030.2057.1093.05728.51.526
1C-180958.7230.1857.09108.05728.52.184
2C-183052.4026.2043.5078.55125.51.135
3C-186558.7030.2057.10107.05728.51.953
\n", "
" ], "text/plain": [ " component_id bolt_pattern_long bolt_pattern_wide extension_length \\\n", "0 C-0271 58.70 30.20 57.10 \n", "1 C-1809 58.72 30.18 57.09 \n", "2 C-1830 52.40 26.20 43.50 \n", "3 C-1865 58.70 30.20 57.10 \n", "\n", " overall_length thickness drop_length weight \n", "0 93.0 57 28.5 1.526 \n", "1 108.0 57 28.5 2.184 \n", "2 78.5 51 25.5 1.135 \n", "3 107.0 57 28.5 1.953 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_tee.drop(['component_type_id', 'mj_class_code', 'mj_plug_class_code', 'groove', 'unique_feature', 'orientation'],\n", " axis=1, inplace=True)\n", "comp_tee" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for i in comp_tee.columns:\n", " if comp_tee[i].dtype != object:\n", " if comp_tee[i][comp_tee[i] > 4 * comp_tee[i].mean()].any() == True:\n", " print(i)\n", " print(comp_tee.loc[comp_tee[i] == comp_tee[i].max()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comp Threaded" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 194 entries, 0 to 193\n", "Data columns (total 32 columns):\n", "component_id 194 non-null object\n", "component_type_id 194 non-null object\n", "adaptor_angle 54 non-null float64\n", "overall_length 121 non-null float64\n", "hex_size 112 non-null float64\n", "end_form_id_1 194 non-null object\n", "connection_type_id_1 135 non-null object\n", "length_1 73 non-null float64\n", "thread_size_1 135 non-null float64\n", "thread_pitch_1 135 non-null float64\n", "nominal_size_1 60 non-null object\n", "end_form_id_2 194 non-null object\n", "connection_type_id_2 63 non-null object\n", "length_2 73 non-null float64\n", "thread_size_2 63 non-null float64\n", "thread_pitch_2 63 non-null float64\n", "nominal_size_2 131 non-null float64\n", "end_form_id_3 19 non-null object\n", "connection_type_id_3 10 non-null object\n", "length_3 19 non-null float64\n", "thread_size_3 10 non-null float64\n", "thread_pitch_3 10 non-null float64\n", "nominal_size_3 9 non-null float64\n", "end_form_id_4 1 non-null object\n", "connection_type_id_4 1 non-null object\n", "length_4 1 non-null float64\n", "thread_size_4 1 non-null float64\n", "thread_pitch_4 1 non-null float64\n", "nominal_size_4 0 non-null float64\n", "unique_feature 194 non-null object\n", "orientation 194 non-null object\n", "weight 193 non-null float64\n", "dtypes: float64(19), object(13)\n", "memory usage: 48.6+ KB\n" ] } ], "source": [ "comp_threaded.info()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idoverall_lengthhex_sizelength_1thread_size_1thread_pitch_1nominal_size_1length_2thread_size_2thread_pitch_2nominal_size_2length_3thread_size_3thread_pitch_3nominal_size_3length_4thread_size_4thread_pitch_4weight
0C-000724.412.70NaN0.43720.0NaNNaNNaNNaN6.35NaNNaNNaNNaNNaNNaNNaN0.028
1C-0030NaNNaN47.0NaNNaN4247.0NaNNaN42.00NaNNaNNaNNaNNaNNaNNaN0.914
2C-004128.032.00NaNNaNNaN21.7NaN0.81216.0NaNNaNNaNNaNNaNNaNNaNNaN0.120
3C-004338.027.00NaNNaNNaN21.7NaN0.81216.0NaNNaNNaNNaNNaNNaNNaNNaN0.090
4C-004430.541.28NaNNaNNaN34NaN1.43712.0NaNNaNNaNNaNNaNNaNNaNNaN0.182
\n", "
" ], "text/plain": [ " component_id overall_length hex_size length_1 thread_size_1 \\\n", "0 C-0007 24.4 12.70 NaN 0.437 \n", "1 C-0030 NaN NaN 47.0 NaN \n", "2 C-0041 28.0 32.00 NaN NaN \n", "3 C-0043 38.0 27.00 NaN NaN \n", "4 C-0044 30.5 41.28 NaN NaN \n", "\n", " thread_pitch_1 nominal_size_1 length_2 thread_size_2 thread_pitch_2 \\\n", "0 20.0 NaN NaN NaN NaN \n", "1 NaN 42 47.0 NaN NaN \n", "2 NaN 21.7 NaN 0.812 16.0 \n", "3 NaN 21.7 NaN 0.812 16.0 \n", "4 NaN 34 NaN 1.437 12.0 \n", "\n", " nominal_size_2 length_3 thread_size_3 thread_pitch_3 nominal_size_3 \\\n", "0 6.35 NaN NaN NaN NaN \n", "1 42.00 NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN \n", "\n", " length_4 thread_size_4 thread_pitch_4 weight \n", "0 NaN NaN NaN 0.028 \n", "1 NaN NaN NaN 0.914 \n", "2 NaN NaN NaN 0.120 \n", "3 NaN NaN NaN 0.090 \n", "4 NaN NaN NaN 0.182 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comp_threaded.drop(['component_type_id', 'adaptor_angle', 'end_form_id_1', 'connection_type_id_1', 'end_form_id_2',\n", " 'connection_type_id_2', 'end_form_id_3', 'connection_type_id_3', 'end_form_id_4', 'connection_type_id_4',\n", " 'nominal_size_4', 'unique_feature', 'orientation'], axis=1, inplace=True)\n", "comp_threaded.head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "#There are five columns with length. So I fill NA with 0, summarize length and drop excessive columns.\n", "comp_threaded['length_1'] = comp_threaded['length_1'].fillna(0)\n", "comp_threaded['length_2'] = comp_threaded['length_2'].fillna(0)\n", "comp_threaded['length_3'] = comp_threaded['length_3'].fillna(0)\n", "comp_threaded['length_4'] = comp_threaded['length_4'].fillna(0)\n", "comp_threaded['overall_length'] = comp_threaded['overall_length'].fillna(0)\n", "comp_threaded['overall_length'] = comp_threaded['overall_length'] + comp_threaded['length_1'] + comp_threaded['length_2'] \\\n", "+ comp_threaded['length_3'] + comp_threaded['length_4']\n", "\n", "comp_threaded.drop(['length_1', 'length_2', 'length_3', 'length_4'], axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nominal_size_2\n", " component_id overall_length hex_size thread_size_1 thread_pitch_1 \\\n", "40 C-0386 42.7 60.3 2.0 12.0 \n", "\n", " nominal_size_1 thread_size_2 thread_pitch_2 nominal_size_2 \\\n", "40 NaN NaN NaN 9999.0 \n", "\n", " thread_size_3 thread_pitch_3 nominal_size_3 thread_size_4 \\\n", "40 NaN NaN NaN NaN \n", "\n", " thread_pitch_4 weight \n", "40 NaN 0.543 \n", "nominal_size_3\n", " component_id overall_length hex_size thread_size_1 thread_pitch_1 \\\n", "90 C-0971 124.1 NaN 1.187 12.0 \n", "\n", " nominal_size_1 thread_size_2 thread_pitch_2 nominal_size_2 \\\n", "90 NaN 1.187 12.0 NaN \n", "\n", " thread_size_3 thread_pitch_3 nominal_size_3 thread_size_4 \\\n", "90 NaN NaN 9999.0 NaN \n", "\n", " thread_pitch_4 weight \n", "90 NaN 0.545 \n", "weight\n", " component_id overall_length hex_size thread_size_1 thread_pitch_1 \\\n", "91 C-0974 112.7 NaN 1.437 12.0 \n", "\n", " nominal_size_1 thread_size_2 thread_pitch_2 nominal_size_2 \\\n", "91 NaN NaN NaN 31.75 \n", "\n", " thread_size_3 thread_pitch_3 nominal_size_3 thread_size_4 \\\n", "91 NaN NaN NaN NaN \n", "\n", " thread_pitch_4 weight \n", "91 NaN 1.17 \n" ] } ], "source": [ "for i in comp_threaded.columns:\n", " if comp_threaded[i].dtype != object:\n", " if comp_threaded[i][comp_threaded[i] > 4 * comp_threaded[i].mean()].any() == True:\n", " print(i)\n", " print(comp_threaded.loc[comp_threaded[i] == comp_threaded[i].max()])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "comp_threaded.drop(comp_threaded.index[[40, 90]], inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tube" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 21198 entries, 0 to 21197\n", "Data columns (total 16 columns):\n", "tube_assembly_id 21198 non-null object\n", "material_id 20919 non-null object\n", "diameter 21198 non-null float64\n", "wall 21198 non-null float64\n", "length 21198 non-null float64\n", "num_bends 21198 non-null int64\n", "bend_radius 21198 non-null float64\n", "end_a_1x 21198 non-null object\n", "end_a_2x 21198 non-null object\n", "end_x_1x 21198 non-null object\n", "end_x_2x 21198 non-null object\n", "end_a 21198 non-null object\n", "end_x 21198 non-null object\n", "num_boss 21198 non-null int64\n", "num_bracket 21198 non-null int64\n", "other 21198 non-null int64\n", "dtypes: float64(4), int64(4), object(8)\n", "memory usage: 2.6+ MB\n" ] } ], "source": [ "tube.info()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tube_assembly_iddiameterwalllengthnum_bendsbend_radius
0TA-0000112.701.65164.0538.10
1TA-000026.350.71137.0819.05
2TA-000036.350.71127.0719.05
3TA-000046.350.71137.0919.05
4TA-0000519.051.24109.0450.80
\n", "
" ], "text/plain": [ " tube_assembly_id diameter wall length num_bends bend_radius\n", "0 TA-00001 12.70 1.65 164.0 5 38.10\n", "1 TA-00002 6.35 0.71 137.0 8 19.05\n", "2 TA-00003 6.35 0.71 127.0 7 19.05\n", "3 TA-00004 6.35 0.71 137.0 9 19.05\n", "4 TA-00005 19.05 1.24 109.0 4 50.80" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tube.drop(['material_id', 'end_a_1x', 'end_a_2x', 'end_x_1x', 'end_x_2x', 'end_a', 'end_x', 'num_boss', 'num_bracket', 'other'],\n", " axis=1, inplace=True)\n", "tube.head()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "diameter\n", " tube_assembly_id diameter wall length num_bends bend_radius\n", "11306 TA-11307 203.2 1.9 62.0 0 0.0\n", "11309 TA-11310 203.2 1.9 48.0 0 0.0\n", "11341 TA-11342 203.2 1.9 40.0 0 0.0\n", "13049 TA-13050 203.2 1.9 70.0 0 0.0\n", "14709 TA-14710 203.2 1.9 64.0 0 0.0\n", "15619 TA-15620 203.2 1.9 26.0 0 0.0\n", "16762 TA-16763 203.2 1.9 129.0 2 203.2\n", "17421 TA-17422 203.2 1.9 22.0 0 0.0\n", "wall\n", " tube_assembly_id diameter wall length num_bends bend_radius\n", "389 TA-00390 66.7 7.9 310.0 2 177.80\n", "390 TA-00391 66.7 7.9 393.0 2 177.80\n", "391 TA-00392 66.7 7.9 174.0 2 177.80\n", "392 TA-00393 66.7 7.9 188.0 2 177.80\n", "431 TA-00432 66.7 7.9 146.0 2 177.80\n", "432 TA-00433 66.7 7.9 146.0 2 177.80\n", "1228 TA-01229 66.7 7.9 430.0 5 177.80\n", "1620 TA-01621 66.7 7.9 160.0 2 177.80\n", "3835 TA-03836 66.7 7.9 114.0 3 180.00\n", "4184 TA-04185 66.7 7.9 466.0 5 177.80\n", "4185 TA-04186 66.7 7.9 149.0 2 177.80\n", "4186 TA-04187 66.7 7.9 130.0 2 177.80\n", "4613 TA-04614 66.7 7.9 139.0 2 177.80\n", "5289 TA-05290 66.7 7.9 225.0 0 0.00\n", "5290 TA-05291 66.7 7.9 161.0 2 180.00\n", "5644 TA-05645 66.7 7.9 134.0 1 180.00\n", "6446 TA-06447 66.7 7.9 211.0 2 180.00\n", "6447 TA-06448 66.7 7.9 211.0 2 180.00\n", "6448 TA-06449 66.7 7.9 351.0 2 180.00\n", "6449 TA-06450 66.7 7.9 351.0 2 180.00\n", "6518 TA-06519 66.7 7.9 141.0 2 180.00\n", "8602 TA-08603 66.7 7.9 37.0 1 180.00\n", "8932 TA-08933 66.7 7.9 221.0 2 180.00\n", "9026 TA-09027 66.7 7.9 68.0 2 171.45\n", "9027 TA-09028 66.7 7.9 68.0 2 171.45\n", "11542 TA-11543 66.7 7.9 158.0 2 180.00\n", "11543 TA-11544 66.7 7.9 158.0 2 180.00\n", "11546 TA-11547 66.7 7.9 78.0 2 180.00\n", "11637 TA-11638 66.7 7.9 78.0 2 180.00\n", "11690 TA-11691 66.7 7.9 112.0 2 180.00\n", "11691 TA-11692 66.7 7.9 110.0 2 180.00\n", "15323 TA-15324 66.7 7.9 195.0 3 180.00\n", "17688 TA-17689 66.7 7.9 88.0 2 9999.00\n", "17689 TA-17690 66.7 7.9 71.0 2 9999.00\n", "18002 TA-18003 66.7 7.9 63.0 2 9999.00\n", "18003 TA-18004 66.7 7.9 81.0 2 9999.00\n", "18093 TA-18094 66.7 7.9 172.0 2 180.00\n", "18094 TA-18095 66.7 7.9 96.0 2 180.00\n", "18095 TA-18096 66.7 7.9 83.0 2 180.00\n", "18096 TA-18097 66.7 7.9 124.0 2 180.00\n", "18097 TA-18098 66.7 7.9 130.0 2 180.00\n", "18098 TA-18099 66.7 7.9 124.0 2 180.00\n", "18099 TA-18100 66.7 7.9 130.0 2 180.00\n", "length\n", " tube_assembly_id diameter wall length num_bends bend_radius\n", "4216 TA-04217 15.88 1.65 1333.0 6 31.75\n", "num_bends\n", " tube_assembly_id diameter wall length num_bends bend_radius\n", "20087 TA-20089 6.35 2.000 93.0 17 19.05\n", "20105 TA-20107 6.35 2.375 83.0 17 19.05\n", "bend_radius\n", " tube_assembly_id diameter wall length num_bends bend_radius\n", "15132 TA-15133 31.75 1.65 49.0 2 9999.0\n", "15174 TA-15175 31.75 1.65 36.0 1 9999.0\n", "15175 TA-15176 31.75 1.65 76.0 1 9999.0\n", "17688 TA-17689 66.70 7.90 88.0 2 9999.0\n", "17689 TA-17690 66.70 7.90 71.0 2 9999.0\n", "18002 TA-18003 66.70 7.90 63.0 2 9999.0\n", "18003 TA-18004 66.70 7.90 81.0 2 9999.0\n", "19320 TA-19321 6.35 0.71 32.0 3 9999.0\n" ] } ], "source": [ "for i in tube.columns:\n", " if tube[i].dtype != object:\n", " if tube[i][tube[i] > 4 * tube[i].mean()].any() == True:\n", " print(i)\n", " print(tube.loc[tube[i] == tube[i].max()])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "tube.drop(tube.index[[15132, 15174, 15175, 17688, 17689, 18002, 18003, 19320]], inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These files contain only descriptions, so I don't use them:\n", "\n", "tube_end_form\n", "\n", "type_component\n", "\n", "type_connection\n", "\n", "type_end_form\n", "\n", "components" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data preparation" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "#Create several features from dates for additional information.\n", "train['year'] = train.quote_date.dt.year\n", "train['month'] = train.quote_date.dt.month\n", "train['dayofyear'] = train.quote_date.dt.dayofyear\n", "train['dayofweek'] = train.quote_date.dt.dayofweek\n", "train['day'] = train.quote_date.dt.day\n", "\n", "test['year'] = test.quote_date.dt.year\n", "test['month'] = test.quote_date.dt.month\n", "test['dayofyear'] = test.quote_date.dt.dayofyear\n", "test['dayofweek'] = test.quote_date.dt.dayofweek\n", "test['day'] = test.quote_date.dt.day\n", "\n", "train = train.drop('quote_date',axis=1)\n", "test = test.drop('quote_date',axis=1)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#I combine all files with info on components in one file.\n", "all_comp = pd.concat([comp_adaptor, comp_boss, comp_elbow, comp_float, comp_hfl, comp_nut, comp_other,\n", " comp_sleeve, comp_straight, comp_tee, comp_threaded])" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 2033 entries, 0 to 193\n", "Data columns (total 29 columns):\n", "bolt_pattern_long 481 non-null float64\n", "bolt_pattern_wide 362 non-null float64\n", "component_id 2033 non-null object\n", "drop_length 174 non-null float64\n", "elbow_angle 129 non-null float64\n", "extension_length 173 non-null float64\n", "height_over_tube 146 non-null float64\n", "hex_nut_size 42 non-null float64\n", "hex_size 126 non-null float64\n", "hose_diameter 6 non-null float64\n", "intended_nut_pitch 42 non-null float64\n", "intended_nut_thread 42 non-null float64\n", "length 107 non-null float64\n", "nominal_size_1 67 non-null object\n", "nominal_size_2 132 non-null float64\n", "nominal_size_3 8 non-null float64\n", "overall_length 393 non-null float64\n", "thickness 551 non-null float64\n", "thread_pitch 65 non-null float64\n", "thread_pitch_1 149 non-null float64\n", "thread_pitch_2 83 non-null float64\n", "thread_pitch_3 10 non-null float64\n", "thread_pitch_4 1 non-null float64\n", "thread_size 65 non-null object\n", "thread_size_1 149 non-null float64\n", "thread_size_2 83 non-null float64\n", "thread_size_3 10 non-null float64\n", "thread_size_4 1 non-null float64\n", "weight 1964 non-null float64\n", "dtypes: float64(26), object(3)\n", "memory usage: 476.5+ KB\n" ] } ], "source": [ "all_comp.info()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 2033 entries, 0 to 193\n", "Data columns (total 5 columns):\n", "component_id 2033 non-null object\n", "weight 1964 non-null float64\n", "length 107 non-null float64\n", "overall_length 393 non-null float64\n", "thickness 551 non-null float64\n", "dtypes: float64(4), object(1)\n", "memory usage: 95.3+ KB\n" ] } ], "source": [ "#Some columns have little values, some have strings and integers, so I use only general parameters\n", "all_comp = all_comp[['component_id', 'weight', 'length', 'overall_length', 'thickness']]\n", "all_comp.info()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "#Combine two length columns.\n", "all_comp['overall_length'] = all_comp['overall_length'].fillna(0)\n", "all_comp['length'] = all_comp['length'].fillna(0)\n", "all_comp['length'] = all_comp['length'] + all_comp['overall_length']\n", "all_comp = all_comp.drop(['overall_length'], axis=1)\n", "\n", "all_comp['weight'] = all_comp['weight'].fillna(0)\n", "all_comp['thickness'] = all_comp['thickness'].fillna(0)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
component_idweightlengththickness
0C-00050.20658.40.0
1C-00060.08334.80.0
2C-14350.02320.30.0
3C-15460.02626.40.0
4C-15830.25644.50.0
\n", "
" ], "text/plain": [ " component_id weight length thickness\n", "0 C-0005 0.206 58.4 0.0\n", "1 C-0006 0.083 34.8 0.0\n", "2 C-1435 0.023 20.3 0.0\n", "3 C-1546 0.026 26.4 0.0\n", "4 C-1583 0.256 44.5 0.0" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#This is how file with components looks like\n", "all_comp.head()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "#Add information about tube itself and the list of components to main files.\n", "train = pd.merge(train, tube, on='tube_assembly_id', how='left')\n", "train = pd.merge(train, bill_of_materials, on ='tube_assembly_id', how='left')\n", "test = pd.merge(test, tube, on='tube_assembly_id', how='left')\n", "test = pd.merge(test, bill_of_materials, on ='tube_assembly_id', how='left')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "#Rename columns so that they will be different from length of components.\n", "train.rename(columns={'length': 'length_t'}, inplace = True)\n", "test.rename(columns={'length': 'length_t'}, inplace = True)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "#Merging to get information about components\n", "for i in range(1, 9, 2):\n", " suffix1 = '_' + str(i)\n", " suffix2 = '_' + str(i + 1)\n", " component_1 = 'component_id' + suffix1\n", " component_2 = 'component_id' + suffix2\n", " \n", " train = pd.merge(train, all_comp, left_on = component_1, right_on = 'component_id', how='left')\n", " train = pd.merge(train, all_comp, left_on = component_2, right_on = 'component_id', suffixes=(suffix1, suffix2), how='left')\n", " \n", " test = pd.merge(test, all_comp, left_on = component_1, right_on = 'component_id', how='left')\n", " test = pd.merge(test, all_comp, left_on = component_2, right_on = 'component_id', suffixes=(suffix1, suffix2), how='left')" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tube_assembly_idsupplierannual_usagemin_order_quantitybracket_pricingquantitycostyearmonthdayofyear...thickness_5weight_6length_6thickness_6weight_7length_7thickness_7weight_8length_8thickness_8
0TA-00002S-006600Yes121.90593320137188...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1TA-00002S-006600Yes212.34121420137188...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2TA-00002S-006600Yes56.60182620137188...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3TA-00002S-006600Yes104.68777020137188...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4TA-00002S-006600Yes253.54156120137188...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

5 rows × 49 columns

\n", "
" ], "text/plain": [ " tube_assembly_id supplier annual_usage min_order_quantity bracket_pricing \\\n", "0 TA-00002 S-0066 0 0 Yes \n", "1 TA-00002 S-0066 0 0 Yes \n", "2 TA-00002 S-0066 0 0 Yes \n", "3 TA-00002 S-0066 0 0 Yes \n", "4 TA-00002 S-0066 0 0 Yes \n", "\n", " quantity cost year month dayofyear ... thickness_5 \\\n", "0 1 21.905933 2013 7 188 ... NaN \n", "1 2 12.341214 2013 7 188 ... NaN \n", "2 5 6.601826 2013 7 188 ... NaN \n", "3 10 4.687770 2013 7 188 ... NaN \n", "4 25 3.541561 2013 7 188 ... NaN \n", "\n", " weight_6 length_6 thickness_6 weight_7 length_7 thickness_7 weight_8 \\\n", "0 NaN NaN NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN NaN NaN \n", "\n", " length_8 thickness_8 \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", "[5 rows x 49 columns]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Drop unnecessary columns\n", "train.drop(['component_id_1', 'component_id_2', 'component_id_3', 'component_id_4', 'component_id_5', 'component_id_6',\n", " 'component_id_7', 'component_id_8'], axis=1, inplace=True)\n", "test.drop(['component_id_1', 'component_id_2', 'component_id_3', 'component_id_4', 'component_id_5', 'component_id_6',\n", " 'component_id_7', 'component_id_8'], axis=1, inplace=True)\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "#Add descriptive information about specs.\n", "train = pd.merge(train, specs, on='tube_assembly_id', how='left')\n", "test = pd.merge(test, specs, on='tube_assembly_id', how='left')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "#Maybe it is strange, but it turned out that tube id is quite a good feature. It seems to be data leak\n", "train['ta_id'] = train['tube_assembly_id'].apply(lambda x: int(x.split('-')[1]))\n", "test['ta_id'] = test['tube_assembly_id'].apply(lambda x: int(x.split('-')[1]))\n", "train.drop(['tube_assembly_id'], axis=1, inplace=True)\n", "test.drop(['tube_assembly_id'], axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I created a lot of derivative features adn tried them. The following features turned out to be good" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['thickness_1',\n", " 'thickness_2',\n", " 'thickness_3',\n", " 'thickness_4',\n", " 'thickness_5',\n", " 'thickness_6',\n", " 'thickness_7',\n", " 'thickness_8']" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[col for col in list(train.columns) if 'thickness' in col]" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Calculate various additional features on physical parameters. They turned out to be useful.\n", "length_columns = [col for col in list(train.columns) if 'length' in col]\n", "weight_columns = [col for col in list(train.columns) if 'weight' in col]\n", "thickness_columns = [col for col in list(train.columns) if 'thickness' in col]\n", "train['avg_w'] = train[weight_columns].mean(axis=1)\n", "train['avg_l'] = train[length_columns].mean(axis=1)\n", "train['avg_th'] = train[thickness_columns].mean(axis=1)\n", "train['min_w'] = train[weight_columns].min(axis=1)\n", "train['min_l'] = train[length_columns].min(axis=1)\n", "train['min_th'] = train[thickness_columns].min(axis=1)\n", "train['max_w'] = train[weight_columns].max(axis=1)\n", "train['max_l'] = train[length_columns].max(axis=1)\n", "train['max_th'] = train[thickness_columns].max(axis=1)\n", "test['avg_w'] = test[weight_columns].mean(axis=1)\n", "test['avg_l'] = test[length_columns].mean(axis=1)\n", "test['avg_th'] = test[thickness_columns].mean(axis=1)\n", "test['min_w'] = test[weight_columns].min(axis=1)\n", "test['min_l'] = test[length_columns].min(axis=1)\n", "test['min_th'] = test[thickness_columns].min(axis=1)\n", "test['max_w'] = test[weight_columns].max(axis=1)\n", "test['max_l'] = test[length_columns].max(axis=1)\n", "test['max_th'] = test[thickness_columns].max(axis=1)\n", "train['tot_w'] = train[weight_columns].sum(axis=1)\n", "train['tot_l'] = train[length_columns].sum(axis=1)\n", "test['tot_w'] = test[weight_columns].sum(axis=1)\n", "test['tot_l'] = test[length_columns].sum(axis=1)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Take log of skewered columns to smooth them and fill NA.\n", "for col in train.columns:\n", " if train[col].dtype != 'object':\n", " if skew(train[col]) > 0.75:\n", " train[col] = np.log1p(train[col])\n", " train[col] = train[col].apply(lambda x: 0 if x == -np.inf else x)\n", "\n", " train[col] = train[col].fillna(0)\n", " \n", "for col in test.columns:\n", " if test[col].dtype != 'object':\n", " if skew(test[col]) > 0.75:\n", " test[col] = np.log1p(test[col])\n", " test[col] = test[col].apply(lambda x: 0 if x == -np.inf else x)\n", "\n", " test[col] = test[col].fillna(0)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "for col in train.columns:\n", " if train[col].dtype == 'object':\n", " train[col].replace(np.nan,' ', regex=True, inplace= True)\n", "for col in test.columns:\n", " if test[col].dtype == 'object':\n", " test[col].replace(np.nan,' ', regex=True, inplace= True)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "X_train = train.drop('cost',axis=1)\n", "Y_train = train['cost']\n", "X_test = test.drop('id', axis=1)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Check that the columns are the same\n", "(X_test.columns == X_train.columns).all()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Convert to arrays for easier transformation\n", "X_train = np.array(X_train)\n", "X_test = np.array(X_test)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "#Label encode the categorical variables\n", "for i in range(X_train.shape[1]):\n", " if i in [0, 3, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56]:\n", " lbl = preprocessing.LabelEncoder()\n", " lbl.fit(list(X_train[:,i]) + list(X_test[:,i]))\n", " X_train[:,i] = lbl.transform(X_train[:,i])\n", " X_test[:,i] = lbl.transform(X_test[:,i])" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "#XGB need float.\n", "X_train = X_train.astype(float)\n", "X_test = X_test.astype(float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": true }, "outputs": [], "source": [ "params = {}\n", "params['objective'] = 'reg:linear'\n", "params['eta'] = 0.1\n", "params['min_child_weight'] = 5\n", "params['subsample'] = 1.0\n", "params['scale_pos_weight'] = 1.0\n", "params['silent'] = 1\n", "params['max_depth'] = 7\n", "\n", "param = list(params.items())" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": true }, "outputs": [], "source": [ "xgtrain = xgb.DMatrix(X_train, label=Y_train)\n", "xgtest = xgb.DMatrix(X_test)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "num_rounds = 1200\n", "model = xgb.train(param, xgtrain, num_rounds)\n", "preds = np.expm1(model.predict(xgtest))" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "preds_df = pd.DataFrame({'id': test['id'], 'cost': preds})\n", "preds_df.to_csv('Caterpillar.csv', index=False)\n", "#0.229153 from ~0.19" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This competition has already ended, but people still can submit their solutions and see their scores. First places have a score ~0.19.\n", "\n", "My model got a score of 0.229153." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }