"
],
"text/plain": [
" Speed Direction Temperature Power\n",
"780 NaN NaN NaN NaN"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"Dataset 2 ~ Example of line where all values are missing\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Speed
\n",
"
Direction
\n",
"
Temperature
\n",
"
Power
\n",
"
\n",
" \n",
" \n",
"
\n",
"
902
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Speed Direction Temperature Power\n",
"902 NaN NaN NaN NaN"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"f = lambda x : x[x.isna().all(axis=1)].head(1) \n",
"display_two(f,\"Example of line where all values are missing\",data1,data2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-10T11:11:14.434751Z",
"start_time": "2021-10-10T11:11:14.430840Z"
}
},
"source": [
"We remove these kind of lines."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:17.607001Z",
"start_time": "2021-10-17T21:24:17.597709Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset 1: (6816, 4) \tDataset 2: (6934, 4)\n"
]
}
],
"source": [
"def remove_empty_lines(data) :\n",
" data = data.dropna(how='all')\n",
" data.reset_index(inplace=True, drop=True)\n",
" return data\n",
" \n",
"data1 = remove_empty_lines(data1)\n",
"data2 = remove_empty_lines(data2)\n",
"\n",
"print(\"Dataset 1:\", data1.shape, \"\\tDataset 2:\",data2.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lines with missing values\n",
"\n",
"Now we check the lines where there is at least one missing value."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:17.620036Z",
"start_time": "2021-10-17T21:24:17.608912Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset 1 ~ Proportion of missing values (by line)\n",
"Speed 0.000000\n",
"Direction 0.000147\n",
"Temperature 0.000000\n",
"Power 0.000000\n",
"dtype: float64\n",
"\n",
"\n",
"\n",
"Dataset 2 ~ Proportion of missing values (by line)\n",
"Speed 0.007263\n",
"Direction 0.000144\n",
"Temperature 0.000000\n",
"Power 0.000000\n",
"dtype: float64\n"
]
}
],
"source": [
"f = lambda x : x.isna().sum(axis=0)/x.count()\n",
"print_two(f, \"Proportion of missing values (by line)\", data1, data2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each column, the rows with at least one missing value correspond to less than 0.8% of the data. We will remove these lines (because it will not really change the distribution of our data)."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:17.637942Z",
"start_time": "2021-10-17T21:24:17.622295Z"
},
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset 1 ~ Shape before removing missing values:\n",
"(6816, 4)\n",
"\n",
"\n",
"\n",
"Dataset 2 ~ Shape before removing missing values:\n",
"(6934, 4)\n",
"\n",
"\n",
"\n",
"\n",
"Dataset 1 ~ Shape after removing missing values:\n",
"(6815, 4)\n",
"\n",
"\n",
"\n",
"Dataset 2 ~ Shape after removing missing values:\n",
"(6883, 4)\n"
]
}
],
"source": [
"def remove_missing(data) :\n",
" data = data.dropna(how='any')\n",
" data.reset_index(inplace=True, drop=True)\n",
" return data\n",
"\n",
"f = lambda x : x.shape\n",
"print_two(f ,\"Shape before removing missing values:\",data1,data2)\n",
"\n",
"data1 = remove_missing(data1)\n",
"data2 = remove_missing(data2)\n",
"print(\"\\n\\n\\n\")\n",
"print_two(f,\"Shape after removing missing values:\",data1,data2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What about negative power"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-10T13:19:13.947603Z",
"start_time": "2021-10-10T13:19:13.944262Z"
}
},
"source": [
"We could see that there were data with negative electrical powers. On peut afficher leur proportion :"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:17.651878Z",
"start_time": "2021-10-17T21:24:17.640106Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset 1 ~ Proportion of negative electrical power\n",
"0.10975788701393983\n",
"\n",
"\n",
"\n",
"Dataset 2 ~ Proportion of negative electrical power\n",
"0.16998401859654222\n"
]
}
],
"source": [
"f = lambda x : ( x[x.Power < 0].count()/x.count() ).Power\n",
"print_two(f,\"Proportion of negative electrical power\", data1, data2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This corresponds to about 10 to 20% of the data. This number is not negligible. We can ask ourselves if these negative values make sense. After research and reflection, we have concluded that they do. Indeed, it is possible that a wind turbine consumes more electricity than it produces (in very low wind conditions). \n",
"\n",
"So we will not apply any particular pre-treatment for the lines where the power is negative. Especially since the power is the label to predict. If it is negative, it must be taken into account for the training of the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Standardization\n",
"\n",
"Now we standardize our features in order to put each feature on the same scale. We will center-reduce each value with the following formula :\n",
"\n",
"$z_{i_f}=\\dfrac{x_{i_f}-\\mu_f}{\\sigma_f}$\n",
"\n",
"with : \n",
"- $x_{i_f}$ the value of the feature $f$, for the indivual $i$.\n",
"- $\\mu_f$ the mean of the values for the feature $f$.\n",
"- $\\sigma_f$ the standard deviation of the values for the feature $f$.\n",
"- $z_{i_f}$ the standardized value that we want.\n",
" \n",
"For an instance of `StandardScaler`, the method `fit` compute the mean and the standard deviation to use and the method `transform()` will apply the above formula on each value."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:17.662701Z",
"start_time": "2021-10-17T21:24:17.653918Z"
}
},
"outputs": [],
"source": [
"# Separing X and y\n",
"def separe_Xy(data) :\n",
" data_X = data[ X_cols ]\n",
" try :\n",
" data_y = data[ Y_col ]\n",
" except :\n",
" data_y = None\n",
" return data_X, data_y\n",
"\n",
"data1_X, data1_y = separe_Xy(data1)\n",
"data2_X, data2_y = separe_Xy(data2)\n",
"\n",
"# Prevent copy problems\n",
"data1 = None\n",
"data2 = None"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:17.670414Z",
"start_time": "2021-10-17T21:24:17.664472Z"
}
},
"outputs": [],
"source": [
"# Standardize features X\n",
"def standardize(data_X, standard_scaler, fit) :\n",
" if fit : data_X = standard_scaler.fit_transform(data_X) # standard_scaler.fit(...) ; standard_scaler.tranfsorm(...)\n",
" else : data_X = standard_scaler.transform(data_X)\n",
" data_X = pd.DataFrame(data_X)\n",
" data_X.columns = X_cols\n",
" return data_X\n",
"\n",
"def destandardize(data_X, standard_scaler) :\n",
" return data_X * np.sqrt(standard_scaler.var_) + standard_scaler.mean_"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:17.688680Z",
"start_time": "2021-10-17T21:24:17.672349Z"
},
"scrolled": false
},
"outputs": [],
"source": [
"standard_scaler1 = StandardScaler()\n",
"data1_X = standardize(data1_X,standard_scaler1,fit=True)\n",
"\n",
"standard_scaler2 = StandardScaler()\n",
"data2_X = standardize(data2_X,standard_scaler2,fit=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have pre-processed our training data, it is ready to be used to train our model. We now define a function `preprocess` that performs all these preprocessing steps. The reasons is as follows.\n",
"\n",
"After we have train our model on the training dataset, we want to predict labels for a supposed unknown similar dataset. \n",
"\n",
"This supposed unknown dataset is not already preprocessed. We have to: \n",
"\n",
" 1) Rename our columns (only for our convenience)\n",
" 2) Clean our data (removing missing values)\n",
" 3) Standardize our data\n",
"\n",
"\n",
"
\n",
" \n",
"⚠️ Note that standardization use parameters estimated from our training dataset (mean, standard deviation). The unknown similar dataset is supposed to have the same distribution. We don't have to re-estimated these parameters (especially since the training data set is representative). In other terms, we dont't have to apply `fit` method of the `StandardScaler` but only the `transform` method\n",
" \n",
"
\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:17.699359Z",
"start_time": "2021-10-17T21:24:17.690279Z"
}
},
"outputs": [],
"source": [
"def preprocess_no_scale(data, verbose=False) :\n",
" data = rename_cols(data)\n",
" if verbose : \n",
" print(\"Rename\")\n",
" display(data)\n",
" \n",
" data = remove_empty_lines(data)\n",
" if verbose : \n",
" print(\"Remove empty\")\n",
" display(data)\n",
" \n",
" data = remove_missing(data) \n",
" if verbose : \n",
" print(\"Remove missing\")\n",
" display(data)\n",
" \n",
" data_X, data_y = separe_Xy(data)\n",
" if verbose : \n",
" print(\"Separe X Y\")\n",
" display(data_X)\n",
" display(data_y)\n",
" \n",
" return data_X, data_y\n",
" \n",
"def preprocess_scale(data_X, data_y, standard_scaler, fit, verbose=False) :\n",
" \n",
" data_X = standardize(data_X,standard_scaler,fit)\n",
" if verbose : \n",
" print(\"Standardize X\")\n",
" display(data_X)\n",
" display(data_y)\n",
" \n",
" return data_X, data_y\n",
"\n",
"def preprocess(data, standard_scaler, fit, verbose=False) :\n",
" data_X, data_y = preprocess_no_scale(data, verbose)\n",
" return preprocess_scale(data_X, data_y, standard_scaler, fit, verbose)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# A first model\n",
"\n",
"We will create our models: Random Forest regressors (one model for each turbine). To do this we will instantiate our models, then we will train it with our training data via the `fit` method\n",
"\n",
"## Models creation and training"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:21.349847Z",
"start_time": "2021-10-17T21:24:17.702755Z"
}
},
"outputs": [],
"source": [
"def create_and_fit_model(data_X, data_y, model_name, **hyperparameters) :\n",
" regressor = model_name(**hyperparameters)\n",
" regressor.fit(data_X, data_y.values.ravel())\n",
" return regressor\n",
"\n",
"regressor1 = create_and_fit_model(data1_X, data1_y , RandomForestRegressor)\n",
"regressor2 = create_and_fit_model(data2_X, data2_y , RandomForestRegressor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Testing our models\n",
"\n",
"Now it's time to test our models. For this we will use the test data that we have ignored so far. We do not know this data. The only thing we assume is that the test data set for turbine 1 (respectively, for turbine 2) follows the same probability law as the training data for turbine 1 (respectively, for turbine 2). These are the laws that we have tried to estimate by our models. This is what will allow us to make predictions on the unknown test data.\n",
"\n",
"### Pre-processing of test data\n",
"\n",
"Our test data has not been pre-processed. However, this data must be in the same format as the training data that fed our model. In other words, our test data must not contain any missing values, and the feature values must be standardized using the training parameters (the means and the standard deviations of the training data, assumed to be representative of any data from the wind turbine in question)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:21.379811Z",
"start_time": "2021-10-17T21:24:21.352161Z"
}
},
"outputs": [],
"source": [
"# Preprocess our test data (no fit !)\n",
"data1_X_test, data1_y_test = preprocess(data1_test, standard_scaler1, fit=False)\n",
"data2_X_test, data2_y_test = preprocess(data2_test, standard_scaler2, fit=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Making predictions\n",
"\n",
"We can now make our predictions. "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:21.541655Z",
"start_time": "2021-10-17T21:24:21.388446Z"
}
},
"outputs": [],
"source": [
"# Make predictions\n",
"data1_y_test_pred = regressor1.predict(data1_X_test)\n",
"data2_y_test_pred = regressor2.predict(data2_X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Performance evaluation.\n",
"\n",
"\n",
"### Defining the metrics\n",
"\n",
"In order to evaluate the success of our predictions, we need to define performance metrics. Since we are dealing with a regression problem, our metrics will measure a distance between the prediction and the true value of the label. For $n$ individuals, with $y_i$ a true label of the individual $i$, $\\hat{y}_i$ the predicted label for $i$ and $\\bar{y}$ the mean of values, we will choose the following metrics: \n",
"\n",
"- R2 Score : score between -1 and 1 (the best possible score is 1)\n",
"\n",
"$R_2 = 1 - \\dfrac{\\sum^{n}_{i=1}{(y_i - \\hat{y}_i)^2}}{\\sum^{n}_{i=1}{(y_i - \\bar{y})^2}}$ \n",
"- Mean Absolute Error : error in the same unit as the label (the best possible score is 0)\n",
"\n",
"$MAE = \\dfrac{\\sum^{n}_{i=1}{|y_i - \\hat{y}_i|}}{n} $\n",
"- Root Mean Squared Error : error in the same unit as the label (the best possible score is 0) \n",
"\n",
"$RMSE = \\displaystyle\\sqrt{\\dfrac{\\sum^{n}_{i=1}{(y_i - \\hat{y}_i)^2}}{n}}$ \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compute scores\n",
"\n",
"We will now calculate the scores of our predictions for each of the models."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:21.573742Z",
"start_time": "2021-10-17T21:24:21.545053Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset 1 ~ Scores\n",
"R2 Score [-1,1] : \t0.958806564723879\n",
"Mean Absolute Error (kW) : \t40.729163940482145\n",
"Root Mean Squared Error (kW) : \t98.46943075204126\n",
"\n",
"\n",
"\n",
"\n",
"Dataset 2 ~ Scores\n",
"R2 Score [-1,1] : \t0.9861313673199962\n",
"Mean Absolute Error (kW) : \t26.844404390773516\n",
"Root Mean Squared Error (kW) : \t52.913120698991825\n",
"\n"
]
}
],
"source": [
"scores_keys = ['R2 Score [-1,1] ','Mean Absolute Error (kW) ', 'Root Mean Squared Error (kW) ']\n",
"\n",
"def compute_scores(y_pred,y_true) :\n",
" return {\n",
" scores_keys[0] : r2_score(y_pred,y_true),\n",
" scores_keys[1] : mean_absolute_error(y_pred,y_true),\n",
" scores_keys[2] : mean_squared_error(y_pred,y_true,squared=False)\n",
" }\n",
"\n",
"def scores_tostr(y_pred,y_true) :\n",
" s = ''\n",
" scores = compute_scores(y_pred,y_true)\n",
" for idx,score in scores.items() :\n",
" s = s+str(idx)+': \\t'+str(score)+'\\n'\n",
" return s\n",
"\n",
"sco = lambda x : scores_tostr(x[0],x[1])\n",
"print_two(sco,'Scores', (data1_y_test,data1_y_test_pred) , (data2_y_test,data2_y_test_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot predictions\n",
"\n",
"Finally, we will plot the actual labels of the test data and the labels predicted by our model as a function of the \"speed\" feature. We choose speed because it is the feature most correlated to the label value."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-17T21:24:22.263111Z",
"start_time": "2021-10-17T21:24:21.578000Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def plot_predictions(scores1, data1_X_test,data1_y_test,data1_y_test_pred,standard_scaler1, model_name1,\n",
" scores2, data2_X_test,data2_y_test,data2_y_test_pred,standard_scaler2, model_name2, \n",
" feature=\"Speed\") :\n",
" \n",
" fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,8))\n",
" \n",
" if (standard_scaler1 != None) :\n",
" df1 = destandardize(data1_X_test,standard_scaler1)\n",
" else :\n",
" df1 = data1_X_test\n",
" if (standard_scaler2 != None) :\n",
" df2 = destandardize(data2_X_test,standard_scaler2)\n",
" else : \n",
" df2 = data2_X_test\n",
" s = 5\n",
"\n",
" #for col in df1.columns :\n",
"\n",
" ax1.set_title(f\"Dataset 1\")\n",
" ax1.scatter(df1[feature],data1_y_test, c=\"blue\", alpha=0.7, label=\"Real values\", s=s)\n",
" ax1.scatter(df1[feature],data1_y_test_pred, c=\"green\", alpha=0.7, label=\"Predicted values\", s=s)\n",
" ax1.set_ylabel(\"Power\")\n",
" ax1.set_xlabel(\"Speed\")\n",
"\n",
" ax1.text(13,200,f\"R2 Score : { round(scores1[scores_keys[0]],3) }\")\n",
" ax1.text(13,100,f\"MAE : { round(scores1[scores_keys[1]],3) } kW\")\n",
" ax1.text(13,0,f\"RMSE : { round(scores1[scores_keys[2]],3) } kW\")\n",
"\n",
" ax2.set_title(f\"Dataset 2\")\n",
" ax2.scatter(df2[feature],data2_y_test, c=\"red\",alpha=0.7, s=s, label=\"Real values\")\n",
" ax2.scatter(df2[feature],data2_y_test_pred, c=\"green\", alpha=0.7, s=s)\n",
" ax2.set_xlabel(\"Speed\")\n",
" ax2.set_ylabel(\"Power\")\n",
"\n",
" ax2.text(13,200,f\"R2 Score : { round(scores2[scores_keys[0]],3) }\")\n",
" ax2.text(13,100,f\"MAE : { round(scores2[scores_keys[1]],3) } kW\")\n",
" ax2.text(13,0,f\"RMSE : { round(scores2[scores_keys[2]],3) } kW\")\n",
"\n",
" fig.suptitle(f\"{model_name} predictions\")\n",
" fig.legend()\n",
"\n",
" fig.show()\n",
"\n",
"\n",
"model_name = 'Random Forest'\n",
"plot_predictions(compute_scores(data1_y_test_pred,data1_y_test), data1_X_test,data1_y_test,data1_y_test_pred,standard_scaler1, model_name,\n",
" compute_scores(data2_y_test_pred,data2_y_test), data2_X_test,data2_y_test,data2_y_test_pred,standard_scaler2, model_name ) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The K-Folds technique\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Measured performance may be biased. Indeed, the separation of the training data and the test data was done in a random way. Maybe the test data and the training data were by chance well chosen in order to obtain such a score. To correct this, we will use the K-folds technique. \n",
"\n",
"First, the data set is randomly separated into k folds. The model will be tested on 1 folds, while the remaining k-1 folds will serve as training data. We reiterate by changing each time the fold that will be used as test data and by noting each time the scores obtained.\n",
"\n",
"
\n",
"\n",
"⚠️ **Important remark:** we do not preprocess the data before separating them into $k$ folds. Indeed, if we preprocess our data before the separation, the standardization would estimate parameters $\\mu$ and $\\sigma$ from the data set. Once this is done, we would separate the data into $k$ folds. One of the folds would serve as a test fold. However, the data of the test fold (supposedly unknown) were taken into account in the calculation of $\\mu$ and $\\sigma$. There would then be a data leak!\n",
" \n",
"Thus, in our `KFolds_validation` function we use non-preprocessed data that we separate into $k$ folds. Within each iteration, we preprocess the training folds: they are standardized. There is no information leakage because the standardization is done on the training folds. After that, we preprocess the test fold by standardizing from the parameters of the training folds (in order to have the test data in the same format), then we make our predictions.\n",
"\n",
"