{"cells":[{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"48e4f2cb-f8c1-4511-a4ff-5cb614a5a435","showTitle":false,"title":""}},"source":["MRMR Feature Selection by Maykon Schots & Matheus Rugollo"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"95086c54-0214-4ffd-b75c-0b181f9f494c","showTitle":false,"title":""}},"source":["

Numerical Feature Selection by MRMR

\n","
\n","\n","Experimenting fast is key to success in Data Science. When experimenting we're going to bump with huge datasets that require special attention when feature selecting and engineering. In a profit driven context, it's important to quickly test the potential value of an idea rather than exploring the best way to use your data or parametrize a machine learning model. It not only takes time that we usually can't afford but also increase financial costs. \n","\n","Herewit we describe an efficient solution to reduce dimensionality of your dataset, by identifying and creating clusters of redundant features and selecting the most relevant one. This has potential to speed up your experimentation process and reduce costs.

\n","\n","
\n","
Case
\n"," \n","You might be wondering how this applies to a real use case and why we had to come up with such technique. Hear this story:\n","Consider a project in a financial company that we try to understand how likely a client is to buy a product through Machine Learning. Other then profile features, we usually end up with many financial transactions history features of the clients. With that in mind we can assume that probably many of them are highly correlated, e.g in order to buy something of x value, the client probably received a value > x in the past, and since we're going to extract aggregation features from such events we're going to end up with a lot of correlation between them. \n","\n","\n","The solution was to come up with an efficient \"automatic\" way to wipe redundant features from the training set, that can vary from time to time, maintaining our model performance. With this we can always consider at the start of our pipeline all of our \"raw\" features and select the most relevant of them that are not highly correlated in given moment.\n","\n","Based on a published [article](https://arxiv.org/abs/1908.05376) we developed an implementation using [feature_engine](https://github.com/feature-engine/feature_engine) and [sklearn](https://scikit-learn.org/stable/). Follow the step-by-step to understand our approach."]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"0d9b8756-389d-41f1-a0b3-2cb157ca53ea","showTitle":false,"title":""}},"source":["

Classification Example

\n","
\n","\n","In order to demonstrate, use the [make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) helper function from sklearn to create a set of features making sure that some of them are redundant. Convert both X and y returned by it to be pandas DataFrames for further compatibility with sklearn api."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"7a1ffdaf-13bd-4efe-92df-11c20cd5eabe","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
","datasetInfos":[],"metadata":{},"removedWidgets":[],"type":"html"}},"output_type":"display_data"}],"source":["import warnings\n","\n","import pandas as pd\n","from sklearn.datasets import make_classification\n","\n","warnings.filterwarnings('ignore')\n","\n","X, y = make_classification(\n"," n_samples=5000,\n"," n_features=30,\n"," n_redundant=15,\n"," n_clusters_per_class=1,\n"," weights=[0.50],\n"," class_sep=2,\n"," random_state=42\n",")\n","\n","cols = []\n","for i in range(len(X[0])):\n"," cols.append(f\"feat_{i}\")\n","X = pd.DataFrame(X, columns=cols)\n","y = pd.DataFrame({\"y\": y})\n"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"a1110f89-1a4a-4df0-bf2c-7095d3222295","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","
Out[32]:
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
Out[32]:
","datasetInfos":[],"metadata":{},"removedWidgets":[],"type":"html"}},"output_type":"display_data"},{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
feat_0feat_1feat_2feat_3feat_4feat_5feat_6feat_7feat_8feat_9feat_10feat_11feat_12feat_13feat_14feat_15feat_16feat_17feat_18feat_19feat_20feat_21feat_22feat_23feat_24feat_25feat_26feat_27feat_28feat_29
0-0.8554840.1472680.9317790.514342-1.1605671.548771-0.814841-0.551259-0.559991-0.588341-0.050275-0.418050-1.8836180.627457-1.4308582.4149680.965552-0.4272751.5751641.3860961.6354520.171738-0.9882982.1292111.5026581.451028-0.1596741.999985-0.0886172.173897
1-0.233143-0.2537650.7401470.8168850.1699521.594853-0.432744-0.118465-0.2456800.318794-0.307883-0.3911241.137396-1.150796-1.6424372.1084501.168464-0.8406391.7880390.134206-0.498645-0.199357-1.2603402.2027381.8593761.392850-0.5215721.7332630.3998962.562269
2-1.004689-2.525201-0.0663701.708870-1.836983-0.2008952.6175200.4944412.118569-1.165920-1.5766500.674405-0.6874770.020001-0.859114-2.6521880.9514800.6466140.8218661.885669-0.361392-2.347812-1.371678-0.2132881.733786-0.8147340.175997-2.2760263.047572-1.306679
3-0.039887-2.002593-0.1370591.352156-0.283016-0.1669582.0790970.3168521.6822860.208227-1.249645-0.460059-0.315972-0.031462-0.673953-2.1145340.7494900.4409440.6437080.300287-1.195257-1.862089-1.082490-0.1796671.366995-0.6530950.799519-1.8142682.4164130.303902
41.6508100.5943250.2684300.154022-0.6260831.439684-1.215922-1.373762-0.8939850.0435170.2422370.462302-1.067515-0.677758-1.1393882.6717880.7010121.0023421.2769110.1941601.5288780.584114-0.6449201.9677601.0447791.4631810.4614692.227199-0.6365400.948157
\n","
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
feat_0feat_1feat_2feat_3feat_4feat_5feat_6feat_7feat_8feat_9feat_10feat_11feat_12feat_13feat_14feat_15feat_16feat_17feat_18feat_19feat_20feat_21feat_22feat_23feat_24feat_25feat_26feat_27feat_28feat_29
0-0.8554840.1472680.9317790.514342-1.1605671.548771-0.814841-0.551259-0.559991-0.588341-0.050275-0.418050-1.8836180.627457-1.4308582.4149680.965552-0.4272751.5751641.3860961.6354520.171738-0.9882982.1292111.5026581.451028-0.1596741.999985-0.0886172.173897
1-0.233143-0.2537650.7401470.8168850.1699521.594853-0.432744-0.118465-0.2456800.318794-0.307883-0.3911241.137396-1.150796-1.6424372.1084501.168464-0.8406391.7880390.134206-0.498645-0.199357-1.2603402.2027381.8593761.392850-0.5215721.7332630.3998962.562269
2-1.004689-2.525201-0.0663701.708870-1.836983-0.2008952.6175200.4944412.118569-1.165920-1.5766500.674405-0.6874770.020001-0.859114-2.6521880.9514800.6466140.8218661.885669-0.361392-2.347812-1.371678-0.2132881.733786-0.8147340.175997-2.2760263.047572-1.306679
3-0.039887-2.002593-0.1370591.352156-0.283016-0.1669582.0790970.3168521.6822860.208227-1.249645-0.460059-0.315972-0.031462-0.673953-2.1145340.7494900.4409440.6437080.300287-1.195257-1.862089-1.082490-0.1796671.366995-0.6530950.799519-1.8142682.4164130.303902
41.6508100.5943250.2684300.154022-0.6260831.439684-1.215922-1.373762-0.8939850.0435170.2422370.462302-1.067515-0.677758-1.1393882.6717880.7010121.0023421.2769110.1941601.5288780.584114-0.6449201.9677601.0447791.4631810.4614692.227199-0.6365400.948157
\n
","datasetInfos":[],"metadata":{},"removedWidgets":[],"textData":null,"type":"htmlSandbox"}},"output_type":"display_data"}],"source":["X.head()"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"e9baf0bd-c8f0-4e05-8d66-b8f7952b83db","showTitle":false,"title":""}},"source":["

Get Redundant Clusters

\n","
\n","\n","Now that we have our master table example set up, we can start by taking advantage of [SmartCorrelatedSelection](https://feature-engine.readthedocs.io/en/1.0.x/selection/SmartCorrelatedSelection.html) implementation by feature_egine. Let's check it's parameters:\n","\n","
Correlation Threshold
\n","This can be a hot topic of discussion for each case, in order to keep as much useful data as possible the correlation threshold set was very conservative .97. \n","p.s: This demonstration will only have one value set, but a good way of improving this pipeline would be to attempt multiple iterations lowering the threshold, then you could measure performance of given model with different sets of selected features.\n","\n","
Method
\n","The best option here was spearman, identifying both linear and non-linear numerical features correlated clusters to make it less redundant as possible through rank correlation threshold.\n","\n","
Selection Method
\n","This is not relevant for this implementation, because we're not going to use features selected by the SmartCorrelatedSelection. Use variance , it's faster.\n","\n","\n","
\n","
Quick Comment
\n","You might be wondering why we don't just use feature_engine methods, and we definitely considered and tried it, finally it inspired us to come up with some tweaks for our process. It's a very similar idea, but instead of variance we use mutual information to select one feature out of each cluster, it's also the ground work for optimal parametrization and further development of the pipeline for ad hoc usage."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"1b032cbb-2de5-44c7-9922-62485e02ad49","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
","datasetInfos":[],"metadata":{},"removedWidgets":[],"type":"html"}},"output_type":"display_data"}],"source":["from feature_engine.selection import SmartCorrelatedSelection\n","\n","\n","MODEL_TYPE = \"classifier\" ## Or \"regressor\"\n","CORRELATION_THRESHOLD = .97\n","\n","# Setup Smart Selector /// Tks feature_engine\n","feature_selector = SmartCorrelatedSelection(\n"," variables=None,\n"," method=\"spearman\",\n"," threshold=CORRELATION_THRESHOLD,\n"," missing_values=\"ignore\",\n"," selection_method=\"variance\",\n"," estimator=None,\n",")\n","\n","\n","feature_selector.fit_transform(X)\n","\n","### Setup a list of correlated clusters as lists and a list of uncorrelated features\n","correlated_sets = feature_selector.correlated_feature_sets_\n","\n","correlated_clusters = [list(feature) for feature in correlated_sets]\n","\n","correlated_features = [feature for features in correlated_clusters for feature in features]\n","\n","uncorrelated_features = [feature for feature in X if feature not in correlated_features]\n"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b9cfa5a9-5fe7-4308-bd86-32d92727ebef","showTitle":false,"title":""}},"source":["

Wiping Redundancy considering Relevance

\n","\n","Now we're going to extract the best feature from each correlated cluster using [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) from sklearn.feature_selection. Here we use [mutual_info_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif) implementation as our score_func for this classifier example, there are other options like mutual_info_regression be sure to select it according to your use case.\n","\n","The relevance of each selected feature is considered when we use mutual info of the samples against the target Y, this will be important so we do not lose any predictive power of our features.\n","\n","
\n","\n","We end up with a set of selected features that considering our correlation threshold of .97, probably will have similar performance. In a context where you want to prioritize reduction of dimensionality, you can check how the selection will perform to make a good decision about it.\n","\n","I don't want to believe, I want to know."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"f982eac0-4c5d-47ce-afe9-63bb595d3b58","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
","datasetInfos":[],"metadata":{},"removedWidgets":[],"type":"html"}},"output_type":"display_data"}],"source":["from sklearn.feature_selection import (\n"," SelectKBest,\n"," mutual_info_classif,\n"," mutual_info_regression,\n",")\n","\n","\n","mutual_info = {\n"," \"classifier\": mutual_info_classif,\n"," \"regressor\": mutual_info_regression,\n","}\n","\n","top_features_cluster = []\n","for cluster in correlated_clusters:\n"," selector = SelectKBest(score_func=mutual_info[MODEL_TYPE], k=1) # selects the top feature (k=1) regarding target mutual information\n"," selector = selector.fit(X[cluster], y)\n"," top_features_cluster.append(\n"," list(selector.get_feature_names_out())[0]\n"," )\n","\n","selected_features = top_features_cluster + uncorrelated_features"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"880ecb32-d9f0-41ae-8eeb-e9988d8ae37e","showTitle":false,"title":""}},"source":["

Evaluating the set of features

\n","\n","Now that we have our set it's time to decide if we're going with it or not. In this demonstration, the idea was to use a GridSearch to find the best hyperparameters for a RandomForestClassifier providing us with the best possible estimator. \n","\n","If we attempt to fit many grid searches in a robust way, it would take too long and be very costy. Since we're just experimenting, initally we can use basic cross_validate with the chosen estimator, and we can quickly discard \"gone wrong\" selections, specially when we lower down our correlation threshold for the clusters.\n","\n","It's an efficient way to approach experimenation with this method, although I highly recommend going for a more robust evaluation with grid searches or other approaches, and a deep discussion on the impact of the performance threshold for your use cause, sometimes 1% can be a lot of $."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"277ab00e-0de7-461c-9e63-8424b4d305b0","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","
modelo melhorou com feat selection, dif = 0.0\n","
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
modelo melhorou com feat selection, dif = 0.0\n
","datasetInfos":[],"metadata":{},"removedWidgets":[],"type":"html"}},"output_type":"display_data"}],"source":["import os\n","import multiprocessing\n","\n","from sklearn.ensemble import RandomForestClassifier\n","from sklearn.model_selection import StratifiedKFold, cross_validate\n","\n","\n","cv = StratifiedKFold(shuffle=True, random_state=42)\n","\n","baseline_raw = cross_validate(\n"," RandomForestClassifier(\n"," max_samples=1.0,\n"," n_jobs=int(os.getenv(\"N_CORES\", 0.50 * multiprocessing.cpu_count())), # simplifica isso aqui pro artigo, bota -1.\n"," random_state=42\n"," ),\n"," X,\n"," y,\n"," cv=cv,\n"," scoring=\"f1\", # or any other metric that you want.\n"," groups=None\n",")\n","\n","baseline_selected_features = cross_validate(\n"," RandomForestClassifier(),\n"," X[selected_features],\n"," y,\n"," cv=cv,\n"," scoring=\"f1\",\n"," groups=None,\n"," error_score=\"raise\",\n"," )\n","\n","score_raw = baseline_raw[\"test_score\"].mean()\n","score_baseline = baseline_selected_features[\"test_score\"].mean()\n","\n","# Define a threshold to decide whether to reduce or not the dimensionality for your test case\n","dif = round(((score_raw - score_baseline) / score_raw), 3)\n","\n","# 5% is our limit (ponder how it will impact your product $)\n","performance_threshold = -0.050\n","\n","if dif >= performance_threshold:\n"," print(f\"It's worth to go with the selected set =D\")\n","elif dif < performance_threshold:\n"," print(f\"The performance reduction is not acceptable!!!! >.<\")\n"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"39f04f57-88d0-4bd8-9ce8-66d1a3dd0152","showTitle":false,"title":""}},"source":["

Make it better !

\n","\n","

Going Further on implementing a robust feature selection with MRMR , we can use the process explained above to iterate over a range of threshold and choose what's best for our needs instead of a simple score performance evaluation!

"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"f07796e9-8eb2-4462-86df-9bd309b8360f","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
","datasetInfos":[],"metadata":{},"removedWidgets":[],"type":"html"}},"output_type":"display_data"}],"source":["# Repeat df from example.\n","\n","import warnings\n","\n","import pandas as pd\n","from sklearn.datasets import make_classification\n","\n","warnings.filterwarnings('ignore')\n","\n","X, y = make_classification(\n"," n_samples=5000,\n"," n_features=30,\n"," n_redundant=15,\n"," n_clusters_per_class=1,\n"," weights=[0.50],\n"," class_sep=2,\n"," random_state=42\n",")\n","\n","cols = []\n","for i in range(len(X[0])):\n"," cols.append(f\"feat_{i}\")\n","X = pd.DataFrame(X, columns=cols)\n","y = pd.DataFrame({\"y\": y})\n"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"abb44eed-4177-4525-a609-8b03d2b3c687","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
","datasetInfos":[],"metadata":{},"removedWidgets":[],"type":"html"}},"output_type":"display_data"}],"source":["# Functions to iterate over accepted threshold\n","from sklearn.feature_selection import (\n"," SelectKBest,\n"," mutual_info_classif,\n"," mutual_info_regression,\n",")\n","import os\n","import multiprocessing\n","\n","from sklearn.ensemble import RandomForestClassifier\n","from sklearn.model_selection import StratifiedKFold, cross_validate\n","\n","import pandas as pd\n","from feature_engine.selection import SmartCorrelatedSelection\n","\n","\n","def select_features_clf(X: pd.DataFrame, y: pd.DataFrame, corr_threshold: float) -> list:\n"," \"\"\" Function will select a set of features with minimum redundance and maximum relevante based on the set correlation threshold \"\"\"\n"," # Setup Smart Selector /// Tks feature_engine\n"," feature_selector = SmartCorrelatedSelection(\n"," variables=None,\n"," method=\"spearman\",\n"," threshold=corr_threshold,\n"," missing_values=\"ignore\",\n"," selection_method=\"variance\",\n"," estimator=None,\n"," )\n"," feature_selector.fit_transform(X)\n"," ### Setup a list of correlated clusters as lists and a list of uncorrelated features\n"," correlated_sets = feature_selector.correlated_feature_sets_\n"," correlated_clusters = [list(feature) for feature in correlated_sets]\n"," correlated_features = [feature for features in correlated_clusters for feature in features]\n"," uncorrelated_features = [feature for feature in X if feature not in correlated_features]\n"," top_features_cluster = []\n"," for cluster in correlated_clusters:\n"," selector = SelectKBest(score_func=mutual_info_classif, k=1) # selects the top feature (k=1) regarding target mutual information\n"," selector = selector.fit(X[cluster], y)\n"," top_features_cluster.append(\n"," list(selector.get_feature_names_out())[0]\n"," )\n"," return top_features_cluster + uncorrelated_features\n","\n","def get_clf_model_scores(X: pd.DataFrame, y: pd.DataFrame, scoring: str, selected_features:list):\n"," \"\"\" \"\"\"\n"," cv = StratifiedKFold(shuffle=True, random_state=42) \n"," model_result = cross_validate(\n"," RandomForestClassifier(),\n"," X[selected_features],\n"," y,\n"," cv=cv,\n"," scoring=scoring,\n"," groups=None,\n"," error_score=\"raise\",\n"," )\n"," return model_result[\"test_score\"].mean(), model_result[\"fit_time\"].mean(), model_result[\"score_time\"].mean()\n","\n","def evaluate_clf_feature_selection_range(X: pd.DataFrame, y: pd.DataFrame, scoring:str, corr_range: int, corr_starting_point: float = .98) -> pd.DataFrame:\n"," \"\"\" Evaluates feature selection for every .01 on corr threshold \"\"\"\n"," evaluation_data = {\n"," \"corr_threshold\": [],\n"," scoring: [],\n"," \"n_features\": [],\n"," \"fit_time\": [],\n"," \"score_time\": []\n"," }\n"," for i in range(corr_range):\n"," current_corr_threshold = corr_starting_point - (i / 100) ## Reduces .01 on corr_threshold for every iteration\n"," selected_features = select_features_clf(X, y, corr_threshold=current_corr_threshold)\n"," score, fit_time, score_time = get_clf_model_scores(X, y, scoring, selected_features)\n"," evaluation_data[\"corr_threshold\"].append(current_corr_threshold)\n"," evaluation_data[scoring].append(score)\n"," evaluation_data[\"n_features\"].append(len(selected_features))\n"," evaluation_data[\"fit_time\"].append(fit_time)\n"," evaluation_data[\"score_time\"].append(score_time)\n"," \n"," return pd.DataFrame(evaluation_data)\n"," \n"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"4bcd731c-c905-4480-bd63-88ba2871e7a1","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","
"]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"
","datasetInfos":[],"metadata":{},"removedWidgets":[],"type":"html"}},"output_type":"display_data"}],"source":["evaluation_df = evaluate_clf_feature_selection_range(X, y, \"f1\", 15)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"10a32ffa-c69d-49f3-abe5-b16ba4994022","showTitle":false,"title":""}},"outputs":[],"source":["%pip install hiplot"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"bf6d3834-7549-4e7a-80b0-69e59ee6da5a","showTitle":false,"title":""}},"outputs":[{"data":{"text/html":["\n","\n","\n","\n","\n","HiPlot\n","\n","\n","\n","
Loading HiPlot...
\n","\n","
\n","\n","\n","\n",""]},"metadata":{"application/vnd.databricks.v1+output":{"addedWidgets":{},"arguments":{},"data":"\n\n\n\n\nHiPlot\n\n\n\n
Loading HiPlot...
\n\n
\n\n\n\n","datasetInfos":[],"metadata":{},"removedWidgets":[],"textData":null,"type":"htmlSandbox"}},"output_type":"display_data"}],"source":["import hiplot\n","\n","html = hiplot.Experiment.from_dataframe(evaluation_df).to_html()\n","displayHTML(html)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"0a75695b-7363-4fa3-aa99-7d347eae1a56","showTitle":false,"title":""}},"outputs":[],"source":[]}],"metadata":{"application/vnd.databricks.v1+notebook":{"dashboards":[],"language":"python","notebookMetadata":{"pythonIndentUnit":4},"notebookName":"mrmr","notebookOrigID":4007811193739156,"widgets":{}},"kernelspec":{"display_name":"Python 3.9.13 64-bit (windows store)","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.9.13"},"vscode":{"interpreter":{"hash":"d54fcd2c4d35bd1c7b8b00e1f691e4b4b7db4785f9c538ccc14ac1cc0dc01d8a"}}},"nbformat":4,"nbformat_minor":0}