{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Column Transformer with Heterogeneous Data Sources\n\nDatasets can often contain components that require different feature\nextraction and processing pipelines. This scenario might occur when:\n\n1. your dataset consists of heterogeneous data types (e.g. raster images and\n text captions),\n2. your dataset is stored in a :class:`pandas.DataFrame` and different columns\n require different processing pipelines.\n\nThis example demonstrates how to use\n:class:`~sklearn.compose.ColumnTransformer` on a dataset containing\ndifferent types of features. The choice of features is not particularly\nhelpful, but serves to illustrate the technique.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.decomposition import PCA\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.metrics import classification_report\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import FunctionTransformer\nfrom sklearn.svm import LinearSVC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 20 newsgroups dataset\n\nWe will use the `20 newsgroups dataset <20newsgroups_dataset>`, which\ncomprises posts from newsgroups on 20 topics. This dataset is split\ninto train and test subsets based on messages posted before and after\na specific date. We will only use posts from 2 categories to speed up running\ntime.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "categories = [\"sci.med\", \"sci.space\"]\nX_train, y_train = fetch_20newsgroups(\n random_state=1,\n subset=\"train\",\n categories=categories,\n remove=(\"footers\", \"quotes\"),\n return_X_y=True,\n)\nX_test, y_test = fetch_20newsgroups(\n random_state=1,\n subset=\"test\",\n categories=categories,\n remove=(\"footers\", \"quotes\"),\n return_X_y=True,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each feature comprises meta information about that post, such as the subject,\nand the body of the news post.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(X_train[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating transformers\n\nFirst, we would like a transformer that extracts the subject and\nbody of each post. Since this is a stateless transformation (does not\nrequire state information from training data), we can define a function that\nperforms the data transformation then use\n:class:`~sklearn.preprocessing.FunctionTransformer` to create a scikit-learn\ntransformer.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def subject_body_extractor(posts):\n # construct object dtype array with two columns\n # first column = 'subject' and second column = 'body'\n features = np.empty(shape=(len(posts), 2), dtype=object)\n for i, text in enumerate(posts):\n # temporary variable `_` stores '\\n\\n'\n headers, _, body = text.partition(\"\\n\\n\")\n # store body text in second column\n features[i, 1] = body\n\n prefix = \"Subject:\"\n sub = \"\"\n # save text after 'Subject:' in first column\n for line in headers.split(\"\\n\"):\n if line.startswith(prefix):\n sub = line[len(prefix) :]\n break\n features[i, 0] = sub\n\n return features\n\n\nsubject_body_transformer = FunctionTransformer(subject_body_extractor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also create a transformer that extracts the\nlength of the text and the number of sentences.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def text_stats(posts):\n return [{\"length\": len(text), \"num_sentences\": text.count(\".\")} for text in posts]\n\n\ntext_stats_transformer = FunctionTransformer(text_stats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification pipeline\n\nThe pipeline below extracts the subject and body from each post using\n``SubjectBodyExtractor``, producing a (n_samples, 2) array. This array is\nthen used to compute standard bag-of-words features for the subject and body\nas well as text length and number of sentences on the body, using\n``ColumnTransformer``. We combine them, with weights, then train a\nclassifier on the combined set of features.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pipeline = Pipeline(\n [\n # Extract subject & body\n (\"subjectbody\", subject_body_transformer),\n # Use ColumnTransformer to combine the subject and body features\n (\n \"union\",\n ColumnTransformer(\n [\n # bag-of-words for subject (col 0)\n (\"subject\", TfidfVectorizer(min_df=50), 0),\n # bag-of-words with decomposition for body (col 1)\n (\n \"body_bow\",\n Pipeline(\n [\n (\"tfidf\", TfidfVectorizer()),\n (\"best\", PCA(n_components=50, svd_solver=\"arpack\")),\n ]\n ),\n 1,\n ),\n # Pipeline for pulling text stats from post's body\n (\n \"body_stats\",\n Pipeline(\n [\n (\n \"stats\",\n text_stats_transformer,\n ), # returns a list of dicts\n (\n \"vect\",\n DictVectorizer(),\n ), # list of dicts -> feature matrix\n ]\n ),\n 1,\n ),\n ],\n # weight above ColumnTransformer features\n transformer_weights={\n \"subject\": 0.8,\n \"body_bow\": 0.5,\n \"body_stats\": 1.0,\n },\n ),\n ),\n # Use a SVC classifier on the combined features\n (\"svc\", LinearSVC(dual=False)),\n ],\n verbose=True,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we fit our pipeline on the training data and use it to predict\ntopics for ``X_test``. Performance metrics of our pipeline are then printed.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pipeline.fit(X_train, y_train)\ny_pred = pipeline.predict(X_test)\nprint(\"Classification report:\\n\\n{}\".format(classification_report(y_test, y_pred)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }