{ "cells": [ { "cell_type": "markdown", "id": "a11671c4-2b96-4fde-949a-6defb05980b8", "metadata": {}, "source": [ "# ARIM-Academy: 基礎編 Scikit-learn(次元削減とクラスター分析)" ] }, { "cell_type": "markdown", "id": "3488f1a0-297e-4dc3-928b-6121c344fc13", "metadata": {}, "source": [ "## 本編の目標\n", "本演習では**『茶の元素分析データセット』**を用いて、**次元削減**と**クラスタリング**の技術を使ってデータ分析を学びます。\n", "\n", "### 本編における内容\n", "以下の内容に取り組みます。\n", "\n", "1. **次元削減技術の学習**: 高次元データを低次元に変換する次元削減手法(主成分分析(PCA)、t-SNE、UMAPなど)を学びます。これにより、多次元データを視覚化し、データの理解を深めることができます。次元削減を通じて、データのパターンを抽出し、複雑なデータセットの分析を簡素化することができます。\n", "\n", "2. **クラスタリングアルゴリズムの理解**: クラスタリング手法(階層クラスタリングやK-means)を使用して、データセット内の類似性、グループやパターンを識別します。クラスタリングを通じて、データの内部構造を発見し、データ群がどのようにグループ化されるかを把握することができます。\n", "\n", "3. **次元削減とクラスタリングの統合的活用**: 次元削減とクラスタリングを組み合わせることで、データの可視化や構造の理解を深めることができます。次元削減によって得られた低次元のデータを用いて、クラスタリング結果を視覚化し、データのクラスタリングパターンを把握することで、データ分析の洞察を得ることができます。\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "90a1da8d-ec4c-4226-b471-76559b4cfd68", "metadata": {}, "source": [ "### 教材への接続\n", "google colabにおけるオンラインの場合にこのラインを実行します。(Google colabに接続しない場合には不要)" ] }, { "cell_type": "code", "execution_count": null, "id": "9b157a8b-02c3-4eee-b6b0-ccec10147db3", "metadata": {}, "outputs": [], "source": [ "%pip install umap-learn\n", "\n", "!git clone https://github.com/ARIM-Academy/Advanced_Tutorial_1.git\n", "%cd Advanced_Tutorial_1" ] }, { "cell_type": "markdown", "id": "e2a436ed-a3bb-4c3d-b665-4ea4111c4679", "metadata": {}, "source": [ "# 1.データセットの読み込みと前処理" ] }, { "cell_type": "markdown", "id": "cb560e23-95f3-43e2-baa7-bba39d98735a", "metadata": {}, "source": [ "### ライブラリのインポート\n", "カリキュラムで扱うpythonのライブラリを`import`文でロードします。機械学習のライブラリであるscikit-learnは後半でimportします。" ] }, { "cell_type": "code", "execution_count": 1, "id": "6a0ad6b8-484b-44cd-b6f9-d861b08d6fd5", "metadata": {}, "outputs": [], "source": [ "#ライブラリ\n", "import pandas as pd\n", "import numpy as np \n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "id": "32f6ea9d-5911-452d-9bb5-de559d57c4c2", "metadata": {}, "source": [ "### サンプルファイルの読み込み\n", "pandasライブラリの`read_csv()`はcsvファイルを読み込むメソッドであり、指定したファイルの読み込みます。ここでは[data]フォルダーに格納されているIris.csvのファイルのをデータフレームとして読み込み、そのデータフレームはdfという変数に格納します。" ] }, { "cell_type": "code", "execution_count": 4, "id": "4e96a46b-2e92-45f9-b290-b3577929be34", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Al | \n", "Ca | \n", "Cu | \n", "Fe | \n", "K | \n", "Mg | \n", "Mn | \n", "Na | \n", "Zn | \n", "tea | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "3.297 | \n", "4.356 | \n", "0.031290 | \n", "0.067 | \n", "99.06 | \n", "3.531 | \n", "1.455 | \n", "0.541 | \n", "0.131 | \n", "BT | \n", "
1 | \n", "4.267 | \n", "4.118 | \n", "0.031290 | \n", "0.079 | \n", "106.50 | \n", "3.378 | \n", "1.542 | \n", "0.603 | \n", "0.126 | \n", "BT | \n", "
2 | \n", "4.088 | \n", "4.763 | \n", "0.033370 | \n", "0.084 | \n", "114.00 | \n", "4.763 | \n", "1.838 | \n", "1.058 | \n", "0.156 | \n", "BT | \n", "
3 | \n", "4.338 | \n", "4.556 | \n", "0.033370 | \n", "0.091 | \n", "122.60 | \n", "5.005 | \n", "2.269 | \n", "0.958 | \n", "0.162 | \n", "BT | \n", "
4 | \n", "4.732 | \n", "5.138 | \n", "0.035514 | \n", "0.110 | \n", "132.40 | \n", "5.626 | \n", "2.998 | \n", "1.510 | \n", "0.165 | \n", "BT | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
163 | \n", "16.690 | \n", "8.895 | \n", "0.153000 | \n", "0.236 | \n", "323.40 | \n", "20.450 | \n", "10.420 | \n", "6.360 | \n", "0.335 | \n", "GC | \n", "
164 | \n", "17.620 | \n", "8.909 | \n", "0.177000 | \n", "0.261 | \n", "334.20 | \n", "23.486 | \n", "11.330 | \n", "7.133 | \n", "0.351 | \n", "GC | \n", "
165 | \n", "17.920 | \n", "9.056 | \n", "0.180000 | \n", "0.266 | \n", "332.30 | \n", "22.840 | \n", "11.290 | \n", "7.609 | \n", "0.358 | \n", "GC | \n", "
166 | \n", "17.820 | \n", "9.128 | \n", "0.175000 | \n", "0.273 | \n", "367.30 | \n", "24.560 | \n", "12.110 | \n", "8.537 | \n", "0.372 | \n", "GC | \n", "
167 | \n", "17.650 | \n", "9.048 | \n", "0.197000 | \n", "0.285 | \n", "358.40 | \n", "24.340 | \n", "12.310 | \n", "8.631 | \n", "0.378 | \n", "GC | \n", "
168 rows × 10 columns
\n", "\n", " | Al | \n", "Ca | \n", "Cu | \n", "Fe | \n", "K | \n", "Mg | \n", "Mn | \n", "Na | \n", "Zn | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "3.297 | \n", "4.356 | \n", "0.031290 | \n", "0.067 | \n", "99.06 | \n", "3.531 | \n", "1.455 | \n", "0.541 | \n", "0.131 | \n", "
1 | \n", "4.267 | \n", "4.118 | \n", "0.031290 | \n", "0.079 | \n", "106.50 | \n", "3.378 | \n", "1.542 | \n", "0.603 | \n", "0.126 | \n", "
2 | \n", "4.088 | \n", "4.763 | \n", "0.033370 | \n", "0.084 | \n", "114.00 | \n", "4.763 | \n", "1.838 | \n", "1.058 | \n", "0.156 | \n", "
3 | \n", "4.338 | \n", "4.556 | \n", "0.033370 | \n", "0.091 | \n", "122.60 | \n", "5.005 | \n", "2.269 | \n", "0.958 | \n", "0.162 | \n", "
4 | \n", "4.732 | \n", "5.138 | \n", "0.035514 | \n", "0.110 | \n", "132.40 | \n", "5.626 | \n", "2.998 | \n", "1.510 | \n", "0.165 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
163 | \n", "16.690 | \n", "8.895 | \n", "0.153000 | \n", "0.236 | \n", "323.40 | \n", "20.450 | \n", "10.420 | \n", "6.360 | \n", "0.335 | \n", "
164 | \n", "17.620 | \n", "8.909 | \n", "0.177000 | \n", "0.261 | \n", "334.20 | \n", "23.486 | \n", "11.330 | \n", "7.133 | \n", "0.351 | \n", "
165 | \n", "17.920 | \n", "9.056 | \n", "0.180000 | \n", "0.266 | \n", "332.30 | \n", "22.840 | \n", "11.290 | \n", "7.609 | \n", "0.358 | \n", "
166 | \n", "17.820 | \n", "9.128 | \n", "0.175000 | \n", "0.273 | \n", "367.30 | \n", "24.560 | \n", "12.110 | \n", "8.537 | \n", "0.372 | \n", "
167 | \n", "17.650 | \n", "9.048 | \n", "0.197000 | \n", "0.285 | \n", "358.40 | \n", "24.340 | \n", "12.310 | \n", "8.631 | \n", "0.378 | \n", "
168 rows × 9 columns
\n", "