{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## এক্সপ্লোরেটরি ডেটা অ্যানালাইসিস \n", "রিভিশন ৪" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "আসলে আমাদের ডেটার ভেতরে কী আছে সেটা না জানলে এর থেকে প্রেডিকশন বের করবো কী করে? সেকারণে এই এক্সপ্লোরেশন। ডেটা নিয়ে একটু ঘাঁটাঘাঁটি করলে এর ভেতরের অনেক ধারণা পাওয়া যায় যেটা মডেল সিলেকশন অথবা ফীচারগুলো বুঝতে সুবিধা হয়। আগের চ্যাপ্টারের ভেতরে কিছুটা \"এক্সপ্লোরেটরি ডেটা অ্যানালাইসিস\" করলেও এখানে সেটাকে আরেকটু খোলাসা করছি। " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ডাটার শেপ, মানে কতোটা ইনস্ট্যান্স?" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n_samples, n_features = iris.data.shape" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "150" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n_samples" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n_features" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of data: (150, 4)\n" ] } ], "source": [ "print(\"Shape of data:\", iris['data'].shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "কোন ডাটা মিসিং নেই " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(iris.target) == n_samples" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ফিচারগুলোর নাম \n", "\n", "ওপরের ছবিতে চারটা ফিচারের নাম দেখেছি। চলুন দেখি সেগুলো আমাদের ডাটাসেট অবজেক্টে। iris এর পর ডট নোটেশন ব্যবহার করে ডাকি একটা \"কী\" ভ্যালুকে। feature_names হচ্ছে আমাদের iris.keys() থেকে পাওয়া একটা অ্যাট্রিবিউট।" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sepal length (cm)',\n", " 'sepal width (cm)',\n", " 'petal length (cm)',\n", " 'petal width (cm)']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.feature_names" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n" ] } ], "source": [ "print(iris['feature_names'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### টার্গেট অর্থাৎ কী প্রেডিক্ট করতে চাই আমরা?\n", "\n", "অনেকভাবেই করা সম্ভব। তবে print ফরম্যাটিং এ ভালো কাজ করে। " ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['setosa', 'versicolor', 'virginica'],\n", " dtype='\n", "\n" ] } ], "source": [ "print(type(iris.data))\n", "print(type(iris.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ফিচারের ম্যাট্রিক্স কি? (১ম ডাইমেনশন = অবজার্ভেশনের সংখ্যা, ২য় = ফিচারের সংখ্যা)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(150, 4)\n" ] } ], "source": [ "print(iris.data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "টার্গেট ম্যাট্রিক্স কি? (১ম ডাইমেনশন = লেবেল, টার্গেট, রেসপন্স)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(150,)\n" ] } ], "source": [ "print(iris.target.shape)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of target: (150,)\n" ] } ], "source": [ "print(\"Shape of target:\", iris['target'].shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### সাইকিট-লার্ন এ ডাটা হ্যান্ডলিং এর নিয়ম \n", "\n", "১. এখানে \"ফিচার\" এবং \"রেসপন্স\" দুটো আলাদা অবজেক্ট \n", "(আমাদের এখানে দেখুন, \"ফিচার\" এবং \"রেসপন্স\" মানে \"টার্গেট\" আলাদা অবজেক্ট)\n", "\n", "২. \"ফিচার\" এবং \"রেসপন্স\" দুটোকেই সংখ্যা হতে হবে \n", "(আমাদের এখানে দুটোই সংখ্যার, দুটোর ম্যাট্রিক্স ডাইমেনশন হচ্ছে (১৫০ x ৪) এবং (১৫০ x ১)\n", "\n", "৩. \"ফিচার\" এবং \"রেসপন্স\" দুটোকেই \"নামপাই অ্যারে\" হতে হবে। \n", "(আমাদের দুটো ফিচারই আছে \"নামপাই অ্যারে\"তে, বাকি ডাটা ডাটাসেট দরকার হলে সেটাকেও লোড করে নিতে হবে \"নামপাই অ্যারে\"তে)\n", "\n", "৪. \"ফিচার\" এবং \"রেসপন্স\" দুটোকেই স্পেসিফিক shape হতে হবে \n", "\n", "* ১৫০ x ৪ -> পুরো ডাটাসেট \n", "* ১৫০ x ১ টার্গেটের জন্য \n", "* ৪ x ১ ফিচারের জন্য \n", "* আমরা ইচ্ছা করলে যেকোন ম্যাট্রিক্স পাল্টে নিতে পারি আমাদের দরকার মতো। যেমন np.tile(a, [4, 1]), মানে a হচ্ছে ম্যাট্রিক্স আর [4, 1] হচ্ছে ইনডেন্ট ম্যাট্রিক্স আরেক ডাইমেনশনে। " ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# ফিচার ম্যাট্রিক্স স্টোর করছি বড় \"X\"এ, মনে আছে f(x)=y কথা? x ইনপুট হলে y আউটপুট \n", "X = iris.data\n", "\n", "# রেসপন্স ভেক্টর রাখছি \"y\" তে \n", "y = iris.target" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 5.1, 3.5, 1.4, 0.2],\n", " [ 4.9, 3. , 1.4, 0.2],\n", " [ 4.7, 3.2, 1.3, 0.2],\n", " [ 4.6, 3.1, 1.5, 0.2],\n", " [ 5. , 3.6, 1.4, 0.2],\n", " [ 5.4, 3.9, 1.7, 0.4],\n", " [ 4.6, 3.4, 1.4, 0.3],\n", " [ 5. , 3.4, 1.5, 0.2],\n", " [ 4.4, 2.9, 1.4, 0.2],\n", " [ 4.9, 3.1, 1.5, 0.1],\n", " [ 5.4, 3.7, 1.5, 0.2],\n", " [ 4.8, 3.4, 1.6, 0.2],\n", " [ 4.8, 3. , 1.4, 0.1],\n", " [ 4.3, 3. , 1.1, 0.1],\n", " [ 5.8, 4. , 1.2, 0.2],\n", " [ 5.7, 4.4, 1.5, 0.4],\n", " [ 5.4, 3.9, 1.3, 0.4],\n", " [ 5.1, 3.5, 1.4, 0.3],\n", " [ 5.7, 3.8, 1.7, 0.3],\n", " [ 5.1, 3.8, 1.5, 0.3],\n", " [ 5.4, 3.4, 1.7, 0.2],\n", " [ 5.1, 3.7, 1.5, 0.4],\n", " [ 4.6, 3.6, 1. , 0.2],\n", " [ 5.1, 3.3, 1.7, 0.5],\n", " [ 4.8, 3.4, 1.9, 0.2],\n", " [ 5. , 3. , 1.6, 0.2],\n", " [ 5. , 3.4, 1.6, 0.4],\n", " [ 5.2, 3.5, 1.5, 0.2],\n", " [ 5.2, 3.4, 1.4, 0.2],\n", " [ 4.7, 3.2, 1.6, 0.2],\n", " [ 4.8, 3.1, 1.6, 0.2],\n", " [ 5.4, 3.4, 1.5, 0.4],\n", " [ 5.2, 4.1, 1.5, 0.1],\n", " [ 5.5, 4.2, 1.4, 0.2],\n", " [ 4.9, 3.1, 1.5, 0.1],\n", " [ 5. , 3.2, 1.2, 0.2],\n", " [ 5.5, 3.5, 1.3, 0.2],\n", " [ 4.9, 3.1, 1.5, 0.1],\n", " [ 4.4, 3. , 1.3, 0.2],\n", " [ 5.1, 3.4, 1.5, 0.2],\n", " [ 5. , 3.5, 1.3, 0.3],\n", " [ 4.5, 2.3, 1.3, 0.3],\n", " [ 4.4, 3.2, 1.3, 0.2],\n", " [ 5. , 3.5, 1.6, 0.6],\n", " [ 5.1, 3.8, 1.9, 0.4],\n", " [ 4.8, 3. , 1.4, 0.3],\n", " [ 5.1, 3.8, 1.6, 0.2],\n", " [ 4.6, 3.2, 1.4, 0.2],\n", " [ 5.3, 3.7, 1.5, 0.2],\n", " [ 5. , 3.3, 1.4, 0.2],\n", " [ 7. , 3.2, 4.7, 1.4],\n", " [ 6.4, 3.2, 4.5, 1.5],\n", " [ 6.9, 3.1, 4.9, 1.5],\n", " [ 5.5, 2.3, 4. , 1.3],\n", " [ 6.5, 2.8, 4.6, 1.5],\n", " [ 5.7, 2.8, 4.5, 1.3],\n", " [ 6.3, 3.3, 4.7, 1.6],\n", " [ 4.9, 2.4, 3.3, 1. ],\n", " [ 6.6, 2.9, 4.6, 1.3],\n", " [ 5.2, 2.7, 3.9, 1.4],\n", " [ 5. , 2. , 3.5, 1. ],\n", " [ 5.9, 3. , 4.2, 1.5],\n", " [ 6. , 2.2, 4. , 1. ],\n", " [ 6.1, 2.9, 4.7, 1.4],\n", " [ 5.6, 2.9, 3.6, 1.3],\n", " [ 6.7, 3.1, 4.4, 1.4],\n", " [ 5.6, 3. , 4.5, 1.5],\n", " [ 5.8, 2.7, 4.1, 1. ],\n", " [ 6.2, 2.2, 4.5, 1.5],\n", " [ 5.6, 2.5, 3.9, 1.1],\n", " [ 5.9, 3.2, 4.8, 1.8],\n", " [ 6.1, 2.8, 4. , 1.3],\n", " [ 6.3, 2.5, 4.9, 1.5],\n", " [ 6.1, 2.8, 4.7, 1.2],\n", " [ 6.4, 2.9, 4.3, 1.3],\n", " [ 6.6, 3. , 4.4, 1.4],\n", " [ 6.8, 2.8, 4.8, 1.4],\n", " [ 6.7, 3. , 5. , 1.7],\n", " [ 6. , 2.9, 4.5, 1.5],\n", " [ 5.7, 2.6, 3.5, 1. ],\n", " [ 5.5, 2.4, 3.8, 1.1],\n", " [ 5.5, 2.4, 3.7, 1. ],\n", " [ 5.8, 2.7, 3.9, 1.2],\n", " [ 6. , 2.7, 5.1, 1.6],\n", " [ 5.4, 3. , 4.5, 1.5],\n", " [ 6. , 3.4, 4.5, 1.6],\n", " [ 6.7, 3.1, 4.7, 1.5],\n", " [ 6.3, 2.3, 4.4, 1.3],\n", " [ 5.6, 3. , 4.1, 1.3],\n", " [ 5.5, 2.5, 4. , 1.3],\n", " [ 5.5, 2.6, 4.4, 1.2],\n", " [ 6.1, 3. , 4.6, 1.4],\n", " [ 5.8, 2.6, 4. , 1.2],\n", " [ 5. , 2.3, 3.3, 1. ],\n", " [ 5.6, 2.7, 4.2, 1.3],\n", " [ 5.7, 3. , 4.2, 1.2],\n", " [ 5.7, 2.9, 4.2, 1.3],\n", " [ 6.2, 2.9, 4.3, 1.3],\n", " [ 5.1, 2.5, 3. , 1.1],\n", " [ 5.7, 2.8, 4.1, 1.3],\n", " [ 6.3, 3.3, 6. , 2.5],\n", " [ 5.8, 2.7, 5.1, 1.9],\n", " [ 7.1, 3. , 5.9, 2.1],\n", " [ 6.3, 2.9, 5.6, 1.8],\n", " [ 6.5, 3. , 5.8, 2.2],\n", " [ 7.6, 3. , 6.6, 2.1],\n", " [ 4.9, 2.5, 4.5, 1.7],\n", " [ 7.3, 2.9, 6.3, 1.8],\n", " [ 6.7, 2.5, 5.8, 1.8],\n", " [ 7.2, 3.6, 6.1, 2.5],\n", " [ 6.5, 3.2, 5.1, 2. ],\n", " [ 6.4, 2.7, 5.3, 1.9],\n", " [ 6.8, 3. , 5.5, 2.1],\n", " [ 5.7, 2.5, 5. , 2. ],\n", " [ 5.8, 2.8, 5.1, 2.4],\n", " [ 6.4, 3.2, 5.3, 2.3],\n", " [ 6.5, 3. , 5.5, 1.8],\n", " [ 7.7, 3.8, 6.7, 2.2],\n", " [ 7.7, 2.6, 6.9, 2.3],\n", " [ 6. , 2.2, 5. , 1.5],\n", " [ 6.9, 3.2, 5.7, 2.3],\n", " [ 5.6, 2.8, 4.9, 2. ],\n", " [ 7.7, 2.8, 6.7, 2. ],\n", " [ 6.3, 2.7, 4.9, 1.8],\n", " [ 6.7, 3.3, 5.7, 2.1],\n", " [ 7.2, 3.2, 6. , 1.8],\n", " [ 6.2, 2.8, 4.8, 1.8],\n", " [ 6.1, 3. , 4.9, 1.8],\n", " [ 6.4, 2.8, 5.6, 2.1],\n", " [ 7.2, 3. , 5.8, 1.6],\n", " [ 7.4, 2.8, 6.1, 1.9],\n", " [ 7.9, 3.8, 6.4, 2. ],\n", " [ 6.4, 2.8, 5.6, 2.2],\n", " [ 6.3, 2.8, 5.1, 1.5],\n", " [ 6.1, 2.6, 5.6, 1.4],\n", " [ 7.7, 3. , 6.1, 2.3],\n", " [ 6.3, 3.4, 5.6, 2.4],\n", " [ 6.4, 3.1, 5.5, 1.8],\n", " [ 6. , 3. , 4.8, 1.8],\n", " [ 6.9, 3.1, 5.4, 2.1],\n", " [ 6.7, 3.1, 5.6, 2.4],\n", " [ 6.9, 3.1, 5.1, 2.3],\n", " [ 5.8, 2.7, 5.1, 1.9],\n", " [ 6.8, 3.2, 5.9, 2.3],\n", " [ 6.7, 3.3, 5.7, 2.5],\n", " [ 6.7, 3. , 5.2, 2.3],\n", " [ 6.3, 2.5, 5. , 1.9],\n", " [ 6.5, 3. , 5.2, 2. ],\n", " [ 6.2, 3.4, 5.4, 2.3],\n", " [ 5.9, 3. , 5.1, 1.8]])" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }