{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "accelerator": "GPU", "colab": { "name": "final-sentiment-analysis_v2", "provenance": [], "collapsed_sections": [], "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "vpaLrN0mteAS" }, "source": [ "## বাংলা সেন্টিমেন্ট অ্যানালাইসিস, টেন্সর-ফ্লো হাব দিয়ে\n", "\n", "এখনো ড্রাফট পর্যায়ে আছে " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GhN2WtIrBQ4y" }, "source": [ "আমরা প্রায় বইয়ের শেষ পর্যায়ে চলে এসেছি। এতক্ষণ আমরা যা শিখেছি - সবগুলোর কিছু কিছু অংশ নিয়ে এই শেষ টিউটোরিয়াল। আগের সেন্টিমেন্ট অ্যানালাইসিস এ আমরা যে ধরনের সাধারণ লাইব্রেরি ব্যবহার করেছি সেগুলোকে কিভাবে প্রি-ট্রেইনড মডেল, ট্রান্সফার লার্নিং এই জিনিসগুলোকে ব্যবহার করে আরো উন্নত করা যায় সেটা নিয়ে আলাপ করছি এখানে। মাঝখান দিয়ে আমরা টেন্সর-ফ্লো সার্ভিং এবং এপিআই নিয়ে কাজ করেছি। এর পাশে আমরা টেন্সর-ফ্লো হাব যেখানে মডেলগুলোকে ট্রান্সফার লার্নিং দিয়ে “সেভড-মডেল” করে রাখা যায় সেই ফরম্যাটগুলো নিয়ে আলাপ করেছি। এদিকে টেন্সর-ফ্লো হাব যেখানে আমাদের প্রি-ট্রেইনড মডেলগুলোকে মডিউল হিসেবে রেখেছি যাকে বিভিন্ন এপিআই দিয়ে কানেক্ট করা যায়। আজকের গল্পের বেশিরভাগ হচ্ছে এই টেন্সর-ফ্লো হাব নিয়ে। " ] }, { "cell_type": "markdown", "metadata": { "id": "4pJhKIERYWTl", "colab_type": "text" }, "source": [ "## আমাদের আজকের কাজ\n", "\n", "১. ওয়ার্ড২ভেক প্রি-ট্রেইনড এমবেডিং ডাউনলোড করে নেব। \n", "\n", "২. ট্রেনিং এর জন্য দুটো লেবেলড ডেটাসেট ডাউনলোড করে নেই। একটা পজিটিভ আরেকটা নেগেটিভ সেন্টিমেন্টের ফাইল। দেখে নেই ভিতরে কি আছে?\n", "\n", "৩. টেন্সর ফ্লো হাব থেকে আগে থেকে তৈরি করা ওয়ার্ড এমবেডিং কনভার্টার/এক্সপোর্টার স্ক্রিপ্ট নামাবো যেটা ১. এ ডাউনলোড করা ওয়ার্ড এমবেডিংগুলোকে টেন্সর-ফ্লো হাব টেক্সট এমবেডিং মডিউলে এক্সপোর্ট করে দেবে। \n", "\n", "৪. ওয়ার্ড২ভেক প্রি-ট্রেইনড এমবেডিং ফাইল এবং ওয়ার্ড এমবেডিং কনভার্টার/এক্সপোর্টার স্ক্রিপ্ট দুটো একই ডাইরেক্টরিতে থাকবে। আমাদের ওয়ার্ড এমবেডিং এর .txt অথবা .vec (বিশেষ করে ফাস্টটেক্সট) ফাইল থেকে এক্সপোর্টার স্ক্রিপ্ট এমবেডিংগুলোর ভেক্টর পড়ে সেটাকে এক্সপোর্ট করবো “সেভড-মডেলে”। \n", "\n", "৫. টেন্সর-ফ্লো হাব এই “সেভড-মডেল”কে লোড করবে মডিউল হিসেবে যা আমাদেরকে মডেলকে সেন্টিমেন্ট অ্যানালাইসিস করবে। \n", "\n", "৬. একটা সিকোয়েন্সিয়াল মডেল তৈরি করবো, সেখানে ডেটাসেট বড় হওয়ায় একটা জেনারেটর ফাংশন ব্যবহার করে সেটার দৈবচয়নের ভিত্তিতে শাফলিং এবং দরকারি ব্যাচিং করবো। এখানে ব্যবহার করবো টেন্সর ফ্লো ডেটাসেটের tf.data.Dataset.from_generator মেথড। এটা ট্রেনিং এর জন্য দরকার।\n", "\n", "৭. ট্রেনিং করে মডেল ‘সেভ’ করবো। এখানে যেকোন লেয়ারের মতো model.add দিয়ে আমাদের টেক্সট এমবেডিং মডিউলকে যোগ করা যায় এই ছোট সিকোয়েন্সিয়াল মডেলে। \n", "\n", "৮. সবশেষে টেস্টিং। কয়েকটা বাক্যকে প্রেডিক্ট মেথডে পাঠালে সেটার দুটো ক্লাস (পজিটিভ/নেগেটিভ) আমাদেরকে জানিয়ে দেবে।" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Q4DN769E2O_R" }, "source": [ "## দরকারি লাইব্রেরিগুলো লোড অথবা ইনস্টল করে নেই \n" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "zA07b51AGF5l", "outputId": "f94cab5b-79a1-4fa1-a0de-ab90e6ec1274", "colab": { "base_uri": "https://localhost:8080/", "height": 70 } }, "source": [ "!pip install -q tensorflow-gpu==2.0.0-beta1\n", "# !pip install -q tensorflow-gpu==1.15" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "text": [ "\u001b[K |████████████████████████████████| 348.9MB 48kB/s \n", "\u001b[K |████████████████████████████████| 3.1MB 49.5MB/s \n", "\u001b[K |████████████████████████████████| 501kB 71.6MB/s \n", "\u001b[?25h" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "zSeyZMq-BYsu", "outputId": "434480e0-1cbd-4096-97cf-a6d694e0236c", "colab": { "base_uri": "https://localhost:8080/", "height": 534 } }, "source": [ "import tensorflow as tf\n", "import tensorflow_hub as hub\n", "import numpy as np\n", "import os\n", "from sklearn.metrics import classification_report\n", "from gensim.models import Word2Vec\n", "\n", "# দেখি কি কি আসলে আছে? \n", "print(\"Version: \", tf.__version__)\n", "print(\"Eager mode: \", tf.executing_eagerly())\n", "print(\"Hub version: \", hub.__version__)\n", "print(\"GPU is\", \"available\" if tf.test.is_gpu_available() else \"NOT AVAILABLE\")" ], "execution_count": 2, "outputs": [ { "output_type": "stream", "text": [ "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "Version: 2.0.0-beta1\n", "Eager mode: True\n", "Hub version: 0.7.0\n", "GPU is available\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "_SF78GWg74cW", "colab_type": "code", "outputId": "98d465f8-cdb7-4c22-b22b-6f63e55189b2", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "tf.__version__" ], "execution_count": 3, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'2.0.0-beta1'" ] }, "metadata": { "tags": [] }, "execution_count": 3 } ] }, { "cell_type": "code", "metadata": { "id": "_qRk_Ff7EGGc", "colab_type": "code", "colab": {} }, "source": [ "# বাড়তি ওয়ার্নিং ফেলে দিচ্ছি, আপনাদের কাজের সময় লাগবে না \n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "KaQVUacCFMhE", "colab_type": "text" }, "source": [ "এই সেন্টিমেন্ট অ্যানালাইসিসে কেরাস লেয়ারে “সেভড-মডেল” ব্যবহার করব যাতে প্রি-ট্রেইনড মডেলগুলোকে ব্যবহার করা যায়। যেগুলোতে আগে থেকে এমবেডিংগুলো বানানো আছে। আমাদের টেন্সর-ফ্লো হাব (tensorflow_hub) লাইব্রেরিটা হাব-কেরাস (hub.KerasLayer) ক্লাস দিচ্ছে যা একটা “ইউআরএল” অথবা ফাইল সিস্টেম থেকে “সেভড-মডেলে”র কম্পিউটেশন এবং প্রি-ট্রেইনড ওয়েট বের করে নিয়ে আসবে। এখানে আমরা টেন্সর-ফ্লো ২ এর “সেভড-মডেল”কে পুনরায় ব্যবহার করবো লো-লেভেল hub.load() এপিআই এবং hub.KerasLayer এর wrapper দিয়ে।\n", "\n", "আমাদের বাংলা ভাষার জন্য কিছু প্রি-ট্রেইনড ওয়ার্ড এমবেডিং ব্যবহার করবো যাতে ব্যাপারটা সহজ হয়। এর আগেও আমরা প্রি-ট্রেইনড ওয়ার্ড এমবেডিং ব্যবহার করার জন্য ওয়ার্ড২ভেক এবং ফাস্টটেক্সট ব্যবহার করেছি। আপনারা তো জানেন, ফেইসবুক ফাস্টটেক্সটে বাংলাসহ ১৫৭টা ভাষার প্রি-ট্রেইনড ওয়ার্ড ভেক্টর ছেড়েছে বেশ কিছুদিন হলো। তবে, ওয়ার্ড২ভেক অতোটা খারাপ নয়। এই মডেলেও আপনারা ওয়ার্ড২ভেকের জায়গায় ফাস্টটেক্সট ব্যবহার করে দেখতে পারেন।\n", "\n", "আমরা ইচ্ছা করলে 'বার্ট' (BERT) অথবা 'ফ্লেয়ার' ব্যবহার করতে পারতাম, তবে এই ফ্রেমওয়ার্কগুলো বুঝলে বাকিগুলো কাজ করা সহজ হবে।" ] }, { "cell_type": "code", "metadata": { "id": "En9nhQ0elBEQ", "colab_type": "code", "outputId": "46b79737-ef46-4205-8194-d4634698766b", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "hub_url = \"https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1\"\n", "embed = hub.KerasLayer(hub_url)\n", "embeddings = embed([\"A long sentence.\", \"single-word\", \"http://example.com\"])\n", "print(embeddings.shape, embeddings.dtype)" ], "execution_count": 5, "outputs": [ { "output_type": "stream", "text": [ "(3, 128) \n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "ldyAlgRHlI2R", "colab_type": "text" }, "source": [ "এখানে সাধারণ কেরাস লেয়ার দিয়ে একটা টেক্সট ক্লাসিফায়ার বানানো সমস্যা নয়।" ] }, { "cell_type": "code", "metadata": { "id": "UIVnQxk_lDEP", "colab_type": "code", "colab": {} }, "source": [ "model = tf.keras.Sequential([\n", " embed,\n", " tf.keras.layers.Dense(16, activation=\"relu\"),\n", " tf.keras.layers.Dense(1, activation=\"sigmoid\"),\n", "])" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9FB7gLU4F54l" }, "source": [ "# ওয়ার্ড২ভেক ওয়ার্ড এমবেডিং ডাউনলোড \n", "\n", "চেষ্টা করছি এতো বড় ফাইল কোথাও হোস্ট করতে। যেহেতু ফাইলটা ২.৫ গিগাবাইটের মতো, এটাকে ভাগ করে নিয়েছি দুটো অংশে। কম্প্রেসড হিসেবে প্রায় ১ গিগাবাইট। দুটো ফাইলকে ডাউনলোড করে যোগ করে নিয়েছি।" ] }, { "cell_type": "code", "metadata": { "id": "hqwIWFs_BAvb", "colab_type": "code", "outputId": "3450f148-3dd8-4906-9f01-3238903ec9b8", "colab": { "base_uri": "https://localhost:8080/", "height": 605 } }, "source": [ "!wget https://bitbucket.org/r_hassan/datasets/raw/9caa4f67c34540e601cbad4de68d4786271b782c/bn-wiki-word2vec-300.txt.tgz.aa\n", "!wget https://bitbucket.org/r_hassan/datasets/raw/9caa4f67c34540e601cbad4de68d4786271b782c/bn-wiki-word2vec-300.txt.tgz.ab" ], "execution_count": 7, "outputs": [ { "output_type": "stream", "text": [ "--2019-11-23 03:19:45-- https://bitbucket.org/r_hassan/datasets/raw/9caa4f67c34540e601cbad4de68d4786271b782c/bn-wiki-word2vec-300.txt.tgz.aa\n", "Resolving bitbucket.org (bitbucket.org)... 18.205.93.1, 18.205.93.0, 18.205.93.2, ...\n", "Connecting to bitbucket.org (bitbucket.org)|18.205.93.1|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://api.media.atlassian.com/file/af96b4a4-e409-43e6-9423-d69a7ba4df91/binary?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiIyYmI2MzgwYS01NTZhLTRhOTMtYWI0ZS02OTc0Y2NmNTUyMjciLCJhY2Nlc3MiOnsidXJuOmZpbGVzdG9yZTpmaWxlOmFmOTZiNGE0LWU0MDktNDNlNi05NDIzLWQ2OWE3YmE0ZGY5MSI6WyJyZWFkIl19LCJuYmYiOjE1NzQ0NzkxMjUsImV4cCI6MTU3NDQ3OTU0NX0.vTwdFQgbjaS8LiNMK--w_vLKLxNGqTFvFDrZV9T1uXQ&client=2bb6380a-556a-4a93-ab4e-6974ccf55227&dl=1&name=bn-wiki-word2vec-300.txt.tgz.aa [following]\n", "--2019-11-23 03:19:45-- https://api.media.atlassian.com/file/af96b4a4-e409-43e6-9423-d69a7ba4df91/binary?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiIyYmI2MzgwYS01NTZhLTRhOTMtYWI0ZS02OTc0Y2NmNTUyMjciLCJhY2Nlc3MiOnsidXJuOmZpbGVzdG9yZTpmaWxlOmFmOTZiNGE0LWU0MDktNDNlNi05NDIzLWQ2OWE3YmE0ZGY5MSI6WyJyZWFkIl19LCJuYmYiOjE1NzQ0NzkxMjUsImV4cCI6MTU3NDQ3OTU0NX0.vTwdFQgbjaS8LiNMK--w_vLKLxNGqTFvFDrZV9T1uXQ&client=2bb6380a-556a-4a93-ab4e-6974ccf55227&dl=1&name=bn-wiki-word2vec-300.txt.tgz.aa\n", "Resolving api.media.atlassian.com (api.media.atlassian.com)... 18.246.31.164, 18.246.31.165, 18.246.31.166\n", "Connecting to api.media.atlassian.com (api.media.atlassian.com)|18.246.31.164|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 492830720 (470M) [application/x-gzip]\n", "Saving to: ‘bn-wiki-word2vec-300.txt.tgz.aa’\n", "\n", "bn-wiki-word2vec-30 100%[===================>] 470.00M 24.3MB/s in 20s \n", "\n", "2019-11-23 03:20:05 (23.8 MB/s) - ‘bn-wiki-word2vec-300.txt.tgz.aa’ saved [492830720/492830720]\n", "\n", "--2019-11-23 03:20:06-- https://bitbucket.org/r_hassan/datasets/raw/9caa4f67c34540e601cbad4de68d4786271b782c/bn-wiki-word2vec-300.txt.tgz.ab\n", "Resolving bitbucket.org (bitbucket.org)... 18.205.93.1, 18.205.93.0, 18.205.93.2, ...\n", "Connecting to bitbucket.org (bitbucket.org)|18.205.93.1|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://api.media.atlassian.com/file/ecab056b-19f1-44c3-a0ee-7a681d8f65c2/binary?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiIyYmI2MzgwYS01NTZhLTRhOTMtYWI0ZS02OTc0Y2NmNTUyMjciLCJhY2Nlc3MiOnsidXJuOmZpbGVzdG9yZTpmaWxlOmVjYWIwNTZiLTE5ZjEtNDRjMy1hMGVlLTdhNjgxZDhmNjVjMiI6WyJyZWFkIl19LCJuYmYiOjE1NzQ0NzkxNDcsImV4cCI6MTU3NDQ3OTU2N30.iHKKVa9f_t1qfGXJZyJXqvJGBmC7ZGSeO3p2FL_lBZY&client=2bb6380a-556a-4a93-ab4e-6974ccf55227&dl=1&name=bn-wiki-word2vec-300.txt.tgz.ab [following]\n", "--2019-11-23 03:20:06-- https://api.media.atlassian.com/file/ecab056b-19f1-44c3-a0ee-7a681d8f65c2/binary?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiIyYmI2MzgwYS01NTZhLTRhOTMtYWI0ZS02OTc0Y2NmNTUyMjciLCJhY2Nlc3MiOnsidXJuOmZpbGVzdG9yZTpmaWxlOmVjYWIwNTZiLTE5ZjEtNDRjMy1hMGVlLTdhNjgxZDhmNjVjMiI6WyJyZWFkIl19LCJuYmYiOjE1NzQ0NzkxNDcsImV4cCI6MTU3NDQ3OTU2N30.iHKKVa9f_t1qfGXJZyJXqvJGBmC7ZGSeO3p2FL_lBZY&client=2bb6380a-556a-4a93-ab4e-6974ccf55227&dl=1&name=bn-wiki-word2vec-300.txt.tgz.ab\n", "Resolving api.media.atlassian.com (api.media.atlassian.com)... 18.246.31.164, 18.246.31.165, 18.246.31.166\n", "Connecting to api.media.atlassian.com (api.media.atlassian.com)|18.246.31.164|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 486748692 (464M) [application/x-dosexec]\n", "Saving to: ‘bn-wiki-word2vec-300.txt.tgz.ab’\n", "\n", "bn-wiki-word2vec-30 100%[===================>] 464.20M 22.7MB/s in 20s \n", "\n", "2019-11-23 03:20:27 (22.8 MB/s) - ‘bn-wiki-word2vec-300.txt.tgz.ab’ saved [486748692/486748692]\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "2lOaqH55P9BJ", "colab_type": "code", "colab": {} }, "source": [ "# ফাইল কনক্যাটেনেট করে যোগ করে 'টার' থেকে এক্সট্র্যাক্ট \n", "\n", "!cat bn-wiki-word2vec-300.txt.tgz.* | tar xzf -" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "pCdVTe2Yu8GI", "colab_type": "code", "outputId": "7cdbcfe3-59c8-484b-e47d-eaafcfc15c5b", "colab": { "base_uri": "https://localhost:8080/", "height": 161 } }, "source": [ "# প্রথম কয়েক লাইন দেখলেই বুঝবেন কি বলতে চাচ্ছি - ওয়ার্ড এমবেডিং ভেক্টর\n", "# শুরুতে বাংলা শব্দটা, এরপরে ভেক্টর\n", "\n", "!head -7 bn-wiki-word2vec-300.txt" ], "execution_count": 9, "outputs": [ { "output_type": "stream", "text": [ "669605 300\n", "এবং -0.93040687 0.60418844 0.6206399 0.15345214 -0.5920706 1.4105053 -0.19125229 1.9424365 -0.28456318 0.86637765 -0.34657523 0.008341969 0.91945136 0.33013958 -1.6456839 -1.6953105 1.9161752 1.1476667 0.17091753 0.3958588 1.0207202 -0.8163486 0.32261878 -0.30720857 -0.6554219 -1.7145324 -1.6113459 0.29473424 -0.8452265 0.18330733 1.047255 0.22511762 0.822286 0.16025306 0.66336554 1.0438149 0.6023638 -0.64874256 1.5032426 1.5895689 0.75842565 -1.2870961 0.079544045 0.3080709 0.32782224 -0.7009649 0.15249959 -1.027652 0.8451291 -0.32714248 0.42230263 -1.4003234 -0.59839815 -0.67217594 1.072765 0.2526819 0.16195725 -1.2569925 -0.5837513 -1.1979657 -0.6138971 0.79471904 -0.9409709 1.2761021 0.89106756 0.53292865 2.2675922 -0.13259064 0.15469204 1.3745763 -0.5177524 0.41830626 0.5299528 -0.40102947 -0.42628673 -1.0313057 0.55274475 -0.88331276 0.21075027 1.387416 0.5721329 0.35013482 -0.21881458 -2.7000587 -1.14341 1.7165354 1.2415577 -0.13076034 0.9847175 -1.4681516 0.35087734 0.7275639 -0.9640771 0.3047465 1.379611 -0.7907444 0.60839903 1.2384896 -0.28551388 -1.2486242 0.43692696 1.3337344 1.4157426 -0.5497216 -1.0586624 1.1485555 -0.2848548 -0.052209064 -0.27139845 -0.9404369 -0.5050181 -0.41314003 -0.28034905 -1.5697598 -0.59607816 -0.7769144 -1.5711054 -0.590155 0.04364686 0.10375001 0.97881234 0.44856653 1.7305022 1.0183886 0.78831923 1.2242231 0.4937382 -1.0819402 -1.3215535 -1.01117 0.206751 0.6223832 1.0083629 -0.59937185 1.3992922 -1.7688543 -0.62926817 0.9333828 -0.20508951 -0.18137959 1.6263956 -0.26525566 -0.70443696 -1.6312447 -1.3012207 1.5594385 0.18196549 0.10253664 -0.27073783 -0.57916814 0.08798229 0.7529285 0.27043918 0.13748497 -0.7652544 1.4045902 -0.2284309 -0.017713083 -0.16723397 1.4385971 1.075745 1.4567398 -0.3620912 -0.049863935 0.21818072 0.494385 0.26240674 0.47549942 -0.40527856 -2.5260077 -0.93502015 -0.6997547 0.66674054 -0.32764944 -0.51164323 0.15243222 -0.2710508 1.4720447 -0.9499978 1.5680584 -0.49819544 0.9979104 1.3278234 0.28267214 -1.9825058 -0.5250971 -1.3529805 1.0538665 -2.9742591 1.3853692 -1.0246351 -0.45788342 -0.58545464 -0.59052104 -0.07944148 0.3599861 -0.09542372 0.48706493 -0.30248284 0.91800874 0.22113384 -1.6033583 0.611133 -1.6167171 0.8257279 1.511753 0.07948396 -0.42566258 1.413907 1.2193689 -2.1537566 0.9580351 0.88082844 0.85505676 -0.097767904 1.9956189 -1.4581429 -0.40138552 -1.5197046 0.895531 0.43802926 1.9927098 0.18570288 -0.1933191 0.37472582 1.6219916 -1.3226445 1.504367 0.7569192 0.3736352 1.9224497 -0.81562334 1.1996975 0.81546986 1.5816469 0.37639666 -1.4780647 0.60364085 -0.2537704 0.10160284 -1.1119928 0.9420247 0.33656976 -0.6338852 0.2734657 -0.3280644 1.0076349 0.31781343 -1.4982619 0.24992752 1.289109 -1.1146048 -0.38495338 -1.235198 0.69176793 -0.73397833 0.8294925 -0.3333072 -0.75034577 -0.30954805 0.51477766 -0.3312381 1.8786335 -0.95206314 1.3874136 0.13236618 -2.1544163 -1.9921167 -0.68547404 -1.6311249 -2.8701363 1.0035746 0.84304595 -0.046721794 -0.5423261 -1.3923548 1.0062038 1.3781087 0.6179311 0.1195227 1.5426967 1.2713842 -0.87397003 1.1154585 -1.1934513 0.54639715 0.25530308 -0.8559359 -1.4125086 1.0636417 0.4095158 -0.21225604 0.14483044 0.043718226 0.3257524 0.74142253 0.6205646 1.7895927 0.6101109 0.2586147\n", "ও -0.58524036 0.45447958 -0.271059 0.903281 -0.93572026 0.781395 0.13197729 2.583347 0.017617458 0.9133457 -0.20172201 0.5319752 0.045481898 0.48530668 -2.8445883 -2.1672294 1.7199044 1.5424261 -0.23204179 -0.47215548 0.23906806 -2.0918553 0.3764463 0.11041565 0.9503702 -1.0620172 -1.6744974 0.51625335 -0.26014885 -0.9991527 0.34031266 -0.43819448 -0.21267189 -0.7827624 0.5974515 -0.671627 1.1221253 -0.16315356 0.66311425 0.79359466 0.36301354 -0.1322366 1.1541231 0.0052844356 -0.16176642 0.24536197 0.26942456 -0.8911422 1.5041916 -0.099623024 0.13907145 0.6387255 -0.97795665 -0.20738949 1.0964372 1.3952187 1.1587632 1.845541 -0.61203593 0.031217841 -0.15978587 0.82115567 -0.5647423 0.99255794 -0.3691354 -0.43174398 1.0422002 0.2552028 0.91485304 1.1568944 -0.49558777 0.84087986 0.17164658 1.0703882 -0.8077708 -2.308471 0.29645053 -1.1976753 -0.1050389 0.48258373 0.7099251 -0.3150525 -0.52110106 -1.5863286 -0.5118045 2.1650927 1.2914363 -0.94399244 0.78685063 -0.57375664 -0.015934566 0.53221756 -2.271201 -0.20666255 0.6977974 -1.0001872 0.19057953 1.5204624 1.2299569 -1.5117029 0.7973736 1.0480093 1.1071234 0.10898301 -0.3522087 2.949563 -0.66522825 0.5497931 -0.06854835 -0.45992488 -0.38099083 -1.4241983 1.7252287 -0.7620656 0.90273964 0.01332941 -1.9926864 -1.3374759 -0.030660542 -0.12929039 1.8411474 1.0526581 2.9473522 1.3345705 0.5947804 1.2005383 0.07082191 -0.23855963 -0.92356974 -0.7214136 0.09630342 1.3055235 1.1290928 0.16470808 1.0403022 -0.33776775 -0.9134136 0.44647256 0.6388682 -0.5466339 0.8665062 -1.8086394 -1.3353461 -0.5004335 -1.0874145 1.2467664 -0.94416803 0.14157577 0.5646803 -1.2637936 -2.4817054 -0.43978366 -0.9711801 -0.11627038 0.2651847 0.22482847 -0.646065 0.6627505 0.18591863 1.3862606 0.26508737 1.4358656 -0.1356123 0.30023885 -1.0071484 0.14609484 -0.40067545 0.11614575 -1.1211672 -2.9948976 -1.1640134 -0.8893208 0.69647986 0.5226507 0.43557674 0.4416255 -0.1976004 1.3055359 -0.35186562 0.96851355 0.6488393 1.613607 0.6422151 0.039758135 -1.3041908 -1.0441707 -0.33136213 0.42566493 -1.8446702 0.8682943 -0.4355695 -0.05710234 -1.3763376 0.44872513 -0.18696697 0.7705693 0.20046796 -0.49683523 -0.9298872 0.19058146 -1.049943 -0.9615622 0.43946669 -1.7700518 0.18221591 1.9623158 1.4556637 -1.4133968 2.1814926 -0.36896977 -0.26596007 0.7020955 0.7601751 0.9486567 0.12267504 1.7857215 1.3161366 -0.007625853 -0.19064592 2.14113 -0.06724504 1.9534035 0.3407787 0.12543564 0.71115464 0.050714646 -0.59355587 -0.03792645 0.23806724 -0.26249117 1.2605692 -1.0072167 0.9429875 0.76088065 1.6205138 -1.8962756 -2.7197406 0.089888565 0.5498877 -1.2580627 -1.8090764 -0.91368014 0.72881436 -0.22046834 -0.69258153 -1.5590016 0.33359516 0.3590895 -0.19181904 0.7086726 0.8025603 -1.6143092 -1.9493469 0.31151736 0.8470133 -1.691301 1.0816429 0.71139693 -0.2311525 -1.1807052 0.87001324 -0.3568379 1.6725506 -0.50945103 -0.27646524 0.37701982 -2.2516382 -2.0910823 0.4179324 -1.0593878 -1.7708141 0.87476987 -0.030148244 1.4041723 -1.1686617 -0.19899243 0.8683499 1.4706208 0.95210135 0.8617375 1.2374552 0.26896238 -0.35295638 1.6243384 0.05577474 1.4881486 -0.75438017 1.0217514 -1.022703 0.840268 0.9699244 -0.30531976 -0.9636556 0.18336996 0.85473806 1.6347116 -0.83497536 0.63057977 1.2308702 -0.32897514\n", "হয় -1.8247691 -2.0915492 1.5521196 0.9817031 -0.71198153 -1.1406553 0.20706734 2.002229 0.4222637 1.0221027 1.9313294 0.20992161 -0.22667134 0.5778148 -1.9811527 -1.6068003 0.51935846 4.190498 1.6591264 3.696415 -0.8972124 -0.9695677 -1.0911344 -2.5159888 0.64899313 -3.9777596 1.2032437 1.5543182 -0.5659898 -1.1380624 0.7247645 -0.81457055 2.3773313 -0.10186057 0.2803823 -1.8328085 1.8070668 0.49927655 3.7146556 2.3095276 1.4903845 0.24114813 -1.6105928 2.377896 -1.8299401 -1.6903756 -0.045043427 -3.1909945 -0.9577706 0.89430904 2.3162541 1.4072512 0.15841399 -0.17395592 1.276816 0.14795898 0.7110485 0.8493568 -0.89166087 1.3985461 -0.2607159 -1.2707427 -0.5525633 1.288889 -1.2450587 2.4014404 -1.647329 0.3628347 -0.36183968 1.8255006 -0.49935928 -3.7446706 3.0025551 1.1856798 -1.0313323 -1.7755738 -3.3956897 -0.0071685165 0.46419826 0.79227895 1.6122632 -1.1335179 0.69419193 -2.092566 -1.2973521 3.7132552 -0.12574294 1.7159435 0.5323431 -1.177301 -0.5910223 1.8957894 1.3004191 0.043679 -2.1573641 -1.2192295 -0.5794048 0.5401791 0.09259804 -0.884269 -0.8885932 -1.6349372 2.9712481 -0.5136787 0.47402793 3.2185366 -1.0303059 1.8862349 -1.2980344 1.3100715 2.224686 2.2730892 0.2538476 -0.10844603 -0.13062906 2.7270248 0.6029394 0.61437994 2.1098824 0.831568 1.8434451 -0.96293265 -1.2397857 1.6735845 -1.2131585 0.26517543 0.81909645 0.6247328 -1.4168934 1.3242841 -0.7107275 1.4562027 0.64761376 2.464252 -1.7311655 -1.0090224 0.26995316 0.8978482 1.0040481 -0.9774874 -0.14152434 0.90454286 1.0976329 -0.26051864 -0.67981464 -0.14273185 -0.59774244 -1.2385356 -0.5086274 1.9503305 -0.4984986 3.2457983 -2.5104663 0.71335363 1.2941341 0.24610522 2.3594325 -2.3504922 1.334544 1.430389 2.6565323 2.0589359 0.16488439 1.20627 0.86995834 -0.9385664 -0.70905566 -1.2333251 0.31907684 -2.1176155 -0.5103191 0.038472503 0.6015968 -1.3714979 0.21941918 -2.0026355 0.5943965 -2.8778977 0.27973834 1.6905153 0.015640106 -1.3693813 0.27766258 1.8742485 -5.211662 0.48750126 -0.27569988 1.3009589 -3.8204072 1.46969 1.0037109 -2.82306 -0.746158 1.1147379 -3.249578 2.260624 1.354588 -1.1860611 -0.014181979 -0.19925839 -0.3954283 -0.6466107 -1.9974515 0.4805865 -2.211764 0.52255344 -0.18123792 -1.6121762 3.2712665 1.7167118 1.1680586 1.1729398 -0.28861085 1.5550202 1.0895033 2.2843733 0.7231674 0.7465411 -1.0611109 -0.49648127 0.4654247 2.1616533 -1.2523402 1.0771954 0.74913055 1.0157934 2.9522753 -0.76349473 -0.84537077 0.82048655 1.1640248 0.2368476 0.34759012 2.1007953 0.7149294 -0.21828659 -0.11407231 -0.1643775 1.2118605 3.0619822 -1.5080625 -0.93636864 0.0018499214 1.2086605 0.26163357 0.56321836 -1.9514295 2.2925124 -0.45035234 -3.6912563 -2.088268 0.5641842 -2.1283545 -2.0851738 0.36083636 0.95293146 -0.7159297 0.26238683 -0.0973097 0.28054848 -0.34115073 -0.52036345 1.0155922 -1.0751973 0.42513663 -0.052052822 -1.1211832 -0.41907045 0.707179 -0.69412476 -2.3132749 -1.1080768 -0.09723622 -2.4235508 -3.4284568 0.8030779 1.1016521 -2.4200463 -1.4302769 0.3497601 1.2805206 0.71379256 0.8369469 1.64142 0.6716727 -1.4271523 0.48001695 -3.4348445 -0.2225095 1.2764535 -0.8893114 -0.31440404 -1.0537965 -0.5511297 -0.7446707 -2.1130662 1.9561703 -0.7298195 -0.45455173 0.49335495\n", "করে 1.4638041 0.46013883 0.4770293 2.1610985 -0.09711245 -0.5382342 -1.3283437 0.70485425 -0.5950621 1.2856623 -0.8577408 -0.70547193 -1.6236017 0.6296531 -1.9744449 -2.508508 1.4924573 1.1503576 1.5973461 0.5769804 -0.6069721 0.13309194 -2.1690218 -0.3986674 -0.1678412 -2.250409 -0.9179335 -1.0952939 -0.88789827 0.09512321 1.7706733 0.11340081 2.1630971 -1.0658263 -0.24395598 -1.1115603 1.5150295 1.1675327 2.7483194 3.5537095 -0.59933996 0.81488013 -0.6739365 1.773015 1.0559928 0.6247622 -0.3478868 -2.965228 -0.21492709 0.89353067 2.9318707 1.9227598 -0.11769063 -2.4075592 0.6904535 2.3752434 -0.037284724 -0.63309926 -3.128802 2.9791856 1.5833476 0.6449195 -0.32738364 2.6401837 0.44933286 1.1283821 2.565787 1.5990252 2.102574 1.1839719 -0.08801095 -1.3336484 0.4691175 2.7238932 -0.07542253 -1.1330373 -1.0942403 -1.2601542 1.1726574 0.8933551 1.5275543 0.8165056 -0.5239747 -2.4554758 0.36391234 0.6503322 2.8399084 1.4098556 -0.51982903 -2.1042693 -0.05306528 -0.56123155 1.1982063 1.2563752 0.7620389 -1.2276828 -0.84249467 1.4393364 0.77440304 -2.104103 -2.059731 -0.21191682 -0.19533132 0.9014223 -0.13766639 1.7979478 -1.5070478 -0.44969973 0.33912385 0.5111161 0.42914236 2.8019922 -0.593548 1.031023 -0.35696626 1.9381701 0.61078405 -0.1258579 1.0306083 0.8604857 -0.13915882 -1.4226767 3.048097 1.00559 -0.22615053 1.128138 1.1582717 -2.166985 -0.65041953 -0.08019211 1.2970815 1.0033172 0.635273 1.8792266 -1.8587356 -0.4351632 -1.0432365 0.3433936 1.877641 -0.6134908 1.7805911 0.44998482 0.6516786 0.08570023 -0.25780004 1.9028347 1.5127552 0.44377133 -1.0347323 0.073662594 0.6528598 1.7802001 -0.08616957 0.10228003 0.90827394 1.8746697 3.0865862 -1.5397671 -2.0126138 1.3180072 2.6859086 2.988848 -0.47898462 1.8480692 -0.8682913 1.2637501 0.38463044 -2.0548644 1.1856201 -2.8460965 -2.0303233 -0.34533274 2.5564435 0.85207915 -0.07946124 -2.363868 -1.1239123 -1.2606779 -0.930255 1.6201055 -0.9506335 -1.3052654 1.1424946 2.498399 -2.8332803 2.473882 -1.9490753 1.0908501 -2.2823808 1.3368375 1.2825803 -1.5106692 1.9418434 0.6338872 -4.218184 0.1354068 0.54564655 -0.613364 0.8976207 -0.33982697 0.43815103 -1.8296937 0.3118065 -1.1050173 -1.0681049 3.0734577 -0.06152384 0.04139191 2.8266747 1.7856406 -0.37226483 -1.1455252 0.07255044 3.1156726 -0.9318064 2.2944505 -0.4834573 -0.22268529 -1.4305617 -0.71662635 -1.4057091 3.9090807 0.3661221 -2.7341263 0.43841648 -0.7899564 0.7970126 1.2823138 -0.35770118 0.21682079 2.389844 -3.4076505 2.0682027 -0.01840818 -1.5271113 0.3105775 -1.7938234 1.9284834 -1.203957 0.54004896 -0.6391245 -0.95948666 0.7867698 1.0515132 -1.1418717 0.61196977 0.50458574 -0.9365569 -0.6977027 -1.6735121 -1.8234655 -0.65726954 0.51447946 -0.74598676 0.1281155 -1.5880537 1.9741507 0.3338909 0.6053169 -1.1810766 0.7080426 -0.96108526 1.6863154 -0.9004534 1.7303089 -0.5598361 -1.5497233 0.27824214 0.29501814 -1.0096308 -2.839824 -0.3321661 1.1058911 -1.7745312 -2.4715776 -0.32964313 1.6028548 2.7005281 -0.5799925 0.34175193 2.3922627 1.4408594 -0.21198568 2.0125268 0.73433226 0.11654882 -1.5469825 0.052762344 -0.81855565 0.42038527 1.0864041 -1.083027 -0.4222766 0.56002355 -1.7107468 -1.171553 2.6199453 -0.6048778 1.6012655 -0.3749908\n", "তিনি -1.1738678 0.37016252 -0.19781187 -0.16410504 -0.753571 -0.4210705 0.01682054 -0.26354325 1.439832 1.9638042 -0.18513128 0.21441272 -0.369283 0.7091637 -1.6222771 -1.0937014 -0.4071048 -1.758798 1.9133385 -1.0898916 0.5836275 1.6742417 1.6410733 -0.34128645 1.5230937 0.21261688 -1.420016 -2.4655473 -1.3401339 0.018569082 -0.034572214 -2.9821966 -0.91251194 0.7565992 0.45008317 0.99885845 -0.6392834 -0.44748396 1.8189455 1.6601168 0.24081528 -0.15245743 1.1164213 -0.60606116 1.1505754 -0.5297351 -1.8496084 -0.7881723 0.8304548 -1.2648691 -0.39852887 1.6382383 0.0046687606 -2.6636164 1.0559351 -0.14164782 0.47628072 0.52358997 -0.18506156 0.5733556 -0.7096374 -0.1830227 -1.5379386 0.60600984 -1.4188213 2.2663279 0.99508315 -0.9390181 -0.06525691 0.7604507 0.40174294 -0.936488 0.581012 0.8795788 1.8943496 -1.1697103 0.046171933 -0.9759555 0.64223045 -0.30895722 1.4993834 1.8423985 -0.5992019 1.4912438 -0.6711824 0.67837816 2.2673793 0.6526759 -0.2134826 1.2043437 -1.3382409 2.451539 -0.5747329 -0.4633366 1.9642861 -2.5575483 2.2893333 -1.731404 -2.2986598 -1.814027 -0.20896345 0.7783642 1.1070861 -0.6239703 1.4139019 1.3592714 0.6100163 2.6752806 0.5042902 -0.0152576035 0.2540001 -1.4778836 -0.08349057 -1.4187082 1.3294297 0.06202054 0.5349524 0.6188579 0.011822534 1.363343 1.0386257 0.7101702 -0.48664105 -1.2881335 1.5449836 2.4187858 1.3230757 -1.7545801 -1.2919799 -1.8938507 0.88096726 -0.3899972 -0.051461186 0.95661414 -0.9052122 -2.28024 -0.8510748 -0.36411715 0.55613613 0.7228644 2.2993028 1.2776306 0.57384104 0.0894299 -1.6698265 -0.76141024 -0.2615726 0.22468947 -0.90536505 0.8760995 -1.0942718 -1.7030396 1.5688227 -0.015965315 3.0786073 2.171664 -1.9684086 -0.71591437 0.84308565 -0.023123678 1.0272353 1.6805394 0.9659495 -0.693886 -0.19121507 0.69738144 -0.81845325 1.4621712 -0.64663017 -0.48601997 -1.8073735 1.9365724 1.0728244 -1.675593 -2.48825 -2.2549763 0.99883205 0.011717423 -0.968143 2.3676248 1.1581804 -1.1019763 -0.4052338 0.259193 -1.6883879 0.5301176 1.6926105 -0.27319026 -0.227297 -0.4368029 -0.7653405 -1.510421 -0.08304167 0.7327884 -0.72956216 0.6924532 0.12786134 -0.44742504 0.68064266 0.46439362 0.13899907 -0.03652357 -0.78007084 -1.0838907 0.9138987 1.2062742 -0.2172192 0.4082591 1.4207231 2.244966 -1.5546291 0.605055 -1.2843417 -1.4083458 -0.21145965 -0.23477666 -2.2664113 -0.28401485 0.66756046 0.76596564 -0.7703791 1.983706 1.670782 0.95740604 0.98075515 -0.97029024 0.8108358 0.14063218 1.0512149 0.65278953 -0.098545246 -1.505401 1.5409887 0.7402301 2.480546 0.08127695 -0.9955454 1.9439696 -2.7986643 -0.8857946 -1.4971809 -2.7917898 -0.047041085 -0.7761789 0.13264154 1.9617302 -1.6107641 -1.6877475 1.0520048 0.12970483 -0.24276163 -0.41984153 -0.15497293 0.9001811 -0.3241018 -1.3436453 0.5630198 -0.3871916 0.71962655 0.23657644 -0.84953403 -0.42115983 1.8571 0.9960627 -0.78075564 -1.2766498 0.030295808 -1.1820985 -0.26983517 -1.1656342 2.0044823 0.17986952 -0.1600661 -0.20971248 0.38795295 1.4175266 0.21813197 -0.25860766 -1.3707263 2.8051105 1.9417973 -1.3433391 -0.5040571 0.2990665 -1.5229963 -0.37078992 0.70896035 -0.019158987 0.23528913 1.5369213 1.0056368 -1.689032 -0.5990417 1.3212885 -0.85864425 -0.46737942 -0.8508794 -0.39553878 -0.35939643 2.2599432\n", "করেন 0.19318587 -0.9337364 0.5890889 1.1284809 -0.85182375 -1.8623679 -1.0837123 0.8337486 -1.4050269 2.0234854 0.36548066 0.61430967 -2.1424985 0.93722534 -1.5154264 -0.38403997 0.61476415 0.43370208 4.0862966 0.13589169 -1.4761454 0.43438548 -1.9887968 -1.0147327 1.4183593 -3.1219664 -0.8659663 -1.6563139 -1.2397468 -1.0541365 0.16262296 0.38919505 3.6184697 0.5404631 0.45725846 -0.3386037 0.83069044 1.3954304 2.4237938 2.018734 -0.30144882 1.15298 0.26642394 0.88249487 0.2900979 0.3875404 -1.026831 -3.002415 0.11618454 -0.6558714 3.1870656 3.3008626 -0.15362315 -2.0912647 2.6626966 2.197552 0.48283696 -0.026623448 -4.7704725 2.4603713 -0.71151155 -0.12804835 1.4783677 3.0028138 -1.366305 2.4201086 2.1641617 -0.50529355 1.9294802 2.5331008 -0.33527365 -0.54821765 -0.66211706 2.3656108 1.3793545 -1.7108806 -1.4458313 -0.48866358 0.75904554 -1.0844674 2.1751697 -0.2692254 -1.6566877 -1.1986219 1.2887886 -0.21022478 2.4642634 1.1712825 -1.7327492 -0.3192753 -0.80545455 -0.08996122 0.98601866 1.9941684 0.4085885 0.17396548 1.135127 0.38655835 0.7577267 -2.6354032 -1.3291075 0.8205332 -0.8832689 1.5700159 -0.14703543 2.9848032 -1.8170913 0.25200543 1.4294518 -0.7820231 0.10771451 1.7591317 -1.1159538 0.89680964 1.0042003 0.785278 1.7250232 0.19296032 1.5285679 2.7183332 0.78775007 -0.92460287 1.057212 0.3105081 0.2049064 2.5118272 1.5450809 -2.0605302 -3.5001185 -0.68298286 1.9003531 1.4539047 -0.8866524 1.5103813 -3.7033958 0.41399 -0.9688876 2.19605 0.39015073 0.92533594 3.3174067 1.1528869 2.2846165 0.89071685 -0.7841825 1.0828366 1.044191 0.32648712 -1.9048945 0.37079397 -0.01011078 -0.7614668 -0.5495317 1.1998264 1.8840824 1.5695889 3.3184748 0.4852694 1.0154525 -0.021934023 4.4868765 1.6185771 -0.3153896 -0.36514628 -1.9076306 1.3816717 -2.1277153 -1.9272645 -1.3931488 -0.57421815 -1.1602892 -0.20142998 1.9718063 1.2130727 -2.745902 -3.8931425 -0.1855184 -2.4099195 -0.8878476 2.6704414 0.5670274 0.38503766 0.35302347 4.551226 -2.0581865 0.9459339 0.337567 0.21248995 -2.9210923 -0.36402667 0.63711524 -0.952968 1.1995943 1.3776119 -3.9117785 0.34826854 2.3526456 0.92054784 2.5165944 -0.98528296 -1.0495394 0.52338076 -0.78099805 -0.33064157 0.33150956 1.875219 1.2452867 -1.6603845 2.65914 3.361523 -0.9170144 -1.004725 -0.31350675 2.7215428 -1.635537 1.1541646 1.7256192 0.12370112 0.48722214 -1.6641299 -2.1762908 4.199462 -0.2883563 -0.16597326 -0.553747 -2.022318 1.4750097 1.1440145 0.46725643 1.014653 3.361021 -3.4376462 1.2877408 -0.5148107 -1.4415078 -0.7464355 -1.5502417 1.7877142 -0.5613597 2.2749197 -0.9590551 -0.6896642 -0.6645099 0.541241 0.5105197 0.9254783 -0.7677737 -0.45314562 0.72046924 -1.971568 -2.613329 -0.49380887 -0.83681905 0.29184273 0.7707044 -0.73559433 2.3954475 -0.07767087 -0.30188093 -2.1767392 2.1759086 -2.0669603 -0.019899957 -0.19439699 0.54339635 0.1168371 -0.6867408 0.21940033 0.90612113 -1.8674718 -0.22612645 -1.3948021 -0.25322914 -2.3842711 -2.5536478 -0.26761347 2.2381825 1.4699225 -0.17046195 0.07034144 1.3627189 0.8321233 1.3198925 2.1762233 1.9735737 0.19121808 -0.40528092 1.0960783 -3.6836426 0.32600078 -2.6550748 -1.9942055 -2.7730727 0.3876345 -0.06702569 -1.1657287 2.536466 -2.1727517 2.0637708 0.57837164\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "AT0UqnHW-MlC", "colab_type": "text" }, "source": [ "## সেন্টিমেন্ট অ্যানালাইসিস এর ট্রেইিং ফাইল\n", "\n", "দুটো ফাইল, একটা নেগেটিভ আরেকটা পজিটিভ। bangla-sentiment.neg ফাইলে সব নেগেটিভ, সেদিক থেকে bangla-sentiment.pos ফাইলে সব পজিটিভ বাক্য। লেবেলিং করেছেন আমাদের মতো মানুষ। তবে ভালো ডেটাসেট খুঁজছি আমি। এমুহুর্তে এই ডেটাসেট দিয়ে সাহায্য করেছেন সোসিয়ান, বিশেষ করে তারেক আল মুনতাসির। সোশ্যাল মিডিয়াতে মানুষ কি লেখেন সেটার ওপর আমাদের কন্ট্রোল নেই। তাই এই ডেটাসেটে কি লেখা আছে সেটা নিয়ে আমরা মাথা ঘামাবো না। রিসার্চের জন্য ব্যবহার হিসেবে ধরে নিচ্ছি আমরা।" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "bYv6LqlEChO1", "outputId": "680d99be-d609-4cd6-f982-fcd7fb58f3d7", "colab": { "base_uri": "https://localhost:8080/", "height": 585 } }, "source": [ "!wget https://github.com/raqueeb/datasets/raw/master/bangla-sentiment.pos\n", "!wget https://github.com/raqueeb/datasets/raw/master/bangla-sentiment.neg\n" ], "execution_count": 10, "outputs": [ { "output_type": "stream", "text": [ "--2019-11-23 03:21:23-- https://github.com/raqueeb/datasets/raw/master/bangla-sentiment.pos\n", "Resolving github.com (github.com)... 192.30.255.112\n", "Connecting to github.com (github.com)|192.30.255.112|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://raw.githubusercontent.com/raqueeb/datasets/master/bangla-sentiment.pos [following]\n", "--2019-11-23 03:21:23-- https://raw.githubusercontent.com/raqueeb/datasets/master/bangla-sentiment.pos\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 220062 (215K) [text/plain]\n", "Saving to: ‘bangla-sentiment.pos’\n", "\n", "\rbangla-sentiment.po 0%[ ] 0 --.-KB/s \rbangla-sentiment.po 100%[===================>] 214.90K --.-KB/s in 0.03s \n", "\n", "2019-11-23 03:21:23 (7.06 MB/s) - ‘bangla-sentiment.pos’ saved [220062/220062]\n", "\n", "--2019-11-23 03:21:38-- https://github.com/raqueeb/datasets/raw/master/bangla-sentiment.neg\n", "Resolving github.com (github.com)... 192.30.255.112\n", "Connecting to github.com (github.com)|192.30.255.112|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://raw.githubusercontent.com/raqueeb/datasets/master/bangla-sentiment.neg [following]\n", "--2019-11-23 03:21:38-- https://raw.githubusercontent.com/raqueeb/datasets/master/bangla-sentiment.neg\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 363162 (355K) [text/plain]\n", "Saving to: ‘bangla-sentiment.neg’\n", "\n", "bangla-sentiment.ne 100%[===================>] 354.65K --.-KB/s in 0.03s \n", "\n", "2019-11-23 03:21:38 (12.3 MB/s) - ‘bangla-sentiment.neg’ saved [363162/363162]\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "FE_7dSquwfvm", "colab_type": "code", "outputId": "4d10fefb-89d9-4f27-af62-8844fd341d8b", "colab": { "base_uri": "https://localhost:8080/", "height": 105 } }, "source": [ "# দেখি এই ফাইলটাতে কি আচ্ছে প্রথম ৫ লাইনে?\n", "\n", "!head -5 bangla-sentiment.pos" ], "execution_count": 11, "outputs": [ { "output_type": "stream", "text": [ "বাংলাদেশের সবাই শান্তিতে আছে থাকবে\n", "ভারতে সব বাংলাদেশী বৈধ ভাবে থাকে\n", "ওদের দেশে সোনার অভাব নাই\n", "গ্রামীণফোন এর মত সু্বিধা পাই নি সবসময় অাছি গ্রামীণফোন এর সাথে ভালবাসি গ্রামীণফোন কে\n", "গ্রামীণফোন থেকে বিভিন্ন সময় বিভিন্ন অফার দেয়া হয়ে থাকে\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "B1tqdoSXw0_0", "colab_type": "code", "outputId": "8f39332d-388c-4e0a-dbff-7e480d321029", "colab": { "base_uri": "https://localhost:8080/", "height": 105 } }, "source": [ "# বাকি ফাইলটাতে?\n", "\n", "!head -5 bangla-sentiment.neg" ], "execution_count": 12, "outputs": [ { "output_type": "stream", "text": [ "আর দোষিরা কোনদিন বিচার পাবে না\n", "সীমের এমন জটিল সমস্যায় রীতিমত হয়রানির শিকার\n", "নেটওয়ার্ক ভাল না আপনাদের\n", "আমার তো এখন নেটওয়ার্ক খুবই কম....\n", "কোন বিদ্বেষের কারনেই কাউকে খুন করার লাইসেন্স দেওয়া হয়না\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "0SA-_IImNerZ", "colab_type": "code", "outputId": "7cfc6079-85da-4bbc-acf5-4fa1e6d45df6", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "source": [ "# কি কি ডাউনলোড করলাম?\n", "\n", "!ls -al" ], "execution_count": 13, "outputs": [ { "output_type": "stream", "text": [ "total 3395692\n", "drwxr-xr-x 1 root root 4096 Nov 23 03:21 .\n", "drwxr-xr-x 1 root root 4096 Nov 23 03:06 ..\n", "-rw-r--r-- 1 root root 363162 Nov 23 03:21 bangla-sentiment.neg\n", "-rw-r--r-- 1 root root 220062 Nov 23 03:21 bangla-sentiment.pos\n", "-rw-r--r-- 1 root root 2496996336 Nov 14 09:58 bn-wiki-word2vec-300.txt\n", "-rw-r--r-- 1 root root 492830720 Nov 23 03:20 bn-wiki-word2vec-300.txt.tgz.aa\n", "-rw-r--r-- 1 root root 486748692 Nov 23 03:20 bn-wiki-word2vec-300.txt.tgz.ab\n", "drwxr-xr-x 1 root root 4096 Nov 21 16:30 .config\n", "drwxr-xr-x 1 root root 4096 Nov 21 16:30 sample_data\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "QIviAyFUJkRi", "colab_type": "text" }, "source": [ "## দুটো ফাইলের মোট বাক্য কতো আছে?\n", "\n", "দুভাবে দেখতে পারি। সাধারণ পাইথন ফাইল অপারেশন।" ] }, { "cell_type": "code", "metadata": { "id": "m3su_fMhBIKR", "colab_type": "code", "colab": {} }, "source": [ "preprocessed_text_file_path = 'bangla-sentiment.pos'" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "8SEVl_jQBBEd", "colab_type": "code", "colab": {} }, "source": [ "lines_from_file = []\n", "with open(preprocessed_text_file_path, encoding='utf8') as text_file:\n", " for line in text_file:\n", " lines_from_file.append(line)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "QUQpFBqFCALn", "colab_type": "code", "outputId": "bf84017e-5406-4862-ec99-56f59abe707a", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "# পজিটিভ ফাইলের লাইন সংখ্যা\n", "\n", "len(lines_from_file)" ], "execution_count": 16, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "2039" ] }, "metadata": { "tags": [] }, "execution_count": 16 } ] }, { "cell_type": "code", "metadata": { "id": "8bWPdVWfBUzD", "colab_type": "code", "colab": {} }, "source": [ "preprocessed_text_file_path = 'bangla-sentiment.neg'" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "yBKXVx4gBU9B", "colab_type": "code", "colab": {} }, "source": [ "lines_from_file = []\n", "with open(preprocessed_text_file_path, encoding='utf8') as text_file:\n", " for line in text_file:\n", " lines_from_file.append(line)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "bLm0slDbCB_h", "colab_type": "code", "outputId": "a39da83a-9b74-4df7-9399-c8f5b5ec2708", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "# নেগেটিভ ফাইলের লাইন সংখ্যা\n", "\n", "len(lines_from_file)" ], "execution_count": 19, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "2520" ] }, "metadata": { "tags": [] }, "execution_count": 19 } ] }, { "cell_type": "code", "metadata": { "id": "vSpOZEzgwf5E", "colab_type": "code", "colab": {} }, "source": [ "# সব এক জায়গায় নিয়ে আসি\n", "\n", "all_sentences = []\n", "with open('bangla-sentiment.pos', encoding='utf8') as f:\n", " all_sentences.extend([(line.strip(), 'positive') for line in f])\n", " \n", "with open('bangla-sentiment.neg', encoding='utf8') as f:\n", " all_sentences.extend([(line.strip(), 'negative') for line in f])" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ifzqGZ88LD_K", "colab_type": "code", "outputId": "3e5c14b5-2707-4ecd-eb4b-c0cc47b9b2f9", "colab": { "base_uri": "https://localhost:8080/", "height": 123 } }, "source": [ "# all_sentences এর প্রথম পাঁচ লাইন\n", "\n", "all_sentences[:5]" ], "execution_count": 21, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('বাংলাদেশের সবাই শান্তিতে আছে থাকবে', 'positive'),\n", " ('ভারতে সব বাংলাদেশী বৈধ ভাবে থাকে', 'positive'),\n", " ('ওদের দেশে সোনার অভাব নাই', 'positive'),\n", " ('গ্রামীণফোন এর মত সু্বিধা পাই নি সবসময় অাছি গ্রামীণফোন এর সাথে ভালবাসি গ্রামীণফোন কে',\n", " 'positive'),\n", " ('গ্রামীণফোন থেকে বিভিন্ন সময় বিভিন্ন অফার দেয়া হয়ে থাকে', 'positive')]" ] }, "metadata": { "tags": [] }, "execution_count": 21 } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "4BNXFrkotAYu", "outputId": "7e5f0094-d216-4c78-b1de-94578b31e9f3", "colab": { "base_uri": "https://localhost:8080/", "height": 52 } }, "source": [ "# কতগুলো পজিটিভ আর কতো নম্বর লাইন থেকে নেগেটিভ শুরু হয়েছে?\n", "\n", "pos_count = 0\n", "neg_count = 0\n", "for sentence, label in all_sentences:\n", " if label =='positive':\n", " pos_count +=1\n", " else:\n", " neg_count +=1\n", "print(pos_count)\n", "print(neg_count)" ], "execution_count": 22, "outputs": [ { "output_type": "stream", "text": [ "2039\n", "2520\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "tpJczOxSLjgg", "colab_type": "code", "outputId": "c6e185a3-507b-4bc0-9fdd-f73eab1715dc", "colab": { "base_uri": "https://localhost:8080/", "height": 105 } }, "source": [ "# নেগেটিভ লাইন শুরুর প্রথম পাঁচ লাইনে কি আছে?\n", "\n", "all_sentences[2040:2045]" ], "execution_count": 23, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('সীমের এমন জটিল সমস্যায় রীতিমত হয়রানির শিকার', 'negative'),\n", " ('নেটওয়ার্ক ভাল না আপনাদের', 'negative'),\n", " ('আমার তো এখন নেটওয়ার্ক খুবই কম....', 'negative'),\n", " ('কোন বিদ্বেষের কারনেই কাউকে খুন করার লাইসেন্স দেওয়া হয়না', 'negative'),\n", " ('জিপি নেট চলছে না কেন', 'negative')]" ] }, "metadata": { "tags": [] }, "execution_count": 23 } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "js75OARBF_B8" }, "source": [ "## প্রিট্রেইনড এমবেডিং এক্সপোর্টার স্ক্রিপ্ট ডাউনলোড" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-uAicYA6vLsf" }, "source": [ "টেন্সর-ফ্লো এর একটা বড় সুবিধা হচ্ছে মডেলের মধ্যে লার্নিং ট্রান্সফার করা যায়। এই ক্লাস মডিউল (নিচে দেখুন) মডেলের গ্রাফের হার্ডডিস্কের সেভ করা একটা অংশ যাকে এক্সপোর্ট করা যায় আরেক জায়গায়।\n", "\n", "শুরুতেই বলেছিলাম আমরা টেন্সর-ফ্লো হাবের একটা প্রিট্রেইনড এমবেডিং এক্সপোর্টার স্ক্রিপ্ট ব্যবহার করবো - ওয়ার্ড এমবেডিং থেকে টেক্সট এমবেডিং মডিউল বের করতে সেটাকে পাঠিয়ে দেবো ক্লাসিফায়ার ট্রেইন করতে। ফাস্টটেক্সট দিয়ে আরেকটা উদাহরন (সেন্টিমেন্ট অ্যানালাইসিস নয়, টেক্সট ক্লাসিফিকেশন) দেয়া আছে নিচের লিঙ্কে। আমরা ফাস্টটেক্সট ব্যবহার করতে চাইলে শুধুমাত্র ফাস্টটেক্সট ভেক্টর ফাইলটা ব্যবহার করলেই হবে।\n", "\n", "আমাদের এক্সপোর্টার স্ক্রিপ্ট আছে https://github.com/tensorflow/hub/tree/master/examples/text_embeddings_v2, ডাউনলোড করে রাখি একই ডিরেক্টরিতে। \n", "\n", "একটা সেভড মডেলে কি থাকে? টেন্সর-ফ্লো এর দরকারী ডেটা সঙ্গে মডেলের ওয়েট এবং গ্রাফ যাতে মডেলটা আবার তৈরি করা যেতে পারে। এই সেভড মডেল থেকে নিয়ে আসবো ওয়ার্ড এমবেডিংগুলো। টেন্সর-ফ্লো হাবের কাজ হচ্ছে এই সেভড মডেলকে লোড করে মডিউল হিসেবে যাকে দরকার হবে [hub.KerasLayer](https://www.tensorflow.org/hub/api_docs/python/hub/KerasLayer) এ। সেকয়েন্সিয়াল লেয়ারে এই কেরাস লেয়ার ভালোই কাজ করছে।\n" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "5DY5Ze6pO1G5", "outputId": "e1818c8b-37bd-4909-e8bf-166322df8fc1", "colab": { "base_uri": "https://localhost:8080/", "height": 212 } }, "source": [ "!wget https://raw.githubusercontent.com/tensorflow/hub/master/examples/text_embeddings_v2/export_v2.py\n", "# !wget https://raw.githubusercontent.com/tensorflow/hub/master/examples/text_embeddings/export.py\n" ], "execution_count": 24, "outputs": [ { "output_type": "stream", "text": [ "--2019-11-23 03:22:39-- https://raw.githubusercontent.com/tensorflow/hub/master/examples/text_embeddings_v2/export_v2.py\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 7603 (7.4K) [text/plain]\n", "Saving to: ‘export_v2.py’\n", "\n", "\rexport_v2.py 0%[ ] 0 --.-KB/s \rexport_v2.py 100%[===================>] 7.42K --.-KB/s in 0s \n", "\n", "2019-11-23 03:22:39 (170 MB/s) - ‘export_v2.py’ saved [7603/7603]\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "PAzdNZaHmdl1" }, "source": [ "এক্সপোর্টার দিয়ে এমবেডিং ফাইল নেবার সময় ওয়ার্ড২ভেক বা ফাস্টটেক্সট এর হেডারটা অনেক বড় হয় বলে সেটাকে ফেলে দিতে পারি। বিশেষ করে লোকাল মেশিনে বা গুগল কোলাবে এটা একটা বাড়তি সমস্যা।" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "Tkv5acr_Q9UU", "outputId": "d7bef653-415c-46f1-902e-3ea773b61913", "colab": { "base_uri": "https://localhost:8080/", "height": 1000 } }, "source": [ "!python export_v2.py --embedding_file=/content/bn-wiki-word2vec-300.txt --export_path=text_embedding --num_lines_to_ignore=1 \n", "# !python export.py --embedding_file=/content/bn-wiki-word2vec-300.txt --export_path=text_embedding --num_lines_to_ignore=1 --preprocess_text=True" ], "execution_count": 25, "outputs": [ { "output_type": "stream", "text": [ "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n", "tcmalloc: large alloc 1607057408 bytes == 0x8a454000 @ 0x7efd4ae4a1e7 0x7efd473e9f71 0x7efd4744d55d 0x7efd47450e28 0x7efd474513e5 0x7efd474e7fc2 0x50abc5 0x50c549 0x509ce8 0x50aa1d 0x50c549 0x5081d5 0x509647 0x5951c1 0x54a11f 0x551761 0x5aa69c 0x50ab53 0x50c549 0x509ce8 0x50aa1d 0x50c549 0x509ce8 0x50aa1d 0x50c549 0x509ce8 0x50aa1d 0x50c549 0x5081d5 0x50a020 0x50aa1d\n", "2019-11-23 03:26:14.587876: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1\n", "2019-11-23 03:26:14.591650: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.592212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: \n", "name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285\n", "pciBusID: 0000:00:04.0\n", "2019-11-23 03:26:14.595051: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0\n", "2019-11-23 03:26:14.605363: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0\n", "2019-11-23 03:26:14.609008: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0\n", "2019-11-23 03:26:14.617034: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0\n", "2019-11-23 03:26:14.626157: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0\n", "2019-11-23 03:26:14.633868: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0\n", "2019-11-23 03:26:14.647332: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7\n", "2019-11-23 03:26:14.647447: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.647989: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.648455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0\n", "2019-11-23 03:26:14.648908: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA\n", "2019-11-23 03:26:14.748087: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.748796: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2d87f80 executing computations on platform CUDA. Devices:\n", "2019-11-23 03:26:14.748823: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0\n", "2019-11-23 03:26:14.750680: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz\n", "2019-11-23 03:26:14.751011: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2d899c0 executing computations on platform Host. Devices:\n", "2019-11-23 03:26:14.751044: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , \n", "2019-11-23 03:26:14.751220: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.751817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: \n", "name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285\n", "pciBusID: 0000:00:04.0\n", "2019-11-23 03:26:14.751872: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0\n", "2019-11-23 03:26:14.751888: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0\n", "2019-11-23 03:26:14.751901: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0\n", "2019-11-23 03:26:14.751914: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0\n", "2019-11-23 03:26:14.751927: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0\n", "2019-11-23 03:26:14.751939: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0\n", "2019-11-23 03:26:14.751954: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7\n", "2019-11-23 03:26:14.752006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.752572: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.753095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0\n", "2019-11-23 03:26:14.753156: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0\n", "2019-11-23 03:26:14.754156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:\n", "2019-11-23 03:26:14.754179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 \n", "2019-11-23 03:26:14.754188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N \n", "2019-11-23 03:26:14.754270: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.754796: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2019-11-23 03:26:14.755251: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.\n", "2019-11-23 03:26:14.755282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13922 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/lookup_ops.py:1159: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.where in 2.0, which has the same broadcast rule as np.where\n", "W1123 03:26:16.215225 139626352727936 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/lookup_ops.py:1159: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.where in 2.0, which has the same broadcast rule as np.where\n", "2019-11-23 03:26:16.826141: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 1607054400 exceeds 10% of system memory.\n", "tcmalloc: large alloc 1607057408 bytes == 0xa334000 @ 0x7efd4ae2cb6b 0x7efd4ae4c379 0x7efd05ecd6e4 0x7efd05cf6d2a 0x7efd05bba011 0x7efd05bccf68 0x7efd0bd46ce3 0x7efd0bd3caf8 0x7efd09731557 0x7efd096a9481 0x7efd096ab7fd 0x50abc5 0x50d320 0x5081d5 0x50a020 0x50aa1d 0x50c549 0x5081d5 0x50a020 0x50aa1d 0x50c549 0x5081d5 0x5895e1 0x5a04ce 0x50d8f5 0x5081d5 0x50a020 0x50aa1d 0x50c549 0x5081d5 0x50a020\n", "INFO:tensorflow:Assets written to: text_embedding/assets\n", "I1123 03:26:23.348451 139626352727936 builder_impl.py:770] Assets written to: text_embedding/assets\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "pDGGOY0EQ3XQ", "colab_type": "code", "outputId": "942852c4-4691-4c16-db14-5554baeceb72", "colab": { "base_uri": "https://localhost:8080/", "height": 658 } }, "source": [ "# সেভড মডেলের একটা কমান্ড লাইন ইন্টারফেস আছে দেখার জন্য, এখনো কিছু আসেনি এখানে\n", "\n", "!saved_model_cli show --dir text_embedding --all" ], "execution_count": 26, "outputs": [ { "output_type": "stream", "text": [ "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", "/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n", "\n", "MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:\n", "\n", "signature_def['__saved_model_init_op']:\n", " The given SavedModel SignatureDef contains the following input(s):\n", " The given SavedModel SignatureDef contains the following output(s):\n", " outputs['__saved_model_init_op'] tensor_info:\n", " dtype: DT_INVALID\n", " shape: unknown_rank\n", " name: NoOp\n", " Method name is: \n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "iiRXv5VexLWN", "colab_type": "text" }, "source": [ "hub.KerasLayer ব্যবহার হচ্ছে, তবে আমাদের এমবেডিং মডিউলে trainable=False সেট করা হয়েছে যাতে এমবেডিং ওয়েটগুলো আপডেট না হয় ট্রেনিং এর সময়। তবে আমরা দুটোই টেস্ট করবো। " ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "k9WEpmedF_3_", "colab": {} }, "source": [ "# এই মডিউলটা ফ্রিজ করা আছে, মনে আছে ট্রান্সফার লার্নিং এর কথা?\n", "# পাশাপাশি hub.KerasLayer এর আর্গুমেন্টগুলো দেখুন \n", "# __init__(\n", "# spec,\n", "# trainable=False,\n", "# name='module',\n", "# tags=None\n", "#)\n", "\n", "embedding_path = \"text_embedding\"\n", "embedding_layer = hub.KerasLayer(embedding_path, trainable=True)\n", "# embedding_layer = hub.KerasLayer(embedding_path, trainable=False)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "p6ZuhfQWPFzv", "colab_type": "code", "outputId": "caaca1cc-eed4-4369-c78c-4e2c7eabaecc", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "print(embedding_layer)" ], "execution_count": 28, "outputs": [ { "output_type": "stream", "text": [ "\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fQHbmS_D4YIo" }, "source": [ "## বাংলা শব্দকে embedding_layerয়ে পাঠিয়ে দিয়ে দেখি\n", "\n", "বাংলা শব্দগুলোকে কিভাবে পাঠাবো এই নতুন মডিউলে? নিচের উদাহরন দেখুন। বাক্যের মধ্যে শব্দগুলোকে ভাগ করছে স্পেস দেখে। একটা বাক্যের ব্যাচ করে এক ডাইমেনশনের টেন্সর দেখাচ্ছে আমাদের shape এট্রিবিউটে। এখানে বাক্য আর শব্দ এমবেডিং নিয়ে একটা চিন্তা আছে তবে সেটা আসবে নিচের উদাহরন থেকে।\n", "\n", "```\n", "tf.nn.embedding_lookup_sparse(\n", " params,\n", " sp_ids,\n", " sp_weights,\n", " combiner=None,\n", " max_norm=None,\n", " name=None\n", ")\n", "```\n", "এর মানে হচ্ছে embedding_layer ইনপুট হিসেবে বাংলা শব্দ নিয়ে এমবেডিং বের করে দিচ্ছে ঠিকমতো। " ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "Z1MBnaBUihWn", "outputId": "8dc0cff7-fc2d-4b44-e65d-0780f24d57fb", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "embedding_layer(['ভালো আছি'], ['আমরা']).shape" ], "execution_count": 29, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "TensorShape([1, 300])" ] }, "metadata": { "tags": [] }, "execution_count": 29 } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4KY8LiFOHmcd" }, "source": [ "# টেন্সর-ফ্লো এর জন্য তৈরি করি ডেটাসেট \n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "pNguCDNe6bvz" }, "source": [ "আপনার মনে আছে, আমাদের all_sentences এর মধ্যে প্রথম দিকে পজিটিভ আর শেষের দিকে নেগেটিভ সেন্টিমেন্টের বাক্য ছিলো। এখন এই ডেটা দিয়ে ট্রেনিং করালে ভারসাম্য থাকবে না। তাই শুরুতে দৈবচয়নের মাধ্যমে শাফল করে নেই। আমরা কাজ করবো একটা জেনারেটর নিয়ে। " ] }, { "cell_type": "markdown", "metadata": { "id": "gdiu6WojE_QI", "colab_type": "text" }, "source": [ "জেনারেটর দিয়ে ডেটাসেট তৈরিতে শুরুতে আমরা একটা জেনারেটর ফাংশন বানিয়ে দিলে সেটা প্রতিটা বাক্য এবং তার করেসপন্ডিং লেবেল থেকে একটা পুরো এক্সাম্পল (ডেটা + লেবেল) তৈরি করে দেবে। এরপর সেটাকে tf.data.Dataset.from_generator পাঠালে সেটার কি ধরনের আউটপুট চাই সেটা বললে হবে। জেনারেটরের একটা উদাহরণ দেখি। generator বানিয়ে সেটা পাঠাচ্ছি from_generator এর মধ্যে। \n", "\n", "\n", "```\n", "@staticmethod\n", "from_generator(\n", " generator,\n", " output_types,\n", " output_shapes=None,\n", " args=None\n", ")\n", "```\n", "এর ব্যবহার?\n", "\n", "```\n", "import itertools\n", "tf.compat.v1.enable_eager_execution()\n", "\n", "def gen():\n", " for i in itertools.count(1):\n", " yield (i, [1] * i)\n", "\n", "ds = tf.data.Dataset.from_generator(\n", " gen, (tf.int64, tf.int64), (tf.TensorShape([]), tf.TensorShape([None])))\n", "\n", "for value in ds.take(2):\n", " print value\n", "# (1, array([1]))\n", "# (2, array([1, 1]))\n", "```\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "eZRGTzEhUi7Q", "colab": {} }, "source": [ "import random\n", "\n", "def generator():\n", " random.shuffle(all_sentences) \n", " for sentence, label in all_sentences:\n", " if label =='positive':\n", " label = tf.keras.utils.to_categorical(1, num_classes=2)\n", " else:\n", " label = tf.keras.utils.to_categorical(0, num_classes=2)\n", " sentence_tensor = tf.constant(sentence, dtype=tf.dtypes.string)\n", " yield sentence_tensor, label" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "A7zmgGSiIte0", "colab_type": "text" }, "source": [ "প্রতিটা এক্সাম্পল এখানে একটা বাক্যের টুপল যা dtype=tf.dtypes.string এবং লেবেলটা হচ্ছে ওয়ান হট এনকোডেড। ডেটাসেট তৈরিতে একটা ট্রেনিং এবং ভ্যালিডেশন সেট লাগবে। কিভাবে করা যায়?\n", "```\n", "train_data = data.take(train_size)\n", "validation_data = data.skip(train_size)\n", "```\n", "\n" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "2g4nRflB7fbF", "colab": {} }, "source": [ "def make_dataset(train_size):\n", " data = tf.data.Dataset.from_generator(generator=generator, \n", " output_types=(tf.string, tf.float32))\n", " train_size = 4000\n", " train_data = data.take(train_size)\n", " validation_data = data.skip(train_size)\n", " return train_data, validation_data" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "8PuuN6el8tv9", "outputId": "4fbca3fc-3027-4896-9138-7befecf22d17", "colab": { "base_uri": "https://localhost:8080/", "height": 498 } }, "source": [ "# ৮০-২০% ভাগ করে ডেটাসেট তৈরি\n", "\n", "train_data, validation_data = make_dataset(0.80)" ], "execution_count": 32, "outputs": [ { "output_type": "stream", "text": [ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:505: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "tf.py_func is deprecated in TF V2. Instead, there are two\n", " options available in V2.\n", " - tf.py_function takes a python function which manipulates tf eager\n", " tensors instead of numpy arrays. It's easy to convert a tf eager tensor to\n", " an ndarray (just call tensor.numpy()) but having access to eager tensors\n", " means `tf.py_function`s can use accelerators such as GPUs as well as\n", " being differentiable using a gradient tape.\n", " - tf.numpy_function maintains the semantics of the deprecated tf.py_func\n", " (it is not differentiable, and manipulates numpy arrays). It drops the\n", " stateful argument making all functions stateful.\n", " \n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:505: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "tf.py_func is deprecated in TF V2. Instead, there are two\n", " options available in V2.\n", " - tf.py_function takes a python function which manipulates tf eager\n", " tensors instead of numpy arrays. It's easy to convert a tf eager tensor to\n", " an ndarray (just call tensor.numpy()) but having access to eager tensors\n", " means `tf.py_function`s can use accelerators such as GPUs as well as\n", " being differentiable using a gradient tape.\n", " - tf.numpy_function maintains the semantics of the deprecated tf.py_func\n", " (it is not differentiable, and manipulates numpy arrays). It drops the\n", " stateful argument making all functions stateful.\n", " \n" ], "name": "stderr" } ] }, { "cell_type": "code", "metadata": { "id": "G0CyNOl1yajF", "colab_type": "code", "outputId": "e278885f-d57e-4ce3-dafc-61fda5a04482", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "source": [ "# একটা ব্যাচ দেখি, যেখানে ২টা এলিমেন্ট থাকবে train_data থেকে \n", "# এরকম বেশ কয়েকটা উদাহরন দেখি নিচে\n", "\n", "next(iter(train_data.batch(2)))" ], "execution_count": 33, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(, )" ] }, "metadata": { "tags": [] }, "execution_count": 33 } ] }, { "cell_type": "code", "metadata": { "id": "5cRCljsMCDmP", "colab_type": "code", "colab": {} }, "source": [ "sentences_in_a_single_batch, labels_in_a_single_batch = next(iter(train_data.batch(2)))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "s9yZYRItCNWR", "colab_type": "code", "outputId": "62526a70-783a-4d5b-dddb-27116f729169", "colab": { "base_uri": "https://localhost:8080/", "height": 107 } }, "source": [ "sentences_in_a_single_batch" ], "execution_count": 35, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 35 } ] }, { "cell_type": "code", "metadata": { "id": "gU2HDStGCbfi", "colab_type": "code", "outputId": "c1196e52-44c6-4f36-d62d-7785494d9195", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "sentences_in_a_single_batch.shape" ], "execution_count": 36, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "TensorShape([2])" ] }, "metadata": { "tags": [] }, "execution_count": 36 } ] }, { "cell_type": "code", "metadata": { "id": "igZFiVMqChDo", "colab_type": "code", "outputId": "fdc96a78-6748-49e7-c384-ff2136f1ea39", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "labels_in_a_single_batch.shape" ], "execution_count": 37, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "TensorShape([2, 2])" ] }, "metadata": { "tags": [] }, "execution_count": 37 } ] }, { "cell_type": "code", "metadata": { "id": "zFcaGNrIzABb", "colab_type": "code", "colab": {} }, "source": [ "sentence, label = next(iter(train_data.take(1)))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "eu12m7AS2YVC", "colab_type": "code", "outputId": "9158fa05-3762-4f51-dc99-d98dacccdde1", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "# numpy()কে ডিকোড করতে হবে ইউনিকোডে, তা না হলে স্ট্রিংকে বাইট হিসেবে পাঠাবে\n", "\n", "sentence.numpy().decode('utf8')" ], "execution_count": 39, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'অর্থ থাকলে যে কোন এলাকার সুস্থ ও সুন্দর উন্নয়ন হয়'" ] }, "metadata": { "tags": [] }, "execution_count": 39 } ] }, { "cell_type": "code", "metadata": { "id": "q7KcfmLC2ceA", "colab_type": "code", "outputId": "cc9ff68c-3072-4db7-9e1f-eed4deaa1f88", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "# to_categorical() এর কনভার্সনের পর লেবেল\n", "label.numpy() " ], "execution_count": 40, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([0., 1.], dtype=float32)" ] }, "metadata": { "tags": [] }, "execution_count": 40 } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "MrdZI6FqPJNP" }, "source": [ "## মডেল ট্রেনিং " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "jgr7YScGVS58" }, "source": [ "এই মডেল আগেও তৈরি করেছি আমরা। এখানে \n", "এমবেডিং লেয়ারকে ঢুকিয়ে দিয়েছি শুরুতেই। \n", "```\n", "model.add(embedding_layer)\n", "```\n", "tf.data থেকে স্যাম্পলকে ব্যাচ করে পাঠানো হবে মডেলে।\n", "LSTM নিয়ে কাজ করবো সামনে।" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "WhCqbDK2uUV5" }, "source": [ "## ডেন্স লেয়ার দিয়ে মডেল\n", "\n", "LSTM দিয়ে কাজ করানোর চেষ্টা চলছে।" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "nHUw807XPPM9", "colab": {} }, "source": [ "def create_model():\n", " model = tf.keras.Sequential()\n", " model.add(embedding_layer)\n", " # model.add(tf.keras.layers.Flatten())\n", " # model.add(tf.keras.layers.SpatialDropout1D(0.2))\n", " # model.add(tf.keras.layers.LSTM(100, dropout=0.2, recurrent_dropout=0.2))\n", " # model.add(Dense(13, activation='softmax'))\n", " model.add(tf.keras.layers.Dense(256, activation=\"relu\"))\n", " model.add(tf.keras.layers.Dense(128, activation=\"relu\"))\n", " model.add(tf.keras.layers.Dense(2, activation=\"softmax\"))\n", " model.compile(optimizer=\"adam\",loss=\"categorical_crossentropy\",metrics=['acc'])\n", " return model" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "CUyeg6WhhliR", "colab_type": "code", "colab": {} }, "source": [ "from tensorflow.keras.callbacks import TensorBoard\n", "log_dir=\"logs/fit/\"\n", "tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)\n" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "5J4EXJUmPVNG", "colab": {} }, "source": [ "model = create_model()" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ZZ7XJLg2u2No" }, "source": [ "## ট্রেনিং ১০ ইপক দিয়ে" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "OoBkN2tAaXWD", "outputId": "52b7eb38-b734-4382-b1c0-6a54df08580e", "colab": { "base_uri": "https://localhost:8080/", "height": 498 } }, "source": [ "batch_size = 256\n", "history = model.fit(train_data.batch(batch_size), \n", " validation_data=validation_data.batch(batch_size), \n", " epochs=10,callbacks=[tensorboard_callback])" ], "execution_count": 44, "outputs": [ { "output_type": "stream", "text": [ "Epoch 1/10\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.where in 2.0, which has the same broadcast rule as np.where\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.where in 2.0, which has the same broadcast rule as np.where\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "16/16 [==============================] - 20s 1s/step - loss: 0.5734 - acc: 0.6204 - val_loss: 0.0000e+00 - val_acc: 0.0000e+00\n", "Epoch 2/10\n", "16/16 [==============================] - 18s 1s/step - loss: 0.3544 - acc: 0.8444 - val_loss: 0.3404 - val_acc: 0.8587\n", "Epoch 3/10\n", "16/16 [==============================] - 17s 1s/step - loss: 0.2773 - acc: 0.8886 - val_loss: 0.2155 - val_acc: 0.9106\n", "Epoch 4/10\n", "16/16 [==============================] - 17s 1s/step - loss: 0.2196 - acc: 0.9184 - val_loss: 0.2215 - val_acc: 0.9177\n", "Epoch 5/10\n", "16/16 [==============================] - 17s 1s/step - loss: 0.1762 - acc: 0.9350 - val_loss: 0.1520 - val_acc: 0.9499\n", "Epoch 6/10\n", "16/16 [==============================] - 17s 1s/step - loss: 0.1290 - acc: 0.9599 - val_loss: 0.0951 - val_acc: 0.9857\n", "Epoch 7/10\n", "16/16 [==============================] - 17s 1s/step - loss: 0.0924 - acc: 0.9743 - val_loss: 0.0622 - val_acc: 0.9875\n", "Epoch 8/10\n", "16/16 [==============================] - 17s 1s/step - loss: 0.0645 - acc: 0.9823 - val_loss: 0.0331 - val_acc: 0.9982\n", "Epoch 9/10\n", "16/16 [==============================] - 17s 1s/step - loss: 0.0441 - acc: 0.9922 - val_loss: 0.0298 - val_acc: 0.9946\n", "Epoch 10/10\n", "16/16 [==============================] - 17s 1s/step - loss: 0.0302 - acc: 0.9921 - val_loss: 0.0249 - val_acc: 0.9946\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "lrR7SmiTk9t3", "colab_type": "code", "outputId": "58dd1cb6-ba8d-4c01-c0d7-3f7024d774b0", "colab": { "base_uri": "https://localhost:8080/", "height": 301 } }, "source": [ "model.summary()" ], "execution_count": 45, "outputs": [ { "output_type": "stream", "text": [ "Model: \"sequential_1\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "keras_layer_1 (KerasLayer) multiple 200881800 \n", "_________________________________________________________________\n", "dense_2 (Dense) multiple 77056 \n", "_________________________________________________________________\n", "dense_3 (Dense) multiple 32896 \n", "_________________________________________________________________\n", "dense_4 (Dense) multiple 258 \n", "=================================================================\n", "Total params: 200,992,010\n", "Trainable params: 200,992,010\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9DeGZFXsJt5g" }, "source": [ "## মডেলকে সেভ করে রাখি ভবিষ্যত কাজে" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "rIO_CseWJtJP", "outputId": "8e59a254-524b-4d53-eeb7-a69b0fd1cb3b", "colab": { "base_uri": "https://localhost:8080/", "height": 52 } }, "source": [ "tf.saved_model.save(model, export_dir=\"my_model\")" ], "execution_count": 46, "outputs": [ { "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: my_model/assets\n" ], "name": "stdout" }, { "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: my_model/assets\n" ], "name": "stderr" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D54IXLqcG8Cq" }, "source": [ "## প্রেডিকশন\n", "\n", "দেখুন আমাদের প্রেডিক্ট মেথড কি বের করে নিয়ে আসে? চেষ্টা করুন নতুন নতুন শব্দ দিয়ে। নিচের sents এর মধ্যে আপনার পছন্দের বাক্যটা লিখে চেষ্টা করুন। ১ হচ্ছে পজিটিভ ০ হচ্ছে নেগেটিভ।" ] }, { "cell_type": "code", "metadata": { "id": "ISbX3GzPoth8", "colab_type": "code", "outputId": "e0d5c26f-a8cf-4a10-f717-0a71237716cc", "colab": { "base_uri": "https://localhost:8080/", "height": 212 } }, "source": [ "sents = ['আমারা খুবি খুশি অফারটির জন্য', 'বই পড়তে অনেক পছন্দ করি', 'আজকের ঘটনা আমাকে মনে কষ্ট দিয়েছে', 'কাজটা খুব খারাপ হয়েছে', \n", " 'আমি দেশকে খুব ভালবাসি', 'এই বইটা বেশ ভালো লাগছে', 'একটা দুর্ঘটনা ঘটে গেল',\n", " 'আজকে একটা অসাধারণ অভিজ্ঞতা হলো', 'আমাদের কাজ করতে বেশ কষ্ট হয়', 'বিদ্যুতের ঘাটতি হলে কারখানার কাজ কমে যায়',\n", " 'ঢাকা-সিলেটসহ আশপাশের সড়কের যানবাহন চলাচল বন্ধ হয়ে যায়',]\n", "pred_dataset = tf.data.Dataset.from_tensor_slices(sents)\n", "prediction = model.predict(np.array(sents))\n", "\n", "for sentence, pred_sentiment in zip(sents, prediction.argmax(axis=1)):\n", " print(\"Sentence:{} - predicted: {}\".format(sentence, pred_sentiment))" ], "execution_count": 47, "outputs": [ { "output_type": "stream", "text": [ "Sentence:আমারা খুবি খুশি অফারটির জন্য - predicted: 1\n", "Sentence:বই পড়তে অনেক পছন্দ করি - predicted: 1\n", "Sentence:আজকের ঘটনা আমাকে মনে কষ্ট দিয়েছে - predicted: 0\n", "Sentence:কাজটা খুব খারাপ হয়েছে - predicted: 0\n", "Sentence:আমি দেশকে খুব ভালবাসি - predicted: 1\n", "Sentence:এই বইটা বেশ ভালো লাগছে - predicted: 1\n", "Sentence:একটা দুর্ঘটনা ঘটে গেল - predicted: 0\n", "Sentence:আজকে একটা অসাধারণ অভিজ্ঞতা হলো - predicted: 1\n", "Sentence:আমাদের কাজ করতে বেশ কষ্ট হয় - predicted: 0\n", "Sentence:বিদ্যুতের ঘাটতি হলে কারখানার কাজ কমে যায় - predicted: 0\n", "Sentence:ঢাকা-সিলেটসহ আশপাশের সড়কের যানবাহন চলাচল বন্ধ হয়ে যায় - predicted: 0\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "b07jbrPw-1-_", "colab_type": "text" }, "source": [ "## চালু করি আমাদের টেন্সরবোর্ডকে" ] }, { "cell_type": "code", "metadata": { "id": "CvY_M886_DOK", "colab_type": "code", "colab": {} }, "source": [ "%reload_ext tensorboard\n", "%tensorboard --logdir logs/fit" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "p5e9m3bV6oXK" }, "source": [ "এই নোটবুক তৈরিতে অনেকগুলো নোটবুক থেকে ধারণা নেয়া হয়েছে এখানে। তবে, নিচের তিনটা নোটবুক দেখতে পারেন। এই নোটবুকগুলো টেক্সট ক্লাসিফিকেশন নিয়ে কাজ করলেও এর পেছনের আন্ডারলাইনড কাজ প্রায় কাছাকাছি। আপনারা শেখার জন্য নোটবুকগুলোকে বুকমার্ক করে রাখতে পারেন আপনার পছন্দমতো কাজ করতে। \n", "\n", "১. https://github.com/tensorflow/hub/blob/master/examples/colab/bangla_article_classifier.ipynb\n", "\n", "২. https://github.com/rezacsedu/BengFastText/blob/master/SentimentAnalysis_Multichannel_CNN_LSTM/Multichannel_CNN_Bengali_Sentiment.ipynb\n", "\n", "৩. https://github.com/tanvirfahim15/BARD-Bangla-Article-Classifier/\n", "\n" ] } ] }