{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyPs7/hAMsmej3X/KNftrZFT",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/cyrus723/my-first-binder/blob/main/embedding_gensim_sentiment1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# with aid from chat gpt"
      ],
      "metadata": {
        "id": "2DYBnFoCX9g6"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lp_9n4bohNL4"
      },
      "outputs": [],
      "source": [
        "from IPython.core.interactiveshell import InteractiveShell\n",
        "InteractiveShell.ast_node_interactivity = \"all\""
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "#pip install gensim"
      ],
      "metadata": {
        "id": "yKQqxxJKhQAW"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.metrics import accuracy_score\n",
        "from gensim.models import Word2Vec\n",
        "import nltk\n",
        "from nltk.corpus import movie_reviews\n",
        "nltk.download('movie_reviews')\n",
        "nltk.download('punkt')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Gy8Y5Dk3hP78",
        "outputId": "1496ef84-2149-4ee0-cd04-2d03f140ec9c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "[nltk_data] Downloading package movie_reviews to /root/nltk_data...\n",
            "[nltk_data]   Unzipping corpora/movie_reviews.zip.\n",
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 2
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "2. Prepare the Data\n",
        "Load the movie reviews from NLTK, preprocess them, and prepare training and test datasets."
      ],
      "metadata": {
        "id": "GNrl4BGhmbIM"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Load movie reviews from nltk\n",
        "documents = [(list(movie_reviews.words(fileid)), category)\n",
        "             for category in movie_reviews.categories()\n",
        "             for fileid in movie_reviews.fileids(category)]\n",
        "\n",
        "# Shuffle the documents\n",
        "np.random.shuffle(documents)\n"
      ],
      "metadata": {
        "id": "GvPz4LofhP5j"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "type(documents)\n",
        "len(documents)\n",
        "print(documents[0])\n",
        "type(documents[0])\n",
        "len(documents[0])\n",
        "\n",
        "type(documents[0][0])\n",
        "type(documents[0][1])\n",
        "\n",
        "print(documents[0][0])\n",
        "print(documents[0][1])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1ukZSvt6j49e",
        "outputId": "aa169c52-b5cb-43c2-f5e3-5471924c67c5"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "list"
            ]
          },
          "metadata": {},
          "execution_count": 28
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "2000"
            ]
          },
          "metadata": {},
          "execution_count": 28
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(['contact', '(', 'pg', ')', 'there', \"'\", 's', 'a', 'moment', 'late', 'in', 'robert', 'zemeckis', \"'\", 's', 'contact', 'where', 'i', 'was', 'reminded', 'of', 'why', 'i', 'started', 'writing', 'movie', 'reviews', 'in', 'the', 'first', 'place', '.', 'we', 'see', 'a', 'scientist', ',', 'dressed', 'in', 'a', 'silvery', 'space', 'suit', ',', 'walking', 'tentatively', 'across', 'a', 'narrow', 'walkway', 'leading', 'inside', 'a', 'compact', ',', 'spherical', 'space', 'pod', ',', 'unaware', 'of', 'what', 'awaits', 'when', 'the', 'ball', 'literally', 'drops', '.', 'anticipation', ',', 'excitement', ',', 'anxiety', ',', 'fear', '--', 'the', 'audience', 'experiences', 'it', 'all', 'the', 'emotional', 'tension', 'right', 'with', 'the', 'character', ',', 'nervously', ',', 'breathlessly', 'eager', 'to', 'see', 'what', 'lies', 'ahead', '.', 'it', 'is', 'this', 'sense', 'of', 'discovery', ',', 'the', 'anticipation', 'of', 'which', 'and', 'its', 'accompanying', 'exhilaration', ',', 'that', 'makes', 'this', 'adaptation', 'of', 'the', 'carl', 'sagan', 'novel', 'such', 'magical', ',', 'captivating', 'entertainment', '.', 'jodie', 'foster', 'stars', 'as', 'dr', '.', 'ellie', 'arroway', ',', 'a', 'brilliant', 'astronomer', 'who', 'dedicates', 'her', 'entire', 'life', 'to', 'searching', 'outer', 'space', 'for', 'extraterrestrial', 'radio', 'signals', '.', 'and', 'i', 'mean', 'life', '--', 'after', 'losing', 'her', 'entire', 'family', 'when', 'she', 'was', 'young', ',', 'the', 'only', 'thing', 'occupying', 'ellie', \"'\", 's', 'world', 'is', 'this', 'quest', 'to', 'discover', 'life', 'beyond', 'this', 'earth', '.', 'after', 'dealing', 'with', 'much', 'skepticism', 'on', 'the', 'part', 'of', 'government', 'officials', 'and', 'wealthy', 'financiers', ',', 'ellie', 'receives', 'her', 'vindication', 'when', 'she', 'stumbles', 'upon', 'an', 'incoming', 'radio', 'transmission', 'from', 'the', 'distant', 'star', 'vega', ',', 'which', 'includes', 'instructions', 'on', 'building', 'an', 'interstellar', 'transport', '.', 'from', 'this', 'synopsis', ',', 'contact', 'does', 'not', 'sound', 'too', 'different', 'to', 'most', 'films', 'about', 'alien', 'contact', ',', 'but', 'there', 'is', 'a', 'whole', 'lot', 'more', 'to', 'this', 'intelligent', 'film', 'than', 'the', 'sci', '-', 'fi', 'hook', '.', 'the', 'alien', 'contact', 'angle', 'generates', 'a', 'great', 'amount', 'of', 'suspense', 'and', 'awe', ',', 'but', 'perhaps', 'more', 'than', 'anything', 'else', ',', 'contact', 'is', 'a', 'character', 'study', 'of', 'ellie', ',', 'whose', 'obsession', 'with', 'empirical', ',', 'scientific', 'evidence', 'has', 'erased', 'all', 'belief', 'in', 'a', 'higher', 'power', '.', 'the', 'irony', 'is', 'that', ',', 'while', 'admitting', 'to', 'having', 'no', 'religious', 'faith', ',', 'she', 'holds', 'onto', 'her', 'belief', 'in', 'extraterrestrial', 'life', 'with', 'such', 'passion', 'and', 'conviction', 'that', 'it', 'becomes', ',', 'in', 'a', 'sense', ',', 'a', 'religion', 'in', 'its', 'own', 'right', '.', 'it', 'would', 'be', 'easy', 'for', 'scripters', 'james', 'v', '.', 'hart', 'and', 'michael', 'goldenberg', ',', 'in', 'trying', 'to', 'paint', 'a', 'positive', 'image', 'of', 'the', 'heroine', ',', 'to', 'champion', 'her', 'scientific', 'beliefs', 'over', 'religious', 'ones', ',', 'but', 'they', 'wisely', 'eschew', 'easy', 'answers', ',', 'giving', 'equal', 'time', 'to', 'both', 'sides', ',', 'and', 'in', 'so', 'doing', 'depict', 'ellie', 'as', 'not', 'completely', 'sane', '.', 'in', 'the', 'end', ',', 'there', 'is', 'no', 'right', 'or', 'wrong', ',', 'nor', 'is', 'there', 'one', 'side', 'that', 'comes', 'off', 'more', 'positive', 'in', 'the', 'other', ',', 'even', 'slightly', 'so', '--', 'there', 'are', 'just', 'two', 'very', 'viable', 'points', 'of', 'view', ',', 'each', 'with', 'their', 'own', 'merits', ',', 'each', 'with', 'their', 'own', 'faults', '.', 'the', 'complex', 'role', 'of', 'ellie', 'is', 'an', 'actress', \"'\", 's', 'dream', ',', 'and', 'foster', ',', 'a', 'virtual', 'shoo', '-', 'in', 'for', 'yet', 'another', 'best', 'actress', 'oscar', 'nomination', 'next', 'year', ',', 'more', 'than', 'rises', 'to', 'the', 'challenge', '.', 'she', 'conveys', 'intelligence', ',', 'determination', ',', 'warmth', ',', 'and', ',', 'in', 'a', 'gutsy', 'move', ',', 'always', 'on', 'edge', '.', 'we', 'root', 'for', 'ellie', 'and', 'feel', 'for', 'her', ',', 'but', 'we', 'also', 'feel', 'at', 'times', 'that', 'she', 'goes', 'too', 'far', '.', 'contact', 'is', 'clearly', 'foster', \"'\", 's', 'vehicle', ',', 'but', 'others', 'are', 'given', 'their', 'chance', 'to', 'shine', 'in', 'smaller', 'roles', '.', 'matthew', 'mcconaughey', ',', 'who', 'receives', 'outrageously', 'high', 'billing', 'for', 'his', 'smallish', 'role', ',', 'holds', 'his', 'own', 'as', 'the', 'religious', 'counterpoint', 'to', 'ellie', ',', 'spiritual', 'scholar', 'and', 'government', 'adviser', 'palmer', 'joss', '(', 'however', ',', 'his', 'main', 'storyline', ',', 'the', 'tentative', 'palmer', '-', 'ellie', 'romance', ',', 'is', 'the', 'film', \"'\", 's', 'weakest', 'subplot', ')', '.', 'john', 'hurt', 'is', 'effectively', 'creepy', 'as', 's', '.', 'r', '.', 'hadden', ',', 'the', 'wealthy', 'eccentric', 'who', 'provides', 'ellie', 'with', 'her', 'research', 'funding', '.', 'angela', 'bassett', 'continues', 'to', 'impress', 'in', 'her', 'bit', 'role', 'as', 'white', 'house', 'aide', 'rachel', 'constantine', '.', 'most', 'memorable', 'of', 'all', ',', 'though', ',', 'are', 'tom', 'skerritt', 'and', 'james', 'woods', ',', 'who', 'play', 'rival', 'scientist', 'dr', '.', 'david', 'drumlin', 'and', 'national', 'security', 'adviser', 'michael', 'litz', ',', 'respectively', ';', 'both', ',', 'especially', 'skerritt', ',', 'embody', 'these', 'asshole', 'characters', 'that', 'the', 'audience', 'hissed', 'just', 'about', 'every', 'single', 'one', 'of', 'their', 'appearances', '.', 'zemeckis', 'comes', 'off', 'of', 'his', 'three', '-', 'year', 'break', 'in', 'top', 'shape', '.', 'always', 'known', 'as', 'a', 'director', 'of', 'effects', '-', 'laden', 'extravaganzas', ',', 'it', 'comes', 'as', 'no', 'surprise', 'that', 'contact', \"'\", 's', 'visual', 'effects', 'are', 'quite', 'stunning', '.', 'the', 'central', 'space', 'journey', 'is', 'more', 'than', 'a', 'little', 'reminiscent', 'of', 'the', 'close', 'of', '2001', ':', 'a', 'space', 'odyssey', ',', 'but', 'with', 'more', 'advanced', 'technology', 'at', 'his', 'disposal', ',', 'zemeckis', \"'\", 's', 'voyage', 'is', 'even', 'trippier', 'than', 'stanley', 'kubrick', \"'\", 's', 'yet', 'more', 'wondrously', 'pure', '.', 'and', 'zemeckis', 'doesn', \"'\", 't', 'resist', 'the', 'urge', 'to', 'use', 'the', 'always', '-', 'interesting', 'incorporate', '-', 'actors', '-', 'into', '-', 'existing', '-', 'film', '-', 'footage', 'effect', ',', 'which', 'is', 'every', 'bit', 'as', 'seamless', 'here', 'as', 'it', 'was', 'in', 'forrest', 'gump', '.', 'effects', ',', 'however', ',', 'are', 'confined', 'to', 'only', 'a', 'few', 'scenes', 'and', 'clearly', 'take', 'a', 'back', 'seat', 'to', 'the', 'drama', ',', 'emotion', ',', 'and', 'pure', 'wonder', ',', 'which', 'zemeckis', 'proved', 'to', 'be', 'quite', 'adept', 'at', 'in', 'gump', '.', 'it', 'says', 'a', 'lot', 'that', ',', 'in', 'a', 'summer', 'science', 'fiction', 'film', 'such', 'as', 'this', ',', 'it', \"'\", 's', 'not', 'so', 'much', 'the', 'effects', 'that', 'stay', 'with', 'you', 'as', 'it', 'is', 'the', 'drama', 'and', 'the', 'issues', 'that', 'are', 'raised', '.', 'the', 'thought', '-', 'provoking', ',', 'two', '-', 'hour', '-', 'plus', 'contact', 'is', 'a', 'much', '-', 'welcome', 'change', 'of', 'pace', 'from', 'summer', 'no', '-', 'brainers', ',', 'but', 'the', 'fact', 'that', 'it', 'is', 'a', 'smart', 'film', 'does', 'not', 'mean', 'that', 'it', 'also', 'isn', \"'\", 't', 'entertaining', '.', 'for', 'all', 'the', 'interesting', 'questions', 'it', 'asks', ',', 'the', 'film', 'is', 'still', 'what', 'it', \"'\", 's', 'being', 'sold', 'as', '--', '\"', 'a', 'journey', 'to', 'the', 'heart', 'of', 'the', 'universe', '.', '\"', 'and', 'what', 'a', 'fascinating', ',', 'unforgettable', 'journey', 'it', 'is', '.'], 'pos')\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tuple"
            ]
          },
          "metadata": {},
          "execution_count": 28
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "2"
            ]
          },
          "metadata": {},
          "execution_count": 28
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "list"
            ]
          },
          "metadata": {},
          "execution_count": 28
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "str"
            ]
          },
          "metadata": {},
          "execution_count": 28
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "['contact', '(', 'pg', ')', 'there', \"'\", 's', 'a', 'moment', 'late', 'in', 'robert', 'zemeckis', \"'\", 's', 'contact', 'where', 'i', 'was', 'reminded', 'of', 'why', 'i', 'started', 'writing', 'movie', 'reviews', 'in', 'the', 'first', 'place', '.', 'we', 'see', 'a', 'scientist', ',', 'dressed', 'in', 'a', 'silvery', 'space', 'suit', ',', 'walking', 'tentatively', 'across', 'a', 'narrow', 'walkway', 'leading', 'inside', 'a', 'compact', ',', 'spherical', 'space', 'pod', ',', 'unaware', 'of', 'what', 'awaits', 'when', 'the', 'ball', 'literally', 'drops', '.', 'anticipation', ',', 'excitement', ',', 'anxiety', ',', 'fear', '--', 'the', 'audience', 'experiences', 'it', 'all', 'the', 'emotional', 'tension', 'right', 'with', 'the', 'character', ',', 'nervously', ',', 'breathlessly', 'eager', 'to', 'see', 'what', 'lies', 'ahead', '.', 'it', 'is', 'this', 'sense', 'of', 'discovery', ',', 'the', 'anticipation', 'of', 'which', 'and', 'its', 'accompanying', 'exhilaration', ',', 'that', 'makes', 'this', 'adaptation', 'of', 'the', 'carl', 'sagan', 'novel', 'such', 'magical', ',', 'captivating', 'entertainment', '.', 'jodie', 'foster', 'stars', 'as', 'dr', '.', 'ellie', 'arroway', ',', 'a', 'brilliant', 'astronomer', 'who', 'dedicates', 'her', 'entire', 'life', 'to', 'searching', 'outer', 'space', 'for', 'extraterrestrial', 'radio', 'signals', '.', 'and', 'i', 'mean', 'life', '--', 'after', 'losing', 'her', 'entire', 'family', 'when', 'she', 'was', 'young', ',', 'the', 'only', 'thing', 'occupying', 'ellie', \"'\", 's', 'world', 'is', 'this', 'quest', 'to', 'discover', 'life', 'beyond', 'this', 'earth', '.', 'after', 'dealing', 'with', 'much', 'skepticism', 'on', 'the', 'part', 'of', 'government', 'officials', 'and', 'wealthy', 'financiers', ',', 'ellie', 'receives', 'her', 'vindication', 'when', 'she', 'stumbles', 'upon', 'an', 'incoming', 'radio', 'transmission', 'from', 'the', 'distant', 'star', 'vega', ',', 'which', 'includes', 'instructions', 'on', 'building', 'an', 'interstellar', 'transport', '.', 'from', 'this', 'synopsis', ',', 'contact', 'does', 'not', 'sound', 'too', 'different', 'to', 'most', 'films', 'about', 'alien', 'contact', ',', 'but', 'there', 'is', 'a', 'whole', 'lot', 'more', 'to', 'this', 'intelligent', 'film', 'than', 'the', 'sci', '-', 'fi', 'hook', '.', 'the', 'alien', 'contact', 'angle', 'generates', 'a', 'great', 'amount', 'of', 'suspense', 'and', 'awe', ',', 'but', 'perhaps', 'more', 'than', 'anything', 'else', ',', 'contact', 'is', 'a', 'character', 'study', 'of', 'ellie', ',', 'whose', 'obsession', 'with', 'empirical', ',', 'scientific', 'evidence', 'has', 'erased', 'all', 'belief', 'in', 'a', 'higher', 'power', '.', 'the', 'irony', 'is', 'that', ',', 'while', 'admitting', 'to', 'having', 'no', 'religious', 'faith', ',', 'she', 'holds', 'onto', 'her', 'belief', 'in', 'extraterrestrial', 'life', 'with', 'such', 'passion', 'and', 'conviction', 'that', 'it', 'becomes', ',', 'in', 'a', 'sense', ',', 'a', 'religion', 'in', 'its', 'own', 'right', '.', 'it', 'would', 'be', 'easy', 'for', 'scripters', 'james', 'v', '.', 'hart', 'and', 'michael', 'goldenberg', ',', 'in', 'trying', 'to', 'paint', 'a', 'positive', 'image', 'of', 'the', 'heroine', ',', 'to', 'champion', 'her', 'scientific', 'beliefs', 'over', 'religious', 'ones', ',', 'but', 'they', 'wisely', 'eschew', 'easy', 'answers', ',', 'giving', 'equal', 'time', 'to', 'both', 'sides', ',', 'and', 'in', 'so', 'doing', 'depict', 'ellie', 'as', 'not', 'completely', 'sane', '.', 'in', 'the', 'end', ',', 'there', 'is', 'no', 'right', 'or', 'wrong', ',', 'nor', 'is', 'there', 'one', 'side', 'that', 'comes', 'off', 'more', 'positive', 'in', 'the', 'other', ',', 'even', 'slightly', 'so', '--', 'there', 'are', 'just', 'two', 'very', 'viable', 'points', 'of', 'view', ',', 'each', 'with', 'their', 'own', 'merits', ',', 'each', 'with', 'their', 'own', 'faults', '.', 'the', 'complex', 'role', 'of', 'ellie', 'is', 'an', 'actress', \"'\", 's', 'dream', ',', 'and', 'foster', ',', 'a', 'virtual', 'shoo', '-', 'in', 'for', 'yet', 'another', 'best', 'actress', 'oscar', 'nomination', 'next', 'year', ',', 'more', 'than', 'rises', 'to', 'the', 'challenge', '.', 'she', 'conveys', 'intelligence', ',', 'determination', ',', 'warmth', ',', 'and', ',', 'in', 'a', 'gutsy', 'move', ',', 'always', 'on', 'edge', '.', 'we', 'root', 'for', 'ellie', 'and', 'feel', 'for', 'her', ',', 'but', 'we', 'also', 'feel', 'at', 'times', 'that', 'she', 'goes', 'too', 'far', '.', 'contact', 'is', 'clearly', 'foster', \"'\", 's', 'vehicle', ',', 'but', 'others', 'are', 'given', 'their', 'chance', 'to', 'shine', 'in', 'smaller', 'roles', '.', 'matthew', 'mcconaughey', ',', 'who', 'receives', 'outrageously', 'high', 'billing', 'for', 'his', 'smallish', 'role', ',', 'holds', 'his', 'own', 'as', 'the', 'religious', 'counterpoint', 'to', 'ellie', ',', 'spiritual', 'scholar', 'and', 'government', 'adviser', 'palmer', 'joss', '(', 'however', ',', 'his', 'main', 'storyline', ',', 'the', 'tentative', 'palmer', '-', 'ellie', 'romance', ',', 'is', 'the', 'film', \"'\", 's', 'weakest', 'subplot', ')', '.', 'john', 'hurt', 'is', 'effectively', 'creepy', 'as', 's', '.', 'r', '.', 'hadden', ',', 'the', 'wealthy', 'eccentric', 'who', 'provides', 'ellie', 'with', 'her', 'research', 'funding', '.', 'angela', 'bassett', 'continues', 'to', 'impress', 'in', 'her', 'bit', 'role', 'as', 'white', 'house', 'aide', 'rachel', 'constantine', '.', 'most', 'memorable', 'of', 'all', ',', 'though', ',', 'are', 'tom', 'skerritt', 'and', 'james', 'woods', ',', 'who', 'play', 'rival', 'scientist', 'dr', '.', 'david', 'drumlin', 'and', 'national', 'security', 'adviser', 'michael', 'litz', ',', 'respectively', ';', 'both', ',', 'especially', 'skerritt', ',', 'embody', 'these', 'asshole', 'characters', 'that', 'the', 'audience', 'hissed', 'just', 'about', 'every', 'single', 'one', 'of', 'their', 'appearances', '.', 'zemeckis', 'comes', 'off', 'of', 'his', 'three', '-', 'year', 'break', 'in', 'top', 'shape', '.', 'always', 'known', 'as', 'a', 'director', 'of', 'effects', '-', 'laden', 'extravaganzas', ',', 'it', 'comes', 'as', 'no', 'surprise', 'that', 'contact', \"'\", 's', 'visual', 'effects', 'are', 'quite', 'stunning', '.', 'the', 'central', 'space', 'journey', 'is', 'more', 'than', 'a', 'little', 'reminiscent', 'of', 'the', 'close', 'of', '2001', ':', 'a', 'space', 'odyssey', ',', 'but', 'with', 'more', 'advanced', 'technology', 'at', 'his', 'disposal', ',', 'zemeckis', \"'\", 's', 'voyage', 'is', 'even', 'trippier', 'than', 'stanley', 'kubrick', \"'\", 's', 'yet', 'more', 'wondrously', 'pure', '.', 'and', 'zemeckis', 'doesn', \"'\", 't', 'resist', 'the', 'urge', 'to', 'use', 'the', 'always', '-', 'interesting', 'incorporate', '-', 'actors', '-', 'into', '-', 'existing', '-', 'film', '-', 'footage', 'effect', ',', 'which', 'is', 'every', 'bit', 'as', 'seamless', 'here', 'as', 'it', 'was', 'in', 'forrest', 'gump', '.', 'effects', ',', 'however', ',', 'are', 'confined', 'to', 'only', 'a', 'few', 'scenes', 'and', 'clearly', 'take', 'a', 'back', 'seat', 'to', 'the', 'drama', ',', 'emotion', ',', 'and', 'pure', 'wonder', ',', 'which', 'zemeckis', 'proved', 'to', 'be', 'quite', 'adept', 'at', 'in', 'gump', '.', 'it', 'says', 'a', 'lot', 'that', ',', 'in', 'a', 'summer', 'science', 'fiction', 'film', 'such', 'as', 'this', ',', 'it', \"'\", 's', 'not', 'so', 'much', 'the', 'effects', 'that', 'stay', 'with', 'you', 'as', 'it', 'is', 'the', 'drama', 'and', 'the', 'issues', 'that', 'are', 'raised', '.', 'the', 'thought', '-', 'provoking', ',', 'two', '-', 'hour', '-', 'plus', 'contact', 'is', 'a', 'much', '-', 'welcome', 'change', 'of', 'pace', 'from', 'summer', 'no', '-', 'brainers', ',', 'but', 'the', 'fact', 'that', 'it', 'is', 'a', 'smart', 'film', 'does', 'not', 'mean', 'that', 'it', 'also', 'isn', \"'\", 't', 'entertaining', '.', 'for', 'all', 'the', 'interesting', 'questions', 'it', 'asks', ',', 'the', 'film', 'is', 'still', 'what', 'it', \"'\", 's', 'being', 'sold', 'as', '--', '\"', 'a', 'journey', 'to', 'the', 'heart', 'of', 'the', 'universe', '.', '\"', 'and', 'what', 'a', 'fascinating', ',', 'unforgettable', 'journey', 'it', 'is', '.']\n",
            "pos\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Tokenization and preprocessing could also be more sophisticated\n",
        "X, y = zip(*[(words, 1 if category == 'pos' else 0) for words, category in documents])\n",
        "\n",
        "# Split the data into training and test sets\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)"
      ],
      "metadata": {
        "id": "T22Qlzatjj51"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "type(X_train)\n",
        "len(X_train)\n",
        "print(X_train[0])\n",
        "type(X_train[0])\n",
        "len(X_train[0])\n",
        "\n",
        "type(X_train[0][0])\n",
        "type(X_train[0][1])\n",
        "\n",
        "len(X_test)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Q_kssNrrlf9E",
        "outputId": "155ce8e6-279a-49ae-a455-30a8d2ac2c6a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "list"
            ]
          },
          "metadata": {},
          "execution_count": 32
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "1500"
            ]
          },
          "metadata": {},
          "execution_count": 32
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "['susan', 'granger', \"'\", 's', 'review', 'of', '\"', 'bread', 'and', 'tulips', '\"', '(', 'first', 'look', 'pictures', ')', 'in', 'this', 'delightfully', 'frothy', 'italian', 'romantic', 'comedy', ',', 'after', 'accidentally', 'being', 'left', 'behind', 'by', 'a', 'tour', 'bus', 'while', 'on', 'a', 'family', 'vacation', 'with', 'her', 'cranky', 'husband', 'and', 'two', 'cynical', 'teenagers', ',', 'rosalba', '(', 'licia', 'maglietta', ')', ',', 'an', 'unhappy', 'housewife', 'from', 'pescara', ',', 'finds', 'herself', '-', 'and', 'love', '-', 'in', 'venice', '.', 'for', 'the', 'first', 'time', 'in', 'years', ',', 'rosalba', \"'\", 's', 'on', 'her', 'own', 'when', 'she', \"'\", 's', 'abandoned', 'at', 'a', 'highway', 'rest', 'area', '.', 'although', 'her', 'philandering', 'husband', '(', 'antonio', 'catania', ')', ',', 'a', 'plumbing', '-', 'supply', 'dealer', ',', 'orders', 'her', 'to', 'stay', 'there', 'until', 'she', \"'\", 's', 'picked', 'up', ',', 'she', 'impulsively', 'accepts', 'a', 'ride', 'to', 'venice', ',', 'a', 'bohemian', 'paradise', 'which', 'she', \"'\", 's', 'never', 'visited', '.', 'rosalba', 'finds', 'refuge', 'and', 'romance', 'with', 'fernando', '(', 'bruno', 'ganz', ')', ',', 'a', 'gruff', 'icelandic', 'waiter', 'who', 'offers', 'her', 'a', 'spare', 'room', 'in', 'his', 'modest', 'apartment', 'and', 'prepares', 'breakfast', 'for', 'her', 'each', 'morning', '.', 'to', 'support', 'herself', ',', 'she', 'gets', 'a', 'job', 'working', 'with', 'a', 'florist', '(', 'antonio', 'catania', ')', '.', 'film', '-', 'maker', 'silvio', 'soldini', 'gently', 'explores', 'the', 'blossoming', 'of', 'this', 'bored', ',', 'middle', '-', 'aged', ',', 'middle', '-', 'class', 'woman', 'with', 'warmth', 'and', 'affection', ',', 'savoring', 'special', 'moments', 'such', 'as', 'when', 'rosalba', 'starts', 'playing', 'the', 'accordion', 'again', 'and', 'abandons', 'her', 'maroon', 'stretch', 'pants', ',', 'silver', 'jacket', 'and', 'orange', 'sneakers', 'for', 'a', 'simple', ',', 'new', 'red', '-', 'and', '-', 'white', 'dress', 'with', 'platform', '-', 'soled', 'espadrilles', '.', 'the', 'superb', 'actors', 'slip', 'into', 'their', 'roles', 'seamlessly', ',', 'particularly', 'luminous', 'licia', 'maglietta', 'and', 'low', '-', 'key', 'bruno', 'ganz', ',', 'along', 'with', 'marina', 'massironi', 'as', 'her', 'nosy', 'massage', '-', 'therapist', 'neighbor', 'and', 'giuseppe', 'massironi', 'as', 'the', 'inept', 'plumber', '-', 'turned', '-', 'private', 'eye', 'who', \"'\", 's', 'sent', 'to', 'retrieve', 'her', 'on', 'orders', 'from', 'her', 'frantic', 'husband', '-', 'who', \"'\", 's', 'discovered', 'that', 'his', 'mistress', 'has', 'no', 'interest', 'in', 'doing', 'his', 'laundry', 'or', 'cleaning', 'the', 'house', '.', 'on', 'the', 'granger', 'movie', 'gauge', 'of', '1', 'to', '10', ',', '\"', 'bread', 'and', 'tulips', '\"', 'is', 'a', 'beguiling', ',', 'escapist', '8', '.', 'as', 'the', 'summer', 'ends', ',', 'it', \"'\", 's', 'a', 'magical', 'getaway', 'for', 'mature', 'audiences', '.', '.']\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "list"
            ]
          },
          "metadata": {},
          "execution_count": 32
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "369"
            ]
          },
          "metadata": {},
          "execution_count": 32
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "str"
            ]
          },
          "metadata": {},
          "execution_count": 32
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "str"
            ]
          },
          "metadata": {},
          "execution_count": 32
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "500"
            ]
          },
          "metadata": {},
          "execution_count": 32
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Train a Word2Vec model\n",
        "model = Word2Vec(sentences=X_train, vector_size=100, window=5, min_count=1, workers=4)"
      ],
      "metadata": {
        "id": "778kDvIKhSV4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The aove code initializes and trains a Word2Vec model using the `gensim` library in Python. Let's break down what each parameter means and how the model is trained:\n",
        "\n",
        "### Parameters of `Word2Vec`\n",
        "\n",
        "1. **sentences (`X_train`)**: This is the input data for the model. It should be a list of lists of tokens, i.e., each inner list is a sequence of words (tokens) from a single document. In the context of your code, `X_train` contains the tokenized text data that will be used to train the Word2Vec model.\n",
        "\n",
        "2. **vector_size (100)**: This parameter sets the dimensionality of the word vectors. In this case, each word will be represented as a 100-dimensional vector. The size of the vector is a key hyperparameter and can affect both the quality of the representations and the performance of the model, with larger sizes generally capturing more information about the word but requiring more data to train effectively.\n",
        "\n",
        "3. **window (5)**: This parameter specifies the maximum distance between the current and predicted word within a sentence. In other words, for a given target word, the model will consider up to 5 words before and after it as context words. The window size affects how much contextual information the model considers and can influence how well the model learns relationships between words that occur in broader contexts.\n",
        "\n",
        "4. **min_count (1)**: This parameter determines the minimum frequency count of words. Here, it is set to 1, meaning that the model will consider all words that appear at least once in the corpus. Setting this threshold helps in reducing the size of the model by ignoring rare words, which often occur as noise or are too infrequent to provide meaningful information.\n",
        "\n",
        "5. **workers (4)**: This is the number of worker threads to use for training the model, which can speed up the training process on multi-core machines. Here, it is set to use 4 threads, enabling parallel processing to train the model faster.\n",
        "\n",
        "### Functionality of `Word2Vec`\n",
        "\n",
        "- **Training**: The Word2Vec model uses the provided corpus (in this case, `X_train`) to learn vector representations for each word in a way that similar words (in terms of context) have similar vectors. It does this either through the CBOW or Skip-Gram architecture, depending on how it's configured (default is CBOW if not specified).\n",
        "- **Output**: After training, the model (`model`) will have learned word embeddings for all words in the corpus that meet the `min_count` criterion. These embeddings can be accessed via `model.wv`, and can be used to check word similarities, as input features for machine learning models, or for any task that might benefit from a numerical representation of text data.\n",
        "\n",
        "The trained Word2Vec model is particularly useful for NLP tasks where the semantic meaning of words and their relationships significantly influence the outcome, such as sentiment analysis, recommendation systems, and more."
      ],
      "metadata": {
        "id": "wqz6LDPhqCWV"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(model)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aSXKSR9MmDc7",
        "outputId": "abad45db-3900-43d2-da5b-b06a0cf28fbb"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Word2Vec<vocab=35363, vector_size=100, alpha=0.025>\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "4. Define a Function to Convert Reviews into Feature Vectors\n",
        "Convert each review into a vector by averaging the vectors of the words in the review."
      ],
      "metadata": {
        "id": "3xu1YSX9mUHe"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def document_vector(doc):\n",
        "    # remove out-of-vocabulary words\n",
        "    doc = [word for word in doc if word in model.wv.index_to_key]\n",
        "    return np.mean(model.wv[doc], axis=0) if doc else np.zeros(model.vector_size)\n",
        "\n",
        "# Apply the function to each document\n",
        "X_train_vec = np.array([document_vector(doc) for doc in X_train])\n",
        "X_test_vec = np.array([document_vector(doc) for doc in X_test])\n"
      ],
      "metadata": {
        "id": "PG31gNbxhSTn"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "This Python code defines a function `document_vector(doc)` and then applies this function to each document in `X_train` and `X_test` datasets. Let's break down the code step by step:\n",
        "\n",
        "1. **Defining the `document_vector` function**:\n",
        "   - This function takes a single argument `doc`, which represents a document. In the context of natural language processing (NLP), a document could be a sentence, a paragraph, or any other unit of text.\n",
        "   - Inside the function:\n",
        "     - It first filters out any words in the document that are not present in the vocabulary of a pre-trained Word2Vec model (`model`). This is done to ensure that only words known to the model are used to compute the document vector.\n",
        "     - It then calculates the mean vector of the remaining words in the document. This is achieved by retrieving the word vectors (`model.wv[doc]`) for each word in the document and taking the mean along the rows (axis=0) to obtain a single vector representation for the entire document.\n",
        "     - If the document is empty (i.e., all words are out-of-vocabulary), it returns a zero vector with the same dimensionality as the word vectors in the model (`model.vector_size`).\n",
        "\n",
        "2. **Applying the function to each document**:\n",
        "   - After defining the `document_vector` function, the code applies this function to each document in both the training (`X_train`) and testing (`X_test`) datasets.\n",
        "   - It creates a new array `X_train_vec` to store the document vectors for the training set and another array `X_test_vec` to store the document vectors for the testing set.\n",
        "   - For each document in `X_train` and `X_test`, the `document_vector` function is called, and the resulting document vector is added to the respective array using list comprehension.\n",
        "   - Finally, the arrays are converted to NumPy arrays using `np.array()` to facilitate further processing or analysis.\n",
        "\n",
        "In summary, this code snippet calculates the document vectors for each document in the training and testing datasets using a pre-trained Word2Vec model. The document vectors represent the semantic meaning of the documents in a continuous vector space, which can be useful for various NLP tasks such as document classification, clustering, or similarity analysis."
      ],
      "metadata": {
        "id": "Sl62wxmbYWsO"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "5. Train a Classifier\n",
        "Train a logistic regression classifier using the document vectors."
      ],
      "metadata": {
        "id": "1EqHv2o0mmvc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Train a logistic regression classifier\n",
        "classifier = LogisticRegression(max_iter=1000)\n",
        "classifier.fit(X_train_vec, y_train)\n",
        "\n",
        "# Predict the labels for the test set\n",
        "y_pred = classifier.predict(X_test_vec)\n",
        "\n",
        "# Calculate the accuracy of the predictions\n",
        "accuracy = accuracy_score(y_test, y_pred)\n",
        "print(f\"Accuracy: {accuracy:.2f}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "TWz0nF1MhSQ7",
        "outputId": "a9138726-580d-40c4-d9f1-6632fff06e18"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Accuracy: 0.69\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#Summary\n",
        "In this example, we loaded a dataset of movie reviews, trained a Word2Vec model on the training part of the dataset, and used the averaged word vectors to create feature vectors for each review. A logistic regression model was then trained on these features to perform sentiment analysis. This example shows the power of Word2Vec in capturing semantic properties of words and using them in downstream tasks like sentiment analysis.\n",
        "\n",
        "Keep in mind that the performance can be significantly improved with more sophisticated preprocessing, hyperparameter tuning, and possibly using more advanced models for both word embeddings and classification."
      ],
      "metadata": {
        "id": "BnhgpUJHmrgP"
      }
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "iE0dMGOhhR_6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "IuWL8cQJhR8q"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}