{ "cells": [ { "cell_type": "markdown", "id": "e4d57d53", "metadata": {}, "source": [ "# Divoce" ] }, { "cell_type": "code", "execution_count": null, "id": "9G1AOK-PWzwR", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 25114, "status": "ok", "timestamp": 1746448292946, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "9G1AOK-PWzwR", "outputId": "eebdb5f0-4e07-4e23-edbd-0bf27f883f60" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: jieba in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (0.42.1)\n", "Requirement already satisfied: gensim==4.3.3 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (4.3.3)\n", "Collecting spacy==3.7.2\n", " Downloading spacy-3.7.2-cp311-cp311-win_amd64.whl.metadata (26 kB)\n", "Collecting thinc==8.2.2\n", " Downloading thinc-8.2.2-cp311-cp311-win_amd64.whl.metadata (15 kB)\n", "Requirement already satisfied: numpy<2.0,>=1.18.5 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from gensim==4.3.3) (1.24.3)\n", "Requirement already satisfied: scipy<1.14.0,>=1.7.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from gensim==4.3.3) (1.12.0)\n", "Requirement already satisfied: smart-open>=1.8.1 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from gensim==4.3.3) (7.1.0)\n", "Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy==3.7.2)\n", " Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)\n", "Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy==3.7.2)\n", " Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)\n", "Collecting murmurhash<1.1.0,>=0.28.0 (from spacy==3.7.2)\n", " Downloading murmurhash-1.0.12-cp311-cp311-win_amd64.whl.metadata (2.2 kB)\n", "Collecting cymem<2.1.0,>=2.0.2 (from spacy==3.7.2)\n", " Downloading cymem-2.0.11-cp311-cp311-win_amd64.whl.metadata (8.8 kB)\n", "Collecting preshed<3.1.0,>=3.0.2 (from spacy==3.7.2)\n", " Downloading preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)\n", "Collecting wasabi<1.2.0,>=0.9.1 (from spacy==3.7.2)\n", " Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)\n", "Collecting srsly<3.0.0,>=2.4.3 (from spacy==3.7.2)\n", " Downloading srsly-2.5.1-cp311-cp311-win_amd64.whl.metadata (20 kB)\n", "Collecting catalogue<2.1.0,>=2.0.6 (from spacy==3.7.2)\n", " Downloading catalogue-2.0.10-py3-none-any.whl.metadata (14 kB)\n", "Collecting weasel<0.4.0,>=0.1.0 (from spacy==3.7.2)\n", " Downloading weasel-0.3.4-py3-none-any.whl.metadata (4.7 kB)\n", "Collecting typer<0.10.0,>=0.3.0 (from spacy==3.7.2)\n", " Downloading typer-0.9.4-py3-none-any.whl.metadata (14 kB)\n", "Collecting smart-open>=1.8.1 (from gensim==4.3.3)\n", " Downloading smart_open-6.4.0-py3-none-any.whl.metadata (21 kB)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from spacy==3.7.2) (4.67.1)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from spacy==3.7.2) (2.32.3)\n", "Collecting pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 (from spacy==3.7.2)\n", " Downloading pydantic-2.11.4-py3-none-any.whl.metadata (66 kB)\n", "Requirement already satisfied: jinja2 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from spacy==3.7.2) (3.1.6)\n", "Requirement already satisfied: setuptools in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from spacy==3.7.2) (75.8.0)\n", "Requirement already satisfied: packaging>=20.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from spacy==3.7.2) (24.2)\n", "Collecting langcodes<4.0.0,>=3.2.0 (from spacy==3.7.2)\n", " Downloading langcodes-3.5.0-py3-none-any.whl.metadata (29 kB)\n", "Collecting blis<0.8.0,>=0.7.8 (from thinc==8.2.2)\n", " Downloading blis-0.7.11-cp311-cp311-win_amd64.whl.metadata (7.6 kB)\n", "Collecting confection<1.0.0,>=0.0.1 (from thinc==8.2.2)\n", " Downloading confection-0.1.5-py3-none-any.whl.metadata (19 kB)\n", "Collecting language-data>=1.2 (from langcodes<4.0.0,>=3.2.0->spacy==3.7.2)\n", " Downloading language_data-1.3.0-py3-none-any.whl.metadata (4.3 kB)\n", "Collecting annotated-types>=0.6.0 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy==3.7.2)\n", " Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)\n", "Collecting pydantic-core==2.33.2 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy==3.7.2)\n", " Downloading pydantic_core-2.33.2-cp311-cp311-win_amd64.whl.metadata (6.9 kB)\n", "Requirement already satisfied: typing-extensions>=4.12.2 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy==3.7.2) (4.12.2)\n", "Collecting typing-inspection>=0.4.0 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy==3.7.2)\n", " Downloading typing_inspection-0.4.0-py3-none-any.whl.metadata (2.6 kB)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy==3.7.2) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy==3.7.2) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy==3.7.2) (2.3.0)\n", "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy==3.7.2) (2025.1.31)\n", "Requirement already satisfied: colorama in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from tqdm<5.0.0,>=4.38.0->spacy==3.7.2) (0.4.6)\n", "Requirement already satisfied: click<9.0.0,>=7.1.1 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from typer<0.10.0,>=0.3.0->spacy==3.7.2) (8.1.8)\n", "Collecting cloudpathlib<0.17.0,>=0.7.0 (from weasel<0.4.0,>=0.1.0->spacy==3.7.2)\n", " Downloading cloudpathlib-0.16.0-py3-none-any.whl.metadata (14 kB)\n", "Requirement already satisfied: MarkupSafe>=2.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from jinja2->spacy==3.7.2) (3.0.2)\n", "Collecting marisa-trie>=1.1.0 (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy==3.7.2)\n", " Downloading marisa_trie-1.2.1-cp311-cp311-win_amd64.whl.metadata (9.3 kB)\n", "Downloading spacy-3.7.2-cp311-cp311-win_amd64.whl (12.1 MB)\n", " ---------------------------------------- 0.0/12.1 MB ? eta -:--:--\n", " --------------------------------------- 0.3/12.1 MB ? eta -:--:--\n", " - -------------------------------------- 0.5/12.1 MB 1.3 MB/s eta 0:00:09\n", " --- ------------------------------------ 1.0/12.1 MB 1.9 MB/s eta 0:00:06\n", " ----- ---------------------------------- 1.6/12.1 MB 2.2 MB/s eta 0:00:05\n", " ------ --------------------------------- 2.1/12.1 MB 2.2 MB/s eta 0:00:05\n", " -------- ------------------------------- 2.6/12.1 MB 2.3 MB/s eta 0:00:05\n", " --------- ------------------------------ 2.9/12.1 MB 2.3 MB/s eta 0:00:04\n", " ------------ --------------------------- 3.7/12.1 MB 2.3 MB/s eta 0:00:04\n", " ------------- -------------------------- 3.9/12.1 MB 2.3 MB/s eta 0:00:04\n", " -------------- ------------------------- 4.5/12.1 MB 2.3 MB/s eta 0:00:04\n", " ---------------- ----------------------- 5.0/12.1 MB 2.2 MB/s eta 0:00:04\n", " ----------------- ---------------------- 5.2/12.1 MB 2.2 MB/s eta 0:00:04\n", " ------------------- -------------------- 5.8/12.1 MB 2.2 MB/s eta 0:00:03\n", " -------------------- ------------------- 6.3/12.1 MB 2.2 MB/s eta 0:00:03\n", " --------------------- ------------------ 6.6/12.1 MB 2.2 MB/s eta 0:00:03\n", " ---------------------- ----------------- 6.8/12.1 MB 2.1 MB/s eta 0:00:03\n", " ----------------------- ---------------- 7.1/12.1 MB 2.0 MB/s eta 0:00:03\n", " ------------------------ --------------- 7.3/12.1 MB 2.0 MB/s eta 0:00:03\n", " ------------------------- -------------- 7.6/12.1 MB 2.0 MB/s eta 0:00:03\n", " -------------------------- ------------- 8.1/12.1 MB 2.0 MB/s eta 0:00:03\n", " ---------------------------- ----------- 8.7/12.1 MB 2.0 MB/s eta 0:00:02\n", " ------------------------------ --------- 9.2/12.1 MB 2.0 MB/s eta 0:00:02\n", " ------------------------------- -------- 9.4/12.1 MB 2.0 MB/s eta 0:00:02\n", " --------------------------------- ------ 10.0/12.1 MB 2.0 MB/s eta 0:00:02\n", " ---------------------------------- ----- 10.5/12.1 MB 2.0 MB/s eta 0:00:01\n", " ------------------------------------ --- 11.0/12.1 MB 2.0 MB/s eta 0:00:01\n", " -------------------------------------- - 11.5/12.1 MB 2.0 MB/s eta 0:00:01\n", " ---------------------------------------- 12.1/12.1 MB 2.0 MB/s eta 0:00:00\n", "Downloading thinc-8.2.2-cp311-cp311-win_amd64.whl (1.5 MB)\n", " ---------------------------------------- 0.0/1.5 MB ? eta -:--:--\n", " ------- -------------------------------- 0.3/1.5 MB ? eta -:--:--\n", " ------- -------------------------------- 0.3/1.5 MB ? eta -:--:--\n", " ----------------------------------- ---- 1.3/1.5 MB 2.2 MB/s eta 0:00:01\n", " ---------------------------------------- 1.5/1.5 MB 1.9 MB/s eta 0:00:00\n", "Downloading blis-0.7.11-cp311-cp311-win_amd64.whl (6.6 MB)\n", " ---------------------------------------- 0.0/6.6 MB ? eta -:--:--\n", " - -------------------------------------- 0.3/6.6 MB ? eta -:--:--\n", " ------ --------------------------------- 1.0/6.6 MB 2.6 MB/s eta 0:00:03\n", " ------- -------------------------------- 1.3/6.6 MB 2.1 MB/s eta 0:00:03\n", " --------- ------------------------------ 1.6/6.6 MB 2.2 MB/s eta 0:00:03\n", " ------------ --------------------------- 2.1/6.6 MB 2.1 MB/s eta 0:00:03\n", " -------------- ------------------------- 2.4/6.6 MB 2.0 MB/s eta 0:00:03\n", " ------------------- -------------------- 3.1/6.6 MB 2.2 MB/s eta 0:00:02\n", " ---------------------- ----------------- 3.7/6.6 MB 2.2 MB/s eta 0:00:02\n", " ------------------------- -------------- 4.2/6.6 MB 2.3 MB/s eta 0:00:02\n", " -------------------------- ------------- 4.5/6.6 MB 2.3 MB/s eta 0:00:01\n", " ------------------------------ --------- 5.0/6.6 MB 2.2 MB/s eta 0:00:01\n", " ------------------------------- -------- 5.2/6.6 MB 2.1 MB/s eta 0:00:01\n", " ---------------------------------- ----- 5.8/6.6 MB 2.1 MB/s eta 0:00:01\n", " --------------------------------------- 6.6/6.6 MB 2.2 MB/s eta 0:00:01\n", " ---------------------------------------- 6.6/6.6 MB 2.1 MB/s eta 0:00:00\n", "Downloading catalogue-2.0.10-py3-none-any.whl (17 kB)\n", "Downloading confection-0.1.5-py3-none-any.whl (35 kB)\n", "Downloading cymem-2.0.11-cp311-cp311-win_amd64.whl (39 kB)\n", "Downloading langcodes-3.5.0-py3-none-any.whl (182 kB)\n", "Downloading murmurhash-1.0.12-cp311-cp311-win_amd64.whl (25 kB)\n", "Downloading preshed-3.0.9-cp311-cp311-win_amd64.whl (122 kB)\n", "Downloading pydantic-2.11.4-py3-none-any.whl (443 kB)\n", "Downloading pydantic_core-2.33.2-cp311-cp311-win_amd64.whl (2.0 MB)\n", " ---------------------------------------- 0.0/2.0 MB ? eta -:--:--\n", " ---------------------------------------- 0.0/2.0 MB ? eta -:--:--\n", " ---------- ----------------------------- 0.5/2.0 MB 1.5 MB/s eta 0:00:01\n", " --------------------- ------------------ 1.0/2.0 MB 2.3 MB/s eta 0:00:01\n", " -------------------------- ------------- 1.3/2.0 MB 2.0 MB/s eta 0:00:01\n", " ------------------------------------- -- 1.8/2.0 MB 1.9 MB/s eta 0:00:01\n", " ---------------------------------------- 2.0/2.0 MB 1.7 MB/s eta 0:00:00\n", "Downloading smart_open-6.4.0-py3-none-any.whl (57 kB)\n", "Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)\n", "Downloading spacy_loggers-1.0.5-py3-none-any.whl (22 kB)\n", "Downloading srsly-2.5.1-cp311-cp311-win_amd64.whl (632 kB)\n", " ---------------------------------------- 0.0/632.6 kB ? eta -:--:--\n", " ---------------- ----------------------- 262.1/632.6 kB ? eta -:--:--\n", " --------------------------------- ------ 524.3/632.6 kB 1.4 MB/s eta 0:00:01\n", " ---------------------------------------- 632.6/632.6 kB 1.2 MB/s eta 0:00:00\n", "Downloading typer-0.9.4-py3-none-any.whl (45 kB)\n", "Downloading wasabi-1.1.3-py3-none-any.whl (27 kB)\n", "Downloading weasel-0.3.4-py3-none-any.whl (50 kB)\n", "Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)\n", "Downloading cloudpathlib-0.16.0-py3-none-any.whl (45 kB)\n", "Downloading language_data-1.3.0-py3-none-any.whl (5.4 MB)\n", " ---------------------------------------- 0.0/5.4 MB ? eta -:--:--\n", " ---------------------------------------- 0.0/5.4 MB ? eta -:--:--\n", " - -------------------------------------- 0.3/5.4 MB ? eta -:--:--\n", " ----- ---------------------------------- 0.8/5.4 MB 1.4 MB/s eta 0:00:04\n", " ------- -------------------------------- 1.0/5.4 MB 1.6 MB/s eta 0:00:03\n", " --------- ------------------------------ 1.3/5.4 MB 1.3 MB/s eta 0:00:04\n", " --------- ------------------------------ 1.3/5.4 MB 1.3 MB/s eta 0:00:04\n", " ------------- -------------------------- 1.8/5.4 MB 1.3 MB/s eta 0:00:03\n", " --------------- ------------------------ 2.1/5.4 MB 1.4 MB/s eta 0:00:03\n", " ------------------- -------------------- 2.6/5.4 MB 1.5 MB/s eta 0:00:02\n", " ----------------------- ---------------- 3.1/5.4 MB 1.6 MB/s eta 0:00:02\n", " ------------------------- -------------- 3.4/5.4 MB 1.6 MB/s eta 0:00:02\n", " ----------------------------- ---------- 3.9/5.4 MB 1.6 MB/s eta 0:00:01\n", " ------------------------------- -------- 4.2/5.4 MB 1.6 MB/s eta 0:00:01\n", " ----------------------------------- ---- 4.7/5.4 MB 1.6 MB/s eta 0:00:01\n", " ------------------------------------ --- 5.0/5.4 MB 1.6 MB/s eta 0:00:01\n", " -------------------------------------- - 5.2/5.4 MB 1.6 MB/s eta 0:00:01\n", " ---------------------------------------- 5.4/5.4 MB 1.6 MB/s eta 0:00:00\n", "Downloading typing_inspection-0.4.0-py3-none-any.whl (14 kB)\n", "Downloading marisa_trie-1.2.1-cp311-cp311-win_amd64.whl (152 kB)\n", "Installing collected packages: cymem, wasabi, typing-inspection, spacy-loggers, spacy-legacy, smart-open, pydantic-core, murmurhash, marisa-trie, cloudpathlib, catalogue, blis, annotated-types, typer, srsly, pydantic, preshed, language-data, langcodes, confection, weasel, thinc, spacy\n", " Attempting uninstall: smart-open\n", " Found existing installation: smart-open 7.1.0\n", " Uninstalling smart-open-7.1.0:\n", " Successfully uninstalled smart-open-7.1.0\n", "Successfully installed annotated-types-0.7.0 blis-0.7.11 catalogue-2.0.10 cloudpathlib-0.16.0 confection-0.1.5 cymem-2.0.11 langcodes-3.5.0 language-data-1.3.0 marisa-trie-1.2.1 murmurhash-1.0.12 preshed-3.0.9 pydantic-2.11.4 pydantic-core-2.33.2 smart-open-6.4.0 spacy-3.7.2 spacy-legacy-3.0.12 spacy-loggers-1.0.5 srsly-2.5.1 thinc-8.2.2 typer-0.9.4 typing-inspection-0.4.0 wasabi-1.1.3 weasel-0.3.4\n" ] } ], "source": [ "#!pip install jieba\n", "#!pip install \"gensim==4.3.3\" \"spacy==3.7.2\" \"thinc==8.2.2\"" ] }, { "cell_type": "code", "execution_count": null, "id": "5bdb9c23", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting plotly\n", " Downloading plotly-6.0.1-py3-none-any.whl.metadata (6.7 kB)\n", "Collecting narwhals>=1.15.1 (from plotly)\n", " Downloading narwhals-1.38.2-py3-none-any.whl.metadata (9.4 kB)\n", "Requirement already satisfied: packaging in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from plotly) (24.2)\n", "Downloading plotly-6.0.1-py3-none-any.whl (14.8 MB)\n", " ---------------------------------------- 0.0/14.8 MB ? eta -:--:--\n", " ---------------------------------------- 0.0/14.8 MB ? eta -:--:--\n", " --------------------------------------- 0.3/14.8 MB ? eta -:--:--\n", " - -------------------------------------- 0.5/14.8 MB 1.2 MB/s eta 0:00:12\n", " -- ------------------------------------- 0.8/14.8 MB 1.3 MB/s eta 0:00:11\n", " --- ------------------------------------ 1.3/14.8 MB 1.4 MB/s eta 0:00:10\n", " ---- ----------------------------------- 1.6/14.8 MB 1.4 MB/s eta 0:00:10\n", " ------ --------------------------------- 2.4/14.8 MB 1.7 MB/s eta 0:00:08\n", " ------- -------------------------------- 2.9/14.8 MB 1.8 MB/s eta 0:00:07\n", " -------- ------------------------------- 3.1/14.8 MB 1.8 MB/s eta 0:00:07\n", " --------- ------------------------------ 3.7/14.8 MB 1.8 MB/s eta 0:00:07\n", " ---------- ----------------------------- 3.9/14.8 MB 1.8 MB/s eta 0:00:06\n", " ------------ --------------------------- 4.5/14.8 MB 1.9 MB/s eta 0:00:06\n", " ------------- -------------------------- 5.0/14.8 MB 1.9 MB/s eta 0:00:06\n", " -------------- ------------------------- 5.5/14.8 MB 2.0 MB/s eta 0:00:05\n", " --------------- ------------------------ 5.8/14.8 MB 1.9 MB/s eta 0:00:05\n", " ---------------- ----------------------- 6.3/14.8 MB 2.0 MB/s eta 0:00:05\n", " ----------------- ---------------------- 6.6/14.8 MB 2.0 MB/s eta 0:00:05\n", " ------------------- -------------------- 7.1/14.8 MB 2.0 MB/s eta 0:00:04\n", " --------------------- ------------------ 7.9/14.8 MB 2.0 MB/s eta 0:00:04\n", " ---------------------- ----------------- 8.4/14.8 MB 2.1 MB/s eta 0:00:04\n", " ----------------------- ---------------- 8.7/14.8 MB 2.1 MB/s eta 0:00:03\n", " ------------------------ --------------- 9.2/14.8 MB 2.1 MB/s eta 0:00:03\n", " ------------------------- -------------- 9.4/14.8 MB 2.0 MB/s eta 0:00:03\n", " --------------------------- ------------ 10.2/14.8 MB 2.1 MB/s eta 0:00:03\n", " ----------------------------- ---------- 10.7/14.8 MB 2.1 MB/s eta 0:00:02\n", " ------------------------------ --------- 11.3/14.8 MB 2.1 MB/s eta 0:00:02\n", " ------------------------------- -------- 11.5/14.8 MB 2.1 MB/s eta 0:00:02\n", " -------------------------------- ------- 12.1/14.8 MB 2.1 MB/s eta 0:00:02\n", " --------------------------------- ------ 12.3/14.8 MB 2.1 MB/s eta 0:00:02\n", " --------------------------------- ------ 12.6/14.8 MB 2.1 MB/s eta 0:00:02\n", " ----------------------------------- ---- 13.1/14.8 MB 2.1 MB/s eta 0:00:01\n", " ------------------------------------ --- 13.4/14.8 MB 2.1 MB/s eta 0:00:01\n", " ------------------------------------- -- 13.9/14.8 MB 2.0 MB/s eta 0:00:01\n", " -------------------------------------- - 14.2/14.8 MB 2.0 MB/s eta 0:00:01\n", " --------------------------------------- 14.7/14.8 MB 2.0 MB/s eta 0:00:01\n", " ---------------------------------------- 14.8/14.8 MB 2.0 MB/s eta 0:00:00\n", "Downloading narwhals-1.38.2-py3-none-any.whl (338 kB)\n", "Installing collected packages: narwhals, plotly\n", "Successfully installed narwhals-1.38.2 plotly-6.0.1\n" ] } ], "source": [ "#!pip install plotly\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "qVkh_Y7HWj9w", "metadata": { "executionInfo": { "elapsed": 10133, "status": "ok", "timestamp": 1746448329851, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "qVkh_Y7HWj9w" }, "outputs": [], "source": [ "import pandas as pd\n", "import jieba\n", "import jieba.analyse\n", "import re\n", "import numpy as np\n", "from collections import defaultdict\n", "import multiprocessing\n", "\n", "from gensim.models.phrases import Phrases, Phraser\n", "from gensim.models import Word2Vec, KeyedVectors\n", "\n", "from gensim.models import Word2Vec\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.decomposition import PCA\n", "from sklearn.manifold import TSNE\n", "\n", "import seaborn as sns\n", "import torch\n", "\n", "from matplotlib.font_manager import fontManager\n", "import plotly.express as px\n", "\n", "sns.set_style(\"darkgrid\")" ] }, { "cell_type": "code", "execution_count": null, "id": "076c6263", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5331, "status": "ok", "timestamp": 1746448337568, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "076c6263", "outputId": "20be4efe-6dc5-4f3c-be04-3527cfe7323c" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Building prefix dict from c:\\Users\\rolya\\Desktop\\DIV\\divoce_project2\\dict\\dict.txt.big ...\n", "Dumping model to file cache C:\\Users\\rolya\\AppData\\Local\\Temp\\jieba.u2ecd0c6dc6535871c5cc6cd50f65ab67.cache\n", "Loading model cost 1.600 seconds.\n", "Prefix dict has been built successfully.\n" ] } ], "source": [ "# 設定繁體中文詞庫\n", "jieba.set_dictionary('/Users/rolya/Desktop/DIV/divoce_project2/dict/dict.txt.big')\n", "jieba.load_userdict('/Users/rolya/Desktop/DIV/divoce_project2/dict/user_dict.txt')\n", "\n", "# 新增stopwords\n", "with open('/Users/rolya/Desktop/DIV/divoce_project2/dict/stopwords.txt',encoding=\"utf-8\") as f:\n", " stopWords = [line.strip() for line in f.readlines()]" ] }, { "cell_type": "code", "execution_count": 4, "id": "CLy-xuUYW8Mb", "metadata": { "executionInfo": { "elapsed": 19, "status": "ok", "timestamp": 1746448340493, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "CLy-xuUYW8Mb" }, "outputs": [], "source": [ "# 斷詞函式\n", "def getToken(row):\n", " seg_list = jieba.lcut(row)\n", " seg_list = [w for w in seg_list if w not in stopWords and len(w)>1] # 篩選掉停用字與字元數小於1的詞彙\n", "\n", " return seg_list" ] }, { "cell_type": "code", "execution_count": 5, "id": "UH4Gyx2zW8O3", "metadata": { "executionInfo": { "elapsed": 5350, "status": "ok", "timestamp": 1746448346821, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "UH4Gyx2zW8O3" }, "outputs": [], "source": [ "# 讀入中文示範資料集\n", "origin_data = pd.read_csv('/Users/rolya/Desktop/DIV/divoce_project2/text.csv')" ] }, { "cell_type": "code", "execution_count": 6, "id": "I1D6HJoyW8RK", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "executionInfo": { "elapsed": 55259, "status": "ok", "timestamp": 1746448404395, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "I1D6HJoyW8RK", "outputId": "faaa04de-fe22-4611-d081-de1d5f4fefb5" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
system_idartUrlartTitleartDateartContentsentenceword
11https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...偷看手機是不對的[偷看, 手機]
21https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...但如果已經結婚了[結婚]
31https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...太太想看你手機[太太, 手機]
41https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...真的可以拒絕嗎[真的, 拒絕]
51https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...感覺你拒絕[感覺, 拒絕]
61https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...就是心裡有鬼[有鬼]
71https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...只是讓太太猜忌[太太, 猜忌]
81https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...自己日子難過[日子, 難過]
101https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...我手機都是隨便太太看[手機, 隨便, 太太]
111https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...但每次都被看的提心吊膽[每次, 提心吊膽]
\n", "
" ], "text/plain": [ " system_id artUrl \\\n", "1 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "2 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "3 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "4 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "5 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "6 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "7 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "8 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "10 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "11 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "\n", " artTitle artDate \\\n", "1 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "2 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "3 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "4 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "5 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "6 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "7 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "8 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "10 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "11 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "\n", " artContent sentence \\\n", "1 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 偷看手機是不對的 \n", "2 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 但如果已經結婚了 \n", "3 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 太太想看你手機 \n", "4 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 真的可以拒絕嗎 \n", "5 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 感覺你拒絕 \n", "6 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 就是心裡有鬼 \n", "7 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 只是讓太太猜忌 \n", "8 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 自己日子難過 \n", "10 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 我手機都是隨便太太看 \n", "11 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 但每次都被看的提心吊膽 \n", "\n", " word \n", "1 [偷看, 手機] \n", "2 [結婚] \n", "3 [太太, 手機] \n", "4 [真的, 拒絕] \n", "5 [感覺, 拒絕] \n", "6 [有鬼] \n", "7 [太太, 猜忌] \n", "8 [日子, 難過] \n", "10 [手機, 隨便, 太太] \n", "11 [每次, 提心吊膽] " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 資料前處理\n", "\n", "# 去除一些不需要的欄位\n", "metaData = origin_data.drop(['artPoster', 'artCatagory', 'artComment', 'e_ip', 'insertedDate', 'dataSource'], axis=1)\n", "\n", "# 只留下中文字\n", "metaData['sentence'] = metaData['artContent'].str.replace(r'\\n\\n','。', regex=True)\n", "metaData['sentence'] = metaData['sentence'].str.replace(r'\\n','', regex=True)\n", "\n", "metaData['sentence'] = metaData['sentence'].str.split(\"[,,。!!??]{1,}\")\n", "metaData = metaData.explode('sentence').reset_index(drop=True)\n", "\n", "metaData['sentence'] = metaData['sentence'].apply(lambda x: re.sub('[^\\u4e00-\\u9fff]+', '',x))\n", "\n", "metaData['word'] = metaData.sentence.apply(getToken)\n", "\n", "metaData = metaData[metaData['word'].apply(len) > 0]\n", "\n", "metaData.head(10)" ] }, { "cell_type": "code", "execution_count": 7, "id": "8jPUK7SdXSqx", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 920 }, "executionInfo": { "elapsed": 1755, "status": "ok", "timestamp": 1746448413451, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "8jPUK7SdXSqx", "outputId": "10e01979-80cb-49d2-f737-1eb34a779264" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
system_idartUrlartTitleartDateartContentsentencewordword_list_bigrams
11https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...偷看手機是不對的[偷看, 手機][偷看, 手機]
21https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...但如果已經結婚了[結婚][結婚]
31https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...太太想看你手機[太太, 手機][太太, 手機]
41https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...真的可以拒絕嗎[真的, 拒絕][真的, 拒絕]
51https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...感覺你拒絕[感覺, 拒絕][感覺, 拒絕]
\n", "
" ], "text/plain": [ " system_id artUrl \\\n", "1 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "2 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "3 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "4 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "5 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "\n", " artTitle artDate \\\n", "1 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "2 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "3 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "4 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "5 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "\n", " artContent sentence word \\\n", "1 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 偷看手機是不對的 [偷看, 手機] \n", "2 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 但如果已經結婚了 [結婚] \n", "3 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 太太想看你手機 [太太, 手機] \n", "4 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 真的可以拒絕嗎 [真的, 拒絕] \n", "5 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 感覺你拒絕 [感覺, 拒絕] \n", "\n", " word_list_bigrams \n", "1 [偷看, 手機] \n", "2 [結婚] \n", "3 [太太, 手機] \n", "4 [真的, 拒絕] \n", "5 [感覺, 拒絕] " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sents = metaData['word'].to_list()\n", "bigrams = Phrases(sents,min_count=1, threshold=1000)\n", "bigram_phrasers = Phraser(bigrams)\n", "metaData['word_list_bigrams'] = list(bigram_phrasers[sents])\n", "\n", "metaData.head()" ] }, { "cell_type": "code", "execution_count": 8, "id": "dzOfLkazXbES", "metadata": { "executionInfo": { "elapsed": 151, "status": "ok", "timestamp": 1746448416220, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "dzOfLkazXbES" }, "outputs": [], "source": [ "word_freq = defaultdict(int)\n", "# 計算詞頻\n", "sents = metaData['word_list_bigrams'].tolist()\n", "for sent in sents: # sent 中的每個句子\n", " for i in sent: # i 是句子中的每個字\n", " word_freq[i] += 1" ] }, { "cell_type": "code", "execution_count": 9, "id": "BadyXcj5XbG6", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 12, "status": "ok", "timestamp": 1746448417825, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "BadyXcj5XbG6", "outputId": "e53931fe-27d1-45c3-9c05-f8b2c01b112f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total unique words in sentences: 39393\n" ] }, { "data": { "text/plain": [ "['小孩', '離婚', '老婆', '老公', '真的', '孩子', '工作', '婚姻', '結婚', '太太']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(f\"total unique words in sentences: {len(word_freq)}\")\n", "sorted(word_freq, key=word_freq.get, reverse=True)[:10]" ] }, { "cell_type": "code", "execution_count": 10, "id": "bQxIwrRPXp8L", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 27, "status": "ok", "timestamp": 1746448439640, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "bQxIwrRPXp8L", "outputId": "43c3a5e7-bf56-47e5-ccd8-5bcc39805e1e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sentence number of corpus: 95013\n", "average length of sentences: 3.452327576226411\n" ] } ], "source": [ "print(f\"sentence number of corpus: {len(sents)}\")\n", "i = 0\n", "for sent in sents:\n", " i = i + len(sent)\n", "print(f\"average length of sentences: {i/len(sents)}\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "M8JXl54lX1B8", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 14, "status": "ok", "timestamp": 1746448441412, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "M8JXl54lX1B8", "outputId": "4ffee132-4a9b-44f2-c8cc-0bc45217870d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: PYTHONHASHSEED=2025\n" ] } ], "source": [ "# 環境變數設定\n", "%env PYTHONHASHSEED=2025" ] }, { "cell_type": "code", "execution_count": 12, "id": "-dlYF-HbX1Ew", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 10, "status": "ok", "timestamp": 1746448442578, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "-dlYF-HbX1Ew", "outputId": "4d6246f8-a6e3-494e-af2d-e0f9d8386f7f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of cores: 8\n" ] } ], "source": [ "# 查看機器的core\n", "cores = multiprocessing.cpu_count()\n", "print(f\"number of cores: {cores}\")" ] }, { "cell_type": "code", "execution_count": 13, "id": "HYUphL5uX1Gx", "metadata": { "executionInfo": { "elapsed": 14626, "status": "ok", "timestamp": 1746448458777, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "HYUphL5uX1Gx" }, "outputs": [], "source": [ "# 建立模型\n", "w2v_model = Word2Vec(sents,\n", " min_count=30,# 小於30次tf的字會被刪除\n", " window=2,# 往左右各2的距離\n", " vector_size=128,# vector 的維度\n", " sample=0.005,# 愈小的話,高tf的字會不容易被選到\n", " alpha=0.001,# learning rate\n", " min_alpha=0.0005, # 迭代到最小的learning rate,learning rate會慢慢下降至min_alpha\n", " negative=0,\n", " workers=cores-1, # 用的cpu資源\n", " seed=8787,\n", " sg = 1,# 0/1 是否使用skip gram\n", " epochs= 30,\n", " hs=1 , # hierarchical softmax\n", " )" ] }, { "cell_type": "code", "execution_count": 14, "id": "F4aco4pYX1I6", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 9, "status": "ok", "timestamp": 1746448460904, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "F4aco4pYX1I6", "outputId": "db21bd03-797e-4505-a82b-fed12df4b761" }, "outputs": [ { "data": { "text/plain": [ "[('思考', 0.7755464315414429),\n", " ('解決', 0.7708988785743713),\n", " ('期待', 0.6676611304283142),\n", " ('嘗試', 0.6662196516990662),\n", " ('這是', 0.6276875734329224),\n", " ('衝突', 0.6261622309684753),\n", " ('表達', 0.6218885183334351),\n", " ('事件', 0.6023396253585815),\n", " ('想法', 0.5949125289916992),\n", " ('需求', 0.5907225012779236)]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 檢查最相關的字\n", "w2v_model.wv.most_similar('溝通',topn=10)" ] }, { "cell_type": "code", "execution_count": 15, "id": "DQBpLVepX-eW", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 10, "status": "ok", "timestamp": 1746448462758, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "DQBpLVepX-eW", "outputId": "7f0f0ac5-b93e-422f-bd9b-02aeb36c0933" }, "outputs": [ { "data": { "text/plain": [ "[('提出', 0.8008367419242859),\n", " ('原因', 0.7528184652328491),\n", " ('不爽', 0.7414447665214539),\n", " ('分手', 0.7389088869094849),\n", " ('實在', 0.732452392578125),\n", " ('念頭', 0.7268778681755066),\n", " ('出軌', 0.7090317606925964),\n", " ('乾脆', 0.700692892074585),\n", " ('平靜', 0.693835437297821),\n", " ('理由', 0.6839743852615356)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "w2v_model.wv.most_similar('外遇',topn=10)" ] }, { "cell_type": "code", "execution_count": 16, "id": "TGj28f87X-gx", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 44, "status": "ok", "timestamp": 1746448464700, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "TGj28f87X-gx", "outputId": "551f9ab0-5c20-48ea-e59b-aab175a35b6c" }, "outputs": [ { "data": { "text/plain": [ "[('平靜', 0.7731956839561462),\n", " ('解決', 0.7368565797805786),\n", " ('思考', 0.7182171940803528),\n", " ('值得', 0.715507447719574),\n", " ('理由', 0.70955491065979),\n", " ('想法', 0.6994882225990295),\n", " ('提出', 0.6934615969657898),\n", " ('尊重', 0.6934552788734436),\n", " ('道理', 0.6815539598464966),\n", " ('原因', 0.6742791533470154)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "w2v_model.wv.most_similar(['溝通','外遇'],topn=10)" ] }, { "cell_type": "code", "execution_count": 17, "id": "Ah-Z2aBOYTWu", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 8, "status": "ok", "timestamp": 1746448466738, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "Ah-Z2aBOYTWu", "outputId": "2224e8fc-3314-4fee-e707-8363e1c0be2f" }, "outputs": [ { "data": { "text/plain": [ "[('休息', 0.6033270359039307),\n", " ('白天', 0.5987855792045593),\n", " ('在家', 0.5936435461044312),\n", " ('大人', 0.5754421353340149),\n", " ('足夠', 0.5643977522850037),\n", " ('接送', 0.5636379718780518),\n", " ('下班', 0.5620827078819275),\n", " ('心力', 0.5549818277359009),\n", " ('育兒', 0.5517842173576355),\n", " ('旁邊', 0.5506572723388672)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 跟兩個字最不相關\n", "w2v_model.wv.most_similar(negative=['外遇','離婚'],topn=10)" ] }, { "cell_type": "code", "execution_count": 18, "id": "ZYb3cV1GYdn-", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 8, "status": "ok", "timestamp": 1746448472457, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "ZYb3cV1GYdn-", "outputId": "97a5fe78-5dd4-4e1a-d718-76489e0a53bb" }, "outputs": [ { "data": { "text/plain": [ "-0.18695961" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 計算兩個字之間的關係\n", "w2v_model.wv.similarity(\"財產\",\"家庭\")" ] }, { "cell_type": "code", "execution_count": 19, "id": "rfT2fDmZYTZZ", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 8, "status": "ok", "timestamp": 1746448509442, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "rfT2fDmZYTZZ", "outputId": "81522c70-ce3a-4ab1-f006-34c71752ff24" }, "outputs": [ { "data": { "text/plain": [ "0.55098337" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "w2v_model.wv.similarity(\"孩子\",\"照顧\")" ] }, { "cell_type": "code", "execution_count": 20, "id": "ap-85hspYw2O", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "executionInfo": { "elapsed": 22, "status": "ok", "timestamp": 1746448563929, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "ap-85hspYw2O", "outputId": "a3a5955f-62ea-4241-ff21-f83790a42027" }, "outputs": [ { "data": { "text/plain": [ "'財產'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 比較字詞間,誰最不相關(邊緣)\n", "w2v_model.wv.doesnt_match([\"孩子\", \"照顧\", '財產'])" ] }, { "cell_type": "code", "execution_count": 21, "id": "4nJ3n4dUYw4s", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 9, "status": "ok", "timestamp": 1746448571806, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "4nJ3n4dUYw4s", "outputId": "fe67bec2-9262-4630-bfd6-dc8499cc8790" }, "outputs": [ { "data": { "text/plain": [ "[('痛苦', 0.5897860527038574),\n", " ('故事', 0.5843417048454285),\n", " ('紀錄', 0.5227385759353638),\n", " ('妻子', 0.5202898979187012),\n", " ('個性', 0.5193036198616028)]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 相對關係\n", "w2v_model.wv.most_similar(positive=[\"孩子\"], negative=[\"照顧\"], topn=5)" ] }, { "cell_type": "code", "execution_count": 22, "id": "bh-G_76TZFo8", "metadata": { "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1746448575071, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "bh-G_76TZFo8" }, "outputs": [], "source": [ "# 取得所有的字\n", "words = w2v_model.wv.key_to_index.keys()" ] }, { "cell_type": "code", "execution_count": 23, "id": "vMtqPxNDZFrQ", "metadata": { "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1746448577427, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "vMtqPxNDZFrQ" }, "outputs": [], "source": [ "# 降維:利用PCA tSNE\n", "\n", "def reduceDim(mat,method:str='PCA',dim:str=2,perplexity = 25,learning_rate = 400):\n", "\n", " method_dict = {\n", " \"PCA\":PCA(n_components=dim,iterated_power = 1000,random_state=0),\n", " \"TSNE\":TSNE(n_components=dim,random_state=0,perplexity=perplexity,learning_rate=learning_rate),\n", " }\n", " new_feat = method_dict[method].fit_transform(mat)\n", "\n", " return new_feat" ] }, { "cell_type": "code", "execution_count": 24, "id": "RKMymLiIZL_6", "metadata": { "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1746448579681, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "RKMymLiIZL_6" }, "outputs": [], "source": [ "# 拿到list of words 的vector\n", "def getVecs(model,words:list):\n", " vecs = []\n", " for i in words:\n", " vecs.append(model.wv[i])\n", " return np.vstack(vecs)" ] }, { "cell_type": "code", "execution_count": 25, "id": "IhkKOa7nZOoC", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 10, "status": "ok", "timestamp": 1746448582558, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "IhkKOa7nZOoC", "outputId": "ee073901-e574-432c-f907-3b177cda3c32" }, "outputs": [ { "data": { "text/plain": [ "array([[ 0.05552926, 0.03847976, -0.23192777, 0.06041394, -0.03632559,\n", " -0.04825368, -0.09273008, 0.06537188, 0.07540878, 0.11345442,\n", " 0.10255916, -0.077728 , 0.03348193, -0.02929123, 0.04965089,\n", " 0.04994745, -0.02007875, 0.06827826, -0.00043407, -0.01652355,\n", " 0.07447162, 0.06956953, 0.04445267, -0.06862568, -0.01181999,\n", " -0.03680974, -0.09098391, -0.02162968, 0.0925396 , -0.03726147,\n", " 0.03511068, -0.0015918 , -0.04098483, -0.08775292, 0.02748766,\n", " -0.0189511 , 0.09938143, -0.00548935, -0.16770932, 0.03284688,\n", " 0.05587117, -0.09926225, -0.09662215, 0.02078925, -0.0978763 ,\n", " -0.10456892, -0.08619943, -0.07782117, -0.00280136, -0.02179379,\n", " 0.10193834, 0.0503955 , -0.00199134, 0.06327718, 0.05917658,\n", " -0.00107532, -0.03456983, -0.12342957, 0.02685144, -0.00024162,\n", " -0.08171521, 0.02174407, 0.03526353, -0.0224024 , 0.01412679,\n", " -0.03493559, -0.07929365, -0.11893469, -0.07531871, 0.04449066,\n", " -0.10216135, 0.10212398, -0.09488969, 0.03615135, -0.03160409,\n", " 0.10345982, 0.03115953, 0.06215408, -0.1881784 , -0.03869899,\n", " 0.0145611 , 0.08423696, -0.00474182, 0.01020931, -0.05811653,\n", " -0.08193173, -0.09165919, 0.00748084, -0.01750713, 0.02589018,\n", " -0.07287675, 0.05250846, 0.02563178, -0.08972724, -0.02054487,\n", " -0.03332768, 0.07814557, -0.10100207, 0.00216089, 0.13747457,\n", " 0.07158327, 0.02981087, 0.02596068, 0.00235181, 0.01135799,\n", " 0.03610549, -0.03554225, 0.1099498 , -0.07154053, 0.02411028,\n", " 0.10968643, -0.0749239 , 0.1138766 , 0.01662243, 0.00771771,\n", " -0.07943476, -0.06997006, -0.04681144, -0.02465238, -0.04719803,\n", " 0.09851346, -0.05417875, 0.05610305, -0.16223566, -0.08995478,\n", " 0.08608418, 0.057287 , -0.014308 ],\n", " [ 0.06940024, 0.02854187, 0.0235089 , 0.0113638 , -0.05424808,\n", " -0.03968925, 0.01986084, 0.04322903, -0.04480281, -0.02419144,\n", " 0.08776139, -0.07677491, 0.03418854, 0.04797035, 0.02194501,\n", " 0.08888641, -0.02592358, 0.00501392, 0.0174946 , -0.02264285,\n", " -0.00230185, -0.0468002 , 0.01820106, 0.04159217, -0.02537901,\n", " -0.05261362, 0.00505259, 0.00644564, -0.01651222, -0.04771264,\n", " -0.03318613, 0.01206908, -0.00102497, 0.05461093, 0.06607899,\n", " -0.005135 , 0.06350551, 0.07255646, -0.01323139, 0.00854335,\n", " -0.02009638, 0.03134184, -0.03179 , -0.04898661, -0.07075465,\n", " -0.05226701, 0.00413447, 0.05127696, -0.04384896, -0.05382123,\n", " 0.05206602, 0.04317016, -0.02729745, 0.03092107, 0.00511999,\n", " 0.02858668, -0.0191604 , -0.05599732, -0.02721906, 0.01753448,\n", " 0.04740141, -0.00360852, 0.01634916, -0.07158028, -0.04917749,\n", " -0.01737433, -0.01424111, -0.02998464, 0.03152172, 0.08365235,\n", " -0.02340383, 0.05740769, 0.01968239, -0.02477934, 0.03875388,\n", " 0.05084869, 0.00639911, 0.03731444, -0.06243524, 0.00063726,\n", " -0.07853391, 0.02793852, -0.03870003, 0.00755745, 0.0322638 ,\n", " -0.02977583, -0.02146657, 0.05491862, 0.02703132, -0.04046306,\n", " 0.04141882, -0.019448 , -0.03544722, 0.05065524, 0.02812297,\n", " -0.00278463, 0.00866573, -0.02489917, 0.0713686 , 0.01389261,\n", " 0.00163449, 0.05256031, -0.03146213, -0.02890553, -0.00148923,\n", " -0.02173272, 0.06234517, 0.07063311, 0.01675576, 0.04644315,\n", " -0.01290863, -0.04317336, 0.06099079, -0.04302061, -0.02277633,\n", " -0.02732737, 0.03801377, -0.0165253 , 0.00495038, -0.04724422,\n", " -0.00122255, 0.05922864, -0.0054045 , -0.01885483, 0.03823422,\n", " 0.01087835, -0.01277172, 0.05238632]], dtype=float32)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getVecs(w2v_model,['溝通','外遇'])" ] }, { "cell_type": "code", "execution_count": 26, "id": "4DxkIg3rZTpX", "metadata": { "executionInfo": { "elapsed": 8, "status": "ok", "timestamp": 1746448593237, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "4DxkIg3rZTpX" }, "outputs": [], "source": [ "# 擴展相似的字詞\n", "def expandPosWord(model, words:list, top_n:int, split = True):\n", "\n", " if split == False:\n", " wp = model.wv.most_similar(words,topn = top_n)\n", " return wp\n", " expand = []\n", "\n", " for w in words:\n", " wp = model.wv.most_similar(w,topn = top_n)\n", " for i in wp:\n", " expand.append(i[0])\n", "\n", " return list(set(expand))" ] }, { "cell_type": "code", "execution_count": 27, "id": "0VpyujvPZTr7", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 7, "status": "ok", "timestamp": 1746448594941, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "0VpyujvPZTr7", "outputId": "b58643d0-c102-4e48-b07f-fc55aa8f4cb9" }, "outputs": [ { "data": { "text/plain": [ "['期待',\n", " '解決',\n", " '提出',\n", " '實在',\n", " '表達',\n", " '想法',\n", " '乾脆',\n", " '需求',\n", " '事件',\n", " '原因',\n", " '嘗試',\n", " '思考',\n", " '分手',\n", " '理由',\n", " '出軌',\n", " '衝突',\n", " '不爽',\n", " '這是',\n", " '平靜',\n", " '念頭']" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expandPosWord(w2v_model,['溝通','外遇'],top_n = 10)" ] }, { "cell_type": "markdown", "id": "gLuu-WhtZmg-", "metadata": { "id": "gLuu-WhtZmg-" }, "source": [ "以上字詞是依照第二次專案tf-idf前十大字詞\n", "4828\t工作\n", "3219\t問題\n", "9604\t美國\n", "4481\t家庭\n", "10732\t財產\n", "7951\t溝通\n", "3646\t外遇\n", "6672\t改變\n", "8172\t照顧\n", "1882\t公公" ] }, { "cell_type": "code", "execution_count": 34, "id": "K6ldPzJtaIWl", "metadata": { "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1746448627096, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "K6ldPzJtaIWl" }, "outputs": [], "source": [ "# 畫出兩維的散佈圖\n", "def plotScatter(vec_df):\n", " \"\"\"\n", " vec_df: 字詞及其兩個維度的值\n", " \"\"\"\n", " plt.figure(figsize=(15,15))\n", " fontManager.addfont('/Users/rolya/Desktop/DIV/divoce_project2/TaipeiSansTCBeta-Regular.ttf')\n", " plt.rcParams['font.sans-serif'] = ['Taipei Sans TC Beta']\n", " plt.rcParams['font.size'] = '16'\n", "\n", " p = sns.scatterplot(x=\"dim1\", y=\"dim2\",\n", " data=vec_df)\n", " for line in range(0, vec_df.shape[0]):\n", " p.text(vec_df[\"dim1\"][line],\n", " vec_df['dim2'][line],\n", " ' ' + vec_df[\"word\"][line].title(),\n", " horizontalalignment='left',\n", " verticalalignment='bottom', size='medium',\n", " weight='normal'\n", " ).set_size(15)\n", " plt.show()\n", "\n", "# 畫出三維的散佈圖\n", "def plotScatter3D(vec_df):\n", " vec_df['size'] = .5\n", " if 'color' not in vec_df.columns:\n", " vec_df['color'] = 'blue'\n", " fig = px.scatter_3d(\n", " vec_df,'dim1','dim2','dim3',text = 'word',width=800, height=800,color = 'color',size = 'size'\n", "\n", " )\n", "\n", " fig.show()" ] }, { "cell_type": "code", "execution_count": 35, "id": "ZnaUyfFKaIZa", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1058, "status": "ok", "timestamp": 1746449496210, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "ZnaUyfFKaIZa", "outputId": "bb1d6728-6674-48fc-82c6-bb86a4bec9bb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(150, 128)\n", "(150, 2)\n" ] } ], "source": [ "sample_words = np.random.choice(list(words),150,replace=False).tolist()\n", "\n", "feat = getVecs(model=w2v_model,words=sample_words)\n", "print(feat.shape)\n", "new_feat = reduceDim(feat,method='TSNE',perplexity=20,learning_rate = 800)\n", "print(new_feat.shape)" ] }, { "cell_type": "code", "execution_count": 36, "id": "kP2XUbI_bIUj", "metadata": { "executionInfo": { "elapsed": 11, "status": "ok", "timestamp": 1746449496791, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "kP2XUbI_bIUj" }, "outputs": [], "source": [ "word_df = pd.DataFrame({\n", " \"word\":sample_words,\n", " \"dim1\":new_feat[:,0],\n", " \"dim2\":new_feat[:,1],\n", "})" ] }, { "cell_type": "code", "execution_count": 37, "id": "-falARSubKP7", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 940 }, "executionInfo": { "elapsed": 559, "status": "ok", "timestamp": 1746449499304, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "-falARSubKP7", "outputId": "e80f6271-5376-47ab-b5f1-d5bcd331f1a4" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plotScatter(word_df)" ] }, { "cell_type": "code", "execution_count": 39, "id": "mBuzpve_bMFr", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 834 }, "executionInfo": { "elapsed": 88, "status": "ok", "timestamp": 1746449505942, "user": { "displayName": "章茗鈞", "userId": "09803695490438841130" }, "user_tz": -480 }, "id": "mBuzpve_bMFr", "outputId": "088b6722-7c64-430a-8ab2-a99da4bac2bd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(150, 3)\n" ] }, { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hovertemplate": "color=blue
dim1=%{x}
dim2=%{y}
dim3=%{z}
size=%{marker.size}
word=%{text}", "legendgroup": "blue", "marker": { "color": "#636efa", "size": { "bdata": "AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/", "dtype": "f8" }, "sizemode": "area", "sizeref": 0.00125, "symbol": "circle" }, "mode": "markers+text", "name": "blue", "scene": "scene", "showlegend": true, "text": [ "抱抱", "扭曲", "禮拜", "實際上", "扣掉", "例子", "男友", "證明", "好好", "消失", "可能性", "打算", "碰到", "分配", "提早", "諮詢", "相關", "照片", "專業", "我能", "更是", "老公", "依舊", "不在乎", "吞下去", "勇敢", "體力", "那位", "墮胎", "假日", "本質", "對話", "法官", "電動", "單方面", "支撐", "這不", "同事", "努力", "講好", "在乎", "項目", "婚姻生活", "居住", "婆媳", "工廠", "生完", "分手", "互動", "要生", "律師", "敘述", "賣掉", "沒差", "停止", "安全感", "盡力", "繼承", "透天", "盡量", "生產", "三次", "陌生人", "老二", "沒事", "來回", "心疼", "發現", "安慰", "所有人", "留言", "表面", "道歉", "接近", "扶養", "頭期", "謝謝", "上次", "第一個", "偷吃", "貸款", "大小事", "男人", "有用", "身邊", "男生", "反駁", "阿公", "無限", "背叛", "困難", "分析", "做出", "丈夫", "未婚夫", "奶瓶", "脾氣", "義務", "你好", "無所謂", "花費", "心情", "做好", "情境", "這份", "拿到", "電視", "養育", "北京", "參與", "扶持", "人才", "犧牲", "失望", "財產", "一路", "觀念", "最終", "危險", "虐待", "怨恨", "改善", "釐清", "磨合", "言論", "情緒", "部份", "我並", "女同事", "提供", "感謝", "贍養費", "打開", "一步", "婚姻", "做法", "姐姐", "言語", "分擔", "長時間", "兩人", "作法", "當你", "拿出", "走路", "撫養費", "伴侶", "管理", "睡眠", "住家" ], "type": "scatter3d", "x": { "bdata": "T6lJvTRlzLx7SCW+VeVqPJj2Fr1Lgp66HsAkvfI1M72Veu48UhEGvdF7o7vhovW9niPMPJ1KBb7d4l6854MVPpK61T3YGaS6jbMVPghHH71GdKW8fGF8vpEeDr0rZRu8nc/IOgSU3rwsFHq9jiVwvZhhob00FpS+w4OmPIeFpzsoxOS87F8ivnhnhD04gQG9X3pavXWpnr2NS1k+GRZVvX8nqTsxxqe8z2GJPJBqnb3ErR+8cBhhvT4v5L2URxE+FK+EPeT1Er6Hzhw+bg3cvO7Av72bMYO9Q/AhvMilOD0/qze9Q3QBvkCDqr0mKYK8q3wWvrjDjLwzMQC9mJ2Fvd/zdL3IDgW8CMXbvZKGpr2nr1i8/xuJPNjdKjtB3Dw9qjYzPc3IC73Gm8S9UrXAvLYKsT6MFKS8J0IUvrO9jb3UFem9RQb8vG/Su7wWzlk9qAHLvdDnuzz/4MC8Lr2VvSY9W7x/ygM9wqaFPbRntT173TI+P48NvDpxj72q94W9E6MVvD7yuLxYOyA7gEGPvalZHz3LgDo+CjmmvQlYazwcVKa9Hw8JvuxCOLyF7Am+UTSgvdrCYr1NPqo7iPJvvNmrzLsoqbm8cPVCvqn41b0Njc+8pIuzO4hxXL2ndFW9laeZvb3ofz6ZDhQ9/KGJPQyfLj2L/4w/78ecvTxbgrppmUi7m4yrPvmzQT7nlt+9FbPtuuQzgb2HNsU/d6vMPaMKE76YpG290csTvniGjb239Dw+gJw7PeC6R73oDzw8rqyIvYvXQr5gMYk+4gYyPT9DW7x4IsC9", "dtype": "f4" }, "y": { "bdata": "v2GRPF0GNr102029sL8GOu0E1Lwduo08p9HvPefMybqT9Ui+xuedvCbdT72akUe9/vHAvIVQoD0DEEw9clbXvR8Ba72x+QY9RsDNvbQKXD0XFDu7JkZsPvVjiryzeHq8cDG2vMsXxbucQHq84OgOPfDWJr2UNKc9Yc6MvHODRr0qhRE8C92NPb4r27pAgOy9y8FrvViIPT34xUy+OFTAvA6v5b33aDm8m8qOvHkKAb2/vK29q0GpvAMm3DxwTVG9oyBfvRAr8btTz+k8Zi/dvEG2ZD5hiXq8draZPOU8sb0sR2g6omSBPMlNqLz5hbU9r+5svVhM8jwB45w8l9TjvGailD1Pejk8Wq4uvUA9Yz1isSy9+SOKvWu0+DsdPVw8ZuylPahdhzzQjWK8YuUcvJj1VD3wrQ+8tNr4u6o/WbzSC4o9BsVIvRPJBb4+0649Gv+vPa3/hz4Twdc5N3djvey8+DzaOti9thyRvfFQFD2M26m9jRpmvVQSw7xERns94bgRPptDNL0g4IE8BRciPHe3PLyrvU8+vQzqO6xsx7va9AS8sYz/vFsFGjvRW8Q8wAGWOpNbt7wSWiG9Szexuxm0/72NGXW9KlNhPrjsnLx9WJe9AkCgvYGQKbzCmwk9wR60O5Lp5rz8dMO7k1TXva6K7zyRMag/K6KJvL9xeD3Z4ng9SIosveNUqDz/ULe8Go9OvFznnT22GkC/ewWQPZwKIz0/uby9clqaPckZ0LxtTIO+RnD3PM6r0DxC0wc9IccLO7fhxbyYawO+ablzPQHSI71xf0k6", "dtype": "f4" }, "z": { "bdata": "ktpKvUtiZ718Nxm+AU6JPNz7kL1JpwY9hQyAPmnTgDxuLMi+mCaovDp6Lz1v8RU+hgckvPZlUD5Z9pO9+1z6Pabvr7yYovS9u2u/vHxokL0fSSa9ACVJPspqwLwVgbu7+59GPe0I7LvhDBe9/IiMPS18gz0RJoK+njK7PV34370vCZm6twA+vjxi2zzp57i9k/0WPUTnwD0Diua9aa+4PR+ATDytcWU8w0VsvOQuizxKP+C7Zbo1vT0NaL2oUqI+H5bFvV9cFb30GFk+fOodO9z38j6IopO8k0WJOeHLMr3dRjo9GTxFPeyfjzy3W2a+vPCNvegi7LywZpc8k/m9vYJj0ry11p08X2GmvISD8z3D/Um9lxDOPd9+ZLyaAdO8eGq8PYxCrb3EO+q9eWTkPb4FUz4J9UI8qHKBPXYDGD5vlnA+NTSJvUTlvj1ZGXQ9YZ3nuhxacD3jIig8Z/bgvaF8Jz1iPtk8r12NvdXANbyDoBC88aAdO1EmZT1Zy4C9qozLveEJYbxq2a09GpVbPUoTtb2xw4e+R6uVufL7ob2gFaO9yeLsvH1BOL47hl6+710/vfuU4r2qDlC7dMzkvJQ1FL3nhe89N7pKP/8ZDT2rnis9ITUTOwaa8Dtn8fm9x3JyO9L2FT191GO9y1YQvuEn0zvNgXW+IgCDvAfnjjxJydQ8pMukPb8QWD4wUQg8BICQvFKyjb20xBQ++t1rvT3DdzxHMLk89ugNvgvXnr1qhg2+6LKXvWc0Cj0qc589e1uXvUK7AL41v5U9oLaePMpPEL4ykQi8", "dtype": "f4" } } ], "layout": { "height": 800, "legend": { "itemsizing": "constant", "title": { "text": "color" }, "tracegroupgap": 0 }, "margin": { "t": 60 }, "scene": { "domain": { "x": [ 0, 1 ], "y": [ 0, 1 ] }, "xaxis": { "title": { "text": "dim1" } }, "yaxis": { "title": { "text": "dim2" } }, "zaxis": { "title": { "text": "dim3" } } }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "width": 800 } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "new_feat = reduceDim(feat,dim = 3,method = 'PCA' )\n", "print(new_feat.shape)\n", "word_df = pd.DataFrame({\n", " \"word\":sample_words,\n", " \"dim1\":new_feat[:,0],\n", " \"dim2\":new_feat[:,1],\n", " \"dim3\":new_feat[:,2],\n", "})\n", "plotScatter3D(word_df)" ] }, { "cell_type": "markdown", "id": "UjwOlPoXdjjh", "metadata": { "id": "UjwOlPoXdjjh" }, "source": [ "3D圖能縮放" ] }, { "cell_type": "markdown", "id": "bc846a67", "metadata": {}, "source": [ "將字分群\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c5fe440a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting scikit-learn-extra\n", " Downloading scikit_learn_extra-0.3.0-cp311-cp311-win_amd64.whl.metadata (3.7 kB)\n", "Requirement already satisfied: numpy>=1.13.3 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from scikit-learn-extra) (1.24.3)\n", "Requirement already satisfied: scipy>=0.19.1 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from scikit-learn-extra) (1.12.0)\n", "Requirement already satisfied: scikit-learn>=0.23.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from scikit-learn-extra) (1.4.0)\n", "Requirement already satisfied: joblib>=1.2.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.4.2)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.6.0)\n", "Downloading scikit_learn_extra-0.3.0-cp311-cp311-win_amd64.whl (340 kB)\n", "Installing collected packages: scikit-learn-extra\n", "Successfully installed scikit-learn-extra-0.3.0\n" ] } ], "source": [ "#!pip install scikit-learn-extra" ] }, { "cell_type": "code", "execution_count": 41, "id": "bca0112f", "metadata": {}, "outputs": [], "source": [ "# 分群\n", "from sklearn.cluster import KMeans\n", "from sklearn_extra.cluster import KMedoids\n", "# 只使用word vector 去分群\n", "def cluster(X,method = 'kmeans',n = 2):\n", "\n", " method_dict = {\n", " 'kmeans':KMeans(n_clusters=n, random_state=0),\n", " 'kmedos':KMedoids(n_clusters=n, random_state=0)\n", " }\n", " method_dict[method].fit(X)\n", " result = method_dict[method].predict(X)\n", " return result" ] }, { "cell_type": "code", "execution_count": 42, "id": "8fa7ba74", "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hovertemplate": "dim1=%{x}
dim2=%{y}
dim3=%{z}
size=%{marker.size}
word=%{text}
color=%{marker.color}", "legendgroup": "", "marker": { "color": { "bdatadtype": "i4" }, "coloraxis": "coloraxis", "size": { "bdata": "AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/AAAAAAAA4D8AAAAAAADgPwAAAAAAAOA/", "dtype": "f8" }, "sizemode": "area", "sizeref": 0.00125, "symbol": "circle" }, "mode": "markers+text", "name": "", "scene": "scene", "showlegend": false, "text": [ "抱抱", "扭曲", "禮拜", "實際上", "扣掉", "例子", "男友", "證明", "好好", "消失", "可能性", "打算", "碰到", "分配", "提早", "諮詢", "相關", "照片", "專業", "我能", "更是", "老公", "依舊", "不在乎", "吞下去", "勇敢", "體力", "那位", "墮胎", "假日", "本質", "對話", "法官", "電動", "單方面", "支撐", "這不", "同事", "努力", "講好", "在乎", "項目", "婚姻生活", "居住", "婆媳", "工廠", "生完", "分手", "互動", "要生", "律師", "敘述", "賣掉", "沒差", "停止", "安全感", "盡力", "繼承", "透天", "盡量", "生產", "三次", "陌生人", "老二", "沒事", "來回", "心疼", "發現", "安慰", "所有人", "留言", "表面", "道歉", "接近", "扶養", "頭期", "謝謝", "上次", "第一個", "偷吃", "貸款", "大小事", "男人", "有用", "身邊", "男生", "反駁", "阿公", "無限", "背叛", "困難", "分析", "做出", "丈夫", "未婚夫", "奶瓶", "脾氣", "義務", "你好", "無所謂", "花費", "心情", "做好", "情境", "這份", "拿到", "電視", "養育", "北京", "參與", "扶持", "人才", "犧牲", "失望", "財產", "一路", "觀念", "最終", "危險", "虐待", "怨恨", "改善", "釐清", "磨合", "言論", "情緒", "部份", "我並", "女同事", "提供", "感謝", "贍養費", "打開", "一步", "婚姻", "做法", "姐姐", "言語", "分擔", "長時間", "兩人", "作法", "當你", "拿出", "走路", "撫養費", "伴侶", "管理", "睡眠", "住家" ], "type": "scatter3d", "x": { "bdata": "T6lJvTRlzLx7SCW+VeVqPJj2Fr1Lgp66HsAkvfI1M72Veu48UhEGvdF7o7vhovW9niPMPJ1KBb7d4l6854MVPpK61T3YGaS6jbMVPghHH71GdKW8fGF8vpEeDr0rZRu8nc/IOgSU3rwsFHq9jiVwvZhhob00FpS+w4OmPIeFpzsoxOS87F8ivnhnhD04gQG9X3pavXWpnr2NS1k+GRZVvX8nqTsxxqe8z2GJPJBqnb3ErR+8cBhhvT4v5L2URxE+FK+EPeT1Er6Hzhw+bg3cvO7Av72bMYO9Q/AhvMilOD0/qze9Q3QBvkCDqr0mKYK8q3wWvrjDjLwzMQC9mJ2Fvd/zdL3IDgW8CMXbvZKGpr2nr1i8/xuJPNjdKjtB3Dw9qjYzPc3IC73Gm8S9UrXAvLYKsT6MFKS8J0IUvrO9jb3UFem9RQb8vG/Su7wWzlk9qAHLvdDnuzz/4MC8Lr2VvSY9W7x/ygM9wqaFPbRntT173TI+P48NvDpxj72q94W9E6MVvD7yuLxYOyA7gEGPvalZHz3LgDo+CjmmvQlYazwcVKa9Hw8JvuxCOLyF7Am+UTSgvdrCYr1NPqo7iPJvvNmrzLsoqbm8cPVCvqn41b0Njc+8pIuzO4hxXL2ndFW9laeZvb3ofz6ZDhQ9/KGJPQyfLj2L/4w/78ecvTxbgrppmUi7m4yrPvmzQT7nlt+9FbPtuuQzgb2HNsU/d6vMPaMKE76YpG290csTvniGjb239Dw+gJw7PeC6R73oDzw8rqyIvYvXQr5gMYk+4gYyPT9DW7x4IsC9", "dtype": "f4" }, "y": { "bdata": "v2GRPF0GNr102029sL8GOu0E1Lwduo08p9HvPefMybqT9Ui+xuedvCbdT72akUe9/vHAvIVQoD0DEEw9clbXvR8Ba72x+QY9RsDNvbQKXD0XFDu7JkZsPvVjiryzeHq8cDG2vMsXxbucQHq84OgOPfDWJr2UNKc9Yc6MvHODRr0qhRE8C92NPb4r27pAgOy9y8FrvViIPT34xUy+OFTAvA6v5b33aDm8m8qOvHkKAb2/vK29q0GpvAMm3DxwTVG9oyBfvRAr8btTz+k8Zi/dvEG2ZD5hiXq8draZPOU8sb0sR2g6omSBPMlNqLz5hbU9r+5svVhM8jwB45w8l9TjvGailD1Pejk8Wq4uvUA9Yz1isSy9+SOKvWu0+DsdPVw8ZuylPahdhzzQjWK8YuUcvJj1VD3wrQ+8tNr4u6o/WbzSC4o9BsVIvRPJBb4+0649Gv+vPa3/hz4Twdc5N3djvey8+DzaOti9thyRvfFQFD2M26m9jRpmvVQSw7xERns94bgRPptDNL0g4IE8BRciPHe3PLyrvU8+vQzqO6xsx7va9AS8sYz/vFsFGjvRW8Q8wAGWOpNbt7wSWiG9Szexuxm0/72NGXW9KlNhPrjsnLx9WJe9AkCgvYGQKbzCmwk9wR60O5Lp5rz8dMO7k1TXva6K7zyRMag/K6KJvL9xeD3Z4ng9SIosveNUqDz/ULe8Go9OvFznnT22GkC/ewWQPZwKIz0/uby9clqaPckZ0LxtTIO+RnD3PM6r0DxC0wc9IccLO7fhxbyYawO+ablzPQHSI71xf0k6", "dtype": "f4" }, "z": { "bdata": "ktpKvUtiZ718Nxm+AU6JPNz7kL1JpwY9hQyAPmnTgDxuLMi+mCaovDp6Lz1v8RU+hgckvPZlUD5Z9pO9+1z6Pabvr7yYovS9u2u/vHxokL0fSSa9ACVJPspqwLwVgbu7+59GPe0I7LvhDBe9/IiMPS18gz0RJoK+njK7PV34370vCZm6twA+vjxi2zzp57i9k/0WPUTnwD0Diua9aa+4PR+ATDytcWU8w0VsvOQuizxKP+C7Zbo1vT0NaL2oUqI+H5bFvV9cFb30GFk+fOodO9z38j6IopO8k0WJOeHLMr3dRjo9GTxFPeyfjzy3W2a+vPCNvegi7LywZpc8k/m9vYJj0ry11p08X2GmvISD8z3D/Um9lxDOPd9+ZLyaAdO8eGq8PYxCrb3EO+q9eWTkPb4FUz4J9UI8qHKBPXYDGD5vlnA+NTSJvUTlvj1ZGXQ9YZ3nuhxacD3jIig8Z/bgvaF8Jz1iPtk8r12NvdXANbyDoBC88aAdO1EmZT1Zy4C9qozLveEJYbxq2a09GpVbPUoTtb2xw4e+R6uVufL7ob2gFaO9yeLsvH1BOL47hl6+710/vfuU4r2qDlC7dMzkvJQ1FL3nhe89N7pKP/8ZDT2rnis9ITUTOwaa8Dtn8fm9x3JyO9L2FT191GO9y1YQvuEn0zvNgXW+IgCDvAfnjjxJydQ8pMukPb8QWD4wUQg8BICQvFKyjb20xBQ++t1rvT3DdzxHMLk89ugNvgvXnr1qhg2+6LKXvWc0Cj0qc589e1uXvUK7AL41v5U9oLaePMpPEL4ykQi8", "dtype": "f4" } } ], "layout": { "coloraxis": { "colorbar": { "title": { "text": "color" } }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "height": 800, "legend": { "itemsizing": "constant", "tracegroupgap": 0 }, "margin": { "t": 60 }, "scene": { "domain": { "x": [ 0, 1 ], "y": [ 0, 1 ] }, "xaxis": { "title": { "text": "dim1" } }, "yaxis": { "title": { "text": "dim2" } }, "zaxis": { "title": { "text": "dim3" } } }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "width": 800 } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "new_feat = reduceDim(feat,method='PCA',dim = 20)\n", "d3_feat = reduceDim(feat,method='PCA',dim = 3)\n", "word_df = pd.DataFrame({\n", " \"word\":sample_words,\n", " \"color\":cluster(new_feat,n=4),\n", " \"dim1\":d3_feat[:,0],\n", " \"dim2\":d3_feat[:,1],\n", " \"dim3\":d3_feat[:,2],\n", "\n", "})\n", "plotScatter3D(word_df)" ] }, { "cell_type": "markdown", "id": "5ac5d686", "metadata": {}, "source": [ "### Transformers Embeddings" ] }, { "cell_type": "markdown", "id": "c2e45969", "metadata": {}, "source": [ "#### 使用 Sentence-Transformer 套件 \n", "參考資料:https://www.sbert.net/index.html" ] }, { "cell_type": "code", "execution_count": null, "id": "244afb48", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting sentence-transformers\n", " Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)\n", "Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)\n", " Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)\n", "Requirement already satisfied: tqdm in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from sentence-transformers) (4.67.1)\n", "Requirement already satisfied: torch>=1.11.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from sentence-transformers) (2.4.1)\n", "Requirement already satisfied: scikit-learn in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from sentence-transformers) (1.4.0)\n", "Requirement already satisfied: scipy in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from sentence-transformers) (1.12.0)\n", "Requirement already satisfied: huggingface-hub>=0.20.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from sentence-transformers) (0.25.1)\n", "Requirement already satisfied: Pillow in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from sentence-transformers) (11.1.0)\n", "Requirement already satisfied: typing_extensions>=4.5.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from sentence-transformers) (4.12.2)\n", "Requirement already satisfied: filelock in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from huggingface-hub>=0.20.0->sentence-transformers) (3.18.0)\n", "Requirement already satisfied: fsspec>=2023.5.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from huggingface-hub>=0.20.0->sentence-transformers) (2025.3.2)\n", "Requirement already satisfied: packaging>=20.9 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from huggingface-hub>=0.20.0->sentence-transformers) (24.2)\n", "Requirement already satisfied: pyyaml>=5.1 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from huggingface-hub>=0.20.0->sentence-transformers) (6.0.2)\n", "Requirement already satisfied: requests in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from huggingface-hub>=0.20.0->sentence-transformers) (2.32.3)\n", "Requirement already satisfied: sympy in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from torch>=1.11.0->sentence-transformers) (1.13.3)\n", "Requirement already satisfied: networkx in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from torch>=1.11.0->sentence-transformers) (3.4.2)\n", "Requirement already satisfied: jinja2 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from torch>=1.11.0->sentence-transformers) (3.1.6)\n", "Requirement already satisfied: colorama in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from tqdm->sentence-transformers) (0.4.6)\n", "Collecting huggingface-hub>=0.20.0 (from sentence-transformers)\n", " Downloading huggingface_hub-0.31.1-py3-none-any.whl.metadata (13 kB)\n", "Requirement already satisfied: numpy>=1.17 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from transformers<5.0.0,>=4.41.0->sentence-transformers) (1.24.3)\n", "Requirement already satisfied: regex!=2019.12.17 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from transformers<5.0.0,>=4.41.0->sentence-transformers) (2024.11.6)\n", "Collecting tokenizers<0.22,>=0.21 (from transformers<5.0.0,>=4.41.0->sentence-transformers)\n", " Downloading tokenizers-0.21.1-cp39-abi3-win_amd64.whl.metadata (6.9 kB)\n", "Collecting safetensors>=0.4.3 (from transformers<5.0.0,>=4.41.0->sentence-transformers)\n", " Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl.metadata (3.9 kB)\n", "Requirement already satisfied: joblib>=1.2.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from scikit-learn->sentence-transformers) (1.4.2)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from scikit-learn->sentence-transformers) (3.6.0)\n", "Requirement already satisfied: MarkupSafe>=2.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from jinja2->torch>=1.11.0->sentence-transformers) (3.0.2)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from requests->huggingface-hub>=0.20.0->sentence-transformers) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from requests->huggingface-hub>=0.20.0->sentence-transformers) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from requests->huggingface-hub>=0.20.0->sentence-transformers) (2.3.0)\n", "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from requests->huggingface-hub>=0.20.0->sentence-transformers) (2025.1.31)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\\users\\rolya\\anaconda3\\envs\\syllabus\\lib\\site-packages (from sympy->torch>=1.11.0->sentence-transformers) (1.3.0)\n", "Downloading sentence_transformers-4.1.0-py3-none-any.whl (345 kB)\n", "Downloading transformers-4.51.3-py3-none-any.whl (10.4 MB)\n", " ---------------------------------------- 0.0/10.4 MB ? eta -:--:--\n", " - -------------------------------------- 0.3/10.4 MB ? eta -:--:--\n", " --- ------------------------------------ 0.8/10.4 MB 2.6 MB/s eta 0:00:04\n", " ------ --------------------------------- 1.6/10.4 MB 3.0 MB/s eta 0:00:03\n", " ------- -------------------------------- 1.8/10.4 MB 2.5 MB/s eta 0:00:04\n", " ---------- ----------------------------- 2.6/10.4 MB 2.7 MB/s eta 0:00:03\n", " ------------ --------------------------- 3.1/10.4 MB 2.8 MB/s eta 0:00:03\n", " --------------- ------------------------ 3.9/10.4 MB 2.9 MB/s eta 0:00:03\n", " ------------------ --------------------- 4.7/10.4 MB 2.9 MB/s eta 0:00:02\n", " -------------------- ------------------- 5.2/10.4 MB 3.0 MB/s eta 0:00:02\n", " ------------------------ --------------- 6.3/10.4 MB 3.1 MB/s eta 0:00:02\n", " --------------------------- ------------ 7.1/10.4 MB 3.2 MB/s eta 0:00:02\n", " ----------------------------- ---------- 7.6/10.4 MB 3.2 MB/s eta 0:00:01\n", " ------------------------------- -------- 8.1/10.4 MB 3.1 MB/s eta 0:00:01\n", " --------------------------------- ------ 8.7/10.4 MB 3.1 MB/s eta 0:00:01\n", " ------------------------------------- -- 9.7/10.4 MB 3.2 MB/s eta 0:00:01\n", " ---------------------------------------- 10.4/10.4 MB 3.2 MB/s eta 0:00:00\n", "Downloading huggingface_hub-0.31.1-py3-none-any.whl (484 kB)\n", "Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl (308 kB)\n", "Downloading tokenizers-0.21.1-cp39-abi3-win_amd64.whl (2.4 MB)\n", " ---------------------------------------- 0.0/2.4 MB ? eta -:--:--\n", " ---- ----------------------------------- 0.3/2.4 MB ? eta -:--:--\n", " -------- ------------------------------- 0.5/2.4 MB 989.2 kB/s eta 0:00:02\n", " ------------ --------------------------- 0.8/2.4 MB 1.4 MB/s eta 0:00:02\n", " ----------------- ---------------------- 1.0/2.4 MB 1.3 MB/s eta 0:00:02\n", " ------------------------- -------------- 1.6/2.4 MB 1.5 MB/s eta 0:00:01\n", " ---------------------------------- ----- 2.1/2.4 MB 1.7 MB/s eta 0:00:01\n", " ---------------------------------------- 2.4/2.4 MB 1.7 MB/s eta 0:00:00\n", "Installing collected packages: safetensors, huggingface-hub, tokenizers, transformers, sentence-transformers\n", " Attempting uninstall: huggingface-hub\n", " Found existing installation: huggingface-hub 0.25.1\n", " Uninstalling huggingface-hub-0.25.1:\n", " Successfully uninstalled huggingface-hub-0.25.1\n", "Successfully installed huggingface-hub-0.31.1 safetensors-0.5.3 sentence-transformers-4.1.0 tokenizers-0.21.1 transformers-4.51.3\n" ] } ], "source": [ "#!pip install -U sentence-transformers" ] }, { "cell_type": "code", "execution_count": 50, "id": "9fd5635e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From c:\\Users\\rolya\\anaconda3\\envs\\Syllabus\\Lib\\site-packages\\keras\\src\\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.\n", "\n" ] } ], "source": [ "from sentence_transformers import SentenceTransformer, models, util" ] }, { "cell_type": "markdown", "id": "fe816241", "metadata": {}, "source": [ "#### 小模型,以BERT為範例" ] }, { "cell_type": "markdown", "id": "473a459b", "metadata": {}, "source": [ "中文 bert-base-chinese" ] }, { "cell_type": "code", "execution_count": null, "id": "3fb7ca1c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No sentence-transformers model found with name google-bert/bert-base-chinese. Creating a new one with mean pooling.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1e010457b0b44786a271e7a2c7032dbe", "version_major": 2, "version_minor": 0 }, "text/plain": [ "config.json: 0%| | 0.00/624 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence1sentence2score
0今天天氣很好。今天是個晴空萬里的好天氣。0.914067
1今天天氣很好。我晚上想去公園散步。0.787589
2今天是個晴空萬里的好天氣。我晚上想去公園散步。0.759642
\n", "" ], "text/plain": [ " sentence1 sentence2 score\n", "0 今天天氣很好。 今天是個晴空萬里的好天氣。 0.914067\n", "1 今天天氣很好。 我晚上想去公園散步。 0.787589\n", "2 今天是個晴空萬里的好天氣。 我晚上想去公園散步。 0.759642" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 示範句子\n", "sentences = [\n", " \"今天天氣很好。\",\n", " \"今天是個晴空萬里的好天氣。\",\n", " \"我晚上想去公園散步。\"\n", "]\n", "\n", "# 使用 encode() 對資料做embedding\n", "embeddings_ch = bert_ch.encode(sentences)\n", "\n", "# Compute cosine-similarities\n", "cosine_scores = util.cos_sim(embeddings_ch, embeddings_ch)\n", "\n", "# 印出句子間的cosine similarity分數\n", "result = []\n", "for i in range(len(sentences)):\n", " for j in range(i+1, len(sentences)):\n", " result.append([sentences[i], sentences[j], cosine_scores[i][j].item()])\n", "\n", "result_df = pd.DataFrame(result, columns=[\"sentence1\", \"sentence2\", \"score\"])\n", "result_df.sort_values(\"score\", ascending = False)" ] }, { "cell_type": "markdown", "id": "3417de4d", "metadata": {}, "source": [ "### 使用embedding做NLP任務" ] }, { "cell_type": "markdown", "id": "24e475d1", "metadata": {}, "source": [ "#### 相似文件" ] }, { "cell_type": "code", "execution_count": 58, "id": "20e70d68", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\rolya\\AppData\\Local\\Temp\\ipykernel_13724\\852741673.py:2: SettingWithCopyWarning:\n", "\n", "\n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
system_idartTitleartContent
01[求助]真的可以不給看手機嗎?常常看大家說偷看手機是不對的但如果已經結婚了太太想看你手機真的可以拒絕嗎感覺你拒絕就是心裡有...
12Re:老公工作不穩定因為男方工作不穩定房東才不肯租只好換女生承租對吧很殘忍的說這無緣的孩子聰明來到這世間只是苦難...
23Re:[求助]真的可以不給看手機嗎?手機要看就給看啊先帝爺不是說一隻不夠不能辦兩隻嗎兩隻不夠不能辦三隻嗎三隻四隻不夠可以辦五隻十...
34[心情]我搞不懂老公到底在想甚麼其實都是小事但都可以吵到離婚可能我們就是幾歲的小孩昨晚上床睡覺後一直覺得很冷老公也喊冷想說睡...
45Re:[心情]我搞不懂老公到底在想甚麼把棉被翻好正面嗯嗯有嗯那你幹嘛抱怨你老公不是誰上床睡覺還會檢查棉被正反的嗎我我也覺得你半夜叫...
\n", "
" ], "text/plain": [ " system_id artTitle \\\n", "0 1 [求助]真的可以不給看手機嗎? \n", "1 2 Re:老公工作不穩定 \n", "2 3 Re:[求助]真的可以不給看手機嗎? \n", "3 4 [心情]我搞不懂老公到底在想甚麼 \n", "4 5 Re:[心情]我搞不懂老公到底在想甚麼 \n", "\n", " artContent \n", "0 常常看大家說偷看手機是不對的但如果已經結婚了太太想看你手機真的可以拒絕嗎感覺你拒絕就是心裡有... \n", "1 因為男方工作不穩定房東才不肯租只好換女生承租對吧很殘忍的說這無緣的孩子聰明來到這世間只是苦難... \n", "2 手機要看就給看啊先帝爺不是說一隻不夠不能辦兩隻嗎兩隻不夠不能辦三隻嗎三隻四隻不夠可以辦五隻十... \n", "3 其實都是小事但都可以吵到離婚可能我們就是幾歲的小孩昨晚上床睡覺後一直覺得很冷老公也喊冷想說睡... \n", "4 把棉被翻好正面嗯嗯有嗯那你幹嘛抱怨你老公不是誰上床睡覺還會檢查棉被正反的嗎我我也覺得你半夜叫... " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_similar = origin_data[['system_id','artTitle', 'artContent']]\n", "df_similar['artContent'] = df_similar['artContent'].apply(lambda x: re.sub('[^\\u4e00-\\u9fff]+', '',x))\n", "\n", "df_similar.head(5)" ] }, { "cell_type": "markdown", "id": "0d5fac26", "metadata": {}, "source": [ "使用 bert-base-chinese 做示範" ] }, { "cell_type": "markdown", "id": "e91d233d", "metadata": {}, "source": [ "取得整個文集的 embeddings" ] }, { "cell_type": "code", "execution_count": 60, "id": "fc238105", "metadata": {}, "outputs": [], "source": [ "corpus_embeddings = bert_ch.encode(\n", " df_similar['artContent'],\n", " convert_to_tensor=True,\n", " batch_size=32\n", ")" ] }, { "cell_type": "code", "execution_count": 61, "id": "3781570f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "======================\n", "\n", "\n", "Query: Re:[閒聊]離婚的掙扎\n", "\n", " 資料集中前五相似的文章:\n", "Re:[閒聊]離婚的掙扎 (Score: 1.0000)\n", "Re:兩人的溝通與目前的情況 (Score: 0.9643)\n", "Re:[心情]人生再重來一次,我不會生小孩 (Score: 0.9635)\n", "Re:[求助]先生的女助理 (Score: 0.9628)\n", "Re:[心情]老公喝酒不懂克制 (Score: 0.9627)\n", "\n", "\n", "======================\n", "\n", "\n" ] } ], "source": [ "query_num = 6 # 指定文章\n", "\n", "# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity\n", "top_k = 5\n", "\n", "\n", "query_embedding = bert_ch.encode(df_similar['artContent'][query_num], convert_to_tensor=True)\n", "\n", "# We use cosine-similarity and torch.topk to find the highest 5 scores\n", "cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]\n", "top_results = torch.topk(cos_scores, k=top_k)\n", "\n", "print(\"\\n\\n======================\\n\\n\")\n", "print(\"Query:\", df_similar['artTitle'][query_num])\n", "print(\"\\n 資料集中前五相似的文章:\")\n", "\n", "for score, idx in zip(top_results[0], top_results[1]):\n", " print(df_similar['artTitle'][idx.item()], \"(Score: {:.4f})\".format(score))\n", "\n", "print(\"\\n\\n======================\\n\\n\")" ] }, { "cell_type": "code", "execution_count": 62, "id": "1ff9051d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "======================\n", "\n", "\n", "Query: [閒聊]美女朋友婚後一直找人約砲\n", "\n", " 資料集中前五相似的文章:\n", "[閒聊]美女朋友婚後一直找人約砲 (Score: 1.0000)\n", "該繼續挽回,還是該放手了? (Score: 0.9864)\n", "[求助]精神出軌後成功修復感情經驗? (Score: 0.9856)\n", "[閒聊]人妻外食被搞懷孕後,繼續外食? (Score: 0.9849)\n", "Re:[閒聊]老公有異性友人 (Score: 0.9847)\n", "\n", "\n", "======================\n", "\n", "\n" ] } ], "source": [ "query_num = 30\n", "\n", "top_k = 5\n", "\n", "query_embedding = bert_ch.encode(df_similar['artContent'][query_num], convert_to_tensor=True)\n", "\n", "# We use cosine-similarity and torch.topk to find the highest 5 scores\n", "cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]\n", "top_results = torch.topk(cos_scores, k=top_k)\n", "\n", "print(\"\\n\\n======================\\n\\n\")\n", "print(\"Query:\", df_similar['artTitle'][query_num])\n", "print(\"\\n 資料集中前五相似的文章:\")\n", "\n", "for score, idx in zip(top_results[0], top_results[1]):\n", " print(df_similar['artTitle'][idx.item()], \"(Score: {:.4f})\".format(score))\n", "\n", "print(\"\\n\\n======================\\n\\n\")" ] }, { "cell_type": "markdown", "id": "00bb07f0", "metadata": {}, "source": [ "### 分類任務\n", "使用bert-base-chinese模型對ppt文集做embeddings,接著訓練分類器。(參考week7程式碼)" ] }, { "cell_type": "code", "execution_count": 63, "id": "c2e04f29", "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "code", "execution_count": 64, "id": "75cdf073", "metadata": {}, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer, models, util" ] }, { "cell_type": "markdown", "id": "fe581be5", "metadata": {}, "source": [ "- 版別:婚姻版 結婚版 八卦版\n", "- 工作流程平台搜尋/排除關鍵字:\n", "\n", "文章數分別為
\n", "Gossiping 3380
\n", "GetMarry 2531
\n", "marriage 2204" ] }, { "cell_type": "code", "execution_count": 71, "id": "17045635", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# 讀取三個 CSV 檔案\n", "df1 = pd.read_csv('/Users/rolya/Desktop/DIV/divoce_project2/text.csv')\n", "df2 = pd.read_csv('/Users/rolya/Desktop/DIV/divoce_project2/GetMarry.csv')\n", "df3 = pd.read_csv('/Users/rolya/Desktop/DIV/divoce_project2/Gossiping.csv')\n", "\n", "# 合併成一個 DataFrame\n", "merged_df = pd.concat([df1, df2, df3], ignore_index=True)\n", "\n", "# 儲存為新的 merge.csv\n", "merged_df.to_csv('merge.csv', index=False)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "981028f4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
system_idartUrlartTitleartDateartPosterartCatagoryartContentartCommente_ipinsertedDatedataSource
01https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05safelovemarriage常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...[{\"cmtStatus\": \"推\", \"cmtPoster\": \"FlyOncidium\"...114.137.169.1052021-01-10 01:20:56ptt
12https://www.ptt.cc/bbs/marriage/M.1610162736.A...Re:老公工作不穩定2021-01-09 11:25:34maykomarriage因為男方工作不穩定,房東才不肯租,只好換女生承租,對吧\\n很殘忍的說...這無緣的孩子聰明,...[]36.229.84.2292021-01-10 01:20:56ptt
23https://www.ptt.cc/bbs/marriage/M.1610190309.A...Re:[求助]真的可以不給看手機嗎?2021-01-09 19:05:00loser1marriage手機要看就給看啊!\\n先帝爺不是說,\\n一隻不夠,不能辦兩隻嗎?\\n兩隻不夠,不能辦三隻嗎?...[{\"cmtStatus\": \"噓\", \"cmtPoster\": \"mark0204\", \"...118.170.238.1382021-01-10 01:20:56ptt
\n", "
" ], "text/plain": [ " system_id artUrl \\\n", "0 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "1 2 https://www.ptt.cc/bbs/marriage/M.1610162736.A... \n", "2 3 https://www.ptt.cc/bbs/marriage/M.1610190309.A... \n", "\n", " artTitle artDate artPoster artCatagory \\\n", "0 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 safelove marriage \n", "1 Re:老公工作不穩定 2021-01-09 11:25:34 mayko marriage \n", "2 Re:[求助]真的可以不給看手機嗎? 2021-01-09 19:05:00 loser1 marriage \n", "\n", " artContent \\\n", "0 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... \n", "1 因為男方工作不穩定,房東才不肯租,只好換女生承租,對吧\\n很殘忍的說...這無緣的孩子聰明,... \n", "2 手機要看就給看啊!\\n先帝爺不是說,\\n一隻不夠,不能辦兩隻嗎?\\n兩隻不夠,不能辦三隻嗎?... \n", "\n", " artComment e_ip \\\n", "0 [{\"cmtStatus\": \"推\", \"cmtPoster\": \"FlyOncidium\"... 114.137.169.105 \n", "1 [] 36.229.84.229 \n", "2 [{\"cmtStatus\": \"噓\", \"cmtPoster\": \"mark0204\", \"... 118.170.238.138 \n", "\n", " insertedDate dataSource \n", "0 2021-01-10 01:20:56 ptt \n", "1 2021-01-10 01:20:56 ptt \n", "2 2021-01-10 01:20:56 ptt " ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "div = pd.read_csv(\"merge.csv\")\n", "div.head(3)" ] }, { "cell_type": "markdown", "id": "7dec8c37", "metadata": {}, "source": [ "## 資料清理\n", "利用標點符號斷句\n", "\n", "文集的標題和內容納入分析的內容,成為content欄位" ] }, { "cell_type": "code", "execution_count": null, "id": "f566c38c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contentartUrlartCatagory
0求助真的可以不給看手機嗎常常看大家說偷看手機是不對的但如果已經結婚了太太想看你手機真的可以拒...https://www.ptt.cc/bbs/marriage/M.1610159827.A...marriage
1老公工作不穩定因為男方工作不穩定房東才不肯租只好換女生承租對吧很殘忍的說這無緣的孩子聰明來到...https://www.ptt.cc/bbs/marriage/M.1610162736.A...marriage
2求助真的可以不給看手機嗎手機要看就給看啊先帝爺不是說一隻不夠不能辦兩隻嗎兩隻不夠不能辦三隻嗎...https://www.ptt.cc/bbs/marriage/M.1610190309.A...marriage
3心情我搞不懂老公到底在想甚麼其實都是小事但都可以吵到離婚可能我們就是幾歲的小孩昨晚上床睡覺後...https://www.ptt.cc/bbs/marriage/M.1610193770.A...marriage
4心情我搞不懂老公到底在想甚麼把棉被翻好正面嗯嗯有嗯那你幹嘛抱怨你老公不是誰上床睡覺還會檢查棉...https://www.ptt.cc/bbs/marriage/M.1610203445.A...marriage
\n", "
" ], "text/plain": [ " content \\\n", "0 求助真的可以不給看手機嗎常常看大家說偷看手機是不對的但如果已經結婚了太太想看你手機真的可以拒... \n", "1 老公工作不穩定因為男方工作不穩定房東才不肯租只好換女生承租對吧很殘忍的說這無緣的孩子聰明來到... \n", "2 求助真的可以不給看手機嗎手機要看就給看啊先帝爺不是說一隻不夠不能辦兩隻嗎兩隻不夠不能辦三隻嗎... \n", "3 心情我搞不懂老公到底在想甚麼其實都是小事但都可以吵到離婚可能我們就是幾歲的小孩昨晚上床睡覺後... \n", "4 心情我搞不懂老公到底在想甚麼把棉被翻好正面嗯嗯有嗯那你幹嘛抱怨你老公不是誰上床睡覺還會檢查棉... \n", "\n", " artUrl artCatagory \n", "0 https://www.ptt.cc/bbs/marriage/M.1610159827.A... marriage \n", "1 https://www.ptt.cc/bbs/marriage/M.1610162736.A... marriage \n", "2 https://www.ptt.cc/bbs/marriage/M.1610190309.A... marriage \n", "3 https://www.ptt.cc/bbs/marriage/M.1610193770.A... marriage \n", "4 https://www.ptt.cc/bbs/marriage/M.1610203445.A... marriage " ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 過濾 nan 的資料\n", "div = div.dropna(subset=['artTitle'])\n", "div = div.dropna(subset=['artContent'])\n", "# 移除網址格式\n", "div[\"artContent\"] = div.artContent.apply(\n", " lambda x: re.sub(\"(http|https)://.*\", \"\", x)\n", ")\n", "div[\"artTitle\"] = div[\"artTitle\"].apply(\n", " lambda x: re.sub(\"(http|https)://.*\", \"\", x)\n", ")\n", "# 只留下中文字\n", "div[\"artContent\"] = div.artContent.apply(\n", " lambda x: re.sub(\"[^\\u4e00-\\u9fa5]+\", \"\", x)\n", ")\n", "div[\"artTitle\"] = div[\"artTitle\"].apply(\n", " lambda x: re.sub(\"[^\\u4e00-\\u9fa5]+\", \"\", x)\n", ")\n", "\n", "# 留下 content\n", "div[\"content\"] = div[\"artTitle\"] + div[\"artContent\"]\n", "div = div[[\"content\", \"artUrl\", \"artCatagory\"]] # 文章內容 文章連結\n", "div.head()" ] }, { "cell_type": "markdown", "id": "5b0a2414", "metadata": {}, "source": [ "### 使用Bert做encoding" ] }, { "cell_type": "code", "execution_count": null, "id": "78700878", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contentartUrlartCatagoryembeddings
0求助真的可以不給看手機嗎常常看大家說偷看手機是不對的但如果已經結婚了太太想看你手機真的可以拒...https://www.ptt.cc/bbs/marriage/M.1610159827.A...marriage[0.8504211, -0.23771943, -0.14535092, 0.156168...
1老公工作不穩定因為男方工作不穩定房東才不肯租只好換女生承租對吧很殘忍的說這無緣的孩子聰明來到...https://www.ptt.cc/bbs/marriage/M.1610162736.A...marriage[0.7396465, -0.14757201, -0.057030175, 0.26861...
2求助真的可以不給看手機嗎手機要看就給看啊先帝爺不是說一隻不夠不能辦兩隻嗎兩隻不夠不能辦三隻嗎...https://www.ptt.cc/bbs/marriage/M.1610190309.A...marriage[0.4144313, -0.095341206, -0.2705029, 0.380942...
\n", "
" ], "text/plain": [ " content \\\n", "0 求助真的可以不給看手機嗎常常看大家說偷看手機是不對的但如果已經結婚了太太想看你手機真的可以拒... \n", "1 老公工作不穩定因為男方工作不穩定房東才不肯租只好換女生承租對吧很殘忍的說這無緣的孩子聰明來到... \n", "2 求助真的可以不給看手機嗎手機要看就給看啊先帝爺不是說一隻不夠不能辦兩隻嗎兩隻不夠不能辦三隻嗎... \n", "\n", " artUrl artCatagory \\\n", "0 https://www.ptt.cc/bbs/marriage/M.1610159827.A... marriage \n", "1 https://www.ptt.cc/bbs/marriage/M.1610162736.A... marriage \n", "2 https://www.ptt.cc/bbs/marriage/M.1610190309.A... marriage \n", "\n", " embeddings \n", "0 [0.8504211, -0.23771943, -0.14535092, 0.156168... \n", "1 [0.7396465, -0.14757201, -0.057030175, 0.26861... \n", "2 [0.4144313, -0.095341206, -0.2705029, 0.380942... " ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "div[\"embeddings\"] = div.content.apply(lambda x: bert_ch.encode(x))\n", "div.head(3)" ] }, { "cell_type": "code", "execution_count": 75, "id": "48df0ffb", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from ast import literal_eval" ] }, { "cell_type": "code", "execution_count": null, "id": "51a5d912", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2 3 4 5 6 \\\n", "5325 0.578834 -0.050640 -0.503452 0.114320 0.008045 0.006949 -0.056759 \n", "6304 0.393490 0.062839 -0.422073 0.157619 0.078032 -0.188103 0.018693 \n", "1659 0.629233 -0.231260 -0.266918 0.320200 -0.087541 -0.173598 -0.265297 \n", "1280 0.670054 -0.171460 -0.248393 0.444933 -0.170084 -0.130054 -0.090347 \n", "291 0.621767 -0.093077 -0.131523 0.232634 -0.262020 -0.321169 -0.187148 \n", "\n", " 7 8 9 ... 758 759 760 \\\n", "5325 0.107957 -0.071790 -0.321657 ... -0.201785 -0.409468 0.127512 \n", "6304 0.168009 -0.057865 -0.318298 ... -0.059410 -0.456915 0.209086 \n", "1659 0.012857 -0.264234 -0.477913 ... 0.013435 -0.425337 0.112885 \n", "1280 -0.076344 -0.472568 -0.343588 ... -0.273166 -0.459275 0.273230 \n", "291 -0.086306 -0.223639 -0.403096 ... -0.066033 -0.425651 0.224642 \n", "\n", " 761 762 763 764 765 766 767 \n", "5325 0.211407 -0.226111 0.095762 -0.286687 0.348344 0.171068 0.239167 \n", "6304 -0.012599 -0.240362 0.116647 -0.470287 0.465576 0.112342 0.218261 \n", "1659 0.025797 -0.122397 0.304335 -0.274302 0.218617 0.328430 0.069736 \n", "1280 -0.121515 0.074967 0.099266 -0.279123 0.051970 0.352049 -0.089615 \n", "291 -0.060458 -0.125518 0.139101 -0.270676 0.286504 0.216595 0.045877 \n", "\n", "[5 rows x 768 columns]\n", "5325 Gossiping\n", "6304 Gossiping\n", "1659 marriage\n", "1280 marriage\n", "291 marriage\n", "Name: artCatagory, dtype: object\n" ] } ], "source": [ "data = div.copy()\n", "\n", "X = data[\"embeddings\"].apply(pd.Series)\n", "y = data[\"artCatagory\"]\n", "\n", "# 把整個資料集七三切\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=777\n", ")\n", "\n", "print(X_train.head())\n", "print(y_train.head())" ] }, { "cell_type": "code", "execution_count": 77, "id": "ef7fcdd6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\rolya\\anaconda3\\envs\\Syllabus\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:469: ConvergenceWarning:\n", "\n", "lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", "\n" ] }, { "data": { "text/html": [ "
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression()" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = LogisticRegression()\n", "clf.fit(X_train, y_train)\n", "clf" ] }, { "cell_type": "code", "execution_count": 78, "id": "53f8dd9e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Gossiping' 'Gossiping' 'Gossiping' 'marriage' 'GetMarry' 'marriage'\n", " 'marriage' 'GetMarry' 'marriage' 'Gossiping']\n" ] } ], "source": [ "y_pred = clf.predict(X_test)\n", "y_pred_proba = clf.predict_proba(X_test)\n", "print(y_pred[:10])" ] }, { "cell_type": "code", "execution_count": 79, "id": "ef647d78", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import classification_report" ] }, { "cell_type": "code", "execution_count": 80, "id": "707f94a0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " GetMarry 1.00 1.00 1.00 768\n", " Gossiping 0.99 0.99 0.99 1011\n", " marriage 0.99 0.99 0.99 655\n", "\n", " accuracy 0.99 2434\n", " macro avg 0.99 0.99 0.99 2434\n", "weighted avg 0.99 0.99 0.99 2434\n", "\n" ] } ], "source": [ "## Accuracy, Precision, Recall, F1-score\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "id": "bb0e8f1f", "metadata": {}, "source": [ "## 用訓練好的分類器來預測不同時間的文章類別" ] }, { "cell_type": "code", "execution_count": 100, "id": "08776b55", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
system_idartUrlartTitleartDateartPosterartCatagoryartContentartCommente_ipinsertedDatedataSource
01https://www.ptt.cc/bbs/GetMarry/M.1653965335.A...[推薦]彰化_DeerHer客製手工喜餅2022-05-31 10:48:52piemeGetMarry廠商所在地區:彰化\\n\\n是什麼場合用到:訂結婚_111/5\\n\\n廠商名稱:\\n喜餅:\\n...[]36.232.149.352022-06-01 01:14:50ptt
12https://www.ptt.cc/bbs/GetMarry/M.1653974818.A...[分享]台北/荳蔻攝影工作室婚紗照2022-05-31 13:26:56ajjhhjGetMarry剛開始查婚紗資訊,真的好討厭傳統婚紗店的組數限制,或是各種無止境加購方案,也好\\n怕遇到纏人...[{\"cmtStatus\": \"推\", \"cmtPoster\": \"exorcist1\", ...223.141.4.2152022-06-01 01:14:50ptt
23https://www.ptt.cc/bbs/GetMarry/M.1654041655.A...[請益]Oohlalove喜餅品項選擇2022-06-01 08:00:53love07erikaGetMarry選擇障礙的新娘來求助了!\\n\\nC區的選擇掙扎到要給品項的deadline…\\n目前確定抹茶...[{\"cmtStatus\": \"推\", \"cmtPoster\": \"mimiwei955\",...111.82.79.252022-06-02 01:14:43ptt
34https://www.ptt.cc/bbs/GetMarry/M.1654042316.A...[贈送]白色及粉色小禮服2022-06-01 08:11:54cherishposseGetMarry《洽中,暫勿來信》\\n\\n已結婚完幾年了才發現還有兩件小禮服擱置在家裡,因家人暫時也用不到了...[]101.10.0.1492022-06-02 01:14:43ptt
45https://www.ptt.cc/bbs/GetMarry/M.1654050290.A...[廣宣]OohLaLove喜餅2022-06-01 10:24:48michael9586GetMarry新人or廠商所在地區:台中\\n\\n是屬於新人哪種場合:結婚 2022/12/18\\n\\n\\n...[{\"cmtStatus\": \"推\", \"cmtPoster\": \"kenkao25\", \"...223.138.172.862022-06-02 01:14:43ptt
....................................
52221375https://www.ptt.cc/bbs/marriage/M.1672325600.A...Re:[心情]果然跟版上說的一樣,還是得離婚...2022-12-29 22:53:18aass5566marriage感謝這個決定?\\n\\n我是覺得根本是被這個諮商害到了吧?\\n\\n如果不是弄這個諮商\\n\\n早...[{\"cmtStatus\": \"推\", \"cmtPoster\": \"wts4832\", \"c...49.158.132.1192022-12-30 01:57:14ptt
52231376https://www.ptt.cc/bbs/marriage/M.1672369383.A...Re:[求助]老公偷拍女同事腿2022-12-30 11:03:00GunOfWindmarriage先說結論 離婚吧\\n沒有小孩 你不能接受 就離婚 +1 吧\\n程度問題\\n就像有些人會...[{\"cmtStatus\": \"推\", \"cmtPoster\": \"robertdelun\"...125.227.145.312022-12-31 01:52:50ptt
52241377https://www.ptt.cc/bbs/marriage/M.1672444694.A...Re:[閒聊]另一半的家庭觀念2022-12-31 07:58:12magicbook123marriage首先,有問題不要上來發文\\n\\n這裡是離婚板 不會給什麼好建議 開口先喊離婚\\n\\n就算有正...[{\"cmtStatus\": \"推\", \"cmtPoster\": \"mtyc\", \"cmtC...223.137.86.712023-01-01 01:51:58ptt
52251378https://www.ptt.cc/bbs/marriage/M.1672455232.A...Re:[閒聊]另一半的家庭觀念2022-12-31 10:53:50lamabclamabcmarriage原PO我有看你舊文,也有留意你的補充和回文裏的推文,看得出來你真的很困擾。\\n\\n很高興你決...[{\"cmtStatus\": \"推\", \"cmtPoster\": \"lastever\", \"...155.137.208.192023-01-01 01:51:58ptt
52261379https://www.ptt.cc/bbs/marriage/M.1672479128.A...[閒聊]離婚,關於結婚金飾2022-12-31 17:32:06penchlinmarriage想請教各位,怎樣比較合理\\n1、男女各拿回自己買的\\n2、男女各拿回送給對方的\\n3、其他\\n[{\"cmtStatus\": \"噓\", \"cmtPoster\": \"wnwe\", \"cmtC...180.217.44.2472023-01-01 01:51:58ptt
\n", "

5156 rows × 11 columns

\n", "
" ], "text/plain": [ " system_id artUrl \\\n", "0 1 https://www.ptt.cc/bbs/GetMarry/M.1653965335.A... \n", "1 2 https://www.ptt.cc/bbs/GetMarry/M.1653974818.A... \n", "2 3 https://www.ptt.cc/bbs/GetMarry/M.1654041655.A... \n", "3 4 https://www.ptt.cc/bbs/GetMarry/M.1654042316.A... \n", "4 5 https://www.ptt.cc/bbs/GetMarry/M.1654050290.A... \n", "... ... ... \n", "5222 1375 https://www.ptt.cc/bbs/marriage/M.1672325600.A... \n", "5223 1376 https://www.ptt.cc/bbs/marriage/M.1672369383.A... \n", "5224 1377 https://www.ptt.cc/bbs/marriage/M.1672444694.A... \n", "5225 1378 https://www.ptt.cc/bbs/marriage/M.1672455232.A... \n", "5226 1379 https://www.ptt.cc/bbs/marriage/M.1672479128.A... \n", "\n", " artTitle artDate artPoster \\\n", "0 [推薦]彰化_DeerHer客製手工喜餅 2022-05-31 10:48:52 pieme \n", "1 [分享]台北/荳蔻攝影工作室婚紗照 2022-05-31 13:26:56 ajjhhj \n", "2 [請益]Oohlalove喜餅品項選擇 2022-06-01 08:00:53 love07erika \n", "3 [贈送]白色及粉色小禮服 2022-06-01 08:11:54 cherishposse \n", "4 [廣宣]OohLaLove喜餅 2022-06-01 10:24:48 michael9586 \n", "... ... ... ... \n", "5222 Re:[心情]果然跟版上說的一樣,還是得離婚... 2022-12-29 22:53:18 aass5566 \n", "5223 Re:[求助]老公偷拍女同事腿 2022-12-30 11:03:00 GunOfWind \n", "5224 Re:[閒聊]另一半的家庭觀念 2022-12-31 07:58:12 magicbook123 \n", "5225 Re:[閒聊]另一半的家庭觀念 2022-12-31 10:53:50 lamabclamabc \n", "5226 [閒聊]離婚,關於結婚金飾 2022-12-31 17:32:06 penchlin \n", "\n", " artCatagory artContent \\\n", "0 GetMarry 廠商所在地區:彰化\\n\\n是什麼場合用到:訂結婚_111/5\\n\\n廠商名稱:\\n喜餅:\\n... \n", "1 GetMarry 剛開始查婚紗資訊,真的好討厭傳統婚紗店的組數限制,或是各種無止境加購方案,也好\\n怕遇到纏人... \n", "2 GetMarry 選擇障礙的新娘來求助了!\\n\\nC區的選擇掙扎到要給品項的deadline…\\n目前確定抹茶... \n", "3 GetMarry 《洽中,暫勿來信》\\n\\n已結婚完幾年了才發現還有兩件小禮服擱置在家裡,因家人暫時也用不到了... \n", "4 GetMarry 新人or廠商所在地區:台中\\n\\n是屬於新人哪種場合:結婚 2022/12/18\\n\\n\\n... \n", "... ... ... \n", "5222 marriage 感謝這個決定?\\n\\n我是覺得根本是被這個諮商害到了吧?\\n\\n如果不是弄這個諮商\\n\\n早... \n", "5223 marriage 先說結論 離婚吧\\n沒有小孩 你不能接受 就離婚 +1 吧\\n程度問題\\n就像有些人會... \n", "5224 marriage 首先,有問題不要上來發文\\n\\n這裡是離婚板 不會給什麼好建議 開口先喊離婚\\n\\n就算有正... \n", "5225 marriage 原PO我有看你舊文,也有留意你的補充和回文裏的推文,看得出來你真的很困擾。\\n\\n很高興你決... \n", "5226 marriage 想請教各位,怎樣比較合理\\n1、男女各拿回自己買的\\n2、男女各拿回送給對方的\\n3、其他\\n \n", "\n", " artComment e_ip \\\n", "0 [] 36.232.149.35 \n", "1 [{\"cmtStatus\": \"推\", \"cmtPoster\": \"exorcist1\", ... 223.141.4.215 \n", "2 [{\"cmtStatus\": \"推\", \"cmtPoster\": \"mimiwei955\",... 111.82.79.25 \n", "3 [] 101.10.0.149 \n", "4 [{\"cmtStatus\": \"推\", \"cmtPoster\": \"kenkao25\", \"... 223.138.172.86 \n", "... ... ... \n", "5222 [{\"cmtStatus\": \"推\", \"cmtPoster\": \"wts4832\", \"c... 49.158.132.119 \n", "5223 [{\"cmtStatus\": \"推\", \"cmtPoster\": \"robertdelun\"... 125.227.145.31 \n", "5224 [{\"cmtStatus\": \"推\", \"cmtPoster\": \"mtyc\", \"cmtC... 223.137.86.71 \n", "5225 [{\"cmtStatus\": \"推\", \"cmtPoster\": \"lastever\", \"... 155.137.208.19 \n", "5226 [{\"cmtStatus\": \"噓\", \"cmtPoster\": \"wnwe\", \"cmtC... 180.217.44.247 \n", "\n", " insertedDate dataSource \n", "0 2022-06-01 01:14:50 ptt \n", "1 2022-06-01 01:14:50 ptt \n", "2 2022-06-02 01:14:43 ptt \n", "3 2022-06-02 01:14:43 ptt \n", "4 2022-06-02 01:14:43 ptt \n", "... ... ... \n", "5222 2022-12-30 01:57:14 ptt \n", "5223 2022-12-31 01:52:50 ptt \n", "5224 2023-01-01 01:51:58 ptt \n", "5225 2023-01-01 01:51:58 ptt \n", "5226 2023-01-01 01:51:58 ptt \n", "\n", "[5156 rows x 11 columns]" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file_list = [\"GetMarry_t.csv\", \"Gossiping_t.csv\", \"marriage_t.csv\"]\n", "dfs = [pd.read_csv(f) for f in file_list]\n", "ct = pd.concat(dfs, ignore_index=True)\n", "ct.dropna(inplace=True)\n", "ct" ] }, { "cell_type": "code", "execution_count": 101, "id": "e783ecec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contentartUrlartCatagory
0推薦彰化客製手工喜餅廠商所在地區彰化是什麼場合用到訂結婚廠商名稱喜餅手工喜餅聯絡資訊官網訂購...https://www.ptt.cc/bbs/GetMarry/M.1653965335.A...GetMarry
1分享台北荳蔻攝影工作室婚紗照剛開始查婚紗資訊真的好討厭傳統婚紗店的組數限制或是各種無止境加購...https://www.ptt.cc/bbs/GetMarry/M.1653974818.A...GetMarry
2請益喜餅品項選擇選擇障礙的新娘來求助了區的選擇掙扎到要給品項的目前確定抹茶鹽之花另一個想要檸...https://www.ptt.cc/bbs/GetMarry/M.1654041655.A...GetMarry
3贈送白色及粉色小禮服洽中暫勿來信已結婚完幾年了才發現還有兩件小禮服擱置在家裡因家人暫時也用不...https://www.ptt.cc/bbs/GetMarry/M.1654042316.A...GetMarry
4廣宣喜餅新人廠商所在地區台中是屬於新人哪種場合結婚以上三項依照要推薦的廠商類別而填寫不得不填...https://www.ptt.cc/bbs/GetMarry/M.1654050290.A...GetMarry
\n", "
" ], "text/plain": [ " content \\\n", "0 推薦彰化客製手工喜餅廠商所在地區彰化是什麼場合用到訂結婚廠商名稱喜餅手工喜餅聯絡資訊官網訂購... \n", "1 分享台北荳蔻攝影工作室婚紗照剛開始查婚紗資訊真的好討厭傳統婚紗店的組數限制或是各種無止境加購... \n", "2 請益喜餅品項選擇選擇障礙的新娘來求助了區的選擇掙扎到要給品項的目前確定抹茶鹽之花另一個想要檸... \n", "3 贈送白色及粉色小禮服洽中暫勿來信已結婚完幾年了才發現還有兩件小禮服擱置在家裡因家人暫時也用不... \n", "4 廣宣喜餅新人廠商所在地區台中是屬於新人哪種場合結婚以上三項依照要推薦的廠商類別而填寫不得不填... \n", "\n", " artUrl artCatagory \n", "0 https://www.ptt.cc/bbs/GetMarry/M.1653965335.A... GetMarry \n", "1 https://www.ptt.cc/bbs/GetMarry/M.1653974818.A... GetMarry \n", "2 https://www.ptt.cc/bbs/GetMarry/M.1654041655.A... GetMarry \n", "3 https://www.ptt.cc/bbs/GetMarry/M.1654042316.A... GetMarry \n", "4 https://www.ptt.cc/bbs/GetMarry/M.1654050290.A... GetMarry " ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 過濾 nan 的資料\n", "ct = ct.dropna(subset=['artTitle'])\n", "ct = ct.dropna(subset=['artContent'])\n", "# 移除網址格式\n", "ct[\"artContent\"] = ct.artContent.apply(\n", " lambda x: re.sub(\"(http|https)://.*\", \"\", x)\n", ")\n", "ct[\"artTitle\"] = ct[\"artTitle\"].apply(\n", " lambda x: re.sub(\"(http|https)://.*\", \"\", x)\n", ")\n", "# 只留下中文字\n", "ct[\"artContent\"] = ct.artContent.apply(\n", " lambda x: re.sub(\"[^\\u4e00-\\u9fa5]+\", \"\", x)\n", ")\n", "ct[\"artTitle\"] = ct[\"artTitle\"].apply(\n", " lambda x: re.sub(\"[^\\u4e00-\\u9fa5]+\", \"\", x)\n", ")\n", "\n", "# 留下 content\n", "ct[\"content\"] = ct[\"artTitle\"] + ct[\"artContent\"]\n", "ct = ct[[\"content\", \"artUrl\", \"artCatagory\"]] # 文章內容 文章連結\n", "ct.head()\n" ] }, { "cell_type": "code", "execution_count": 102, "id": "755411b3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "artCatagory\n", "GetMarry 2301\n", "Gossiping 1508\n", "marriage 1347\n", "Name: count, dtype: int64" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct['artCatagory'].value_counts()" ] }, { "cell_type": "code", "execution_count": 103, "id": "186f0382", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contentartUrlartCatagoryembeddings
0推薦彰化客製手工喜餅廠商所在地區彰化是什麼場合用到訂結婚廠商名稱喜餅手工喜餅聯絡資訊官網訂購...https://www.ptt.cc/bbs/GetMarry/M.1653965335.A...GetMarry[0.5550073, -0.20386253, -0.3193656, 0.1848941...
1分享台北荳蔻攝影工作室婚紗照剛開始查婚紗資訊真的好討厭傳統婚紗店的組數限制或是各種無止境加購...https://www.ptt.cc/bbs/GetMarry/M.1653974818.A...GetMarry[0.50719243, -0.0016513392, -0.3523839, 0.2919...
2請益喜餅品項選擇選擇障礙的新娘來求助了區的選擇掙扎到要給品項的目前確定抹茶鹽之花另一個想要檸...https://www.ptt.cc/bbs/GetMarry/M.1654041655.A...GetMarry[0.5617253, -0.14796321, -0.42836052, 0.094054...
\n", "
" ], "text/plain": [ " content \\\n", "0 推薦彰化客製手工喜餅廠商所在地區彰化是什麼場合用到訂結婚廠商名稱喜餅手工喜餅聯絡資訊官網訂購... \n", "1 分享台北荳蔻攝影工作室婚紗照剛開始查婚紗資訊真的好討厭傳統婚紗店的組數限制或是各種無止境加購... \n", "2 請益喜餅品項選擇選擇障礙的新娘來求助了區的選擇掙扎到要給品項的目前確定抹茶鹽之花另一個想要檸... \n", "\n", " artUrl artCatagory \\\n", "0 https://www.ptt.cc/bbs/GetMarry/M.1653965335.A... GetMarry \n", "1 https://www.ptt.cc/bbs/GetMarry/M.1653974818.A... GetMarry \n", "2 https://www.ptt.cc/bbs/GetMarry/M.1654041655.A... GetMarry \n", "\n", " embeddings \n", "0 [0.5550073, -0.20386253, -0.3193656, 0.1848941... \n", "1 [0.50719243, -0.0016513392, -0.3523839, 0.2919... \n", "2 [0.5617253, -0.14796321, -0.42836052, 0.094054... " ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct[\"embeddings\"] = ct.content.apply(lambda x: bert_ch.encode(x))\n", "ct.head(3)" ] }, { "cell_type": "code", "execution_count": 104, "id": "b2934c33", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " GetMarry 1.00 0.99 1.00 2301\n", " Gossiping 0.99 0.97 0.98 1508\n", " marriage 0.97 0.99 0.98 1347\n", "\n", " accuracy 0.99 5156\n", " macro avg 0.99 0.99 0.99 5156\n", "weighted avg 0.99 0.99 0.99 5156\n", "\n" ] } ], "source": [ "X = ct[\"embeddings\"].apply(pd.Series)\n", "y = ct['artCatagory']\n", "\n", "y_pred = clf.predict(X)\n", "print(classification_report(y, y_pred))" ] }, { "cell_type": "code", "execution_count": 109, "id": "72aebfa9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artCatagorypred
82GetMarrymarriage
968GetMarryGossiping
975GetMarrymarriage
1103GetMarryGossiping
1108GetMarryGossiping
1168GetMarrymarriage
1396GetMarryGossiping
1789GetMarryGossiping
1948GetMarrymarriage
2046GetMarryGossiping
2109GetMarrymarriage
2176GetMarrymarriage
2381Gossipingmarriage
2437Gossipingmarriage
2454Gossipingmarriage
2456Gossipingmarriage
2465GossipingGetMarry
2466Gossipingmarriage
2474Gossipingmarriage
2545Gossipingmarriage
2607Gossipingmarriage
2646GossipingGetMarry
2718Gossipingmarriage
2838Gossipingmarriage
2860Gossipingmarriage
2896Gossipingmarriage
2946Gossipingmarriage
2992Gossipingmarriage
3221Gossipingmarriage
3226GossipingGetMarry
3317Gossipingmarriage
3383Gossipingmarriage
3408GossipingGetMarry
3425Gossipingmarriage
3428Gossipingmarriage
3501Gossipingmarriage
3521GossipingGetMarry
3566GossipingGetMarry
3615Gossipingmarriage
3617GossipingGetMarry
3630Gossipingmarriage
3657Gossipingmarriage
3664Gossipingmarriage
3669Gossipingmarriage
3671GossipingGetMarry
3714Gossipingmarriage
3757Gossipingmarriage
3775Gossipingmarriage
3801Gossipingmarriage
3805Gossipingmarriage
3855marriageGossiping
4180marriageGossiping
4510marriageGossiping
4532marriageGossiping
4773marriageGossiping
4846marriageGossiping
5226marriageGetMarry
\n", "
" ], "text/plain": [ " artCatagory pred\n", "82 GetMarry marriage\n", "968 GetMarry Gossiping\n", "975 GetMarry marriage\n", "1103 GetMarry Gossiping\n", "1108 GetMarry Gossiping\n", "1168 GetMarry marriage\n", "1396 GetMarry Gossiping\n", "1789 GetMarry Gossiping\n", "1948 GetMarry marriage\n", "2046 GetMarry Gossiping\n", "2109 GetMarry marriage\n", "2176 GetMarry marriage\n", "2381 Gossiping marriage\n", "2437 Gossiping marriage\n", "2454 Gossiping marriage\n", "2456 Gossiping marriage\n", "2465 Gossiping GetMarry\n", "2466 Gossiping marriage\n", "2474 Gossiping marriage\n", "2545 Gossiping marriage\n", "2607 Gossiping marriage\n", "2646 Gossiping GetMarry\n", "2718 Gossiping marriage\n", "2838 Gossiping marriage\n", "2860 Gossiping marriage\n", "2896 Gossiping marriage\n", "2946 Gossiping marriage\n", "2992 Gossiping marriage\n", "3221 Gossiping marriage\n", "3226 Gossiping GetMarry\n", "3317 Gossiping marriage\n", "3383 Gossiping marriage\n", "3408 Gossiping GetMarry\n", "3425 Gossiping marriage\n", "3428 Gossiping marriage\n", "3501 Gossiping marriage\n", "3521 Gossiping GetMarry\n", "3566 Gossiping GetMarry\n", "3615 Gossiping marriage\n", "3617 Gossiping GetMarry\n", "3630 Gossiping marriage\n", "3657 Gossiping marriage\n", "3664 Gossiping marriage\n", "3669 Gossiping marriage\n", "3671 Gossiping GetMarry\n", "3714 Gossiping marriage\n", "3757 Gossiping marriage\n", "3775 Gossiping marriage\n", "3801 Gossiping marriage\n", "3805 Gossiping marriage\n", "3855 marriage Gossiping\n", "4180 marriage Gossiping\n", "4510 marriage Gossiping\n", "4532 marriage Gossiping\n", "4773 marriage Gossiping\n", "4846 marriage Gossiping\n", "5226 marriage GetMarry" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "false_pred = ct.query(\"artCatagory != pred\").loc[:,['artCatagory',\"pred\"]]\n", "false_pred" ] }, { "cell_type": "code", "execution_count": 110, "id": "e406f543", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artCatagorypred
2381Gossipingmarriage
2437Gossipingmarriage
2454Gossipingmarriage
2456Gossipingmarriage
2465GossipingGetMarry
2466Gossipingmarriage
2474Gossipingmarriage
2545Gossipingmarriage
2607Gossipingmarriage
2646GossipingGetMarry
2718Gossipingmarriage
2838Gossipingmarriage
2860Gossipingmarriage
2896Gossipingmarriage
2946Gossipingmarriage
2992Gossipingmarriage
3221Gossipingmarriage
3226GossipingGetMarry
3317Gossipingmarriage
3383Gossipingmarriage
\n", "
" ], "text/plain": [ " artCatagory pred\n", "2381 Gossiping marriage\n", "2437 Gossiping marriage\n", "2454 Gossiping marriage\n", "2456 Gossiping marriage\n", "2465 Gossiping GetMarry\n", "2466 Gossiping marriage\n", "2474 Gossiping marriage\n", "2545 Gossiping marriage\n", "2607 Gossiping marriage\n", "2646 Gossiping GetMarry\n", "2718 Gossiping marriage\n", "2838 Gossiping marriage\n", "2860 Gossiping marriage\n", "2896 Gossiping marriage\n", "2946 Gossiping marriage\n", "2992 Gossiping marriage\n", "3221 Gossiping marriage\n", "3226 Gossiping GetMarry\n", "3317 Gossiping marriage\n", "3383 Gossiping marriage" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "false_pred.loc[false_pred['artCatagory']=='Gossiping', :].head(20)" ] }, { "cell_type": "code", "execution_count": 115, "id": "cdae96d4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
artCatagorypredcontentembeddings
2381Gossipingmarriage問卦少子化女生的問題佔大部分吧台女白天要上班下班要養家假日要讀書沒辦法到外面聯誼之前我在臉書...[0.8725247, -0.08411743, -0.203732, 0.27536434...
2437Gossipingmarriage新聞工程師尪被孕妻抓包上按摩店性交易簽工程師尪被孕妻抓包上按摩店性交易簽完協議再犯下場曝光年...[0.46804097, -0.016441723, -0.2565862, 0.06258...
2454Gossipingmarriage問卦你各位買得起房就真的會結婚生小孩結婚可能但生小孩很難買不起房然後呢別忘了還有房貸再來是小...[0.65234625, -0.09828554, -0.12775144, 0.26947...
2456Gossipingmarriage問卦台女找尋結婚對象有人留言說不生幹嘛結婚我只是能說太天真了婚姻是個強大道德武器之後至於要不...[0.79514945, -0.067468375, -0.14028932, 0.4278...
2465GossipingGetMarry問卦台女找尋結婚對象小妹我的親辜徵友條件跟低卡上這位列的有相似但小妹的親辜身高體重而且比較年...[0.928948, -0.1616967, 0.04163605, 0.56533563,...
2466Gossipingmarriage問卦家長是巨嬰嗎一般來說親子關係是要培養的不管是爸爸還是媽媽不是只負責生不負責養那以前都會說...[0.8342646, -0.11413084, -0.100559324, 0.21266...
2474Gossipingmarriage新聞色人妻新婚不久就出軌誘惑男同事小色人妻新婚不久就出軌誘惑男同事小朋友才戴套把你榨乾記者柯...[0.58158386, 0.022150613, -0.3089661, 0.124260...
2545Gossipingmarriage問卦老婆只想生一個建議你先跟你老婆借一下手機然後把你老婆的手機格式化再跟她講解備份的重要性有...[0.5313501, -0.26143715, -0.0760047, 0.5591018...
2607Gossipingmarriage問卦男人不婚不生把錢花在出國也爽以前常有人說台女沒啥在存錢總是把錢花在出國玩樂享受美景美食上...[0.5529267, 0.08784643, -0.022090623, 0.515583...
2646GossipingGetMarry問卦有萬為什麼不娶台灣要娶越南老婆對聽說去年有位造型師娶越南老婆但是價格驚人全部流程加上婚禮...[0.5280451, -0.01208675, 0.1979833, 0.43109772...
\n", "
" ], "text/plain": [ " artCatagory pred content \\\n", "2381 Gossiping marriage 問卦少子化女生的問題佔大部分吧台女白天要上班下班要養家假日要讀書沒辦法到外面聯誼之前我在臉書... \n", "2437 Gossiping marriage 新聞工程師尪被孕妻抓包上按摩店性交易簽工程師尪被孕妻抓包上按摩店性交易簽完協議再犯下場曝光年... \n", "2454 Gossiping marriage 問卦你各位買得起房就真的會結婚生小孩結婚可能但生小孩很難買不起房然後呢別忘了還有房貸再來是小... \n", "2456 Gossiping marriage 問卦台女找尋結婚對象有人留言說不生幹嘛結婚我只是能說太天真了婚姻是個強大道德武器之後至於要不... \n", "2465 Gossiping GetMarry 問卦台女找尋結婚對象小妹我的親辜徵友條件跟低卡上這位列的有相似但小妹的親辜身高體重而且比較年... \n", "2466 Gossiping marriage 問卦家長是巨嬰嗎一般來說親子關係是要培養的不管是爸爸還是媽媽不是只負責生不負責養那以前都會說... \n", "2474 Gossiping marriage 新聞色人妻新婚不久就出軌誘惑男同事小色人妻新婚不久就出軌誘惑男同事小朋友才戴套把你榨乾記者柯... \n", "2545 Gossiping marriage 問卦老婆只想生一個建議你先跟你老婆借一下手機然後把你老婆的手機格式化再跟她講解備份的重要性有... \n", "2607 Gossiping marriage 問卦男人不婚不生把錢花在出國也爽以前常有人說台女沒啥在存錢總是把錢花在出國玩樂享受美景美食上... \n", "2646 Gossiping GetMarry 問卦有萬為什麼不娶台灣要娶越南老婆對聽說去年有位造型師娶越南老婆但是價格驚人全部流程加上婚禮... \n", "\n", " embeddings \n", "2381 [0.8725247, -0.08411743, -0.203732, 0.27536434... \n", "2437 [0.46804097, -0.016441723, -0.2565862, 0.06258... \n", "2454 [0.65234625, -0.09828554, -0.12775144, 0.26947... \n", "2456 [0.79514945, -0.067468375, -0.14028932, 0.4278... \n", "2465 [0.928948, -0.1616967, 0.04163605, 0.56533563,... \n", "2466 [0.8342646, -0.11413084, -0.100559324, 0.21266... \n", "2474 [0.58158386, 0.022150613, -0.3089661, 0.124260... \n", "2545 [0.5313501, -0.26143715, -0.0760047, 0.5591018... \n", "2607 [0.5529267, 0.08784643, -0.022090623, 0.515583... \n", "2646 [0.5280451, -0.01208675, 0.1979833, 0.43109772... " ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct_wrong = ct[(ct[\"artCatagory\"] == \"Gossiping\") & (ct[\"pred\"] != \"Gossiping\")]\n", "ct_wrong[[\"artCatagory\", \"pred\", \"content\",\"embeddings\"]].head(10)\n" ] }, { "cell_type": "markdown", "id": "5aae1db1", "metadata": {}, "source": [ "## 第三次讀書會 BERT (Encoder-only-model)\n", "---" ] }, { "cell_type": "markdown", "id": "c7c09a4c", "metadata": {}, "source": [ "**前處理常用套件**\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ddc61652", "metadata": {}, "outputs": [], "source": [ "!pip install jieba" ] }, { "cell_type": "code", "execution_count": null, "id": "9e0abbd2", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "import numpy as np\n", "from collections import defaultdict\n", "import multiprocessing\n", "import jieba\n", "import matplotlib.pyplot as plt\n", "from matplotlib.font_manager import fontManager\n", "\n", "# 設定字體\n", "fontManager.addfont('./TaipeiSansTCBeta-Regular.ttf')\n", "plt.rcParams['font.sans-serif'] = ['Taipei Sans TC Beta']\n", "plt.rcParams['font.size'] = '16'" ] }, { "cell_type": "markdown", "id": "e2ab3234", "metadata": {}, "source": [ "**Transformers 和 Sentence-transformers (使用 huggingface 模型)**" ] }, { "cell_type": "code", "execution_count": null, "id": "8b6b7b4d", "metadata": {}, "outputs": [], "source": [ "!pip install sentence_transformers\n", "!pip install ckip_transformers" ] }, { "cell_type": "code", "execution_count": null, "id": "5f1e1edd", "metadata": {}, "outputs": [], "source": [ "from transformers import BertTokenizerFast, AutoTokenizer, AutoModelForTokenClassification, AutoModelForSequenceClassification, pipeline\n", "from sentence_transformers import SentenceTransformer\n", "from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker" ] }, { "cell_type": "markdown", "id": "fd53b4c2", "metadata": {}, "source": [ "**BERTopic套件**" ] }, { "cell_type": "code", "execution_count": null, "id": "c14979a8", "metadata": {}, "outputs": [], "source": [ "!pip install bertopic" ] }, { "cell_type": "code", "execution_count": null, "id": "59cd4033", "metadata": {}, "outputs": [], "source": [ "from bertopic import BERTopic\n", "from bertopic.vectorizers import ClassTfidfTransformer\n", "from hdbscan import HDBSCAN\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.cluster import KMeans" ] }, { "cell_type": "markdown", "id": "9bd0c6ae", "metadata": {}, "source": [ "### 2. 資料前處理" ] }, { "cell_type": "markdown", "id": "e28698a4", "metadata": {}, "source": [ "中文資料集:載入離婚資料集" ] }, { "cell_type": "code", "execution_count": null, "id": "1b699d88", "metadata": {}, "outputs": [], "source": [ "# 讀入中文示範資料集\n", "# origin_data = pd.read_csv('./raw_data/zh_buffet_20_22.csv')\n", "origin_data = pd.read_csv('./raw_data/text_marriage.csv')" ] }, { "cell_type": "code", "execution_count": null, "id": "a261617a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
system_idartUrlartTitleartDateartContentsentence
01https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...常常看大家說
11https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...偷看手機是不對的
21https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...但如果已經結婚了
31https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...太太想看你手機
41https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...真的可以拒絕嗎
51https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...感覺你拒絕
61https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...就是心裡有鬼
71https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...只是讓太太猜忌
81https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...自己日子難過
91https://www.ptt.cc/bbs/marriage/M.1610159827.A...[求助]真的可以不給看手機嗎?2021-01-09 10:37:05常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\...長期以來
\n", "
" ], "text/plain": [ " system_id artUrl \\\n", "0 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "1 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "2 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "3 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "4 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "5 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "6 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "7 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "8 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "9 1 https://www.ptt.cc/bbs/marriage/M.1610159827.A... \n", "\n", " artTitle artDate \\\n", "0 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "1 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "2 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "3 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "4 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "5 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "6 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "7 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "8 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "9 [求助]真的可以不給看手機嗎? 2021-01-09 10:37:05 \n", "\n", " artContent sentence \n", "0 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 常常看大家說 \n", "1 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 偷看手機是不對的 \n", "2 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 但如果已經結婚了 \n", "3 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 太太想看你手機 \n", "4 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 真的可以拒絕嗎 \n", "5 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 感覺你拒絕 \n", "6 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 就是心裡有鬼 \n", "7 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 只是讓太太猜忌 \n", "8 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 自己日子難過 \n", "9 常常看大家說,偷看手機是不對的,\\n但如果已經結婚了,太太想看你手機,\\n真的可以拒絕嗎?\\... 長期以來 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 去除一些不需要的欄位\n", "metaData = origin_data.drop(['artPoster', 'artCatagory', 'artComment', 'e_ip', 'insertedDate', 'dataSource'], axis=1)\n", "\n", "# 只留下中文字\n", "metaData['sentence'] = metaData['artContent'].str.replace(r'\\n\\n','。', regex=True)\n", "metaData['sentence'] = metaData['sentence'].str.replace(r'\\n','', regex=True)\n", "\n", "metaData['sentence'] = metaData['sentence'].str.split(\"[,,。!!??]{1,}\")\n", "metaData = metaData.explode('sentence').reset_index(drop=True)\n", "\n", "metaData['sentence'] = metaData['sentence'].apply(lambda x: re.sub('[^\\u4e00-\\u9fff]+', '',x))\n", "\n", "metaData.head(10)" ] }, { "cell_type": "markdown", "id": "4fb7a426", "metadata": {}, "source": [ "## 3. Token classification" ] }, { "cell_type": "markdown", "id": "7a49f49f", "metadata": {}, "source": [ "### NER\n", "使用 Huggingface 上面已經針對 NER 任務 finetune 好的 BERT 模型來實作
\n", "Huggingface 的模型列表:https://huggingface.co/models?sort=trending" ] }, { "cell_type": "markdown", "id": "531530fc", "metadata": {}, "source": [ "#### 3.1 中文 NER:
\n", "- 使用套件:transformers
\n", "- 使用的 NER 模型:https://huggingface.co/ckiplab/bert-base-chinese-ner" ] }, { "cell_type": "code", "execution_count": null, "id": "6fe27d64", "metadata": {}, "outputs": [], "source": [ "# 載入中文NER模型\n", "model_name = 'ckiplab/bert-base-chinese-ner'\n", "tokenizer = BertTokenizerFast.from_pretrained(model_name)\n", "model = AutoModelForTokenClassification.from_pretrained(model_name)" ] }, { "cell_type": "markdown", "id": "8cadce14", "metadata": {}, "source": [ "也可以使用 CKIP 開發的 NLP 套件:ckip_transformers
\n", "- 使用的 WS 模型:https://huggingface.co/ckiplab/bert-base-chinese-ws
\n", "- 使用的 POS 模型:https://huggingface.co/ckiplab/bert-base-chinese-pos
\n", "- 使用的 NER 模型:https://huggingface.co/ckiplab/bert-base-chinese-ner" ] }, { "cell_type": "code", "execution_count": null, "id": "91f1cd65", "metadata": {}, "outputs": [], "source": [ "# 初始化 ckip 工具 device=0 使用GPU | device=-1 使用CPU(速度會很慢)\n", "# Mac使用者可以設定 device=torch.device(\"mps\") 使用GPU\n", "ws_driver = CkipWordSegmenter(model_name=\"ckiplab/bert-base-chinese-ws\", device=0) # Word Segmenter斷詞\n", "pos_driver = CkipPosTagger(model_name=\"ckiplab/bert-base-chinese-pos\", device=0) # POS tagger 詞性標記\n", "ner_driver = CkipNerChunker(model_name=\"ckiplab/bert-base-chinese-ner\", device=0) # NER識別\n" ] }, { "cell_type": "markdown", "id": "65fd99d1", "metadata": {}, "source": [ "**將CKIP套用到我們先前處理好的資料集**" ] }, { "cell_type": "code", "execution_count": null, "id": "0ad6ce38", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Tokenization: 100%|██████████| 50/50 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentencepacked_sentenceentities
0常常看大家說常常(D) 看(VC) 大家(Nh) 說(VE)[]
1偷看手機是不對的偷看(VC) 手機(Na) 是(SHI) 不(D) 對(VH) 的(DE)[]
2但如果已經結婚了但(Cbb) 如果(Cbb) 已經(D) 結婚(VA) 了(Di)[]
3太太想看你手機太太(Na) 想(VE) 看(VC) 你(Nh) 手機(Na)[]
4真的可以拒絕嗎真的(D) 可以(D) 拒絕(VF) 嗎(T)[]
5感覺你拒絕感覺(VK) 你(Nh) 拒絕(VF)[]
6就是心裡有鬼就(D) 是(SHI) 心(Na) 裡(Ng) 有(V_2) 鬼(Na)[]
7只是讓太太猜忌只是(D) 讓(VL) 太太(Na) 猜忌(VJ)[]
8自己日子難過自己(Nh) 日子(Na) 難過(VK)[]
9長期以來長期(Na) 以來(Ng)[]
\n", "" ], "text/plain": [ " sentence packed_sentence entities\n", "0 常常看大家說 常常(D) 看(VC) 大家(Nh) 說(VE) []\n", "1 偷看手機是不對的 偷看(VC) 手機(Na) 是(SHI) 不(D) 對(VH) 的(DE) []\n", "2 但如果已經結婚了 但(Cbb) 如果(Cbb) 已經(D) 結婚(VA) 了(Di) []\n", "3 太太想看你手機 太太(Na) 想(VE) 看(VC) 你(Nh) 手機(Na) []\n", "4 真的可以拒絕嗎 真的(D) 可以(D) 拒絕(VF) 嗎(T) []\n", "5 感覺你拒絕 感覺(VK) 你(Nh) 拒絕(VF) []\n", "6 就是心裡有鬼 就(D) 是(SHI) 心(Na) 裡(Ng) 有(V_2) 鬼(Na) []\n", "7 只是讓太太猜忌 只是(D) 讓(VL) 太太(Na) 猜忌(VJ) []\n", "8 自己日子難過 自己(Nh) 日子(Na) 難過(VK) []\n", "9 長期以來 長期(Na) 以來(Ng) []" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 以前50筆資料作為範例\n", "text = metaData['sentence'].tolist()\n", "text = text[:50]\n", "\n", "# 執行處理 \n", "ws = ws_driver(text) # 斷詞\n", "pos = pos_driver(ws) # POS\n", "ner = ner_driver(text) # NER\n", "\n", "# 將斷詞以及 pos 結果合在一起顯示\n", "def pack_ws_pos_sentece(sentence_ws, sentence_pos):\n", " assert len(sentence_ws) == len(sentence_pos) # 確認斷詞和POS的長度相同\n", " res = []\n", " for word_ws, word_pos in zip(sentence_ws, sentence_pos):\n", " res.append(f\"{word_ws}({word_pos})\") # 合併在一起\n", " return \"\\u3000\".join(res) \n", "\n", "sentences, packed_sentences, entities = [], [], []\n", "\n", "# 儲存結果\n", "for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):\n", " sentences.append(sentence)\n", " packed_sentences.append(pack_ws_pos_sentece(sentence_ws, sentence_pos))\n", " entities.append([str(entity) for entity in sentence_ner])\n", "\n", "# 將結果存在一個 dataframe 中\n", "ner_results = pd.DataFrame({\n", " 'sentence': sentences,\n", " 'packed_sentence': packed_sentences,\n", " 'entities': entities\n", "})\n", "\n", "ner_results.head(10)\n" ] }, { "cell_type": "markdown", "id": "b6e0f4af", "metadata": {}, "source": [ "用CKIP Transformers提供的模型套件進行斷詞、詞性標註和實體辨識的結果" ] }, { "cell_type": "markdown", "id": "7b281430", "metadata": {}, "source": [ "## 4. Sequence classification" ] }, { "cell_type": "markdown", "id": "a7fb6e18", "metadata": {}, "source": [ "### 4.1 Sentiment Classification\n", "使用 Huggingface 上面已經針對 Sentiment classification 任務 finetune 的 BERT 模型來實作
\n", "使用的模型:https://huggingface.co/techthiyanes/chinese_sentiment

\n", "情緒(start 1到star 5):
\n", "1. Semi-negation
\n", "2. Negation
\n", "3. Neutral
\n", "4. Semi-positive
\n", "5. Positive" ] }, { "cell_type": "code", "execution_count": null, "id": "0d166115", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cuda:0\n" ] } ], "source": [ "# 載入已經被 fine-tune 過的 BERT 模型\n", "model_name = \"techthiyanes/chinese_sentiment\" # 你可以將這裡換成你想要使用的模型\n", "# model = pipeline('sentiment-analysis', model=model_name)\n", "model = pipeline('sentiment-analysis', model=model_name, device=0)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7fb5e035", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentencelabelscore
0常常看大家說star 40.509952
1偷看手機是不對的star 20.314251
2但如果已經結婚了star 30.319404
3太太想看你手機star 20.288831
4真的可以拒絕嗎star 10.465014
5感覺你拒絕star 50.336695
6就是心裡有鬼star 20.328898
7只是讓太太猜忌star 20.311495
8自己日子難過star 10.475942
9長期以來star 40.445570
\n", "
" ], "text/plain": [ " sentence label score\n", "0 常常看大家說 star 4 0.509952\n", "1 偷看手機是不對的 star 2 0.314251\n", "2 但如果已經結婚了 star 3 0.319404\n", "3 太太想看你手機 star 2 0.288831\n", "4 真的可以拒絕嗎 star 1 0.465014\n", "5 感覺你拒絕 star 5 0.336695\n", "6 就是心裡有鬼 star 2 0.328898\n", "7 只是讓太太猜忌 star 2 0.311495\n", "8 自己日子難過 star 1 0.475942\n", "9 長期以來 star 4 0.445570" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# # 建立一個新的 dataframe 來儲存結果\n", "# results_df = pd.DataFrame(columns=['sentence', 'label', 'score'])\n", "# results_df['sentence'] = metaData['sentence']\n", "\n", "# # 定義一個函數來進行情緒分析\n", "# # def analyze_sentiment(sentence):\n", "# # result = model([sentence])\n", "# # return pd.Series([result[0]['label'], result[0]['score']])\n", "# def analyze_sentiment(sentence):\n", "# result = model(sentence, truncation=True, max_length=512, batch_size=8) # 用 tokenizer 算 token 數、切割長度,就不用再用 x[:500] 去砍字元\n", "# return pd.Series([result[0]['label'], result[0]['score']])\n", "\n", "\n", "# # 使用 apply 函數來進行情緒分析\n", "# # metaData['sentence'] = metaData['sentence'].apply(lambda x: x[:500])\n", "# results_df[['label', 'score']] = metaData['sentence'].apply(analyze_sentiment)\n", "\n", "# # 輸出結果\n", "# results_df.head(10)\n", "\n", "# 取出所有句子\n", "sentences = metaData['sentence'].tolist()\n", "\n", "# 批次跑 pipeline(一次丟入全部,內部分批跑 batch_size)\n", "results = model(sentences, truncation=True, max_length=512, batch_size=8)\n", "\n", "# 組成結果 DataFrame\n", "results_df = pd.DataFrame({\n", " 'sentence': sentences,\n", " 'label': [res['label'] for res in results],\n", " 'score': [res['score'] for res in results]\n", "})\n", "\n", "# 輸出結果\n", "results_df.head(10)\n", "\n" ] }, { "cell_type": "markdown", "id": "8f5f2732", "metadata": {}, "source": [ "如\"常常看大家說\"被分在star 4,是較為正向的,而\"自己日子難過\"是被分在star 1,是最負向" ] }, { "cell_type": "markdown", "id": "b14a1bdd", "metadata": {}, "source": [ "## 5. Text Clustering" ] }, { "cell_type": "markdown", "id": "bb120602", "metadata": {}, "source": [ "#### 套用於中文
\n", "為了套用到中文文章,各元件必須修改為支援中文的方法,主要針對 embedding model 以及 tokenizer" ] }, { "cell_type": "code", "execution_count": null, "id": "a378ecfe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "103443\n" ] } ], "source": [ "print(len(metaData))" ] }, { "cell_type": "code", "execution_count": null, "id": "60d99703", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No sentence-transformers model found with name google-bert/bert-base-chinese. Creating a new one with mean pooling.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Encoding batch 1 / 21\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f33dc7ca51604dbaa535a643089a49ac", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Batches: 0%| | 0/157 [00:00\n", "(103443, 768)\n" ] } ], "source": [ "print(type(all_embeddings))\n", "print(all_embeddings.shape)" ] }, { "cell_type": "code", "execution_count": null, "id": "bff28fe9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-05-10 03:34:32,454 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm\n", "2025-05-10 03:35:01,087 - BERTopic - Dimensionality - Completed ✓\n", "2025-05-10 03:35:01,088 - BERTopic - Cluster - Start clustering the reduced embeddings\n", "2025-05-10 03:35:05,307 - BERTopic - Cluster - Completed ✓\n", "2025-05-10 03:35:05,315 - BERTopic - Representation - Fine-tuning topics using representation models.\n", "2025-05-10 03:35:09,015 - BERTopic - Representation - Completed ✓\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Topic Count Name \\\n", "0 -1 77800 -1_ _說_小孩_妳 \n", "1 0 7361 0_ _老婆_妳_婚姻 \n", "2 1 6016 1_ _真的_說_情況 \n", "3 2 4244 2_房子_ _房貸_萬 \n", "4 3 3396 3_ _小孩_睡_下班 \n", "5 4 2522 4_ _文章_版友_建議 \n", "6 5 2104 5_離婚_ _結婚_外遇 \n", "\n", " Representation \\\n", "0 [ , 說, 小孩, 妳, 老婆, 想, 老公, 真的, 做, 離婚, 孩子, 工作, 太太... \n", "1 [ , 老婆, 妳, 婚姻, 老公, 做, 說, 離婚, 改變, 想, 人生, 小孩, 真的... \n", "2 [ , 真的, 說, 情況, 想, 太, 答案, 妳, 不想, 方法, 痛苦, 做, 喔, ... \n", "3 [房子, , 房貸, 萬, 錢, 薪水, 財產, 買, 工作, 貸款, 費用, 買房, 收... \n", "4 [ , 小孩, 睡, 下班, 回家, 吃, 晚上, 洗, 時間, 睡覺, 假日, 煮, 上班... \n", "5 [ , 文章, 版友, 建議, 謝謝, 發文, 分享, 推文, 參考, 想, 文, 回文, ... \n", "6 [離婚, , 結婚, 外遇, 交往, 想, 提, 分手, 建議, 真的, 說, 談, 不想... \n", "\n", " Representative_Docs \n", "0 [但有小孩的話, 阿不然你說說他有哪些好, 而不是忽然你父母說要帶小孩] \n", "1 [最後我想說的是我懂妳那種在一個地方孤立無援的感覺如果跟先生一直找不到共識那就自己先做出改變... \n", "2 [我說, 像大說的, 那真的先不要] \n", "3 [而房子就歸我了, 房子和, 沒人知道房子是我的] \n", "4 [他想好好休息我陪睡的時候我也不會拒絕我都是哄完小孩睡覺才會去睡要小時顧才不廢, 媽媽自己做... \n", "5 [你的文章只看到, 看了你文章的推文, 感謝大家的回覆與建議沒想到會這麼多人回覆早知道很多年... \n", "6 [也是直接離婚就好, 趕快離婚, 也差不多可以離婚了] \n" ] } ], "source": [ "\n", "\n", "# ---------- 定義 jieba 分詞函數 ----------\n", "def tokenize_zh(text):\n", " words = jieba.lcut(text)\n", " return words\n", "\n", "# ---------- 設定 CountVectorizer(含 jieba 分詞 + 停用詞) ----------\n", "jieba_vectorizer = CountVectorizer(\n", " tokenizer=tokenize_zh,\n", " stop_words=stopwords,\n", " analyzer='word',\n", " token_pattern=u\"(?u)\\\\b\\\\w+\\\\b\"\n", ")\n", "\n", "# ---------- 設定 HDBSCAN 參數(可調整群組靈敏度) ----------\n", "hdbscan_model = HDBSCAN(min_cluster_size=2000, min_samples=10)\n", "\n", "# ---------- 建立 BERTopic 模型 ----------\n", "zh_topic_model = BERTopic(\n", " embedding_model=bert_sentence_model,\n", " vectorizer_model=jieba_vectorizer,\n", " hdbscan_model=hdbscan_model,\n", " verbose=True,\n", " top_n_words=30\n", ")\n", "\n", "# ---------- 跑主題模型 ----------\n", "topics, probs = zh_topic_model.fit_transform(docs_zh, all_embeddings)\n", "\n", "# ---------- 查看主題資訊 ----------\n", "topic_info = zh_topic_model.get_topic_info()\n", "print(topic_info)\n" ] }, { "cell_type": "markdown", "id": "3551b3f0", "metadata": {}, "source": [ "總共分出七個主題,其中第一個主題無法分類的詞彙,第二個則與婚姻中的互動關係較有關,包含老公、老婆、小孩、婚姻、離婚等詞彙,第三個則與較抽象的情感表達詞彙有關,如真的、不想、痛苦等,第四個則與房子、經濟、財務相關,主要是針對經濟壓力、買房貸款、家庭財務分配的討論,第五個則與孩子及家庭生活有關,有較多關於照顧孩子、作息安排與家庭生活的描述,第六個則是與社群平台的互動行為較相關,就與我們的主題較無關聯,但有被單獨分成一類,第七個則與感情、婚姻破裂較相關,有外遇、分手、離婚等詞,多是關於分手、離婚、外遇的感情問題。
\n", "所以是有成功辨識出多個的明確主題,例如「婚姻問題」、「房貸壓力」、「家庭生活」。" ] }, { "cell_type": "code", "execution_count": null, "id": "5cbbd497", "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "customdata": [ [ 0, " | 老婆 | 妳 | 婚姻 | 老公", 7361 ], [ 1, " | 真的 | 說 | 情況 | 想", 6016 ], [ 2, "房子 | | 房貸 | 萬 | 錢", 4244 ], [ 3, " | 小孩 | 睡 | 下班 | 回家", 3396 ], [ 4, " | 文章 | 版友 | 建議 | 謝謝", 2522 ], [ 5, "離婚 | | 結婚 | 外遇 | 交往", 2104 ] ], "hovertemplate": "Topic %{customdata[0]}
%{customdata[1]}
Size: %{customdata[2]}", "legendgroup": "", "marker": { "color": "#B0BEC5", "line": { "color": "DarkSlateGrey", "width": 2 }, "size": { "bdata": "wRyAF5QQRA3aCTgI", "dtype": "i2" }, "sizemode": "area", "sizeref": 4.600625, "symbol": "circle" }, "mode": "markers", "name": "", "orientation": "v", "showlegend": false, "type": "scatter", "x": { "bdata": "m7ihQa5TpEHz0ptBxOCqQcOUo0FiCJ9B", "dtype": "f4" }, "xaxis": "x", "y": { "bdata": "SkVEwTtPNcGM30rBo/lawfSWTsEF4zfB", "dtype": "f4" }, "yaxis": "y" } ], "layout": { "annotations": [ { "showarrow": false, "text": "D1", "x": 16.55630216598511, "y": -12.685452651977538, "yshift": 10 }, { "showarrow": false, "text": "D2", "x": 20.560006666183472, "xshift": 10, "y": -9.632066869735718 } ], "height": 650, "hoverlabel": { "bgcolor": "white", "font": { "family": "Rockwell", "size": 16 } }, "legend": { "itemsizing": "constant", "tracegroupgap": 0 }, "margin": { "t": 60 }, "shapes": [ { "line": { "color": "#CFD8DC", "width": 2 }, "type": "line", "x0": 20.560006666183472, "x1": 20.560006666183472, "y0": -15.73883843421936, "y1": -9.632066869735718 }, { "line": { "color": "#9E9E9E", "width": 2 }, "type": "line", "x0": 16.55630216598511, "x1": 24.563711166381836, "y0": -12.685452651977538, "y1": -12.685452651977538 } ], "sliders": [ { "active": 0, "pad": { "t": 50 }, "steps": [ { "args": [ { "marker.color": [ [ "red", "#B0BEC5", "#B0BEC5", "#B0BEC5", "#B0BEC5", "#B0BEC5" ] ] } ], "label": "Topic 0", "method": "update" }, { "args": [ { "marker.color": [ [ "#B0BEC5", "red", "#B0BEC5", "#B0BEC5", "#B0BEC5", "#B0BEC5" ] ] } ], "label": "Topic 1", "method": "update" }, { "args": [ { "marker.color": [ [ "#B0BEC5", "#B0BEC5", "red", "#B0BEC5", "#B0BEC5", "#B0BEC5" ] ] } ], "label": "Topic 2", "method": "update" }, { "args": [ { "marker.color": [ [ "#B0BEC5", "#B0BEC5", "#B0BEC5", "red", "#B0BEC5", "#B0BEC5" ] ] } ], "label": "Topic 3", "method": "update" }, { "args": [ { "marker.color": [ [ "#B0BEC5", "#B0BEC5", "#B0BEC5", "#B0BEC5", "red", "#B0BEC5" ] ] } ], "label": "Topic 4", "method": "update" }, { "args": [ { "marker.color": [ [ "#B0BEC5", "#B0BEC5", "#B0BEC5", "#B0BEC5", "#B0BEC5", "red" ] ] } ], "label": "Topic 5", "method": "update" } ] } ], "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "title": { "font": { "color": "Black", "size": 22 }, "text": "Intertopic Distance Map", "x": 0.5, "xanchor": "center", "y": 0.95, "yanchor": "top" }, "width": 650, "xaxis": { "anchor": "y", "domain": [ 0, 1 ], "range": [ 16.55630216598511, 24.563711166381836 ], "title": { "text": "" }, "visible": false }, "yaxis": { "anchor": "x", "domain": [ 0, 1 ], "range": [ -15.73883843421936, -9.632066869735718 ], "title": { "text": "" }, "visible": false } } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "zh_topic_model.visualize_topics()" ] }, { "cell_type": "markdown", "id": "81e7bbe0", "metadata": {}, "source": [ "我們將第一個主題去掉,而剩下的主題,從這個分布圖我們可以以看出情緒表達詞彙與婚姻關係破裂的詞會較相關,而與孩子、家庭生活相關的這個主題就與其他主體相關性較低,語意較少與其他主題重疊。" ] }, { "cell_type": "code", "execution_count": null, "id": "3c888d9d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 104/104 [00:15<00:00, 6.68it/s]\n" ] } ], "source": [ "# 估算每個文件對BERTopic每個主題的機率分布\n", "topic_distr, _ = zh_topic_model.approximate_distribution(docs_zh)" ] }, { "cell_type": "code", "execution_count": null, "id": "c9ad4f31", "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "marker": { "color": "#C8D2D7", "line": { "color": "#6E8484", "width": 1 } }, "orientation": "h", "type": "bar", "x": [ 0.2030228180373761, 0.2904274787816493, 0.13669468061291898, 0.09567379287617377, 0.08618057255827931, 0.18800065713360267 ], "y": [ "Topic 0: _老婆_妳_婚姻_老公", "Topic 1: _真的_說_情況_想", "Topic 2: 房子_ _房貸_萬_錢", "Topic 3: _小孩_睡_下班_回家", "Topic 4: _文章_版友_建議_謝謝", "Topic 5: 離婚_ _結婚_外遇_交往" ] } ], "layout": { "height": 600, "hoverlabel": { "bgcolor": "white", "font": { "family": "Rockwell", "size": 16 } }, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "title": { "font": { "color": "Black", "size": 22 }, "text": "Topic Probability Distribution", "x": 0.5, "xanchor": "center", "y": 0.95, "yanchor": "top" }, "width": 800, "xaxis": { "title": { "text": "Probability" } } } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 以第1777個文件為例,列出這份文件對每個主題的機率分布\n", "zh_topic_model.visualize_distribution(topic_distr[1777])" ] }, { "cell_type": "markdown", "id": "295f3c21", "metadata": {}, "source": [ "我們以資料集中的其中一篇文章作為範例,查看每個主題的機率分布,其中與情緒詞主題相關的詞彙是最多的,再來是與婚姻關係相關的兩個主題的詞彙較多,討論到小孩和家庭生活的詞彙則較少。" ] }, { "cell_type": "code", "execution_count": null, "id": "79800333", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('房子', 0.08832980862301461),\n", " (' ', 0.08206604884387665),\n", " ('房貸', 0.05096740115727702),\n", " ('萬', 0.047356587645044315),\n", " ('錢', 0.0450103800079604),\n", " ('薪水', 0.04135365237991898),\n", " ('財產', 0.03950471267694048),\n", " ('買', 0.03455065077692002),\n", " ('工作', 0.03227272405986625),\n", " ('貸款', 0.03135481554556126),\n", " ('費用', 0.02734955195859806),\n", " ('買房', 0.02721027073860453),\n", " ('收入', 0.024952198329710087),\n", " ('存款', 0.02376174368712919),\n", " ('名下', 0.02236766307496004),\n", " ('小孩', 0.022094595487329578),\n", " ('開銷', 0.021967702675287745),\n", " ('負擔', 0.01928951669898701),\n", " ('住', 0.018815296571449015),\n", " ('一半', 0.018162093209078425),\n", " ('老公', 0.017709298496047788),\n", " ('老婆', 0.01752294289544913),\n", " ('薪資', 0.017494089796292113),\n", " ('投資', 0.01704722423177808),\n", " ('一個月', 0.01670727403634324),\n", " ('付', 0.016667227857624485),\n", " ('賺', 0.01634891501423118),\n", " ('年薪', 0.01633017406466438),\n", " ('家用', 0.016122946393990072),\n", " ('說', 0.015515271195159285)]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 列出主題的代表詞和其對應的權重\n", "zh_topic_model.get_topic(2)" ] }, { "cell_type": "markdown", "id": "1ab7f9e8", "metadata": {}, "source": [ "我們查看了與房屋、財產相關的這個主題的代表字及對應的權重,主要有房子、房貸、錢、薪水、工作、貸款等,主要都圍繞在房屋投資及工作收入,與當今社會對婚姻狀況討論最多的議題相關。" ] }, { "cell_type": "markdown", "id": "1fa4f4bd", "metadata": {}, "source": [ "查看特定文章的主題分佈" ] }, { "cell_type": "code", "execution_count": null, "id": "2c83e5fd", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 視覺化顯示主題-詞彙分佈\n", "topic_n = 2\n", "data = zh_topic_model.get_topic(topic_n)\n", "\n", "# 轉換為DataFrame\n", "df = pd.DataFrame(data, columns=['word', 'prob'])\n", "df = df[df['word'] != ' ']\n", "\n", "# 根據prob排序並選出前10名\n", "top_10 = df.sort_values('prob', ascending=False).head(10)\n", "\n", "# 畫出長條圖\n", "plt.figure(figsize=(10,6))\n", "plt.barh(top_10['word'], top_10['prob'], color='navy')\n", "plt.xlabel('機率')\n", "plt.title(f'主題 {topic_n} 詞彙機率前10名')\n", "plt.gca().invert_yaxis()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "6733a2b3", "metadata": {}, "source": [ "我們也列出了前十高的詞彙,從分布長條圖能看出,房子是提及最多的,其次是房貸,還有如萬、錢這類與金錢相關的用字,說明這個主題中討論最多的還是多圍繞在房產及金錢的劃分,可能可以反映出婚姻關係中與房屋相關的經濟負擔、收入財產分配的爭議是最常見的。" ] }, { "cell_type": "code", "execution_count": null, "id": "f815a762", "metadata": {}, "outputs": [], "source": [ "from bertopic.representation import KeyBERTInspired\n", "# KeyBERT\n", "keybert = KeyBERTInspired()\n", "\n", "# 設定HDBscan模型\n", "hdbscan_model = HDBSCAN(min_cluster_size=5, min_samples=30)\n", "\n", "# 定義我們要用到的representation model\n", "representation_model = {\n", " \"KeyBERT\": keybert,\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "91e02591", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-05-10 03:41:55,606 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm\n", "2025-05-10 03:43:13,490 - BERTopic - Dimensionality - Completed ✓\n", "2025-05-10 03:43:13,492 - BERTopic - Cluster - Start clustering the reduced embeddings\n", "2025-05-10 03:43:19,574 - BERTopic - Cluster - Completed ✓\n", "2025-05-10 03:43:19,586 - BERTopic - Representation - Fine-tuning topics using representation models.\n", "2025-05-10 03:43:32,595 - BERTopic - Representation - Completed ✓\n" ] } ], "source": [ "# 建立BERTopic模型\n", "representation_topic_model = BERTopic(\n", " # Sub-models\n", " embedding_model=bert_sentence_model,\n", " vectorizer_model=jieba_vectorizer,\n", " # 設定Representation model\n", " representation_model=representation_model,\n", " # Hyperparameters\n", " top_n_words=30,\n", " verbose=True\n", ")\n", "\n", "# Train model\n", "topics, probs = representation_topic_model.fit_transform(docs_zh, all_embeddings) " ] }, { "cell_type": "code", "execution_count": null, "id": "cf202515", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TopicCountNameRepresentationKeyBERTRepresentative_Docs
0-178153-1_說_小孩_妳_[說, 小孩, 妳, , 老婆, 想, 老公, 真的, 做, 工作, 離婚, 孩子, 生活...[有沒有, 好好, 回家, 我媽, 房子, 希望, 開心, 婆家, 幫忙, 孩子][當她想吃東西時, 所以原又有回公婆家, 無意中從爸爸手機中的訊息內容發現跟一個女生聊天所以...
1018520_ ___[ , , , , , , , , , , , , , , , , , , , , , , ...[, , , , , , , , , ][, , ]
218831_洗_衣服_地板_乾淨[洗, 衣服, 地板, 乾淨, 煮, 洗碗, 冰箱, 尿布, 浴室, 消毒, 空間, 洗碗機...[洗碗機, 曬衣服, 洗衣機, 洗衣服, 晾衣服, 洗手, 拖地, 吸塵器, 機器人, 洗碗][我常常跟她說洗不乾淨, 現在兩人衣服也都我在洗, 只有洗自己的衣服]
326812_老婆_老公_岳父_你媽[老婆, 老公, 岳父, 你媽, 婆婆, 公婆, 喊停, 老爺, 不爽, 媳婦, 老馬, 喊...[老姑婆, 你媽, 我愛我, 噴你媽, 老娘, 你家, 老公, 掛你, 玩不動, 老婆][他老婆不可以, 所以才處處覺得老婆的不是, 你老婆根本有問題]
434893_個性_脾氣_性格_很會[個性, 脾氣, 性格, 很會, 喜歡, 善良, 冷淡, 強勢, 溫和, 很強, 衛生習慣,...[個性, 性格, 大男人主義, 好脾氣, 衛生習慣, 口頭禪, 和藹可親, 很強, 沒什麼,...[這樣的個性, 對上這種個性的人, 同時她的個性也很好]
.....................
65865710657_迴圈_無窮_循環_無限[迴圈, 無窮, 循環, 無限, 沒學, 情況, 反感, 後援, 差異, 反正, , , ...[迴圈, 無限, 無窮, 循環, 情況, 後援, 沒學, 反正, 差異, 反感][無限迴圈, 反正無窮迴圈, 無窮迴圈]
65965810658_偷不著_如來_偷_一把[偷不著, 如來, 偷, 一把, 拚, 有趣, 不給, 玩, 做, , , , , , ,...[偷不著, 不給, 有趣, 偷, 一把, 拚, 玩, 如來, 做, ][不如拚一把, 不如來的有趣, 偷不如偷不著]
66065910659_做得還_盡善盡美_面面俱到_幅度[做得還, 盡善盡美, 面面俱到, 幅度, 自食其力, 有作, 太小, 到位, 承認, 終究...[面面俱到, 盡善盡美, 做得還, 到位, 打算, 真的, 自食其力, 幅度, 我會, 算是][我自己很多地方有作到位, 我認為她進步幅度太小, 短期間內真的是很難做到盡善盡美]
66166010660_韓劇_電影_裡看_整部[韓劇, 電影, 裡看, 整部, 聽相聲, 聽陸劇, 過劇, 有累, 偶然, 動畫, 看劇,...[聽陸劇, 看電視, 聽相聲, 小時候, 休息時間, 偶然, 抬頭, 天天, 看劇, 那天][而且兩次就把整部韓劇看完, 而且說實在的整天晾在家裡看韓劇休息, 我跟老公會一起看韓劇電影動畫]
66266110661_人身攻擊_論點_言論自由_言論[人身攻擊, 論點, 言論自由, 言論, 攻擊, 恰當, 酸言酸語, 片面, 指出, 無辜,...[言論自由, 人身攻擊, 酸言酸語, 尊重, 理性, 攻擊, 恰當, 反對, 指出, 傷害][只要我感受到人身攻擊或是不受尊重, 不過很多人對人不對事對我的人身攻擊, 我沒有人身攻擊]
\n", "

663 rows × 6 columns

\n", "
" ], "text/plain": [ " Topic Count Name \\\n", "0 -1 78153 -1_說_小孩_妳_ \n", "1 0 1852 0_ ___ \n", "2 1 883 1_洗_衣服_地板_乾淨 \n", "3 2 681 2_老婆_老公_岳父_你媽 \n", "4 3 489 3_個性_脾氣_性格_很會 \n", ".. ... ... ... \n", "658 657 10 657_迴圈_無窮_循環_無限 \n", "659 658 10 658_偷不著_如來_偷_一把 \n", "660 659 10 659_做得還_盡善盡美_面面俱到_幅度 \n", "661 660 10 660_韓劇_電影_裡看_整部 \n", "662 661 10 661_人身攻擊_論點_言論自由_言論 \n", "\n", " Representation \\\n", "0 [說, 小孩, 妳, , 老婆, 想, 老公, 真的, 做, 工作, 離婚, 孩子, 生活... \n", "1 [ , , , , , , , , , , , , , , , , , , , , , , ... \n", "2 [洗, 衣服, 地板, 乾淨, 煮, 洗碗, 冰箱, 尿布, 浴室, 消毒, 空間, 洗碗機... \n", "3 [老婆, 老公, 岳父, 你媽, 婆婆, 公婆, 喊停, 老爺, 不爽, 媳婦, 老馬, 喊... \n", "4 [個性, 脾氣, 性格, 很會, 喜歡, 善良, 冷淡, 強勢, 溫和, 很強, 衛生習慣,... \n", ".. ... \n", "658 [迴圈, 無窮, 循環, 無限, 沒學, 情況, 反感, 後援, 差異, 反正, , , ... \n", "659 [偷不著, 如來, 偷, 一把, 拚, 有趣, 不給, 玩, 做, , , , , , ,... \n", "660 [做得還, 盡善盡美, 面面俱到, 幅度, 自食其力, 有作, 太小, 到位, 承認, 終究... \n", "661 [韓劇, 電影, 裡看, 整部, 聽相聲, 聽陸劇, 過劇, 有累, 偶然, 動畫, 看劇,... \n", "662 [人身攻擊, 論點, 言論自由, 言論, 攻擊, 恰當, 酸言酸語, 片面, 指出, 無辜,... \n", "\n", " KeyBERT \\\n", "0 [有沒有, 好好, 回家, 我媽, 房子, 希望, 開心, 婆家, 幫忙, 孩子] \n", "1 [, , , , , , , , , ] \n", "2 [洗碗機, 曬衣服, 洗衣機, 洗衣服, 晾衣服, 洗手, 拖地, 吸塵器, 機器人, 洗碗] \n", "3 [老姑婆, 你媽, 我愛我, 噴你媽, 老娘, 你家, 老公, 掛你, 玩不動, 老婆] \n", "4 [個性, 性格, 大男人主義, 好脾氣, 衛生習慣, 口頭禪, 和藹可親, 很強, 沒什麼,... \n", ".. ... \n", "658 [迴圈, 無限, 無窮, 循環, 情況, 後援, 沒學, 反正, 差異, 反感] \n", "659 [偷不著, 不給, 有趣, 偷, 一把, 拚, 玩, 如來, 做, ] \n", "660 [面面俱到, 盡善盡美, 做得還, 到位, 打算, 真的, 自食其力, 幅度, 我會, 算是] \n", "661 [聽陸劇, 看電視, 聽相聲, 小時候, 休息時間, 偶然, 抬頭, 天天, 看劇, 那天] \n", "662 [言論自由, 人身攻擊, 酸言酸語, 尊重, 理性, 攻擊, 恰當, 反對, 指出, 傷害] \n", "\n", " Representative_Docs \n", "0 [當她想吃東西時, 所以原又有回公婆家, 無意中從爸爸手機中的訊息內容發現跟一個女生聊天所以... \n", "1 [, , ] \n", "2 [我常常跟她說洗不乾淨, 現在兩人衣服也都我在洗, 只有洗自己的衣服] \n", "3 [他老婆不可以, 所以才處處覺得老婆的不是, 你老婆根本有問題] \n", "4 [這樣的個性, 對上這種個性的人, 同時她的個性也很好] \n", ".. ... \n", "658 [無限迴圈, 反正無窮迴圈, 無窮迴圈] \n", "659 [不如拚一把, 不如來的有趣, 偷不如偷不著] \n", "660 [我自己很多地方有作到位, 我認為她進步幅度太小, 短期間內真的是很難做到盡善盡美] \n", "661 [而且兩次就把整部韓劇看完, 而且說實在的整天晾在家裡看韓劇休息, 我跟老公會一起看韓劇電影動畫] \n", "662 [只要我感受到人身攻擊或是不受尊重, 不過很多人對人不對事對我的人身攻擊, 我沒有人身攻擊] \n", "\n", "[663 rows x 6 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 觀察KeyBERT微調後的主題表示\n", "representation_topic_model.get_topic_info()" ] }, { "cell_type": "markdown", "id": "dddf4700", "metadata": {}, "source": [ "經過微調後,主題數量從原本的六個大幅增加到六百多個,顯示模型在嘗試更細緻地劃分,但可能原本文章的內容便差異不大,導致在過度細緻的劃分下,被迫分成許多小主題,雖然這可以捕捉到更多具體的婚姻議題,如家事分工、親子互動、娛樂活動等,但這種過度細化的結果可能反而不利於整體議題的統整。" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Syllabus", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }