{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Licensed under the Apache License, Version 2.0 (the \"License\"); you may\n", "# not use this file except in compliance with the License. You may obtain\n", "# a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT\n", "# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the\n", "# License for the specific language governing permissions and limitations\n", "# under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Watson Studioで文字データを可視化しよう\n", "\n", "twitterから気になる情報を取得し、 Watson Natural Languege UnderstandingでKeyword抽出、WorldCloudをPixieDust を使用して表示してみましょう!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Part 1 - 分析データ作成\n", "\n", "## 1. Setup\n", "### 1.1 最新の Watson Developer Cloud, requests-oauthlib パッケージの導入\n", "Natural Languge Understandingに使用するWatson Developer CloudとTwitterのOAuth認証に使用する requests-oauthlibパッケージを導入します。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade ibm-watson\n", "\n", "!pip install requests requests_oauthlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 1.2 PixieDust Libraryの導入\n", "このノートブックでは、PixieDustライブラリを使用してデータセットを分析および視覚化します。\n", "\n", "PixieDustの詳細は[Introductory Notebook](https://dataplatform.cloud.ibm.com/exchange/public/entry/view/5b000ed5abda694232eb5be84c3dd7c1) または [PixieDust Github](https://ibm-cds-labs.github.io/pixiedust/) を参照してください。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "次のセルを実行して、最新バージョンのPixieDustを実行していることを確認します。 ローカルのjupyter notyebookを使用し、PixieDustをローカルにインストール済みで、それを使用したい場合は、このセルを実行しないでください。\n", "\n", "尚、正式リリース前の`https://github.com/pixiedust/pixiedust.git@va-working-branch#egg=pixiedust`はフォント指定が可能なモジュールで、正式リリースまで一時的に利用しています。日本語表示を可能にするために使用しています。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To confirm you have the latest version of PixieDust on your system, run this cell\n", "#!pip install -U --no-deps pixiedust\n", "!pip install --upgrade --no-deps git+https://github.com/pixiedust/pixiedust.git@va-working-branch#egg=pixiedust" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PixieDustをインポートし、カーネルをRestartが必要な場合はRestartさせます。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pixiedust" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " #### 1.4 Option\n", " Pixiedust runtime updated. Please restart kernel と表示された場合は、上のメニューの`Kernel`-> `Restart`からカーネルをRestartさせてください。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 1.3 wordcloud Libraryの導入\n", "\n", "日本語が表示できるように、日本語フォントも導入します。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --user wordcloud" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#日本語フォントの導入\n", "jp_font_path ='/home/dsxuser/work/ipaexg00301/ipaexg.ttf'\n", "\n", "import os\n", "if not os.path.exists(jp_font_path):\n", " !wget https://oscdl.ipa.go.jp/IPAexfont/ipaexg00301.zip\n", " !unzip ipaexg00301.zip\n", "else:\n", " print('IPA font has been already installed')\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 1.4 PixieDust にWordCloudの設定\n", "PixieDustにWordCloud形式でデータが表示できる チャートを追加しましょう。\n", "PixieDustは自分で設定したフォーマットでデータ表示するチャートを設定することができます。\n", "\n", "今回は1列目にwordcloudに表示する文字、2列目にその表示Volume数を入れたpandasのDataFrameを渡すと、wordcloudを表示するチャートを設定します。\n", "例えば下記のようなデータです:\n", "\n", "```\n", "import pandas as pd\n", "df = pd.DataFrame([[\"四月\", 26],[\"May\", 10],[\"June\", 5]], columns=['key', 'value'])\n", "```\n", "|   | Key | Value |\n", "| ---- | ---- | ---- |\n", "| 0 | 四月 | 26 |\n", "| 1 | May | 10 |\n", "| 2 | June | 5 |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pixiedust.display.display import *\n", "import io\n", "import base64\n", "from wordcloud import WordCloud\n", "\n", "class SimpleWordCloudDisplay(Display):\n", " def doRender(self, handlerId):\n", " # convert from dataframe to dict\n", " dfdict = {}\n", " # df = self.entity.toPandas()\n", " df = self.entity\n", " for x in range(len(df)):\n", " currentid = df.iloc[x,0] or 'NoKey'\n", " currentvalue = df.iloc[x,1]\n", " dfdict.setdefault(currentid, 0)\n", " dfdict[currentid] = dfdict[currentid] + currentvalue\n", " \n", " ##remove 'Others' stopwordsオプションが効かないのでマニュアル削除\n", " #if ('Others' in dfdict.keys())==True:\n", " # dfdict.pop('Others' )\n", "\n", " # create word cloud from dict\n", " wc = WordCloud(background_color=\"white\", width=800, height=400, max_font_size=140, font_path=jp_font_path).fit_words(dfdict)\n", " #wc = WordCloud(background_color=\"white\", max_font_size=140, font_path=jp_font_path).fit_words(dfdict)\n", "\n", "\n", " # encode word cloud image to base64 string\n", " img = wc.to_image()\n", " buffer =io.BytesIO()\n", " img.save(buffer,format=\"JPEG\") #Enregistre l'image dans le buffer\n", " myimage = buffer.getvalue() \n", " img_str = base64.b64encode(myimage)\n", " \n", "\n", " self._addHTMLTemplateString(\n", "\"\"\"\n", "
\n", "\"\"\".format(img_str.decode(\"ascii\"))\n", " \n", " )\n", " \n", "@PixiedustDisplay()\n", "class SimpleWordCloudMeta(DisplayHandlerMeta):\n", " @addId\n", " def getMenuInfo(self,entity,dataHandler):\n", " if entity.__class__.__name__ == \"DataFrame\":\n", " return [\n", " {\n", " \"categoryId\": \"Chart\",\n", " \"title\": \"Simple Word Cloud\",\n", " \"icon\": \"fa-cloud\",\n", " \"id\": \"mySimpleWordCloud\"\n", " }\n", " ]\n", " else:\n", " return []\n", "\n", " def newDisplayHandler(self,options,entity):\n", " return SimpleWordCloudDisplay(options,entity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.5 WordCloudのTest\n", "表示できるか確認してみましょう!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pixiedust": { "displayParams": { "handlerId": "mySimpleWordCloud", "keyFields": "key", "rendererId": "matplotlib", "valueFields": "value" } } }, "outputs": [], "source": [ "#Test Code\n", "import pandas as pd\n", "\n", "df = pd.DataFrame([[\"四月\", 26],[\"May\", 10],[\"June\", 5]], columns=['key', 'value'])\n", "display(df, font_path='/home/dsxuser/work/ipaexg00301/ipaexg.ttf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. 環境準備\n", "\n", "以下の3つができるように環境の設定を行います。\n", "\n", "1: Twitter APIを使用してTweet Dataを取得します。\n", "Twitter API KEYの取得が必要です。ここではその取得方法は説明しませんので、お持ちでない方は「Twitter API KEY 取得」等で検索し、取得お願いします。以下の4つの値が必要です。\n", "\n", "* AccessToken\n", "* AccessTokenSecret\n", "* ConsumerKey\n", "* ConsumerSecret\n", "\n", "2: IBM Cloud Natural Language Understanding サービスでTweetを分析します。\n", "Natural Language Understanding サービスの作成しAPIKEY,URLの値が必要です。\n", "Natural Language Understanding サービスが未作成の方は、[こちら](https://cloud.ibm.com/catalog/services/natural-language-understanding)より作成してください。\n", "\n", "3: 設定ファイルや作成したファイルをIBM Cloud Object Strageを使用してて読み込み、保存します。\n", "IBM Cloud Object Strageを使用する設定を以下で行います。\n", "\n", "### 2.1 apikeys.iniの作成\n", "[apikeys.in](https://raw.githubusercontent.com/kyokonishito/pixiedust-twitter-analysis/master/apikeys.ini)をダウンロードします。\n", "リンクを右クリックし、「リンク先を別名で保存」をしてください。\n", "\n", "ダウンロードしたapikeys.iniを以下のように準備した1、2の値を記入して保存してください。\n", "[ ]でかこまれた部分を自分のKEYに変更します。[ ]は不要ですので残さないようにしてください。\n", "```\n", "[TWITTER]\n", "TWITTER_AccessToken=[AccessTokenの値]\n", "TWITTER_AccessTokenSecret=[AccessTokenSecretの値]\n", "TWITTER_ConsumerKey=[ConsumerKeyの値]\n", "TWITTER_ConsumerSecret=[ConsumerSecreの値]\n", "[IBM]\n", "NLU_APIKEY=[Natural Language UnderstandingサービスのAPIの値]\n", "NLU_URL=[Natural Language UnderstandingサービスのURLの値]\n", "```\n", "\n", "### 2.2 apikeys.iniのアップロード\n", "Watson Studioのjupyter note book上、右上の`0101`というアイコンをクリックし、Fileアップロードの画面を出します。`Drop your file here or browse your files to add a new file` と書いてある場所に、2.1で作成したファイルをドロップし、Watson Studio上のProjectにアップロードします。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "### 2.3 apikeys.iniの読み込み\n", "\n", " 1. **下のセルを選択して、空の行にカーソルを置いてください。** \n", "\n", "2. アップロードしたTweeterAPIKey.iniファイルの下にある(見えない場合は右上の10/01アイコンをクリック) `Insert to code`の下にある`Insert StreamingBody object`をクリックしてください。\n", "\n", "3. ファイルを読み込むストリーム`streaming_body_2`をセットするコードが挿入されます。\n", "\n", "4. 4箇所ある`streaming_body_2`は 全て`streaming_body_1`に変更します。(後のコードで使用するため)\n", "\n", "5. 編集が終わったらセルを実行します。\n", "\n", "

\n", "# Your data file was loaded into a botocore.response.StreamingBody object.
\n", "# Please read the documentation of ibm_boto3 and pandas to learn more about your possibilities to load the data.
\n", "# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
\n", "# pandas documentation: http://pandas.pydata.org/
\n", "streaming_body_2 = client_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.get_object(Bucket='wordcloud-donotdelete-pr-wyztdevipqhfkt', Key='apikeys.ini')['Body']

\n", "

\n", " # add missing __iter__ method, so pandas accepts body as file-like object\n", "if not hasattr(streaming_body_2, \"__iter__\"): streaming_body_2.__iter__ = types.MethodType( __iter__, streaming_body_2 ) \n", "

\n", "\n", "\n", "## この下に入力 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# この行の下にカーソルを置いて、Insert StreamingBody objectをクリック\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 ConfigParserへの設定情報の読み込み" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import configparser\n", "\n", "inifile = configparser.ConfigParser(interpolation=configparser.ExtendedInterpolation())\n", "inifile.read_string(streaming_body_1.read().decode('utf-8'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5 IBM Objerct StrageのCredentialセット\n", "結果のファイル保存に使用するため、IBM Objerct StrageのCredentialをセットします。\n", "\n", " 1. **下のセルを選択して、空の行にカーソルを置いてください。** \n", "\n", "2. アップロードしたTweeterAPIKey.iniファイルの下にある(見えない場合は右上の10/01アイコンをクリック) `Insert to code`の下にある`Insert Credentials`をクリックしてください。\n", "\n", "3. ファイルを読み込むストリーム`credentials_2`をセットするコードが挿入されます。\n", "\n", "4. `credentials_2`は 全てcredentials_1`に変更します。(後のコードで使用するため)\n", "5. 編集が終わったらセルを実行します。\n", "\n", "

\n", "# @hidden_cell
\n", "# The following code contains the credentials for a file in your IBM Cloud Object Storage.
\n", "# You might want to remove those credentials before you share your notebook.
\n", "
\n", " credentials_2 = {\n", " 'IAM_SERVICE_ID': 'iam-ServiceId-xxxxxxxx-xxxx-xxxx-xxxx-1234567890xx',
\n", " 'IBM_API_KEY_ID': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
\n", " 'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
\n", " 'IBM_AUTH_ENDPOINT': 'https://iam.bluemix.net/oidc/token',
\n", " 'BUCKET': 'wordcloud-donotdelete-pr-wyztdevipqhfkt',
\n", " 'FILE': 'apikeys.ini'
\n", "}

\n", "\n", "## この下に入力 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# この行の下にカーソルを置いて、Insert Credentialsをクリック\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Tweet情報の取得と保存" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 パラメータの設定\n", "検索したい文字列、取得したいTweet数、検索の繰り返し回数、結果をサマリーするKeyword出現数をセット。\n", "1回の検索で最大100 件しか取得できないので、それ以上取得したい場合は、繰り返し回数をセットする。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#以下のパラメータをセットする\n", "#検索したい文字列\n", "#基本は最新1週間分\n", "search_key = \"IBM\"\n", "\n", "#1回の検索で取得したいTweet数, 最大は100\n", "search_count=100\n", "\n", "#検索の繰り返し回数\n", "repeat_count=5\n", "\n", "#以下の数以下のKeywordはOthersでまとめる。Othersはwordcloudには表示しない\n", "summary_num = 5\n", "\n", "#search_until=\"2019-04-07\" #期間を指定したい場合はコメントをはずす" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Tweet情報の取得\n", "Twitter Search APIを使って、Tweet情報を取得します。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import urllib\n", "import calendar\n", "import time\n", "from requests_oauthlib import OAuth1Session\n", "\n", "CK = inifile['TWITTER']['TWITTER_ConsumerKey']\n", "CS = inifile['TWITTER']['TWITTER_ConsumerSecret']\n", "AT = inifile['TWITTER']['TWITTER_AccessToken']\n", "ATS = inifile['TWITTER']['TWITTER_AccessTokenSecret']\n", "twitter = OAuth1Session(CK, CS, AT, ATS) \n", "search_endpoint = \"https://api.twitter.com/1.1/search/tweets.json\"\n", "\n", "\n", "\n", "def search_tw(tw, params, search_endpoint):\n", "\n", " url = \"https://api.twitter.com/1.1/search/tweets.json\" # 取得エンドポイント\n", " res = tw.get(search_endpoint, params = params)\n", "\n", " if res.status_code == 200: # 正常通信出来た場合\n", " timelines = json.loads(res.text) # レスポンスからタイムラインリストを取得\n", " df_tweets = pd.DataFrame.from_dict(timelines.get('statuses'), dtype='object')\n", " next_param = timelines.get('search_metadata').get('next_results')\n", " if next_param is not None:\n", " next_param = urllib.parse.parse_qs(next_param.replace('?', ''))\n", " return df_tweets, next_param\n", "\n", " else: # 正常通信出来なかった場合\n", " print(\"Failed: %d\" % res.status_code)\n", " raise ValueError(\"Failed: %d\" % res.status_code)\n", " \n", "def YmdHMS(created_at):\n", " time_utc = time.strptime(created_at, '%a %b %d %H:%M:%S +0000 %Y')\n", " unix_time = calendar.timegm(time_utc)\n", " time_local = time.localtime(unix_time)\n", " return str(time.strftime(\"%Y-%m-%d %H:%M:%S\", time_local))\n", " \n", "param_dict = {} \n", "param_dict['q'] = search_key\n", "#param_dict['until'] = search_until #期間を指定したい場合はコメントをはずす\n", "param_dict['result_type'] = 'recent'\n", "param_dict['count'] = search_count\n", "param_dict['lang'] = 'ja'\n", "\n", "next_params = param_dict\n", "df_tweets= pd.DataFrame()\n", "last_id = ''\n", "for i in range(int(repeat_count)):\n", " print('repaet#:{}'.format(i))\n", " df, next_params = search_tw(twitter, next_params, search_endpoint)\n", " df_tweets = pd.concat([df_tweets, df], ignore_index=True,)\n", " if len(df) == 0:\n", " break\n", " if next_params is None:\n", " before_last_id = last_id\n", " last_id = df.tail(1)['id'].astype(str).values[0]\n", " if before_last_id == last_id:\n", " break\n", " param_dict['max_id'] = last_id\n", " next_params = param_dict\n", "if len(df_tweets) > 0:\n", " df_tweets['create_at_jst'] = df_tweets.apply(lambda x: YmdHMS(x['created_at']) if x.dtype == \"object\" else x, axis=1)\n", " df_tweets = df_tweets.sort_values(by=[\"create_at_jst\"], ascending=True)\n", " col_list = list(df_tweets.columns.values)\n", " del col_list[-1]\n", " col_list.insert(0, 'create_at_jst')\n", " df_tweets = df_tweets.reindex(columns=col_list)\n", "else:\n", " print('No Data')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Tweet情報の取得\n", "結果を一旦ファイルに保存します。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#IBM Object Storageへの検索保存\n", "tweet_result_filename = 'tweets_results.csv'\n", "\n", "cos = ibm_boto3.client(service_name='s3',\n", " ibm_api_key_id=credentials_1['IBM_API_KEY_ID'],\n", " ibm_service_instance_id=credentials_1['IAM_SERVICE_ID'],\n", " ibm_auth_endpoint=credentials_1['IBM_AUTH_ENDPOINT'],\n", " config=Config(signature_version='oauth'),\n", " endpoint_url=credentials_1['ENDPOINT'])\n", "\n", "\n", "# Write a CSV file from the enriched pandas DataFrame.\n", "df_tweets.to_csv(tweet_result_filename, index=False)\n", "\n", "# Use the above put_file method with credentials to put the file in Object Storage.\n", "cos.upload_file(tweet_result_filename, Bucket=credentials_1['BUCKET'],Key=tweet_result_filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. TweetからNatural Language Understandingを使用してKeyword抽出\n", "\n", "### 4.1 前処理後NLU呼び出しと合計処理" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from ibm_watson import NaturalLanguageUnderstandingV1\n", "from ibm_watson.natural_language_understanding_v1 import Features, KeywordsOptions\n", "from ibm_watson import ApiException\n", "\n", "\n", "df_tweets['refineText']=df_tweets['text']\n", "\n", "# 改行削除\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('\\n', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].strip(), axis=1)\n", "\n", "# RTマーク削除\n", "df_tweets['refineText'] = df_tweets['refineText'].replace(r'^RT.*?:', '', regex=True)\n", "\n", "# hasshutag記号(#)削除\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('#', ' '), axis=1)\n", "\n", "# linkの削除\n", "df_tweets['refineText'] = df_tweets['refineText'].replace(r'(http:\\/\\/[\\x21-\\x7e]+)', '', regex=True)\n", "df_tweets['refineText'] = df_tweets['refineText'].replace(r'(https:\\/\\/[\\x21-\\x7e]+)', '', regex=True)\n", "\n", "# カッコ削除\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('(', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace(')', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('【', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('】', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('[', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace(']', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('「', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('」', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('『', ' '), axis=1)\n", "df_tweets['refineText'] = df_tweets.apply(lambda x: x['refineText'].replace('』', ' '), axis=1)\n", "\n", "\n", "nlu = NaturalLanguageUnderstandingV1(version='2018-11-16', url=inifile['IBM']['NLU_URL'], iam_apikey=inifile['IBM']['NLU_APIKEY'])\n", "\n", "features = Features(keywords=KeywordsOptions())\n", "keywords = []\n", "df_pixiedust = pd.DataFrame(columns=['Key', 'Value'])\n", "\n", "\n", "for text, i in zip(df_tweets['refineText'], df_tweets.index):\n", " if not text:\n", " keywords.append(' ')\n", " continue\n", "\n", " try:\n", " enriched_json = nlu.analyze(text=text, features=features, language='ja') #NLU 呼び出し\n", " except ApiException as ex:\n", " print('{}:text: {} -- Method failed with status code {}: {}'.format(i, text, str(ex.code), ex.message))\n", " keywords.append(' ')\n", " continue\n", "\n", "\n", " # Iterate and get KEYWORDS with a confidence of over 70%\n", " if 'keywords' in enriched_json.result:\n", " tmpkw = []\n", " for kw in enriched_json.result[\"keywords\"]:\n", " if (float(kw[\"relevance\"]) >= 0.7):\n", " tmpkw.append(kw[\"text\"])\n", " df_pixiedust = df_pixiedust.append({'Key': kw[\"text\"], 'Value': 1}, ignore_index=True)\n", " # Convert multiple keywords in a list to a string and append the string\n", " keywords.append(', '.join(tmpkw))\n", " else:\n", " keywords.append(\"\")\n", " print('{}:keyword: {}'.format(i, tmpkw))\n", "\n", "df_tweets['Keywords'] = keywords\n", "\n", "df_pixiedust = df_pixiedust.groupby('Key').count()\n", "df_pixiedust = df_pixiedust.sort_values(by=[\"Value\"], ascending=False)\n", "df_pixiedust = df_pixiedust.reset_index()\n", "\n", "# Write a CSV file from the enriched pandas DataFrame.\n", "df_tweets.to_csv(tweet_result_filename, index=False)\n", "\n", "# Use the above put_file method with credentials to put the file in Object Storage.\n", "cos.upload_file(tweet_result_filename, Bucket=credentials_1['BUCKET'],Key=tweet_result_filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 抽出データ確認" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df_pixiedust" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 <<オプション>> 抽出データのうち、小さなものはまとめる\n", "Keywordの出現回数が少ないものが多数を占めると、グラフが見にくくなる場合は、`3.1 パラメータの設定の設定`で設定したsummary_num以下の場合はOthersにまとめる。\n", " 以下を実行し表示は`4.3 <<オプション>>WordCloudで表示してみましょう! :Summary`の方を実施\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_pixiedust_sum = df_pixiedust[df_pixiedust['Value'] > summary_num -1]\n", "# サマリーしたデータをOthersとして表示させたい場合は以下のコメントを外す\n", "#df_pixiedust_others = df_pixiedust[df_pixiedust['Value']< summary_num]\n", "#df_pixiedust_others = df_pixiedust_others['Value'].sum()\n", "#df_pixiedust_sum = df_pixiedust_sum.append({'Key': 'Others', 'Value': df_pixiedust_others}, ignore_index=True)\n", "df_pixiedust_sum = df_pixiedust_sum.sort_values(by=[\"Value\"], ascending=False)\n", "df_pixiedust_sum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 WordCloudで表示してみましょう!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(df_pixiedust, font_path = '/home/dsxuser/work/ipaexg00301/ipaexg.ttf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 <<オプション>>WordCloudで表示してみましょう! :Summary" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(df_pixiedust_sum, font_path = '/home/dsxuser/work/ipaexg00301/ipaexg.ttf') #font_path指定はtemporary fix\n" ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3.5", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.5" } }, "nbformat": 4, "nbformat_minor": 1 }