{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.simplefilter(action='ignore')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "import tarfile\n", "import urllib.request\n", "\n", "import tensorflow as tf\n", "from tensorflow.keras.preprocessing import sequence\n", "from tensorflow.keras.preprocessing.text import Tokenizer\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. 数据预处理" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.1 定义函数用于删除文字中的 HTML 标签" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def rm_tags(text):\n", " re_tag = re.compile(r'<[^>]+>')\n", " return re_tag.sub('', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2 定义函数用于读取 IMDb 文件目录" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def read_files(file_type):\n", " path = 'data/aclImdb/'\n", " file_list = []\n", " \n", " positive_path = path + file_type + '/pos/'\n", " for f in os.listdir(positive_path):\n", " file_list += [positive_path + f]\n", " \n", " negative_path = path + file_type + '/neg/'\n", " for f in os.listdir(negative_path):\n", " file_list += [negative_path + f]\n", " \n", " print('read', file_type, 'files:', len(file_list))\n", " \n", " all_labels = ([1] * 12500 + [0] * 12500)\n", " all_texts = []\n", " for file in file_list:\n", " with open(file, encoding='utf8') as f:\n", " all_texts += [rm_tags(\" \".join(f.readlines()))]\n", " \n", " return all_labels, all_texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.3 读取数据集" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "read train files: 25000\n", "read test files: 25000\n" ] } ], "source": [ "y_train, train_text = read_files('train')\n", "y_test, test_text = read_files('test')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.4 建立 token" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "token = Tokenizer(num_words=2500) # 将字典的单词数目增大为2500\n", "token.fit_on_texts(train_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.5 将 \"影评文字\" 转换成 \"数字列表\"" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "x_train_seq = token.texts_to_sequences(train_text)\n", "x_test_seq = token.texts_to_sequences(test_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.6 截长补短让所有 \"数字列表\" 的长度都为150" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "x_train = sequence.pad_sequences(x_train_seq, maxlen=150) # 将\"数字列表\"的长度增大为150\n", "x_test = sequence.pad_sequences(x_test_seq, maxlen=150)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. 建立模型" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1 建立 Sequential 模型" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "model = tf.keras.models.Sequential([\n", "\n", " # 嵌入层\n", " tf.keras.layers.Embedding(input_length=150, input_dim=2500, output_dim=32),\n", " tf.keras.layers.Dropout(0.2),\n", " \n", " # RNN层(16个神经元)\n", " tf.keras.layers.SimpleRNN(units=16),\n", " \n", " # 隐藏层\n", " tf.keras.layers.Dense(units=256, activation='relu'),\n", " tf.keras.layers.Dropout(0.2),\n", " \n", " # 输出层\n", " tf.keras.layers.Dense(units=1, activation='sigmoid')\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2 查看模型的摘要" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "embedding (Embedding) (None, 150, 32) 80000 \n", "_________________________________________________________________\n", "dropout (Dropout) (None, 150, 32) 0 \n", "_________________________________________________________________\n", "simple_rnn (SimpleRNN) (None, 16) 784 \n", "_________________________________________________________________\n", "dense (Dense) (None, 256) 4352 \n", "_________________________________________________________________\n", "dropout_1 (Dropout) (None, 256) 0 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 1) 257 \n", "=================================================================\n", "Total params: 85,393\n", "Trainable params: 85,393\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "print(model.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. 训练模型" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 20000 samples, validate on 5000 samples\n", "Epoch 1/10\n", " - 11s - loss: 0.6622 - acc: 0.6248 - val_loss: 0.9930 - val_acc: 0.0000e+00\n", "Epoch 2/10\n", " - 11s - loss: 0.6443 - acc: 0.6301 - val_loss: 0.8296 - val_acc: 0.5188\n", "Epoch 3/10\n", " - 11s - loss: 0.5821 - acc: 0.6883 - val_loss: 1.1692 - val_acc: 0.0000e+00\n", "Epoch 4/10\n", " - 13s - loss: 0.6347 - acc: 0.6368 - val_loss: 0.8651 - val_acc: 0.2194\n", "Epoch 5/10\n", " - 14s - loss: 0.6052 - acc: 0.6611 - val_loss: 0.7075 - val_acc: 0.5838\n", "Epoch 6/10\n", " - 12s - loss: 0.5669 - acc: 0.6980 - val_loss: 0.6381 - val_acc: 0.6754\n", "Epoch 7/10\n", " - 12s - loss: 0.5157 - acc: 0.7483 - val_loss: 0.9720 - val_acc: 0.3192\n", "Epoch 8/10\n", " - 11s - loss: 0.5116 - acc: 0.7441 - val_loss: 1.4460 - val_acc: 0.1446\n", "Epoch 9/10\n", " - 12s - loss: 0.4598 - acc: 0.7867 - val_loss: 0.4969 - val_acc: 0.8100\n", "Epoch 10/10\n", " - 12s - loss: 0.4330 - acc: 0.8026 - val_loss: 0.2970 - val_acc: 0.8818\n" ] } ], "source": [ "train_history = model.fit(x=x_train, y=y_train, validation_split=0.2, \n", " epochs=10, batch_size=100, verbose=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. 以图形显示训练过程" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def show_train_history(train_history, train, validation):\n", " plt.plot(train_history.history[train])\n", " plt.plot(train_history.history[validation])\n", " plt.title('Train History')\n", " plt.xlabel('Epoch')\n", " plt.ylabel(train)\n", " plt.legend(['train', 'validation'], loc='upper left')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_train_history(train_history, 'acc', 'val_acc')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_train_history(train_history, 'loss', 'val_loss')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. 评估模型的准确率" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "25000/25000 [==============================] - 12s 495us/step\n", "\n", "accuracy: 0.77672\n" ] } ], "source": [ "scores = model.evaluate(x_test, y_test)\n", "print()\n", "print('accuracy:', scores[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. 进行预测" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.1 执行预测" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "predictions = model.predict_classes(x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.2 将预测结果转换为一维数据" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "predictions = predictions.reshape(-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.3 预测结果" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 1, 1, 1, 0, 0, 1, 0, 0], dtype=int32)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions[:10]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int32)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions[12500:12510]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.4 对比 \"影评文字\" 与预测结果" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "Sentiment_Dict = {1: '正面的', 0: '负面的'}\n", "\n", "def display_test_sentiment(idx):\n", " print('影评文字:')\n", " print(test_text[idx])\n", " print()\n", " print('标签:', Sentiment_Dict[y_test[idx]])\n", " print('预测结果:', Sentiment_Dict[predictions[idx]])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "影评文字:\n", "I really like this show. It has drama, romance, and comedy all rolled into one. I am 28 and I am a married mother, so I can identify both with Lorelei's and Rory's experiences in the show. I have been watching mostly the repeats on the Family Channel lately, so I am not up-to-date on what is going on now. I think females would like this show more than males, but I know some men out there would enjoy it! I really like that is an hour long and not a half hour, as th hour seems to fly by when I am watching it! Give it a chance if you have never seen the show! I think Lorelei and Luke are my favorite characters on the show though, mainly because of the way they are with one another. How could you not see something was there (or take that long to see it I guess I should say)? Happy viewing!\n", "\n", "标签: 正面的\n", "预测结果: 正面的\n" ] } ], "source": [ "display_test_sentiment(2)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "影评文字:\n", "In Los Angeles, the alcoholic and lazy Hank Chinaski (Matt Dillon) performs a wide range of non-qualified functions just to get enough money to drink and gamble in horse races. His primary and only objective is writing and having sexy with dirty women.\"Factotum\" is an uninteresting, pointless and extremely boring movie about an irresponsible drunken vagrant that works a couple of days or weeks just to get enough money to buy spirits and gamble, being immediately fired due to his reckless behavior. In accordance with IMDb, this character would be the fictional alter-ego of the author Charles Bukowski, and based on this story, I will certainly never read any of his novels. Honestly, if the viewer likes this theme of alcoholic couples, better off watching the touching and heartbreaking Hector Babenco's \"Ironweed\" or Marco Ferreri's \"Storie di Ordinaria Follia\" that is based on the life of the same writer. My vote is four.Title (Brazil): \"Factotum – Sem Destino\" (\"Factotum – Without Destiny\")\n", "\n", "标签: 负面的\n", "预测结果: 负面的\n" ] } ], "source": [ "display_test_sentiment(12502)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. 预测新的影评" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7.1 定义用于根据影评文字进行正负面预测的函数" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def predict_preview(input_text):\n", " input_seq = token.texts_to_sequences([input_text])\n", " pad_input_seq = sequence.pad_sequences(input_seq, maxlen=150)\n", " predict_result = model.predict_classes(pad_input_seq)\n", " print(Sentiment_Dict[predict_result[0][0]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7.2 预测某些新的影评" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "负面的\n" ] } ], "source": [ "input_text = '''\n", "It's hard to believe that the same talented director who made the influential cult action classic The Road Warrior had anything to do with this disaster.\n", "Road Warrior was raw, gritty, violent and uncompromising, and this movie is the exact opposite. It's like Road Warrior for kids who need constant action in their movies.\n", "This is the movie. The good guys get into a fight with the bad guys, outrun them, they break down in their vehicle and fix it. Rinse and repeat. The second half of the movie is the first half again just done faster.\n", "The Road Warrior may have been a simple premise but it made you feel something, even with it's opening narration before any action was even shown. And the supporting characters were given just enough time for each of them to be likable or relatable.\n", "In this movie there is absolutely nothing and no one to care about. We're supposed to care about the characters because... well we should. George Miller just wants us to, and in one of the most cringe worthy moments Charlize Theron's character breaks down while dramatic music plays to try desperately to make us care.\n", "Tom Hardy is pathetic as Max. One of the dullest leading men I've seen in a long time. There's not one single moment throughout the entire movie where he comes anywhere near reaching the same level of charisma Mel Gibson did in the role. Gibson made more of an impression just eating a tin of dog food. I'm still confused as to what accent Hardy was even trying to do.\n", "I was amazed that Max has now become a cartoon character as well. Gibson's Max was a semi-realistic tough guy who hurt, bled, and nearly died several times. Now he survives car crashes and tornadoes with ease?\n", "In the previous movies, fuel and guns and bullets were rare. Not anymore. It doesn't even seem Post-Apocalyptic. There's no sense of desperation anymore and everything is too glossy looking. And the main villain's super model looking wives with their perfect skin are about as convincing as apocalyptic survivors as Hardy's Australian accent is. They're so boring and one-dimensional, George Miller could have combined them all into one character and you wouldn't miss anyone.\n", "Some of the green screen is very obvious and fake looking, and the CGI sandstorm is laughably bad. It wouldn't look out of place in a Pixar movie.\n", "There's no tension, no real struggle, or any real dirt and grit that Road Warrior had. Everything George Miller got right with that masterpiece he gets completely wrong here. \n", "'''\n", "\n", "predict_preview(input_text)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "负面的\n" ] } ], "source": [ "input_text = '''\n", "Sure, I'm a huge film snob who (on the surface) only likes artsy-fartsy foreign films from before the 60's, but that hasn't stopped me from loving Disney's Beauty & The Beast; in fact, it's probably my favorite American animated film and is easily Disney's finest work. It's beautiful, it's breathtaking, it's warm, it's hilarious, it's captivating, and, in Disney fashion, it's magical. When I learned that Disney would be remaking their classic films, B&TB was undeniably the best wrapped package. How could they go wrong?\n", "Oh man, they went wrong.\n", "First thing's first: this film is so flat. The directing was dull and uninteresting throughout the entire film and it honestly felt like one of the Twilight sequels...and then I looked it up and found out that, yes, director Bill Condon was the man behind Breaking Dawn parts 1 & 2. Every shot looks bored and uninterested, which contrasts heavily with the original animated film that was constantly popping with vibrancy. The script too is boring because it's almost a complete remake of the original, though I guess most people won't mind that.\n", "Next: the CGI is horrid. Although I didn't care for The Jungle Book from last year, I could at least admit that the CGI was breathtaking. The same cant be said for this film. Characters like Lumière, Cogsworth, Mrs Potts, and most of the cursed appliances have very strange, lifeless faces that are pretty off putting to be looking at for such a long time. All of the sets too look artificial and fake, especially the town towards the beginning. However, the biggest offender is easily and infuriatingly the character that mattered most: The Beast. The CGI on the Beast's face is so distracting that it completely takes you out of the film. His eyes are completely devoid of soul, and his mouth is a gaping video game black hole of fiction. Klaus Kinski looked much better in the Faerie Tale Theatre episode of Beauty & The Beast, and that was a 1984 TV show episode. But do you know why it looked better? Because it was an actual face with actual eyes, not some video game computerized synthetic monstrosity. When will studios learn that practical effects will always top CGI?\n", "Finally: wasted casting. Emma Watson is beautiful, but she's no Belle. She is completely devoid of the warmth and humanity that made the animated Belle so beloved. Instead, she is cold and heartless throughout most of the film. Kevin Kline is 100% wasted and does nothing except look old. Ian McKellan, Ewan McGregor, Emma Thompson, and even Dan Stevens as the Beast are very expendable and could've been played by anyone else. The only good characters are Gaston and LeFou, mostly because they are fun and played by actors who breathe new life into their original shapes. If anything, this film should've been about Gaston and LeFou, but that would never happen because that would mean Disney couldn't cater to blind nostalgic 90's kids.\n", "Overall, this film is a complete bore. It could've been better if even the special effects were good, but the CGI in particular is horrendous. I'm all for Disney remaking their nostalgia- catering 90's films, but they need to be interesting. This film, sadly, is not. Even the Christmas sequel is better than this film because it's at least something. \n", "'''\n", "\n", "predict_preview(input_text)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "正面的\n" ] } ], "source": [ "input_text = '''\n", "I was really looking forward to this film. Not only has Disney recently made excellent live-action versions of their animated masterpieces (Jungle Book, Cinderella), but the cast alone (Emma Watson, Ian McKellen, Kevin Kline) already seemed to make this one a sure hit. Well, not so much as it turns out.\n", "Some of the animation is fantastic, but because characters like Cogsworth (the clock), Lumière (the candelabra) and Chip (the little tea cup) now look \"realistic\", they lose a lot of their animated predecessors' charm and actually even look kind of creepy at times. And ironically - unlike in the animated original - in this new realistic version they only have very limited facial expressions (which is a creative decision I can't for the life of me understand).\n", "Even when it works: there can be too much of a good thing. The film is overstuffed with lush production design and cgi (which is often weirdly artificial looking though) but sadly lacking in charm and genuine emotion. If this were a music album, I'd say it is \"over-produced\" and in need of more soul and swing. The great voice talent in some cases actually seems wasted, because it drowns in a sea of visual effects that numbs all senses. The most crucial thing that didn't work for me, though, is the Beast. He just never looks convincing. The eyes somehow don't look like real eyes and they're always slightly off.\n", "On the positive side, I really liked Gaston, and the actor who played him, Luke Evans, actually gave the perhaps most energized performance of all. Kevin Kline as Belle's father has little to do but to look fatherly and old, but he makes the most of his part. Speaking of Belle, now that I've seen the film, I think her role was miscast. I think someone like Rachel McAdams would actually have been a more natural, lively and perhaps a bit more feisty Belle than Emma Watson.\n", "If you love the original, you might want to give this one a pass, it's really not that good (although at least the songs were OK). Also, I'd think twice before bringing small children; without cute animated faces, all those \"realistic\" looking creatures and devices can be rather frightening for a child. \n", "'''\n", "\n", "predict_preview(input_text)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "负面的\n" ] } ], "source": [ "input_text = '''\n", "Full disclosure, I didn't think the first movie was as bad as it was made out to be. It wasn't good in almost any sense, but it was to be expected given the combination of source material, resources and constraints.\n", "That said, this sequel is 20x better than the first. Having established the characters in the first movie, the actors seem to be able to act now comfortably in their parts. The story becomes much more nuanced with plenty of dynamics on the go.\n", "SPOILERS from now on\n", "Can they maintain a \"vanilla\" relationship? Is he going to become controlling again and ruin things? Will she let it get out of control and ruin things also or stay on it? Who is that stalky girl and what happened to her exactly? what about his mother? and that ex of his? Will something occur with her infatuated boss?\n", "On top of all of this, I realised while watching that the series was never about a bizarre sadist control freak, it's actually about all men and the story of a woman trying to find the balance between accepting or desiring the dominant behaviour of the male archetype and maintaining strength and independence in such a relationship.\n", "While of course the fact that he is rich, while possibly relating to the power struggle, looks like it is going to be more and more used for generating further drama. The romance is much more evident in this movie to\n", "'''\n", "\n", "predict_preview(input_text)" ] } ], "metadata": { "kernelspec": { "display_name": "tensorflow-keras-practice", "language": "python", "name": "tensorflow-keras-practice" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }