{ "cells": [ { "cell_type": "markdown", "id": "55a5f31d", "metadata": {}, "source": [ "# GOODREADS CONTENT-BASED BOOK RECOMMENDATION SYSTEM" ] }, { "cell_type": "code", "execution_count": 1, "id": "8119de64", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import pickle\n", "import fasttext\n", "from rake_nltk import Rake\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.metrics.pairwise import linear_kernel\n", "from sklearn.pipeline import Pipeline\n", "from tqdm import tqdm\n", "\n", "tqdm.pandas()" ] }, { "cell_type": "markdown", "id": "3a999759", "metadata": {}, "source": [ "## GOAL\n", "\n", "Creating a good recommendation typically need a combination of content and user data. Here, instead of employing the user data, I will use only content data to create a recommendation system by generating item with high similarity with the item entered by user.\n", "\n", "## DATASET\n", "\n", "The dataset I used for this experiment is the [Goodreads' Best Book Dataset](https://www.kaggle.com/datasets/meetnaren/goodreads-best-books) that available on kaggle. The dataset contains 54301 rows of data and 12 columns." ] }, { "cell_type": "code", "execution_count": 2, "id": "0e3a3fb3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
book_authorsbook_descbook_editionbook_formatbook_isbnbook_pagesbook_ratingbook_rating_countbook_review_countbook_titlegenresimage_url
0Suzanne CollinsWinning will make you famous. Losing means cer...NaNHardcover9.78044E+12374 pages4.335519135160706The Hunger GamesYoung Adult|Fiction|Science Fiction|Dystopia|F...https://images.gr-assets.com/books/1447303603l...
1J.K. Rowling|Mary GrandPréThere is a door at the end of a silent corrido...US EditionPaperback9.78044E+12870 pages4.48204159433264Harry Potter and the Order of the PhoenixFantasy|Young Adult|Fictionhttps://images.gr-assets.com/books/1255614970l...
2Harper LeeThe unforgettable novel of a childhood in a sl...50th AnniversaryPaperback9.78006E+12324 pages4.27374519779450To Kill a MockingbirdClassics|Fiction|Historical|Historical Fiction...https://images.gr-assets.com/books/1361975680l...
3Jane Austen|Anna Quindlen|Mrs. Oliphant|George...«È cosa ormai risaputa che a uno scapolo in po...Modern Library Classics, USA / CANPaperback9.78068E+12279 pages4.25245362054322Pride and PrejudiceClassics|Fiction|Romancehttps://images.gr-assets.com/books/1320399351l...
4Stephenie MeyerAbout three things I was absolutely positive.F...NaNPaperback9.78032E+12498 pages3.58428126897991TwilightYoung Adult|Fantasy|Romance|Paranormal|Vampire...https://images.gr-assets.com/books/1361039443l...
\n", "
" ], "text/plain": [ " book_authors \\\n", "0 Suzanne Collins \n", "1 J.K. Rowling|Mary GrandPré \n", "2 Harper Lee \n", "3 Jane Austen|Anna Quindlen|Mrs. Oliphant|George... \n", "4 Stephenie Meyer \n", "\n", " book_desc \\\n", "0 Winning will make you famous. Losing means cer... \n", "1 There is a door at the end of a silent corrido... \n", "2 The unforgettable novel of a childhood in a sl... \n", "3 «È cosa ormai risaputa che a uno scapolo in po... \n", "4 About three things I was absolutely positive.F... \n", "\n", " book_edition book_format book_isbn book_pages \\\n", "0 NaN Hardcover 9.78044E+12 374 pages \n", "1 US Edition Paperback 9.78044E+12 870 pages \n", "2 50th Anniversary Paperback 9.78006E+12 324 pages \n", "3 Modern Library Classics, USA / CAN Paperback 9.78068E+12 279 pages \n", "4 NaN Paperback 9.78032E+12 498 pages \n", "\n", " book_rating book_rating_count book_review_count \\\n", "0 4.33 5519135 160706 \n", "1 4.48 2041594 33264 \n", "2 4.27 3745197 79450 \n", "3 4.25 2453620 54322 \n", "4 3.58 4281268 97991 \n", "\n", " book_title \\\n", "0 The Hunger Games \n", "1 Harry Potter and the Order of the Phoenix \n", "2 To Kill a Mockingbird \n", "3 Pride and Prejudice \n", "4 Twilight \n", "\n", " genres \\\n", "0 Young Adult|Fiction|Science Fiction|Dystopia|F... \n", "1 Fantasy|Young Adult|Fiction \n", "2 Classics|Fiction|Historical|Historical Fiction... \n", "3 Classics|Fiction|Romance \n", "4 Young Adult|Fantasy|Romance|Paranormal|Vampire... \n", "\n", " image_url \n", "0 https://images.gr-assets.com/books/1447303603l... \n", "1 https://images.gr-assets.com/books/1255614970l... \n", "2 https://images.gr-assets.com/books/1361975680l... \n", "3 https://images.gr-assets.com/books/1320399351l... \n", "4 https://images.gr-assets.com/books/1361039443l... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/book_data.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "id": "6224e89f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "54301" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "id": "99a6652e", "metadata": {}, "source": [ "## Preprocessing" ] }, { "cell_type": "markdown", "id": "f7c2843b", "metadata": {}, "source": [ "As I mentioned previously, we will need to calculate the similarity between items to provide a book recommendation for users. The similarity will be calculated simply by looking at **how similar the item's keywords is**. In this goodreads recommendation systems, the keywords will contain:\n", "- The book's first author\n", "- Important keywords from description\n", "- Genre of the book\n", "\n", "Here in the KeywordsTransformer class, we will gather each book's keywords into one column. We also filter the book's dataset so it's only contain the books that written in English based on its title and description using fasttext." ] }, { "cell_type": "code", "execution_count": 4, "id": "861f7430", "metadata": {}, "outputs": [], "source": [ "class KeywordsTransformer(BaseEstimator, TransformerMixin):\n", " \"\"\"\n", " gathering the keywords of each book into one column and will .\n", " the keyword will containe book's first author, genre, and important keywords description.\n", " \n", " \"\"\"\n", " \n", " def __init__(self):\n", " print('keywords transformer called...')\n", " \n", " def fit(self, X, y=None):\n", " return self\n", " \n", " def transform(self, X, y=None):\n", " X_ = X.copy()\n", " \n", " # getting only English books for the recommendation system\n", " print('filtering English books only...')\n", " self.lang_model = fasttext.load_model('lid.176.ftz')\n", " X_['lang_status'] = X_.progress_apply(lambda x: self.lang_detect(x), axis=1)\n", " \n", " # getting the first author of the book\n", " print('getting first author...')\n", " X_['first_author'] = X_.apply(lambda x: self.set_author(x), axis=1)\n", " X_['keywords'] = X_.progress_apply(lambda x: ''.join(x['first_author'].split(' ')), axis=1)\n", " \n", " # getting important keywords in the description\n", " print('getting description...')\n", " rake = Rake()\n", " X_['keywords'] = X_.progress_apply(lambda x: self.set_description(rake, x), axis=1)\n", " \n", " # getting the genre of the books\n", " print('getting genre...')\n", " X_['keywords'] = X_.progress_apply(lambda x: self.set_genres(x), axis=1)\n", " \n", " # eliminating the books that is not written in English\n", " X_ = X_[X_.lang_status=='en']\n", " \n", " # eliminating duplicates book based on combination of first author and book title\n", " X_ = X_.drop_duplicates(['first_author', 'book_title'], keep='first').reset_index(drop=True)\n", " \n", " return X_\n", " \n", " def lang_detect(self, x):\n", " \"\"\"\n", " detecting book's language using fasttext based on book's description and title\n", " \"\"\"\n", " try:\n", " status = (self.lang_model.predict(x['book_desc'])[0][0] in ('__label__en') or isinstance(x['book_desc'], float)) and self.lang_model.predict(x['book_title'])[0][0] in ('__label__en')\n", " if status:\n", " return 'en'\n", " return 'other'\n", " except:\n", " return 'other'\n", " \n", " def set_author(self, x):\n", " \"\"\"\n", " getting the first author of the books\n", " \"\"\"\n", " return x['book_authors'].split('|')[0]\n", " \n", " def set_description(self, rake, x):\n", " \"\"\"\n", " inserting some important keywords from book's description using rake into keywords\n", " \"\"\"\n", " DESCRIPTION_KEYWORDS_COUNT = 2\n", " if isinstance(x['book_desc'], str):\n", " rake.extract_keywords_from_text(x['book_desc'])\n", " key = rake.get_ranked_phrases()[:DESCRIPTION_KEYWORDS_COUNT]\n", " return x['keywords']+' '+' '.join(key)\n", " return x['keywords']\n", " \n", " def set_genres(self, x):\n", " \"\"\"\n", " inserting the genres of the books into keywords\n", " \"\"\"\n", " if isinstance(x['genres'], str):\n", " return x['keywords'] + ' ' + ' '.join(x['genres'].split('|'))\n", " return x['keywords']" ] }, { "cell_type": "markdown", "id": "dfab8e1b", "metadata": {}, "source": [ "## Creating book recommendation system" ] }, { "cell_type": "markdown", "id": "585c5eb4", "metadata": {}, "source": [ "After getting each of the book's keywords, next step will be building the recommendation system from the book dataset.\n", "- First, we will creating TfIdf from the dataset's keywords using TfIdfVectorizer. TfIdf, of Term Frequency-Inverse Document Frequency, is a numerical statistic to reflect how important a word is to a document in a collection of document.[*](https://en.wikipedia.org/wiki/Tf–idf)\n", "- Then using [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of each other TfIdf vector of the books, we will be able to return the books that have high similarity with the book entered by the user." ] }, { "cell_type": "code", "execution_count": 5, "id": "071ef3db", "metadata": {}, "outputs": [], "source": [ "class BookRecommender():\n", " def __init__(self, data):\n", " self.data = data\n", "\n", " def transform(self):\n", " \"\"\"\n", " transforming the raw dataset, getting TfIdf from its keywords, and getting cosine similarity of the tfidf\n", " \"\"\"\n", " self.preprocessed()\n", " \n", " print('preprocess completed. getting tfidf...')\n", " vectorizer = TfidfVectorizer()\n", " tfidf = vectorizer.fit_transform(self.processed_data['keywords'])\n", " \n", " print('getting cosine similarity...')\n", " self.cosine_similarity = linear_kernel(tfidf, tfidf)\n", " \n", " print('transformation completed.')\n", " \n", " def preprocessed(self):\n", " \"\"\"\n", " preprocessing the dataset using KeywordsTransformer\n", " \"\"\"\n", " kt = KeywordsTransformer()\n", " kt.fit(self.data)\n", " self.processed_data = kt.transform(self.data)\n", " \n", " def book_recommender(self, title):\n", " \"\"\"\n", " getting books that have high similarity with book that's entered by users\n", " \"\"\"\n", " try:\n", " ids = self.processed_data.loc[self.processed_data.book_title.str.lower()==title.lower()].index[0]\n", " \n", " BOOK_RECOMMENDATION_COUNT = 10\n", " \n", " # getting id of books that have high similarity with entered book\n", " sim_scores = list(enumerate(self.cosine_similarity[ids]))\n", " sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[0:BOOK_RECOMMENDATION_COUNT+1]\n", "\n", " res_dic = {}\n", " for x in sim_scores:\n", " res_dic[x] = {'book_id': x[0], \n", " 'book_title': self.processed_data.iloc[x[0]].book_title,\n", " 'book_author': self.processed_data.iloc[x[0]].first_author,\n", " 'similarity': x[1]}\n", "\n", " return self.book_recommender_printer(sorted(res_dic.items(), key=lambda x: x[1]['similarity'], reverse=True))\n", " except IndexError:\n", " return 'Book is not found!'\n", "\n", " def book_recommender_printer(self, books):\n", " \"\"\"\n", " printing the recommended book results\n", " \"\"\"\n", " print('---------title input : ', books[0][1]['book_title'], ' by', books[0][1]['book_author'], '\\n')\n", " print('---------recommendations: \\n')\n", " for b in books[1:]:\n", " print(b[1]['book_title'], 'by', b[1]['book_author'])\n", " print('similarity: ', b[1]['similarity'])\n", " print('\\n')" ] }, { "cell_type": "code", "execution_count": 6, "id": "61f40510", "metadata": {}, "outputs": [], "source": [ "# load the dataset\n", "df = pd.read_csv('data/book_data.csv')" ] }, { "cell_type": "code", "execution_count": 7, "id": "24f237c4", "metadata": {}, "outputs": [], "source": [ "# initiate the book recommender\n", "recommender = BookRecommender(df)" ] }, { "cell_type": "code", "execution_count": 8, "id": "8d3ad2a1", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "keywords transformer called...\n", "filtering English books only...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|███████████████████████████████████| 54301/54301 [00:09<00:00, 5520.73it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "getting first author...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|█████████████████████████████████| 54301/54301 [00:00<00:00, 111068.85it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "getting description...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|███████████████████████████████████| 54301/54301 [00:32<00:00, 1660.22it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "getting genre...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████| 54301/54301 [00:00<00:00, 66708.60it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "preprocess completed. getting tfidf...\n", "getting cosine similarity...\n", "transformation completed.\n" ] } ], "source": [ "# transform the data and create the book recommender\n", "recommender.transform()" ] }, { "cell_type": "markdown", "id": "e23f2203", "metadata": {}, "source": [ "## Testing the recommendation system" ] }, { "cell_type": "markdown", "id": "e34148c4", "metadata": {}, "source": [ "Let's test the recommendation system by entering one of my favorite books: **And Then There Were None by Agatha Christie!**" ] }, { "cell_type": "code", "execution_count": 9, "id": "c9d13083", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---------title input : And Then There Were None by Agatha Christie \n", "\n", "---------recommendations: \n", "\n", "The Murder on the Links by Agatha Christie\n", "similarity: 0.44428498622376605\n", "\n", "\n", "Evil Under the Sun by Agatha Christie\n", "similarity: 0.35410649242110337\n", "\n", "\n", "N or M? by Agatha Christie\n", "similarity: 0.33615155083710596\n", "\n", "\n", "Sleeping Murder by Agatha Christie\n", "similarity: 0.33395712265493865\n", "\n", "\n", "The Mystery of the Blue Train by Agatha Christie\n", "similarity: 0.3297086837196586\n", "\n", "\n", "Witness for the Prosecution and Selected Plays by Agatha Christie\n", "similarity: 0.3160301751208531\n", "\n", "\n", "How Does Your Garden Grow? and Other Stories by Agatha Christie\n", "similarity: 0.3062958948772468\n", "\n", "\n", "Dead Simple by Peter James\n", "similarity: 0.3033607052474662\n", "\n", "\n", "Faithful Place by Tana French\n", "similarity: 0.302753760386371\n", "\n", "\n", "In The Dark by Brian Freeman\n", "similarity: 0.30251517321055743\n", "\n", "\n" ] } ], "source": [ "recommender.book_recommender('and then there were none')" ] }, { "cell_type": "markdown", "id": "ee80db7d", "metadata": {}, "source": [ "As expected, the book recommendation system will generate several books that has high similarity with the entered title, in this context is And Then There Were None by Agatha Christie. Since I'm a big fans of her, I already read The Murder on the Links and N or M?! And both of the books are pretty solid book, so I'm very excited when the system giving me Evil Under the Sun, a title that I haven't read yet. Already on my to-be-read list and I can't wait to read it!!!" ] }, { "cell_type": "markdown", "id": "c35c5a99", "metadata": {}, "source": [ "Next, I will enter a book that I currently read: **The Secret History by Donna Tart**" ] }, { "cell_type": "code", "execution_count": 10, "id": "f3cae190", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---------title input : The Secret History by Donna Tartt \n", "\n", "---------recommendations: \n", "\n", "The Pugilist at Rest by Thom Jones\n", "similarity: 0.29270328313073773\n", "\n", "\n", "The Goldfinch by Donna Tartt\n", "similarity: 0.2666224989294841\n", "\n", "\n", "Losing It by Cora Carmack\n", "similarity: 0.2430829833151303\n", "\n", "\n", "The Lady That I Love by Crystal Linn\n", "similarity: 0.22847475855102853\n", "\n", "\n", "The Destiny of Violet & Luke by Jessica Sorensen\n", "similarity: 0.22359497790387248\n", "\n", "\n", "Sweet Girl by Sierra Hill\n", "similarity: 0.22280415744966814\n", "\n", "\n", "A Winter Haunting by Dan Simmons\n", "similarity: 0.2220611264340409\n", "\n", "\n", "The Little Friend by Donna Tartt\n", "similarity: 0.2127591170494097\n", "\n", "\n", "Elite by Rachel Van Dyken\n", "similarity: 0.21057750312655546\n", "\n", "\n", "Dark Matter by Blake Crouch\n", "similarity: 0.18968665820726552\n", "\n", "\n" ] } ], "source": [ "recommender.book_recommender('the secret history')" ] }, { "cell_type": "markdown", "id": "d7d978f8", "metadata": {}, "source": [ "I'm embarrased to say that I've never heard of The Pugilist at Rest by Thom Jones. It's a short story collection, and I rarely read that sort of book, so that kind of understandable. The Goldfinch by Donna Tartt is the only book that I recognized from this list, and it's on my to-be-read list already. But it's still very exciting to explore some new books and maybe, find a new all time favorite!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }