{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"},"colab":{"name":"2022-01-26-social.ipynb","provenance":[{"file_id":"https://github.com/recohut/nbs/blob/main/raw/T354054%20%7C%20Predicting%20Future%20Friendships%20from%20Social%20Network%20data.ipynb","timestamp":1644673008974},{"file_id":"https://github.com/sparsh-ai/general-recsys/blob/T426474/Case_Study5/Section23.ipynb","timestamp":1637511334076}],"collapsed_sections":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"UR8sEDb_SEVh"},"source":["# Predicting Future Friendships from Social Network data\n","\n","## Predicting future friendships from social network data\n","\n","Welcome to FriendHook, Silicon Valley’s hottest new startup. FriendHook is a social networking app for college undergrads. To join, an undergrad must scan their college ID to prove their affiliation. After approval, undergrads can create a FriendHook profile, which lists their dorm name and scholastic interests. Once a profile is created, an undergrad can send *friend requests* to other students at their college. A student who receives a friend request can either approve or reject it. When a friend request is approved, the pair of students are officially *FriendHook friends*. Using their new digital connection, FriendHook friends can share photographs, collaborate on coursework, and keep each other up to date on the latest campus gossip.\n","\n","The FriendHook app is a hit. It’s utilized on hundreds of college campuses worldwide. The user base is growing, and so is the company. You are FriendHook’s first data science hire! Your first challenging task will be to work on FriendHook’s friend recommendation algorithm.\n","\n","Sometimes FriendHook users have trouble finding their real-life friends on the digital app. To facilitate more connections, the engineering team has implemented a simple friend-recommendation engine. Once a week, all users receive an email recommending a new friend who is not yet in their network. The users can ignore the email, or they can send a friend request. That request is then either accepted or rejected/ignored.\n","\n","Currently, the recommendation engine follows a simple algorithm called the *friend-of-a-friend recommendation algorithm*. The algorithm works like this. Suppose we want to recommend a new friend for student A. We pick a random student B who is already friends with student A. We then pick a random student C who is friends with student B but not student A. Student C is then selected as the recommended friend for student A."]},{"cell_type":"markdown","metadata":{"id":"rzhafgGjFh_8"},"source":["![image.png]()"]},{"cell_type":"markdown","metadata":{"id":"FeF_omkZFjwc"},"source":["*The friend-of-a-friend recommendation algorithm in action. Mary has two friends: Fred and Bob. One of these friends (Bob) is randomly selected. Bob has two additional friends: Marty and Alice. Neither Alice nor Marty are friends with Mary. A friend of a friend (Marty) is randomly selected. Mary receives an email suggesting that she should send a friend request to Marty.*"]},{"cell_type":"markdown","metadata":{"id":"4l0MExh7Fmxx"},"source":["Essentially, the algorithm assumes that a friend of your friend is also likely to be your friend. This assumption is reasonable but also a bit simplistic. How well does this assumption hold? Nobody knows! However, as the company’s first data scientist, it’s your job to find out. You have been tasked with building a model that predicts student behavior in response to the recommendation algorithm.\n","\n","Your task is to build a model that predicts user behavior based on user profiles and social network data. The model must generalize to other colleges and universities. This generalizability is very important—a model that cannot be utilized at other colleges is worthless to the product team. Consider, for example, a model that accurately predicts behavior in one or two of the dorms at the sampled university. In other words, it requires specific dorm names to make accurate predictions. Such a model is not useful because other universities will have different dormitory names. Ideally, the model should generalize to all dormitories across all universities worldwide.\n","\n","Once you’ve built the generalized model, you should explore its inner workings. Your goal is to gain insights into how university life facilitates new FriendHook connections.\n","\n","The project goals are ambitious but also very doable. You can complete them by carrying out the following tasks:\n","\n","1. Load the three datasets pertaining to user behavior, user profiles, and the user friendship network. Explore each dataset, and clean it as required.\n","2. Build and evaluate a model that predicts user behavior based on user profiles and established friendship connections. You can optionally split this task into two subtasks: build a model using just the friendship network, and then add the profile information and test whether this improves the model’s performance.\n","3. Determine whether the model generalizes well to other universities.\n","4. Explore the inner workings of the model to gain better insights into student behavior.\n","\n","Our data contains three files stored in a friendhook directory. These files are CSV tables and are named Profiles.csv, Observations.csv, and Friendships.csv. Let’s discuss each table individually.\n","\n","**The Profiles table**\n","\n","Profiles.csv contains profile information for all the students at the chosen university. This information is distributed across six columns: `Profile_ID`, `Sex`, `Relationship_ Status`, `Major`, `Dorm`, and `Year`. Maintaining student privacy is very important to the FriendHook team, so all the profile information has been carefully encrypted. FriendHook’s encryption algorithm takes in descriptive text and returns a unique, scrambled 12-character code known as a *hash code*. Suppose, for example, that a student lists their major as physics. The word *physics* is then scrambled and replaced with a hash code such as *b90a1221d2bc*. If another student lists their major as art history, a different hash code is returned (for example, *983a9b1dc2ef*). In this manner, we can check whether two students share the same major without necessarily knowing the identity of that major. All six profile columns have been encrypted as a precautionary measure. Let’s discuss the separate columns in detail:\n","\n","- `Profile_ID`—A unique identifier used to track each student. The identifier can be linked to the user behaviors in the `Observations` table. It can also be linked to FriendHook connections in the `Friendships` table.\n","- `Sex`—This optional field describes the sex of a student as `Male` or `Female`. Students who don’t wish to specify a gender can leave the `Sex` field blank. Blank inputs are stored as empty values in the table.\n","- `Relationship_Status`—This optional field specifies the relationship status of the student. Each student has three relationship categories to choose from: `Single`, `In a Relationship`, or `It’s Complicated`. All students have a fourth option of leaving this field blank. Blank inputs are stored as empty values in the table.\n","- `Major`—The chosen area of study for the student, such as physics, history, economics, etc. This field is required to activate a FriendHook account. Students who have not yet picked their major can select `Undecided` from among the options.\n","- `Dorm`—The name of the dormitory where the student resides. This field is required to activate a FriendHook account. Students who reside in off-campus housing can select `Off-Campus Housing` from among the options.\n","- `Year`—The undergraduate student’s year. This field must be set to one of four options: `Freshman`, `Sophomore`, `Junior`, or `Senior`.\n","\n","**The Observations table**\n","\n","Observations.csv contains the observed user behavior in response to the emailed friend recommendation. It includes the following five fields:\n","\n","- `Profile_ID`—The ID of the user who received a friend recommendation. The ID corresponds to the profile ID in the `Profiles` table.\n","- `Selected_Friend`—An existing friend of the user in the `Profile_ID` column.\n","- `Selected_Friend_of_Friend`—A randomly chosen friend of `Selected_Friend` who is not yet a friend of `Profile_ID`. This random friend of a friend is emailed as a friend recommendation for the user.\n","- `Friend_Requent_Sent`—A Boolean column that is `True` if a user sends a friend request to the suggested friend of a friend or `False` otherwise.\n","- `Friend_Request_Accepted`—A Boolean column that is `True` only if a user sends a friend request and that request is accepted.\n","\n","This table stores all the observed user behaviors in response to the weekly recommendation email. Our goal is to predict the Boolean outputs of the final two table columns based on the profile and social networking data.\n","\n","**The Friendships table**\n","\n","Friendships.csv contains the FriendHook friendship network corresponding to the selected university. This network was used as input into the friend-of-a-friend recommendation algorithm. The `Friendships` table has just two columns: `Friend A` and `Friend B`. These columns contain profile IDs that map to the `Profile_ID` columns of the `Profiles` and `Observations` tables. Each row corresponds to a pair of FriendHook friends. For instance, the first row contains IDs `b8bc075e54b9` and `49194b3720b6`. From these IDs, we can infer that the associated students have an established FriendHook connection. Using the IDs, we can look up the profile of each student. The profiles then allow us to explore whether the friends share the same major or reside together in the same dorm.\n","\n","To address the problem at hand, we need to know how to do the following:\n","\n","- Analyze network data using Python\n","- Discover friendship clusters in social networks\n","- Train and evaluate supervised machine learning models\n","- Probe the inner workings of trained models to draw insights from our data\n","\n","FriendHook is a popular social networking app designed for college campuses. Students can connect as friends in the FriendHook network. A recommendation engine emails users weekly with new friend suggestions based on their existing connections; students can ignore these recommendations, or they can send out friend requests. We have been provided with one week’s worth of data pertaining to friend recommendations and student responses. That data is stored in the friendhook/Observations.csv file. We’re provided with two additional files: friendhook/Profiles.csv and friendhook/Friendships.csv, containing user profile information and the friendship graph, respectively. The user profiles have been encrypted to protect student privacy. Our goal is to build a model that predicts user behavior in response to the friend recommendations. We will do so by following these steps:\n","\n","1. Load the three datasets containing the observations, user profiles, and friendship connections.\n","2. Train and evaluate a supervised model that predicts behavior based on network features and profile features. We can optionally split this task into two subtasks: training a model using network features, and then adding profile features and evaluating the shift in model performance.\n","3. Check to ensure that the model generalizes well to other universities.\n","4. Explore the inner workings of our model to gain better insights into student behavior."]},{"cell_type":"code","metadata":{"id":"5Qv4ngnqR5sf"},"source":["import warnings\n","warnings.filterwarnings('ignore')"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Z9S5jC3VSibR","executionInfo":{"status":"ok","timestamp":1637511500127,"user_tz":-330,"elapsed":1688,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"f3473e09-508e-465f-852a-3e649822333a"},"source":["!wget -q --show-progress https://github.com/sparsh-ai/general-recsys/raw/T426474/Case_Study5/friendhook/Friendships.csv\n","!wget -q --show-progress https://github.com/sparsh-ai/general-recsys/raw/T426474/Case_Study5/friendhook/Observations.csv\n","!wget -q --show-progress https://github.com/sparsh-ai/general-recsys/raw/T426474/Case_Study5/friendhook/Profiles.csv"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Friendships.csv 100%[===================>] 2.19M --.-KB/s in 0.02s \n","Observations.csv 100%[===================>] 196.39K --.-KB/s in 0.004s \n","Profiles.csv 100%[===================>] 302.93K --.-KB/s in 0.004s \n"]}]},{"cell_type":"markdown","metadata":{"id":"Rq1BGCEHR5sn"},"source":["## Exploring the Data\n","\n","Let's separately explore the _Profiles_, _Observations_ and _Friendships_ tables.\n","\n","### Examining the Profiles\n","We'll start by loading the _Profiles_ table into Pandas and summarizing the table's contents.\n","\n","**Loading the Profiles table**"]},{"cell_type":"code","metadata":{"id":"lGpfo2Q1R5sq","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511510773,"user_tz":-330,"elapsed":452,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"38efdd5e-26b5-4a32-d84e-27ded04a8d66"},"source":["import pandas as pd\n","\n","def summarize_table(df):\n"," n_rows, n_columns = df.shape\n"," summary = df.describe()\n"," print(f\"The table contains {n_rows} rows and {n_columns} columns.\")\n"," print(\"Table Summary:\\n\")\n"," print(summary.to_string())\n","\n","df_profile = pd.read_csv('Profiles.csv')\n","summarize_table(df_profile)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The table contains 4039 rows and 6 columns.\n","Table Summary:\n","\n"," Profile_ID Sex Relationship_Status Dorm Major Year\n","count 4039 4039 3631 4039 4039 4039\n","unique 4039 2 3 15 30 4\n","top 03b574e6cef3 e807eb960650 ac0b88e46e20 a8e6e404d1b3 141d4cdd5aaf c1a648750a4b\n","freq 1 2020 1963 2739 1366 1796\n"]}]},{"cell_type":"markdown","metadata":{"id":"UGwQZxlNR5ss"},"source":["There is one place in the table summary where the numbers are off; the _Relationship Status_ column. Pandas has detected three _Relationship Status_ categories across 3631 of 4039 table rows. The remaining 400 or so rows are null. They don't contain any assigned relationship status. Lets count the total number of empty rows.\n","\n","**Counting empty Relationship Status profiles**"]},{"cell_type":"code","metadata":{"id":"p9k-ODzhR5st","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511517033,"user_tz":-330,"elapsed":421,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"e120f3f6-2bc2-493b-c500-aa47f8c924ae"},"source":["is_null = df_profile.Relationship_Status.isnull()\n","num_null = df_profile[is_null].shape[0]\n","print(f\"{num_null} profiles are missing the Relationship Status field.\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["408 profiles are missing the Relationship Status field.\n"]}]},{"cell_type":"markdown","metadata":{"id":"hZyWW76fR5su"},"source":["408 profiles do not contain a listed relationship status. We can treat the lack of status as a fourth _unspecified_ relationship status category. Hence, we should assign these rows a category id. What id value should we choose? Before we answer the question, lets examine all unique ids within the _Relationship Status_ column.\n","\n","**Checking unique Relationship Status values**"]},{"cell_type":"code","metadata":{"id":"Sdlzm8CRR5sv","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511519520,"user_tz":-330,"elapsed":7,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"c6bc7dcf-5cab-4beb-f2cc-b26bed9b70c1"},"source":["unique_ids = set(df_profile.Relationship_Status.values)\n","print(unique_ids)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["{nan, '188f9a32c360', 'ac0b88e46e20', '9cea719429e9'}\n"]}]},{"cell_type":"markdown","metadata":{"id":"zS5YSLPwR5sx"},"source":["The Scikit-Learn library is unable to process hash-codes or null values. It can only process numbers. Hence, we'll need to eventually convert the categories to numeric values. \n","\n","**Mapping Relationship Status values to numbers**"]},{"cell_type":"code","metadata":{"id":"5a7Y6WJvR5sy"},"source":["import numpy as np\n","category_map = {'9cea719429e9': 0, np.nan: 1, '188f9a32c360': 2, \n"," 'ac0b88e46e20': 3}"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"iluKRgqkR5s0"},"source":["Next, we'll replace the contents of the _Relationship Status_ column with the appropriate numeric values.\n","\n","**Updating the Relationship Status column**"]},{"cell_type":"code","metadata":{"id":"9x0FZu3zR5s0","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511558269,"user_tz":-330,"elapsed":10,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"2bd5bd0a-de9b-4f18-a912-c4f37bf4a326"},"source":["nums = [category_map[hash_code] \n"," for hash_code in df_profile.Relationship_Status.values]\n","df_profile['Relationship_Status'] = nums\n","print(df_profile.Relationship_Status)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["0 0\n","1 3\n","2 3\n","3 3\n","4 0\n"," ..\n","4034 3\n","4035 0\n","4036 3\n","4037 3\n","4038 0\n","Name: Relationship_Status, Length: 4039, dtype: int64\n"]}]},{"cell_type":"markdown","metadata":{"id":"zStc1B59R5s1"},"source":["We've transformed _Relationship Status_ into a numeric variable. However, the remaining five columns in the table still contain hash-codes. Let's create a category mapping between hash-codes and numbers in each column. We'll track the category mappings in each column with a `col_to_mapping` dictionary. We'll also leverage the mappings in order to replace all hash-codes with numbers in `df_profile`.\n","\n","**Replacing all Profile hash-codes with numeric values**"]},{"cell_type":"code","metadata":{"id":"yDm83aBZR5s1","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511561919,"user_tz":-330,"elapsed":433,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"110fa181-5c97-4182-bdef-6dadf77ce0dd"},"source":["col_to_mapping = {'Relationship_Status': category_map}\n","\n","for column in df_profile.columns:\n"," if column in col_to_mapping:\n"," continue\n"," \n"," unique_ids = sorted(set(df_profile[column].values))\n"," category_map = {id_: i for i, id_ in enumerate(unique_ids)}\n"," col_to_mapping[column] = category_map\n"," nums = [category_map[hash_code] \n"," for hash_code in df_profile[column].values]\n"," df_profile[column] = nums\n","\n","head = df_profile.head()\n","print(head.to_string(index=False))"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":[" Profile_ID Sex Relationship_Status Dorm Major Year\n"," 2899 0 0 5 13 2\n"," 1125 0 3 12 6 1\n"," 3799 0 3 12 29 2\n"," 3338 0 3 4 25 0\n"," 2007 1 0 12 2 0\n"]}]},{"cell_type":"markdown","metadata":{"id":"s-ToK8jUR5s3"},"source":["We've finished tweaking the `df_profile`. Now lets turn our attention to the table of experimental observations.\n","\n","### Exploring the Experimental Observations\n","We'll start by loading the _Observations_ table into Pandas and summarizing the table's contents.\n","\n","**Loading the Observations table**"]},{"cell_type":"code","metadata":{"id":"IKHOylu-R5s3","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511580047,"user_tz":-330,"elapsed":425,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"3205f7cc-6adb-483e-be43-1759cf47d799"},"source":["df_obs = pd.read_csv('Observations.csv')\n","summarize_table(df_obs)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The table contains 4039 rows and 5 columns.\n","Table Summary:\n","\n"," Profile_ID Selected_Friend Selected_Friend_of_Friend Friend_Request_Sent Friend_Request_Accepted\n","count 4039 4039 4039 4039 4039\n","unique 4039 2219 2327 2 2\n","top 43928bbe27fd 89581f99fa1e 6caa597f13cc True True\n","freq 1 77 27 2519 2460\n"]}]},{"cell_type":"markdown","metadata":{"id":"SL0z-jZyR5s4"},"source":["The five table columns all consistantly show 4039 filled rows. There are no empty values in the table. This is good. However, the actual column names are hard to read. The names are very descriptive, but also very long. We should shorten some of the names in order to ease our cognitive load. \n","\n","**Renaming the observation columns**"]},{"cell_type":"code","metadata":{"id":"0QykAd_YR5s5","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511582221,"user_tz":-330,"elapsed":9,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"53fe4b1b-f161-4284-d5a5-57517a8430f2"},"source":["new_names = {'Selected_Friend': 'Friend', \n"," 'Selected_Friend_of_Friend': 'FoF',\n"," 'Friend_Request_Sent': 'Sent',\n"," 'Friend_Request_Accepted': 'Accepted'}\n","df_obs = df_obs.rename(columns=new_names)\n","summarize_table(df_obs)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The table contains 4039 rows and 5 columns.\n","Table Summary:\n","\n"," Profile_ID Friend FoF Sent Accepted\n","count 4039 4039 4039 4039 4039\n","unique 4039 2219 2327 2 2\n","top 43928bbe27fd 89581f99fa1e 6caa597f13cc True True\n","freq 1 77 27 2519 2460\n"]}]},{"cell_type":"markdown","metadata":{"id":"Ou5Xqi_aR5s5"},"source":["Approximately 62% (2519) of the friend suggestions lead to friend request being sent. This is very promising; the friend-of-a-friend suggestions are quite effective. Furthermore, approximately 60% (2460) of sampled instances led to a friend request getting accepted. Hence, the sent friend requests are ignored or rejected just 2% (2519 - 2460 = 50) of the time. Of course, our numbers assume that there are no observations where _Sent_ is False and _Accepted_ is True. This scenario is not possible, because a friend-request cannot be accepted if it has not yet been sent. Still, as sanity check, lets test the integrity of the data by confirm that the scenario does not take place.\n","\n","**Ensuring that Sent is `True` for all accepted requests**"]},{"cell_type":"code","metadata":{"id":"6lbK8y0-R5s6"},"source":["condition = (df_obs.Sent == False) & (df_obs.Accepted == True)\n","assert not df_obs[condition].shape[0]"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"p8brARl6R5s6"},"source":["Based on our observations, user behavior follows three possible scenarios. Hence, we can encode this categorical behavior by assigning numbers `0`, `1`, and `2` to the behavior patterns.\n","\n","**Assigning classes of behavior to the user observations**"]},{"cell_type":"code","metadata":{"id":"Rm1IbPpDR5s7"},"source":["behaviors = []\n","for sent, accepted in df_obs[['Sent', 'Accepted']].values:\n"," behavior = 2 if (sent and not accepted) else int(sent) * int(accepted)\n"," behaviors.append(behavior)\n","df_obs['Behavior'] = behaviors"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"uEqNg7SqR5s7"},"source":["Additionally, we must transform the profile ids in the first three columns from hash-codes to numeric ids that are consistent with `df_profile.Profile_ID`. \n","\n","**Replacing all Observation hash-codes with numeric values**"]},{"cell_type":"code","metadata":{"id":"nnVtphvHR5s7","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511585199,"user_tz":-330,"elapsed":6,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"0c9abbe6-fe38-421c-839a-7ecced2909b4"},"source":["for col in ['Profile_ID', 'Friend', 'FoF']:\n"," nums = [col_to_mapping['Profile_ID'][hash_code]\n"," for hash_code in df_obs[col]]\n"," df_obs[col] = nums\n","\n","head = df_obs.head()\n","print(head.to_string(index=False))"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":[" Profile_ID Friend FoF Sent Accepted Behavior\n"," 2485 2899 2847 False False 0\n"," 2690 2899 3528 False False 0\n"," 3904 2899 3528 False False 0\n"," 709 2899 3403 False False 0\n"," 502 2899 345 True True 1\n"]}]},{"cell_type":"markdown","metadata":{"id":"RKYhXx1qR5s8"},"source":["The `df_obs` now aligns with `df_profile`. Only a single data table remains unanalyzed. Lets proceed to explore the friendship linkages within the remaining _Friendships_ table.\n","\n","\n","### Exploring Friendships Linkage Table\n","We'll start by loading the _Friendships_ table into Pandas and summarizing the table's contents.\n","\n","**Loading the Friendships table**"]},{"cell_type":"code","metadata":{"id":"6rfPDqBNR5s8","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511594631,"user_tz":-330,"elapsed":581,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"680d7d2b-c617-457e-8368-5eed9a2709ae"},"source":["df_friends = pd.read_csv('Friendships.csv')\n","summarize_table(df_friends)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The table contains 88234 rows and 2 columns.\n","Table Summary:\n","\n"," Friend_A Friend_B\n","count 88234 88234\n","unique 3646 4037\n","top 89581f99fa1e 97ba93d9b169\n","freq 1043 251\n"]}]},{"cell_type":"markdown","metadata":{"id":"CQZ8eJ5wR5s9"},"source":["In order to carry out a more detailed analysis, we should load the frienship data into a NetworkX graph. \n","\n","**Loading the social graph into NetworkX**"]},{"cell_type":"code","metadata":{"id":"mGF4ZMpsR5s9","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511595734,"user_tz":-330,"elapsed":620,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"8a4b0ce4-457c-44cf-b442-8660b405842c"},"source":["import networkx as nx\n","G = nx.Graph()\n","for id1, id2 in df_friends.values:\n"," node1 = col_to_mapping['Profile_ID'][id1]\n"," node2 = col_to_mapping['Profile_ID'][id2]\n"," G.add_edge(node1, node2)\n"," \n","nodes = list(G.nodes)\n","num_nodes = len(nodes)\n","print(f\"The social graph contains {num_nodes} nodes.\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The social graph contains 4039 nodes.\n"]}]},{"cell_type":"markdown","metadata":{"id":"0kLVaYBsR5s9"},"source":["Lets try to gain more insights into the graph structure by visualizing it with `nx.draw`. Please note the graph is rather large, so visulization might take 10-30 seconds to load.\n","\n","**Visualizing the social graph**"]},{"cell_type":"code","metadata":{"id":"xmk3JhjQR5s9","colab":{"base_uri":"https://localhost:8080/","height":319},"executionInfo":{"status":"ok","timestamp":1637511667288,"user_tz":-330,"elapsed":69058,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"b0f7ab28-f207-4190-bc50-02a56d76bda5"},"source":["import matplotlib.pyplot as plt\n","import numpy as np\n","np.random.seed(0)\n","nx.draw(G, node_size=5)\n","plt.show()"],"execution_count":null,"outputs":[{"output_type":"display_data","data":{"image/png":"\n","text/plain":["
"]},"metadata":{}}]},{"cell_type":"markdown","metadata":{"id":"DA4vLks5R5s-"},"source":["Tightly clustered social groups are clearly visible within the network. Lets extract these groups using Markov clustering. \n","\n","**Finding social groups using Markov clustering**"]},{"cell_type":"code","metadata":{"id":"szZqOhZZTLzI"},"source":["!pip install -q markov_clustering"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"j2lf-aypR5s-","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511675632,"user_tz":-330,"elapsed":2648,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"47952b39-0748-45b1-d21a-d6a2e80d9cd3"},"source":["import markov_clustering as mc\n","matrix = nx.to_scipy_sparse_matrix(G)\n","result = mc.run_mcl(matrix)\n","clusters = mc.get_clusters(result) \n","num_clusters = len(clusters)\n","print(f\"{num_clusters} clusters were found in the social graph.\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["10 clusters were found in the social graph.\n"]}]},{"cell_type":"markdown","metadata":{"id":"Aaz8KRiLR5s-"},"source":["Ten clusters were found in the social graph. Lets visualize these clusters by coloring each node based on cluster id. To start, we'll need to iterate over `clusters` and assign a `cluster_id` attribute to every node.\n","\n","**Assigning cluster attributes to nodes**"]},{"cell_type":"code","metadata":{"id":"jKBZmK0CR5s-"},"source":["for cluster_id, node_indices in enumerate(clusters):\n"," for i in node_indices:\n"," node = nodes[i]\n"," G.nodes[node]['cluster_id'] = cluster_id"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"aSXRgvFAR5s_"},"source":["Next, we'll color the nodes based on their attribute assignment.\n","\n","**Coloring the nodes by cluster assignment**"]},{"cell_type":"code","metadata":{"id":"5ahEfgRFR5s_","colab":{"base_uri":"https://localhost:8080/","height":319},"executionInfo":{"status":"ok","timestamp":1637511748316,"user_tz":-330,"elapsed":69108,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"6a083d3c-198b-4b25-910a-00c8e6fe660d"},"source":["np.random.seed(0)\n","colors = [G.nodes[n]['cluster_id'] for n in G.nodes]\n","nx.draw(G, node_size=5, node_color=colors, cmap=plt.cm.tab20)\n","plt.show()"],"execution_count":null,"outputs":[{"output_type":"display_data","data":{"image/png":"\n","text/plain":["
"]},"metadata":{}}]},{"cell_type":"markdown","metadata":{"id":"ezbAq8OSR5tA"},"source":["The cluster colors clearly correspond to tight social groups. Our clustering thus has been effective. Hence, the assigned `cluster_id` attributes should prove useful during the model-building process. In this same manner, it might be useful to store all five profile features as attributes within the student nodes. \n","\n","**Assigning profile attributes to nodes**"]},{"cell_type":"code","metadata":{"id":"s5x-wrTzR5tA","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511749092,"user_tz":-330,"elapsed":783,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"dcdfe7f4-7a28-4626-97e0-949bca1ba0f2"},"source":["attribute_names = df_profile.columns\n","for attributes in df_profile.values:\n"," profile_id = attributes[0]\n"," for name, att in zip(attribute_names[1:], attributes[1:]):\n"," G.nodes[profile_id][name] = att\n"," \n","first_node = nodes[0]\n","print(f\"Attributes of node {first_node}:\")\n","print(G.nodes[first_node])"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Attributes of node 2899:\n","{'cluster_id': 0, 'Sex': 0, 'Relationship_Status': 0, 'Dorm': 5, 'Major': 13, 'Year': 2}\n"]}]},{"cell_type":"markdown","metadata":{"id":"hz2gpSS8R5tB"},"source":["We have finished exploring our input data. Now, we'll proceed to train a model that predicts user behavior. We'll start by constructing simple model that only utilizes network features.\n","\n","## Training a Predictive Model Using Network Features\n","\n","Our goal is to train a supervised ML model on our dataset, in order to predict user behavior. Currently, all possible classes of behavior are stored within the _Behavior_ columns of `df_obs`. We'll assign our training class-label array to equal the `df_obs.Behavior` column.\n","\n","**Assigning the class-label array `y`**"]},{"cell_type":"code","metadata":{"id":"RpMLjsHwR5tB","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511749093,"user_tz":-330,"elapsed":44,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"c7b9d27d-30a4-496b-b193-76f2d70b6ea8"},"source":["y = df_obs.Behavior.values\n","print(y)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["[0 0 0 ... 1 1 1]\n"]}]},{"cell_type":"markdown","metadata":{"id":"A1uWDrQHR5tC"},"source":["Now that we have class-labels, we'll need to create a feature-matrix `X`. Let's create an initial version of `X`, and populate it with some very basic features. The simplest question we can ask about a person within a social network is this; how many friends does the person have? Of course, that value equals edge-count associated with the person's node within the social graph. Lets make the edge-count our first feature in the feature-matrix. We'll iterate over all the rows in `df_obs` and assign an edge-count to each profile that's referenced within each row. \n","\n","**Creating a feature-matrix from edge counts**"]},{"cell_type":"code","metadata":{"id":"S5mqLCFyR5tD"},"source":["cols = ['Profile_ID', 'Friend', 'FoF']\n","features = {f'{col}_Edge_Count': [] for col in cols}\n","for node_ids in df_obs[cols].values:\n"," for node, feature_name in zip(node_ids, features.keys()):\n"," degree = G.degree(node)\n"," features[feature_name].append(degree)\n","\n","df_features = pd.DataFrame(features)\n","X = df_features.values"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"g29SJDPtR5tD"},"source":["Lets quickly check the quality of the signal in the training data by training and testing a simple model. We have multiple possible models to choose from. One sensible choice is a decision tree classifier. Decision trees can handle non-linear decision boundaries and are easily interpretable. \n","\n","**Training and evaluating a Decision tree classifier**"]},{"cell_type":"code","metadata":{"id":"u8g5qI7qR5tD","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511922904,"user_tz":-330,"elapsed":4,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"2d41bdd3-2e1e-4d54-904a-531788a64ed9"},"source":["np.random.seed(0)\n","from sklearn.tree import DecisionTreeClassifier\n","from sklearn.model_selection import train_test_split\n","from sklearn.metrics import f1_score\n","\n","def evaluate(X, y, model_type=DecisionTreeClassifier, **kwargs):\n"," np.random.seed(0)\n"," X_train, X_test, y_train, y_test = train_test_split(X, y)\n"," clf = model_type(**kwargs)\n"," clf.fit(X_train, y_train)\n"," pred = clf.predict(X_test)\n"," f_measure = f1_score(pred, y_test, average='macro')\n"," return f_measure, clf\n","\n","f_measure, clf = evaluate(X, y)\n","print(f\"The f-measure is {f_measure:0.2f}\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.37\n"]}]},{"cell_type":"markdown","metadata":{"id":"U8UWaIfIR5tE"},"source":["Our f-measure is terrible! Clearly, friend-count by itself is not a sufficient signal for predicting user-behavior. Perhaps adding PageRank values to our training set will yield improved results? \n","\n","**Adding PageRank features**"]},{"cell_type":"code","metadata":{"id":"LueVuNuUR5tE","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511925750,"user_tz":-330,"elapsed":1074,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"0bb4aa49-a632-4b5d-f096-43e8219bfd0d"},"source":["np.random.seed(0)\n","node_to_pagerank = nx.pagerank(G)\n","features = {f'{col}_PageRank': [] for col in cols}\n","for node_ids in df_obs[cols].values:\n"," for node, feature_name in zip(node_ids, features.keys()):\n"," pagerank = node_to_pagerank[node]\n"," features[feature_name].append(pagerank)\n","\n","def update_features(new_features):\n"," for feature_name, values in new_features.items():\n"," df_features[feature_name] = values\n"," return df_features.values\n","\n","X = update_features(features)\n","f_measure, clf = evaluate(X, y)\n","\n","print(f\"The f-measure is {f_measure:0.2f}\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.38\n"]}]},{"cell_type":"markdown","metadata":{"id":"miyHKkVmR5tF"},"source":["The f-measure remains approximately the same. Basic centrality measures are insufficient. We need to expand `X` to include the social groups uncovered by Markov Clustering. One approach is just consider the following binary question; are two people in the same social group? Yes or no? If they are, then perhaps they are more likely to eventually become friends on FriendHook. We can make this binary comparison between each pair of profile-ids within a single row of observations. \n","\n","**Adding social group features**"]},{"cell_type":"code","metadata":{"id":"Z5sHBEI7R5tF","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511926180,"user_tz":-330,"elapsed":6,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"ac73b828-4108-4aac-c1c4-2ab6dc25591b"},"source":["np.random.seed(0)\n","features = {f'Shared_Cluster_{e}': []\n"," for e in ['id_f', 'id_fof', 'f_fof']}\n","\n","i = 0\n","for node_ids in df_obs[cols].values:\n"," c_id, c_f, c_fof = [G.nodes[n]['cluster_id'] \n"," for n in node_ids]\n"," features['Shared_Cluster_id_f'].append(int(c_id == c_f))\n"," features['Shared_Cluster_id_fof'].append(int(c_id == c_fof))\n"," features['Shared_Cluster_f_fof'].append(int(c_f == c_fof))\n"," \n","X = update_features(features)\n","f_measure, clf = evaluate(X, y)\n","print(f\"The f-measure is {f_measure:0.2f}\") "],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.43\n"]}]},{"cell_type":"markdown","metadata":{"id":"Fbp2DmCJR5tG"},"source":["Our f-measure has noticeably improved, from 0.38 to 0.43. Perfomance is still poor. Nonetheless, the social group inclusion has led a slight enhacement of our model. How important are the new social group features relative the model's current perfomance? We can find check, using the `feature_importance_` attribute of our trained classifier.\n","\n","**Ranking features by their importance score**"]},{"cell_type":"code","metadata":{"id":"8HzVCA25R5tG","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511927730,"user_tz":-330,"elapsed":8,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"ac0094ae-c96c-47e3-ddd5-b9be90dbbd5b"},"source":["def view_top_features(clf, feature_names):\n"," for i in np.argsort(clf.feature_importances_)[::-1]:\n"," feature_name = feature_names[i]\n"," importance = clf.feature_importances_[i]\n"," if not round(importance, 2):\n"," break\n"," \n"," print(f\"{feature_name}: {importance:0.2f}\")\n","feature_names = df_features.columns\n","view_top_features(clf, feature_names)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Shared_Cluster_id_fof: 0.18\n","FoF_PageRank: 0.17\n","Profile_ID_PageRank: 0.17\n","Friend_PageRank: 0.15\n","FoF_Edge_Count: 0.12\n","Profile_ID_Edge_Count: 0.11\n","Friend_Edge_Count: 0.10\n"]}]},{"cell_type":"markdown","metadata":{"id":"Tketm2ZTR5tH"},"source":["Social graph centrality plays some role in friendship determination. Of course, our model's performance is still poor, so we should be cautious with our inferences on how the features drive predictions. What other graph-based features could we utilize? Perhaps the network cluster size can impact the predictions? We can find out. However, we should be cautious in our efforts to keep our model generalizable. Cluster size can inexplicably take the place of cluster-id, making the model very specific to the university. \n","\n","**Adding cluster-size features**"]},{"cell_type":"code","metadata":{"id":"23WZ_Zr5R5tH","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511937961,"user_tz":-330,"elapsed":447,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"214acf3c-2834-4061-bf72-4673500cbd8c"},"source":["np.random.seed(0)\n","cluster_sizes = [len(cluster) for cluster in clusters]\n","features = {f'{col}_Cluster_Size': [] for col in cols}\n","for node_ids in df_obs[cols].values:\n"," for node, feature_name in zip(node_ids, features.keys()):\n"," c_id = G.nodes[node]['cluster_id']\n"," features[feature_name].append(cluster_sizes[c_id])\n"," \n","X = update_features(features)\n","f_measure, clf = evaluate(X, y)\n","print(f\"The f-measure is {f_measure:0.2f}\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.43\n"]}]},{"cell_type":"markdown","metadata":{"id":"_8adpXgZR5tH"},"source":["The cluster did not improve the model. As a precaution, lets delete it from our feature set.\n","\n","**Deleting cluster-size features**"]},{"cell_type":"code","metadata":{"id":"iKZyxJjIR5tI"},"source":["import re\n","def delete_features(df_features, regex=r'Cluster_Size'):\n"," \n"," df_features.drop(columns=[name for name in df_features.columns\n"," if re.search(regex, name)], inplace=True)\n"," return df_features.values\n","\n","X = delete_features(df_features)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"SuB6EnwoR5tI"},"source":["The f-measure remains at 0.43. Perhaps we should try thinking outside the box. In what ways can social connections drive real world-behavior? Consider the following scenario. Suppose we analyze a student named Alex, whose node id in network `G` is `n`. Alex has 50 FriendHook friends. These are accessible through `G[n]`. We randomly sample two of the friends in `G[n]`. Their node ids are `a` and `b`. We then check if `a` and `b` are friends. They are! It seems that `a` is in `list(G[n])`. We then repeat this 100 times. In 95% percent of sampled instances, `a` is friend of `b`. Basically, there's a 95% likelihood that any pair of Alex's friends are also friends with each other. We'll refer to this probability as the **friend-sharing likelihood**. Lets try incorporating this likelihood into our features. We'll start by computing the likehood for every node in `G`. \n","\n","**Computing friend-sharing likelihoods**"]},{"cell_type":"code","metadata":{"id":"6sddUPr3R5tI"},"source":["friend_sharing_likelihood = {}\n","for node in nodes:\n"," neighbors = list(G[node])\n"," friendship_count = 0\n"," total_possible = 0\n"," for i, node1 in enumerate(neighbors[:-1]):\n"," for node2 in neighbors[i + 1:]:\n"," if node1 in G[node2]:\n"," friendship_count += 1\n"," \n"," total_possible += 1\n"," \n"," prob = friendship_count / total_possible if total_possible else 0\n"," friend_sharing_likelihood[node] = prob"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"_pf4DSThR5tJ"},"source":["Next, we'll generate a friend-sharing likelihood feature for each of our three profile ids. After adding the features, we'll re-valuate the trained model's performance.\n","\n","**Adding friend-sharing likelihood features**"]},{"cell_type":"code","metadata":{"id":"E0CaSuAkR5tJ","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511959812,"user_tz":-330,"elapsed":31,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"e96b9192-271f-4fb2-a0bf-4305bfcaffd1"},"source":["np.random.seed(0)\n","features = {f'{col}_Friend_Sharing_Likelihood': [] for col in cols}\n","for node_ids in df_obs[cols].values:\n"," for node, feature_name in zip(node_ids, features.keys()):\n"," sharing_likelihood = friend_sharing_likelihood[node]\n"," features[feature_name].append(sharing_likelihood)\n","\n","X = update_features(features)\n","f_measure, clf = evaluate(X, y)\n","print(f\"The f-measure is {f_measure:0.2f}\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.49\n"]}]},{"cell_type":"markdown","metadata":{"id":"lzSCCbP-R5tK"},"source":["Performance has increased from 0.43 to 0.49! It's still not great, but its progressively getting better. How does the friend-sharing likelihood compare to other features in the model? Lets find out!\n","\n","**Ranking features by their importance score**"]},{"cell_type":"code","metadata":{"id":"xQ0MxIKWR5tK","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511959813,"user_tz":-330,"elapsed":25,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"93ceb01c-1fbe-414b-9e5f-274a0f13f7ef"},"source":["feature_names = df_features.columns\n","view_top_features(clf, feature_names)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Shared_Cluster_id_fof: 0.18\n","Friend_Friend_Sharing_Likelihood: 0.13\n","FoF_PageRank: 0.11\n","Profile_ID_PageRank: 0.11\n","Profile_ID_Friend_Sharing_Likelihood: 0.10\n","FoF_Friend_Sharing_Likelihood: 0.10\n","FoF_Edge_Count: 0.08\n","Friend_PageRank: 0.07\n","Profile_ID_Edge_Count: 0.07\n","Friend_Edge_Count: 0.06\n"]}]},{"cell_type":"markdown","metadata":{"id":"IYdi9vJ7R5tK"},"source":["One of our new features ranks quite highly.Nonetheless, the model is incomplete. An f-measure of 0.49 is not acceptable. We need to incorporate features from the profiles stored within `df_profiles`.\n","\n","## Adding Profile Features to the Model\n","\n","\n","Our aim is to incorporate the profile attributes of _Sex_, _Relationship_Status_, _Major_, _Dorm_ and _Year_ into our feature matrix. Based on our experience with the network data, there are three ways in which we can do this:\n","1. Exact Value Extraction.\n"," * We can store the exact value of the profile feature associated with each of the three profile-id columns in `df_obs`.\n"," \n","2. Equivalence Comparison\n"," * Given a profile attribute, we can carry-out a pairwise comparison of the attribute across all three profile-id columns in `df_obs`. For each comparison we would return a boolean feature demarcating whether the attribute is equal within the two columns.\n"," \n","3. Size:\n"," * Given a profile attribute, we can return the number of profiles that share that attribute.\n"," \n"," \n","Let's apply Exact Value Extraction to _Sex_, _Relationship_Status_ and _Year_. \n","\n","**Adding exact-value profile features**"]},{"cell_type":"code","metadata":{"id":"7Z1N3ya7R5tL","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511961540,"user_tz":-330,"elapsed":5,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"34098105-1dde-4f33-ac2a-9330be9558cb"},"source":["attributes = ['Sex', 'Relationship_Status', 'Year']\n","for attribute in attributes:\n"," features = {f'{col}_{attribute}_Value': [] for col in cols}\n"," for node_ids in df_obs[cols].values:\n"," for node, feature_name in zip(node_ids, features.keys()):\n"," att_value = G.nodes[node][attribute]\n"," features[feature_name].append(att_value)\n"," \n"," X = update_features(features)\n"," \n","f_measure, clf = evaluate(X, y)\n","print(f\"The f-measure is {f_measure:0.2f}\") "],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.74\n"]}]},{"cell_type":"markdown","metadata":{"id":"1DyXSFThR5tL"},"source":["Wow! The f-measure dramatically increased from 0.49 to 0.74! The profile features have provided a very valuable signal, but we can still do better. We need to incorporate information from the _Major_ and _Dorm_ attributes. Equivalence Comparison is an excellent way to do this.\n","\n","**Adding equivalence-comparison profile features**"]},{"cell_type":"code","metadata":{"id":"P33DntlfR5tL","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511962707,"user_tz":-330,"elapsed":700,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"6cd41d91-8838-4457-9d67-332d44420ba0"},"source":["attributes = ['Major', 'Dorm']\n","for attribute in attributes:\n"," features = {f'Shared_{attribute}_{e}': []\n"," for e in ['id_f', 'id_fof', 'f_fof']}\n"," for node_ids in df_obs[cols].values:\n"," att_id, att_f, att_fof = [G.nodes[n][attribute] \n"," for n in node_ids]\n"," features[f'Shared_{attribute}_id_f'].append(int(att_id == att_f))\n"," features[f'Shared_{attribute}_id_fof'].append(int(att_id == att_fof))\n"," features[f'Shared_{attribute}_f_fof'].append(int(att_f == att_fof))\n"," \n"," X = update_features(features)\n"," \n","f_measure, clf = evaluate(X, y)\n","print(f\"The f-measure is {f_measure:0.2f}\") "],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.82\n"]}]},{"cell_type":"markdown","metadata":{"id":"gg7jh64TR5tM"},"source":["Incorporating the _Major_ and _Dorm_ attributes has improved model performance. Now lets consider adding _Major_ and _Dorm Size_ into the mix. We can count the number of students associated with each major / dorm, and include this count as one of our features. However, we need to be careful. As we previously discussed, our trained model can cheat by utilizing size as a substitute for a category id. \n","\n","**Adding size-related profile features**"]},{"cell_type":"code","metadata":{"id":"xeuLHwSPR5tM","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511963980,"user_tz":-330,"elapsed":585,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"10d8f7a3-7386-4843-a0a5-0215e7c5a8fe"},"source":["from collections import Counter\n","\n","for attribute in ['Major', 'Dorm']:\n"," counter = Counter(df_profile[attribute].values)\n"," att_to_size = {k: v \n"," for k, v in counter.items()}\n"," features = {f'{col}_{attribute}_Size': [] for col in cols}\n"," for node_ids in df_obs[cols].values:\n"," for node, feature_name in zip(node_ids, features.keys()):\n"," size = att_to_size[G.nodes[node][attribute]]\n"," features[feature_name].append(size)\n"," \n"," \n"," X = update_features(features)\n"," \n","f_measure, clf = evaluate(X, y)\n","print(f\"The f-measure is {f_measure:0.2f}\") "],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.85\n"]}]},{"cell_type":"markdown","metadata":{"id":"HbdlsHoxR5tM"},"source":["Performance has increased from 0.82 to 0.85. The introduction of size has impacted our model. Lets dive deeper into that impact. We'll start by printing out the feature importance scores.\n","\n","**Ranking features by their importance score**"]},{"cell_type":"code","metadata":{"id":"r-eKmdWtR5tM","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511963981,"user_tz":-330,"elapsed":9,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"3572c480-c5b2-4374-c0b4-04fc9387102b"},"source":["feature_names = df_features.columns.values\n","view_top_features(clf, feature_names)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["FoF_Dorm_Size: 0.25\n","Shared_Cluster_id_fof: 0.16\n","Shared_Dorm_id_fof: 0.05\n","FoF_PageRank: 0.04\n","Profile_ID_Major_Size: 0.04\n","FoF_Major_Size: 0.04\n","FoF_Edge_Count: 0.04\n","Profile_ID_PageRank: 0.03\n","Profile_ID_Friend_Sharing_Likelihood: 0.03\n","Friend_Friend_Sharing_Likelihood: 0.03\n","Friend_Edge_Count: 0.03\n","Shared_Major_id_fof: 0.03\n","FoF_Friend_Sharing_Likelihood: 0.02\n","Friend_PageRank: 0.02\n","Profile_ID_Dorm_Size: 0.02\n","Profile_ID_Edge_Count: 0.02\n","Profile_ID_Sex_Value: 0.02\n","Friend_Major_Size: 0.02\n","Profile_ID_Relationship_Status_Value: 0.02\n","FoF_Sex_Value: 0.01\n","Friend_Dorm_Size: 0.01\n","Profile_ID_Year_Value: 0.01\n","Friend_Sex_Value: 0.01\n","Shared_Major_id_f: 0.01\n","Friend_Relationship_Status_Value: 0.01\n","Friend_Year_Value: 0.01\n"]}]},{"cell_type":"markdown","metadata":{"id":"mLJ29m_nR5tM"},"source":["The feature importance scores are dominated by two features; _FoF_Dorm_Size_ and _Shared_Cluster_id_fof_. The presence of _FoF_Dorm_Size_ is a bit concerning. As we've discussed, a single dorm dominates over 50% of the network data. Is our model simply memorizing that dorm based on its size? We can find out by actually visualizing a trained decision tree. \n","\n","**Displaying the top branches of the tree**"]},{"cell_type":"code","metadata":{"id":"45d4i0-qR5tN","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511964952,"user_tz":-330,"elapsed":6,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"8a6c0b62-cb32-4306-bf84-4423b6d1f9ec"},"source":["from sklearn import tree\n","\n","clf_depth2 = DecisionTreeClassifier(max_depth=2)\n","clf_depth2.fit(X, y)\n","text_tree = tree.export_text(clf_depth2, feature_names=list(feature_names))\n","print(text_tree)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["|--- FoF_Dorm_Size <= 278.50\n","| |--- Shared_Cluster_id_fof <= 0.50\n","| | |--- class: 0\n","| |--- Shared_Cluster_id_fof > 0.50\n","| | |--- class: 0\n","|--- FoF_Dorm_Size > 278.50\n","| |--- Shared_Cluster_id_fof <= 0.50\n","| | |--- class: 0\n","| |--- Shared_Cluster_id_fof > 0.50\n","| | |--- class: 1\n","\n"]}]},{"cell_type":"markdown","metadata":{"id":"Pt_XS2LFR5tN"},"source":["According to the the tree, the most important signal is whether _FoF_Dorm_Size_ is less than 279. This begs the question, how many dorms contain at-least 279 students? \n","\n","**Checking dorms with at-least 279 students**"]},{"cell_type":"code","metadata":{"id":"IJYLfO6vR5tN","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511967128,"user_tz":-330,"elapsed":7,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"92494656-88b6-4060-c6c9-f27e110a2b2d"},"source":["counter = Counter(df_profile.Dorm.values)\n","for dorm, count in counter.items():\n"," if count < 279:\n"," continue\n"," \n"," print(f\"Dorm {dorm} holds {count} students.\")"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Dorm 12 holds 2739 students.\n","Dorm 1 holds 413 students.\n"]}]},{"cell_type":"markdown","metadata":{"id":"9gWCZH6sR5tN"},"source":["Just two of the 15 dorms contain more than 279 FriendHook-registered students. Essentially, our model relies on the two most-populous dorms to make its decisions. This might not generalize to other college campuses. For instance, consider a campus whose dormitories are smaller, and hold 200 students at the most. The model will completely fail to predict user-behavior in this instance. \n","\n","Perhaps we could try deleting the size-related features while also adjusting our choice of classifier. There is a slight chance that we'll achieve comparable performance without relying on dorm size. This is unlikely but is still worth trying. \n","\n","**Deleting all size-related features**"]},{"cell_type":"code","metadata":{"id":"zwO2A2hZR5tQ"},"source":["X_with_sizes = X.copy()\n","X = delete_features(df_features, regex=r'_Size')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"eB27CUnCR5tQ"},"source":["## Optimizing Performance Across a Steady Set of Features\n","\n","Will switching the model-type from a Decision tree to a Forest improve performance outcome? Lets find out.\n","\n","**Training and evaluating a Random Forest classifier**"]},{"cell_type":"code","metadata":{"id":"cw7-0shbR5tQ","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511976583,"user_tz":-330,"elapsed":1449,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"4e558a1c-f0d0-4fbe-fa91-4b5fd8212cf8"},"source":["np.random.seed(0)\n","from sklearn.ensemble import RandomForestClassifier\n","f_measure, clf = evaluate(X, y, model_type=RandomForestClassifier)\n","print(f\"The f-measure is {f_measure:0.2f}\") "],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.75\n"]}]},{"cell_type":"markdown","metadata":{"id":"8AlbUwndR5tR"},"source":["\n","Switching the type of model has not helped. Perhaps instead we can boost performance by optimizing on the hyperparameters? Within this book, we've focused on single Decision tree hyperparameter; max depth. Will limiting the depth improve our predictions? Lets quickly check using a simple grid search. \n","\n","**Optimizing max depth using grid search**"]},{"cell_type":"code","metadata":{"id":"7G7rA7LFR5tR","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511980785,"user_tz":-330,"elapsed":4208,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"d20f5aa8-8bcf-4412-89af-1b30a2f6c2fc"},"source":["from sklearn.model_selection import GridSearchCV\n","np.random.seed(0)\n","\n","hyperparams = {'max_depth': list(range(1, 100)) + [None]}\n","clf_grid = GridSearchCV(DecisionTreeClassifier(), hyperparams, \n"," scoring='f1_macro', cv=2)\n","clf_grid.fit(X, y)\n","best_f = clf_grid.best_score_\n","best_depth = clf_grid.best_params_['max_depth']\n","print(f\"A maximized f-measure of {best_f:.2f} is achieved when \"\n"," f\"max_depth equals {best_depth}\") "],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["A maximized f-measure of 0.84 is achieved when max_depth equals 5\n"]}]},{"cell_type":"markdown","metadata":{"id":"GqNB0WiMR5tR"},"source":["Setting `max_depth` to 5 improves the f-measure from from 0.82 to 0.84. This level of performance is comperable with our Dorm-size dependent model. Of course, we cannot make a fair comparison without first running a grid search on the size-inclusive `X_with_size` feature-matrix. Will optimizing on `X_with_size` yield an even better classifier? Lets find out.\n","\n","**Applying grid search to size-dependent training data**"]},{"cell_type":"code","metadata":{"id":"ZE-wPb7NR5tR","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511985998,"user_tz":-330,"elapsed":5217,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"0b52b3de-fedf-49c6-a60b-7f2409643d4d"},"source":["np.random.seed(0)\n","clf_grid.fit(X_with_sizes, y)\n","best_f = clf_grid.best_score_\n","best_depth = clf_grid.best_params_['max_depth']\n","print(f\"A maximized f-measure of {best_f:.2f} is achieved when \"\n"," f\"max_depth equals {best_depth}\") "],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["A maximized f-measure of 0.85 is achieved when max_depth equals 6\n"]}]},{"cell_type":"markdown","metadata":{"id":"JxspNBOWR5tS"},"source":["The grid search did not improve performance on `X_with_size`. Thus, we can conclude that with the right choice of max-depth, both the size-dependent and independent models perform with approximately equal quality. Consequently, we can train a generalizable, size-independent model without sacrificing perfomance. \n","\n","**Training a Decision tree with `max_depth` set to 5**"]},{"cell_type":"code","metadata":{"id":"ZWDYceBcR5tS","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511986000,"user_tz":-330,"elapsed":25,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"773c8799-e098-4b3c-f027-676d05853521"},"source":["clf = DecisionTreeClassifier(max_depth=5)\n","clf.fit(X, y)"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["DecisionTreeClassifier(max_depth=5)"]},"metadata":{},"execution_count":52}]},{"cell_type":"markdown","metadata":{"id":"w1MMvAJBR5tS"},"source":["## Interpreting the Trained Model\n","\n","Lets print our model's feature importance scores.\n","\n","**Ranking features by their importance score**"]},{"cell_type":"code","metadata":{"id":"zf2YWB5HR5tS","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511991446,"user_tz":-330,"elapsed":400,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"9a942d24-ed85-4650-85f6-0d82c54cfb6c"},"source":["feature_names = df_features.columns\n","view_top_features(clf, feature_names)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Shared_Dorm_id_fof: 0.42\n","Shared_Cluster_id_fof: 0.29\n","Shared_Major_id_fof: 0.10\n","Shared_Dorm_f_fof: 0.06\n","Profile_ID_Relationship_Status_Value: 0.04\n","Profile_ID_Sex_Value: 0.04\n","Friend_Edge_Count: 0.02\n","Friend_PageRank: 0.01\n","Shared_Dorm_id_f: 0.01\n"]}]},{"cell_type":"markdown","metadata":{"id":"IQ2x600eR5tT"},"source":["Only nine important features remain. Only three features have an importance score that's at or above 0.10. These are `Shared_Dorm_id_fof`, `Shared_Cluster_id_fof`, and `Shared_Major_id_fof`. Thus, the model is primarily driven by the following three questions:\n","\n","1. Do the the user and the friend-of-friend share a dormitory? Yes or no?\n","2. Do the the user and the friend-of-friend share a social group? Yes or no?\n","3. Do the the user and the friend-of-friend share a major? Yes or no?\n","\n","Intuitively, if the answers to all three questions are _Yes_, then the user and the friend-of-a-friend are more likely to connect on FriendHook. Lets test this intutition, by displaying the tree. \n","\n","**Displaying the top branches of the tree**"]},{"cell_type":"code","metadata":{"id":"EsbpNSHhR5tT","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511991984,"user_tz":-330,"elapsed":8,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"2d67aaf9-26e9-4319-c7d9-5f5bedee9790"},"source":["clf_depth3 = DecisionTreeClassifier(max_depth=3)\n","clf_depth3.fit(X, y)\n","text_tree = tree.export_text(clf_depth3, \n"," feature_names=list(feature_names))\n","print(text_tree)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["|--- Shared_Dorm_id_fof <= 0.50\n","| |--- Shared_Cluster_id_fof <= 0.50\n","| | |--- Shared_Major_id_fof <= 0.50\n","| | | |--- class: 0\n","| | |--- Shared_Major_id_fof > 0.50\n","| | | |--- class: 0\n","| |--- Shared_Cluster_id_fof > 0.50\n","| | |--- Shared_Major_id_fof <= 0.50\n","| | | |--- class: 0\n","| | |--- Shared_Major_id_fof > 0.50\n","| | | |--- class: 1\n","|--- Shared_Dorm_id_fof > 0.50\n","| |--- Shared_Cluster_id_fof <= 0.50\n","| | |--- Profile_ID_Sex_Value <= 0.50\n","| | | |--- class: 0\n","| | |--- Profile_ID_Sex_Value > 0.50\n","| | | |--- class: 2\n","| |--- Shared_Cluster_id_fof > 0.50\n","| | |--- Shared_Dorm_f_fof <= 0.50\n","| | | |--- class: 1\n","| | |--- Shared_Dorm_f_fof > 0.50\n","| | | |--- class: 1\n","\n"]}]},{"cell_type":"markdown","metadata":{"id":"xaLpaqqGR5tT"},"source":["\n","The model's logic is very straightforward. Users who share social groups and living spaces or study-schedules are more likely to connect. There's nothing suprising about that. What is suprising is how the _Sex_ feature drives _Class 2_ label prediction. According to our tree, rejection is more likeley when:\n","\n","1. The users share a dorm but are not in the same social group.\n","2. The request sender is of a certain specific sex.\n","\n","Of course, we know that _Class 2_ labels are fairly sparse within our data. Perhaps the model's predictions are caused by random noise arising from the sparse sampling? Lets quickly check how well we predict rejection. \n","\n","**Evaluating a rejection classifier**"]},{"cell_type":"code","metadata":{"id":"UZdrRIOOR5tT","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511993082,"user_tz":-330,"elapsed":7,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"27da7635-14a0-4bad-a6d9-ad50646372b0"},"source":["y_reject = y *(y == 2)\n","f_measure, clf_reject = evaluate(X, y_reject, max_depth=5)\n","print(f\"The f-measure is {f_measure:0.2f}\") "],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["The f-measure is 0.97\n"]}]},{"cell_type":"markdown","metadata":{"id":"UuGMqc8CR5tU"},"source":["Wow, the f-measure is actually very high! We can predict rejection very well, despite the sparsity of data. What features drive rejection? Lets check by printing the new feature importance scores.\n","\n","**Ranking features by their importance score**"]},{"cell_type":"code","metadata":{"id":"yjMg-212R5tU","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511994583,"user_tz":-330,"elapsed":10,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"f086a958-3cff-4fd8-954d-b9b69a30b307"},"source":["view_top_features(clf_reject, feature_names)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Profile_ID_Sex_Value: 0.40\n","Profile_ID_Relationship_Status_Value: 0.24\n","Shared_Major_id_fof: 0.21\n","Shared_Cluster_id_fof: 0.10\n","Shared_Dorm_id_fof: 0.05\n"]}]},{"cell_type":"markdown","metadata":{"id":"Vvi2S_xRR5tU"},"source":["Interesting! Rejection is primarily driven by the user's _Sex_ and _Relationship_Status_ attributes. Lets visualize the trained tree to learn more.\n","\n","**Displaying the rejection-predicting tree**"]},{"cell_type":"code","metadata":{"id":"iY3lwtWWR5tU","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1637511995784,"user_tz":-330,"elapsed":9,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"84153145-1545-4ddc-9f00-335d216675ca"},"source":["text_tree = tree.export_text(clf_reject, \n"," feature_names=list(feature_names))\n","print(text_tree)"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["|--- Shared_Cluster_id_fof <= 0.50\n","| |--- Shared_Major_id_fof <= 0.50\n","| | |--- Shared_Dorm_id_fof <= 0.50\n","| | | |--- class: 0\n","| | |--- Shared_Dorm_id_fof > 0.50\n","| | | |--- Profile_ID_Relationship_Status_Value <= 2.50\n","| | | | |--- class: 0\n","| | | |--- Profile_ID_Relationship_Status_Value > 2.50\n","| | | | |--- Profile_ID_Sex_Value <= 0.50\n","| | | | | |--- class: 0\n","| | | | |--- Profile_ID_Sex_Value > 0.50\n","| | | | | |--- class: 2\n","| |--- Shared_Major_id_fof > 0.50\n","| | |--- Profile_ID_Sex_Value <= 0.50\n","| | | |--- class: 0\n","| | |--- Profile_ID_Sex_Value > 0.50\n","| | | |--- Profile_ID_Relationship_Status_Value <= 2.50\n","| | | | |--- class: 0\n","| | | |--- Profile_ID_Relationship_Status_Value > 2.50\n","| | | | |--- class: 2\n","|--- Shared_Cluster_id_fof > 0.50\n","| |--- class: 0\n","\n"]}]},{"cell_type":"markdown","metadata":{"id":"c844Hq5RR5tV"},"source":["According to the tree, individuals with _Sex Category 1_ and _Relationship Status Category 3_ are sending friend-requests to people outside their social group. These friend requests are likely to get rejected. Of course, we don't know the identify of the categories that lead to rejection. However, as scientists, we still speculate. Given what we know about human nature, it wouldn't be suprising this behavior is driven by single men. Perhaps they are trying to connect with women outside their social circle, in order to get a date? If, so then the request will probably be rejected. Again, all this is speculation. However, this hypothesis is worth discussing with the product managers at FriendHook. If our hypothesis is correct, then certain changes should be introduced to the product. More steps could be taken to limit unwanted dating requests. Alternatively, new product changes could be added that make easier for single people to connect."]},{"cell_type":"markdown","metadata":{"id":"lYhpcF49VW8Q"},"source":["In this case study, we have agonized over keeping our model generalizable. A model that does not generalize beyond the training set is worthless, even if the performance score seems high. Unfortunately, it’s hard to know whether a model can generalize until it’s tested on external data. But we can try to remain aware of hidden biases that won’t generalize well to other datasets. Failure to do so can yield serious consequences. Consider the following true story.\n","\n","For many years, machine learning researchers have tried to automate the field of radiology. In radiology, trained doctors examine medical images (such as X-rays) to diagnose disease. This can be treated as a supervised learning problem in which the images are features and the diagnoses are class labels. By the year 2016, multiple radiology models were published in the scientific literature. Each published model was supposed to be highly accurate based on internal evaluation. That year, leading machine learning researchers publicly declared that “we should stop training radiologists” and that “radiologists should be worried about their jobs.” Four years later, the negative publicity had led to a worldwide radiologist shortage—medical students were reluctant to enter a field that seemed destined for full automation. But by 2020, the promise of automation had failed to materialize. Most of the published models performed very poorly on new data. Why? Well, it turns out that imaging outputs differ from hospital to hospital. Different hospitals use slightly different lighting and different settings on their imaging machines. Thus, a model trained at Hospital A could not generalize well to Hospital B. Despite their seemingly high performance scores, the models were not fit for generalized use. The machine learning researchers had been too optimistic; they failed to take into account the biases inherent in their data. These failures inadvertently led to a crisis in the medical community. A more thoughtful evaluation of generalizability could have prevented this from happening."]},{"cell_type":"markdown","metadata":{"id":"Y7fRvG64F0YF"},"source":["## Summary\n","\n","- Superior machine learning algorithms do not necessarily work in every situation. Our decision tree model outperformed our random forest model, even though random forests are considered superior in the literature. We should never blindly assume that a model will always work well in every possible scenario. Instead, we should intelligently calibrate our model choice based on the specifics of the problem.\n","- Proper feature selection is less of a science and more of an art. We cannot always know in advance which features will boost a model’s performance. However, the commonsense integration of diverse and interesting features into our model should eventually improve prediction quality.\n","- We should pay careful attention to the features we feed into our model. Otherwise, the model may not generalize to other datasets.\n","- Proper hyperparameter optimization can sometimes significantly boost a model’s performance.\n","- Occasionally, it seems like nothing is working and our data is simply insufficient. However, with grit and perseverance, we can eventually yield meaningful resources. Remember, a good data scientist should never give up until they have exhausted every possible avenue of analysis."]},{"cell_type":"markdown","metadata":{"id":"a9khCVuLUnzd"},"source":["---"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Mq6Q1HzjUnzd","executionInfo":{"status":"ok","timestamp":1637512032091,"user_tz":-330,"elapsed":3739,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"579d9d3d-a4d7-498f-c8b6-280a436d464f"},"source":["!pip install -q watermark\n","%reload_ext watermark\n","%watermark -a \"Sparsh A.\" -m -iv -u -t -d"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Author: Sparsh A.\n","\n","Last updated: 2021-11-21 16:27:10\n","\n","Compiler : GCC 7.5.0\n","OS : Linux\n","Release : 5.4.104+\n","Machine : x86_64\n","Processor : x86_64\n","CPU cores : 2\n","Architecture: 64bit\n","\n","re : 2.2.1\n","numpy : 1.19.5\n","IPython : 5.5.0\n","markov_clustering: 0.0.6.dev0\n","networkx : 2.6.3\n","pandas : 1.1.5\n","sklearn : 0.0\n","matplotlib : 3.2.2\n","\n"]}]},{"cell_type":"markdown","metadata":{"id":"SLUgXbkjUnze"},"source":["---"]}]}