{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\" " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "using Keras version: 2.2.4\n" ] } ], "source": [ "import ktrain\n", "from ktrain import graph as gr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Node Classification in Graphs\n", "\n", "Consider a social network (e.g., Facebook, Linkedin, Twitter) where each node is a person and links represent friendships. Each node (or person) in the graph can be be described by various attributes such as their location, Alma mater, organizational memberships, gender, relationship status, children, etc. Suppose we had the U.S. political affiliation (e.g., Democrat, Republican, Libertarian, Green Party) of only a small subset of nodes with the remaining nodes being unknown. Here, node classification involves predicting the political affiliation of unknown nodes based only on the small subset of of nodes for which we know the political affiliation. \n", "\n", "Where as traditional tabular-based models (e.g., logistic regression, SVM) utilize only the node's attributes to predict a node's label, graph neural networks utilize both the node's attributes and the graph's structure. For instance, to predict the political affiliation of a person it is helpful to not only look at the person's attributes but the attributes of other people within the vicinity of this person in the social network. Birds of a feather typically flock together. By exploiting graph structure, graph neural networks require much less labeled ground truth than non-graph approaches. For instance, in the example below, we will consider the labels of only a very small fraction of all nodes to build our model.\n", "\n", "## Hateful Twitter Users\n", "In this notebook, we will use *ktain* to perform node classification on a Twitter graph to predict hateful users. Each Twitter user is described by various attributes related to both their profile and their tweet behavior. Examples include number of tweets and retweets, status length, etc. \n", "\n", "The dataset can be downloaded from [here](https://www.kaggle.com/manoelribeiro/hateful-users-on-twitter).\n", "\n", "For node classification, *ktrain* requires two files formatted in a specific way:\n", "- a CSV or tab-delimited file containing the links (or edges) in the graph. Each row containing two node IDs representing an edge.\n", "- A CSV or tab-delimited file describing the attributes and label associated with each node. The first column is the node ID and the last column should be the label or target (as string labels such as \"hate\" or \"normal\"). All other columns should contain numerical features and are assumed to be standardized or transformed as necessary. If the last column representing the target has missing values, these are treated as a holdout set for which predictions can be made after training the model. The numeric feature columns should not have any missing values.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean and Prepare Data\n", "We must first transform the raw dataset into the file formats described above. We consider two files: `users.edges` which describes the graph structure and `users_neighborhood_anon.csv` which contains each node's label and attributes. The file `users.edges` is the edge list and is already in the format expected by *ktrain* for the most part. We must clean and prepare `users_neighborhood_anon.csv` into the format expected by *ktrain*. We will drop unused columns, normalize numeric attributes, re-order/transform the target column `hate` into an interpretable string label, and save the data as a tab-delimited file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | statuses_count | \n", "followers_count | \n", "followees_count | \n", "favorites_count | \n", "listed_count | \n", "negotiate_empath | \n", "vehicle_empath | \n", "science_empath | \n", "timidity_empath | \n", "gain_empath | \n", "... | \n", "tweet number | \n", "retweet number | \n", "quote number | \n", "status length | \n", "number urls | \n", "baddies | \n", "mentions | \n", "time_diff | \n", "time_diff_median | \n", "hate | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1.541150 | \n", "0.046773 | \n", "1.104767 | \n", "1.869391 | \n", "0.017835 | \n", "-1.752256 | \n", "0.164900 | \n", "0.181173 | \n", "0.875069 | \n", "1.130523 | \n", "... | \n", "-0.049013 | \n", "0.321929 | \n", "-0.369992 | \n", "-1.036127 | \n", "-0.796091 | \n", "0.047430 | \n", "0.356495 | \n", "-1.888186 | \n", "-1.299249 | \n", "normal | \n", "
1 | \n", "-0.700240 | \n", "0.772450 | \n", "-0.526061 | \n", "-1.434183 | \n", "0.613187 | \n", "-0.735320 | \n", "-0.864337 | \n", "0.599279 | \n", "1.610977 | \n", "-1.203049 | \n", "... | \n", "1.479066 | \n", "-1.999580 | \n", "-1.545285 | \n", "-0.188945 | \n", "-1.875745 | \n", "-0.626192 | \n", "-1.972207 | \n", "0.160925 | \n", "-1.512603 | \n", "unknown | \n", "
2 | \n", "-1.077284 | \n", "-0.127775 | \n", "0.767345 | \n", "-0.669050 | \n", "-0.523882 | \n", "-0.118440 | \n", "-1.573040 | \n", "1.211083 | \n", "-0.154213 | \n", "0.932754 | \n", "... | \n", "-0.201320 | \n", "0.452537 | \n", "-1.545285 | \n", "0.637869 | \n", "0.884530 | \n", "-0.096918 | \n", "0.348954 | \n", "0.698841 | \n", "0.122176 | \n", "unknown | \n", "
3 | \n", "1.908494 | \n", "-0.021575 | \n", "-0.548705 | \n", "0.078540 | \n", "0.017835 | \n", "-0.472125 | \n", "1.281633 | \n", "-0.544862 | \n", "1.259492 | \n", "-0.456470 | \n", "... | \n", "-1.018822 | \n", "1.085858 | \n", "-0.662393 | \n", "-0.701835 | \n", "0.088472 | \n", "-0.626192 | \n", "-1.254997 | \n", "-1.576801 | \n", "-1.311031 | \n", "unknown | \n", "
4 | \n", "-0.778589 | \n", "0.729918 | \n", "2.296049 | \n", "-0.725089 | \n", "0.700128 | \n", "-1.488804 | \n", "-1.573040 | \n", "-0.969812 | \n", "0.199834 | \n", "-1.203049 | \n", "... | \n", "-0.427866 | \n", "0.638106 | \n", "-1.545285 | \n", "1.370832 | \n", "0.655433 | \n", "0.955922 | \n", "-1.914894 | \n", "0.803553 | \n", "1.472247 | \n", "unknown | \n", "
5 rows × 205 columns
\n", "\n", " | UserID | \n", "Predicted | \n", "
---|---|---|
52 | \n", "56 | \n", "hateful | \n", "
129 | \n", "140 | \n", "hateful | \n", "
243 | \n", "259 | \n", "hateful | \n", "
475 | \n", "499 | \n", "hateful | \n", "
501 | \n", "526 | \n", "hateful | \n", "