{ "metadata": { "name": "", "signature": "sha256:9510ae37a0e9ee094694c921732ad26abdf6bbb32599c22b0aa3cd368114a9ee" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "import csv\n", "import requests\n", "from collections import Set\n", "from pprint import pprint\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Picmaglia Data import \n", "\n", "Often, when building applications with Datamaglia a large amount of user data needs to be imported through the Datmaglia API. This is typically the case when integrating Datamaglia with an existing application where we haven't been posting user information and actions to the Datamaglia API as they occur.\n", "\n", "This notebook will show how we can use the Datamaglia API to import data from our Picmaglia application. We have three CSV files that we have exported from our database: `users.csv` (simply a list of usernames), `out_pics.csv` (pictures, including some metadata about what user created them and the location where they were taken), and `out_likes.csv` (a list of username, picture pairs where the user has liked the picture). We would like to import this data into Datamaglia so that we can start generating recommendationed pictures for users, based on their previous liked photos. This import will be done in 5 steps:\n", "\n", "1. Configure the application in the management console\n", "1. Define helpers and constants\n", "1. Import users\n", "1. Import pictures\n", "1. Import user-LIKES->picture actions\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Management Console Configuration\n", "\n", "We first must define our data model in the [Datamaglia Managment Console](https://console.datamaglia.com). Ultimately we are interested in users liking photos, so our data model looks likes this:\n", "\n", "![Picmaglia Admin](img/picmaglia_admin.png)\n", "\n", "Note the configuration of the Subgraph here, `person(id)-[:likes]->content(url)`. This specifies the types of the data objects we will be inserting into the Datamaglia API and what user actions we will use to generate recommendations. For more information on the concept of a Subgraph and configuration Datamagli data models see our [Getting Started Tutorial](http://nbviewer.ipython.org/github/Datamaglia/examples/blob/master/datamaglia-demo/datamaglia-demo.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define helpers and constants\n", "\n", "Here we will define some helper functions and some constants. These are documented below, but pay special attention to the URLS that we define." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# subgraph ID from management console. \n", "SUBGRAPH_ID = \"6044a63406e748bc9c1cd54c1a77f4da\"\n", "\n", "# Our API key, also from te management console to authenticate our requests.\n", "API_KEY = \"f88edf839cad4600b139bed5d6184efb\"\n", "\n", "# Specify Auth-Token header\n", "HEADERS = {'Auth-Token': API_KEY, 'Content-Type': 'application/json'}\n", "\n", "# Base URL for Datamaglia API\n", "BASE_URL = 'https://api.datamaglia.com/v1{}'\n", "\n", "# URL for inserting sources (users)\n", "SRC_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/sources/')\n", "\n", "# URL for inserting targets (pictures)\n", "TARGET_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/targets')\n", "\n", "# URL for inserting relationships (user -[likes]-> photo)\n", "REL_URL = BASE_URL.format('/subgraphs/' + SUBGRAPH_ID + '/data/relationships/')\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# helper function to iterate through a list in chunks of a specified size\n", "def chunker(seq, size):\n", " return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing users\n", "\n", "Our users are defined in a `data/users.csv` as simple list of usernames, one username per row. We will load this file and make `POST` requests to the Datamaglia API to add these users as **Sources** in our Subgraph." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# init empty list to hold our users\n", "users = []\n", "\n", "# use csv.DictReader to parse the file and append to our users list\n", "with open('data/users.csv') as f:\n", " for line in csv.DictReader(f):\n", " users.append(line)\n", " \n", "print users[0:2]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[{'users': 'jacubert'}, {'users': 'namirte_stevens'}]\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we define a function to take a list of users and create the structure that will be serialized to JSON and sent as the body of our POST request to insert these users in the Datamaglia API. Note that `subgraphs/{subgraphId}/sources` takes an `entities` JSON array for batch inserts. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "def createSources(users):\n", " # list comprehension to create a dict for each user in the users list of the form {id: username}\n", " entities = [\n", " {\n", " 'id': user['users']\n", " } for user in users\n", " ]\n", " payload = {'entities': entities} # This payload will be serialized to JSON and sent with our request\n", " resp = requests.post(SRC_URL, headers=HEADERS, data=json.dumps(payload)) # Make the POST request\n", " print resp # 204 status " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we call the `createSources` function with the usernames we loaded previously in batches of 100 " ] }, { "cell_type": "code", "collapsed": false, "input": [ "for chunk in chunker(users[7000:], 100):\n", " createSources(chunk)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing Pictures\n", "\n", "Next we import our pictures which represent the **Target** piece of our Subgraph. These are the items that we will be recommending to users based on previous photos they have liked. This piece is very similar to the Importing Users section above, however here we demonstrate how we can store arbitrary properties for a given Source or Target." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# init empty pictures list\n", "pictures = []\n", "\n", "# read csv file and add each picture dict to the pictures list\n", "with open('data/out_pic.csv') as f:\n", " for line in csv.DictReader(f):\n", " pictures.append(line)\n", " \n", "pprint(pictures[0:2])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[{'lat': '27.74613857589044',\n", " 'lon': '-15.585136413574219',\n", " 'text': 'my travel',\n", " 'url': 'https://ppcdn.500px.org/95503699/7cf68738631dba1d1b946b3e0a90ab21264c238f/3.jpg?v=11',\n", " 'user': 'danno_80'},\n", " {'lat': '-34.829728',\n", " 'lon': '19.984637',\n", " 'text': 'Another shot from a shipwreck at the coast of Cape Agulhas in South Africa.',\n", " 'url': 'https://ppcdn.500px.org/95503635/39614c9852eadb4c5dd63a75a764eb7df3a5e6c8/3.jpg?v=10',\n", " 'user': 'AndreasKunz1'}]\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we define the `createTargets` function which will create the JSON structure for our pictures and make the POST request to the Datamaglia API. Note that we can define a `properties` list of `key, value` pairs. This data will be returned to us when we generate recommendations. A common use case for this is storing data necessary to render the client views for recommendations, without making a separate request to hydrate the recommended objects from the client backend." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def createTargets(pictures):\n", " entities = [\n", " {\n", " 'id': pic['url'],\n", " 'properties': [\n", " {'key': 'lat', 'value': pic['lat']},\n", " {'key': 'lon', 'value': pic['lon']},\n", " {'key': 'text', 'value': pic['text']},\n", " {'key': 'user', 'value': pic['user']}\n", " ]\n", " } for pic in pictures\n", " ]\n", " payload = {'entities': entities}\n", " resp = requests.post(TARGET_URL, headers=HEADERS, data=json.dumps(payload))\n", " print resp # Should be 204" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "for chunk in chunker(pictures, 100):\n", " createTargets(chunk)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import relationships\n", "Now that we have created our Sources and Targets using the Datamaglia API we can post the relationships, or actions, that define our user preferences. These relationships form the data that will be used for generating recommendations." ] }, { "cell_type": "code", "collapsed": false, "input": [ "likes = []\n", "with open('data/out_like.csv') as f:\n", " for line in csv.DictReader(f):\n", " likes.append(line)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "def createLikes(likes):\n", " entities = [\n", " {\n", " 'weight': 0,\n", " 'source': like['user'],\n", " 'target': like['pic']\n", " } for like in likes\n", " ]\n", " payload = {'entities': entities}\n", " resp = requests.post(REL_URL, headers=HEADERS, data=json.dumps(payload))\n", " print resp # Should be 204" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "for chunk in chunker(likes, 100):\n", " createLikes(chunk)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've now imported all our data and are ready to query for recommendations!" ] } ], "metadata": {} } ] }