{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Montreal Python 69 - Tutorial\n", "### By Abbas Taher\n", "### GoFlek Inc.\n", "### Monday Feb 5th 2018\n", "### Three Methods to Aggregate Subscriber's Interest - Aggregating Data\n", "### AGENDA:\n", "\n", "> a- Recipe 1- Using Python Dictionary\n", " \n", "> b- Short Overview of PySpark\n", " \n", "> c- Recipe 2- Using Apache Spark - GroupBy Transformation\n", " \n", "> d- Recipe 3- Using Apache Spark - ReduceBy Transformation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Statment: \"Aggregate Interest by ID\"\n", "> For Subscribers of an Online Magazine\n", "> Aggregate each subscriber's interest/likes into a (key,value) pair in one record\n", "\n", "\n", "### Original Data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | ID | \n", "Interest | \n", "
|---|---|---|
| 0 | \n", "1001 | \n", "Sports | \n", "
| 1 | \n", "1001 | \n", "Techno | \n", "
| 2 | \n", "1002 | \n", "Auto | \n", "
| 3 | \n", "1002 | \n", "Food | \n", "
| 4 | \n", "1002 | \n", "Sports | \n", "
| 5 | \n", "1002 | \n", "Techno | \n", "
| 6 | \n", "1003 | \n", "Food | \n", "
| ... | \n", "... | \n", "... | \n", "
| 37 | \n", "1018 | \n", "Auto | \n", "
| 38 | \n", "1018 | \n", "Food | \n", "
| 39 | \n", "1018 | \n", "Techno | \n", "
| 40 | \n", "1019 | \n", "Food | \n", "
| 41 | \n", "1019 | \n", "Techno | \n", "
| 42 | \n", "1020 | \n", "Auto | \n", "
| 43 | \n", "1020 | \n", "Food | \n", "
44 rows × 2 columns
\n", "| \n", " | ID | \n", "Interest | \n", "
|---|---|---|
| 0 | \n", "1001 | \n", "Sports, Techno | \n", "
| 1 | \n", "1002 | \n", "Auto, Food, Sports, Techno | \n", "
| 2 | \n", "1003 | \n", "Food | \n", "
| 3 | \n", "1004 | \n", "Auto, Food | \n", "
| 4 | \n", "1005 | \n", "Auto, Food, Techno | \n", "
| 5 | \n", "1006 | \n", "Food, Sports, Techno | \n", "
| 6 | \n", "1007 | \n", "Sports, Techno | \n", "
| 7 | \n", "1008 | \n", "Auto, Techno | \n", "
| 8 | \n", "1009 | \n", "Food | \n", "
| 9 | \n", "1010 | \n", "Auto, Food | \n", "
| 10 | \n", "1011 | \n", "Food,Sport | \n", "
| 11 | \n", "1012 | \n", "Auto, Food, Sports, Techno | \n", "
| 12 | \n", "1013 | \n", "Auto | \n", "
| 13 | \n", "1014 | \n", "Sports | \n", "
| 14 | \n", "1015 | \n", "Auto, Food | \n", "
| 15 | \n", "1016 | \n", "Auto, Food, Sports, Techno | \n", "
| 16 | \n", "1017 | \n", "Sports | \n", "
| 17 | \n", "1018 | \n", "Auto, Food, Techno | \n", "
| 18 | \n", "1019 | \n", "Food, Techno | \n", "
| 19 | \n", "1020 | \n", "Auto, Food | \n", "