{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Montreal Python 69 - Tutorial\n", "### By Abbas Taher\n", "### GoFlek Inc.\n", "### Monday Feb 5th 2018\n", "### Three Methods to Aggregate Subscriber's Interest - Aggregating Data\n", "### AGENDA:\n", "\n", "> a- Recipe 1- Using Python Dictionary\n", " \n", "> b- Short Overview of PySpark\n", " \n", "> c- Recipe 2- Using Apache Spark - GroupBy Transformation\n", " \n", "> d- Recipe 3- Using Apache Spark - ReduceBy Transformation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Statment: \"Aggregate Interest by ID\"\n", "> For Subscribers of an Online Magazine\n", "> Aggregate each subscriber's interest/likes into a (key,value) pair in one record\n", "\n", "\n", "### Original Data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | ID | \n", "Interest | \n", "
---|---|---|
0 | \n", "1001 | \n", "Sports | \n", "
1 | \n", "1001 | \n", "Techno | \n", "
2 | \n", "1002 | \n", "Auto | \n", "
3 | \n", "1002 | \n", "Food | \n", "
4 | \n", "1002 | \n", "Sports | \n", "
5 | \n", "1002 | \n", "Techno | \n", "
6 | \n", "1003 | \n", "Food | \n", "
... | \n", "... | \n", "... | \n", "
37 | \n", "1018 | \n", "Auto | \n", "
38 | \n", "1018 | \n", "Food | \n", "
39 | \n", "1018 | \n", "Techno | \n", "
40 | \n", "1019 | \n", "Food | \n", "
41 | \n", "1019 | \n", "Techno | \n", "
42 | \n", "1020 | \n", "Auto | \n", "
43 | \n", "1020 | \n", "Food | \n", "
44 rows × 2 columns
\n", "\n", " | ID | \n", "Interest | \n", "
---|---|---|
0 | \n", "1001 | \n", "Sports, Techno | \n", "
1 | \n", "1002 | \n", "Auto, Food, Sports, Techno | \n", "
2 | \n", "1003 | \n", "Food | \n", "
3 | \n", "1004 | \n", "Auto, Food | \n", "
4 | \n", "1005 | \n", "Auto, Food, Techno | \n", "
5 | \n", "1006 | \n", "Food, Sports, Techno | \n", "
6 | \n", "1007 | \n", "Sports, Techno | \n", "
7 | \n", "1008 | \n", "Auto, Techno | \n", "
8 | \n", "1009 | \n", "Food | \n", "
9 | \n", "1010 | \n", "Auto, Food | \n", "
10 | \n", "1011 | \n", "Food,Sport | \n", "
11 | \n", "1012 | \n", "Auto, Food, Sports, Techno | \n", "
12 | \n", "1013 | \n", "Auto | \n", "
13 | \n", "1014 | \n", "Sports | \n", "
14 | \n", "1015 | \n", "Auto, Food | \n", "
15 | \n", "1016 | \n", "Auto, Food, Sports, Techno | \n", "
16 | \n", "1017 | \n", "Sports | \n", "
17 | \n", "1018 | \n", "Auto, Food, Techno | \n", "
18 | \n", "1019 | \n", "Food, Techno | \n", "
19 | \n", "1020 | \n", "Auto, Food | \n", "