{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Word Count Example with Apache Spark (PySpark)\n", "\n", "In this notebook we will go through the traditional Word Count example but we will cover map, flatmap, filter, count, reduceByKey, sortByKey and enhanced word count.\n", "\n", "`\n", "@author: Anindya Saha \n", "@email: mail.anindya@gmail.com\n", "`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pyspark.sql import SparkSession" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "spark = SparkSession.builder.master('local[*]').appName('wordcount-pyspark').getOrCreate()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "

SparkSession - in-memory

\n", " \n", "
\n", "

SparkContext

\n", "\n", "

Spark UI

\n", "\n", "
\n", "
Version
\n", "
v2.3.0
\n", "
Master
\n", "
local[*]
\n", "
AppName
\n", "
wordcount-pyspark
\n", "
\n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "infile ='data/wordcount.txt'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Read the file:**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# read the file\n", "word_rdd = spark.sparkContext.textFile(infile).cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1. Read the file, print every line:**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = word_rdd" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Management (or managing) is the administration of an organization, whether it be a business, a not-for-profit organization, or government body. Management includes the activities of setting the strategy of an organization and coordinating the efforts of its employees or volunteers to accomplish its objectives through the application of available resources, such as financial, natural, technological, and human resources. The term \"management\" may also refer to the people who manage an organization.\n", "Management is also an academic discipline, a social science whose objective is to study social organization and organizational leadership. Management is studied at colleges and universities; some important degrees in management are the Bachelor of Commerce (B.Com.) and Master of Business Administration (M.B.A.) and, for the public sector, the Master of Public Administration (MPA) degree. Individuals who aim at becoming management researchers or professors may complete the Doctor of Business Administration (DBA) or the PhD in business administration or management.\n", "In larger organizations, there are generally three levels of managers, which are typically organized in a hierarchical, pyramid structure. Senior managers, such as the Board of Directors, Chief Executive Officer (CEO) or President of an organization, set the strategic goals of the organization and make decisions on how the overall organization will operate. Senior managers provide direction to the middle managers who report to them. Middle managers, examples of which would include branch managers, regional managers and section managers, provide direction to front-line managers. Middle managers communicate the strategic goals of senior management to the front-line managers. Lower managers, such as supervisors and front-line team leaders, oversee the work of regular employees (or volunteers, in some voluntary organizations) and provide direction on their work.\n", "In smaller organizations, the roles of managers have much wider scopes. A manager can perform several roles or even all of the roles commonly observed in a large organization. There are many more smaller organizations than larger ones.\n" ] } ], "source": [ "# print each line of the book\n", "for line in data.collect():\n", " print(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2. Read the file, print every word:**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = word_rdd.flatMap(lambda line: line.split(\" \"))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Management', '(or', 'managing)', 'is', 'the', 'administration', 'of', 'an', 'organization,', 'whether', 'it', 'be', 'a', 'business,', 'a', 'not-for-profit', 'organization,', 'or', 'government', 'body.', 'Management', 'includes', 'the', 'activities', 'of', 'setting', 'the', 'strategy', 'of', 'an', 'organization', 'and', 'coordinating', 'the', 'efforts', 'of', 'its', 'employees', 'or', 'volunteers', 'to', 'accomplish', 'its', 'objectives', 'through', 'the', 'application', 'of', 'available', 'resources,', 'such', 'as', 'financial,', 'natural,', 'technological,', 'and', 'human', 'resources.', 'The', 'term', '\"management\"', 'may', 'also', 'refer', 'to', 'the', 'people', 'who', 'manage', 'an', 'organization.', 'Management', 'is', 'also', 'an', 'academic', 'discipline,', 'a', 'social', 'science', 'whose', 'objective', 'is', 'to', 'study', 'social', 'organization', 'and', 'organizational', 'leadership.', 'Management', 'is', 'studied', 'at', 'colleges', 'and', 'universities;', 'some', 'important', 'degrees', 'in', 'management', 'are', 'the', 'Bachelor', 'of', 'Commerce', '(B.Com.)', 'and', 'Master', 'of', 'Business', 'Administration', '(M.B.A.)', 'and,', 'for', 'the', 'public', 'sector,', 'the', 'Master', 'of', 'Public', 'Administration', '(MPA)', 'degree.', 'Individuals', 'who', 'aim', 'at', 'becoming', 'management', 'researchers', 'or', 'professors', 'may', 'complete', 'the', 'Doctor', 'of', 'Business', 'Administration', '(DBA)', 'or', 'the', 'PhD', 'in', 'business', 'administration', 'or', 'management.', 'In', 'larger', 'organizations,', 'there', 'are', 'generally', 'three', 'levels', 'of', 'managers,', 'which', 'are', 'typically', 'organized', 'in', 'a', 'hierarchical,', 'pyramid', 'structure.', 'Senior', 'managers,', 'such', 'as', 'the', 'Board', 'of', 'Directors,', 'Chief', 'Executive', 'Officer', '(CEO)', 'or', 'President', 'of', 'an', 'organization,', 'set', 'the', 'strategic', 'goals', 'of', 'the', 'organization', 'and', 'make', 'decisions', 'on', 'how', 'the', 'overall', 'organization', 'will', 'operate.', 'Senior', 'managers', 'provide', 'direction', 'to', 'the', 'middle', 'managers', 'who', 'report', 'to', 'them.', 'Middle', 'managers,', 'examples', 'of', 'which', 'would', 'include', 'branch', 'managers,', 'regional', 'managers', 'and', 'section', 'managers,', 'provide', 'direction', 'to', 'front-line', 'managers.', 'Middle', 'managers', 'communicate', 'the', 'strategic', 'goals', 'of', 'senior', 'management', 'to', 'the', 'front-line', 'managers.', 'Lower', 'managers,', 'such', 'as', 'supervisors', 'and', 'front-line', 'team', 'leaders,', 'oversee', 'the', 'work', 'of', 'regular', 'employees', '(or', 'volunteers,', 'in', 'some', 'voluntary', 'organizations)', 'and', 'provide', 'direction', 'on', 'their', 'work.', 'In', 'smaller', 'organizations,', 'the', 'roles', 'of', 'managers', 'have', 'much', 'wider', 'scopes.', 'A', 'manager', 'can', 'perform', 'several', 'roles', 'or', 'even', 'all', 'of', 'the', 'roles', 'commonly', 'observed', 'in', 'a', 'large', 'organization.', 'There', 'are', 'many', 'more', 'smaller', 'organizations', 'than', 'larger', 'ones.']\n" ] } ], "source": [ "# print each and every word\n", "print(data.collect())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3. Read the file, generate words, filter out \"empty\" word and print each word:**" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = (word_rdd\n", " .flatMap(lambda line: line.split(\" \"))\n", " .filter(lambda word: len(word) > 0)\n", " )" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Management', '(or', 'managing)', 'is', 'the', 'administration', 'of', 'an', 'organization,', 'whether', 'it', 'be', 'a', 'business,', 'a', 'not-for-profit', 'organization,', 'or', 'government', 'body.', 'Management', 'includes', 'the', 'activities', 'of', 'setting', 'the', 'strategy', 'of', 'an', 'organization', 'and', 'coordinating', 'the', 'efforts', 'of', 'its', 'employees', 'or', 'volunteers', 'to', 'accomplish', 'its', 'objectives', 'through', 'the', 'application', 'of', 'available', 'resources,', 'such', 'as', 'financial,', 'natural,', 'technological,', 'and', 'human', 'resources.', 'The', 'term', '\"management\"', 'may', 'also', 'refer', 'to', 'the', 'people', 'who', 'manage', 'an', 'organization.', 'Management', 'is', 'also', 'an', 'academic', 'discipline,', 'a', 'social', 'science', 'whose', 'objective', 'is', 'to', 'study', 'social', 'organization', 'and', 'organizational', 'leadership.', 'Management', 'is', 'studied', 'at', 'colleges', 'and', 'universities;', 'some', 'important', 'degrees', 'in', 'management', 'are', 'the', 'Bachelor', 'of', 'Commerce', '(B.Com.)', 'and', 'Master', 'of', 'Business', 'Administration', '(M.B.A.)', 'and,', 'for', 'the', 'public', 'sector,', 'the', 'Master', 'of', 'Public', 'Administration', '(MPA)', 'degree.', 'Individuals', 'who', 'aim', 'at', 'becoming', 'management', 'researchers', 'or', 'professors', 'may', 'complete', 'the', 'Doctor', 'of', 'Business', 'Administration', '(DBA)', 'or', 'the', 'PhD', 'in', 'business', 'administration', 'or', 'management.', 'In', 'larger', 'organizations,', 'there', 'are', 'generally', 'three', 'levels', 'of', 'managers,', 'which', 'are', 'typically', 'organized', 'in', 'a', 'hierarchical,', 'pyramid', 'structure.', 'Senior', 'managers,', 'such', 'as', 'the', 'Board', 'of', 'Directors,', 'Chief', 'Executive', 'Officer', '(CEO)', 'or', 'President', 'of', 'an', 'organization,', 'set', 'the', 'strategic', 'goals', 'of', 'the', 'organization', 'and', 'make', 'decisions', 'on', 'how', 'the', 'overall', 'organization', 'will', 'operate.', 'Senior', 'managers', 'provide', 'direction', 'to', 'the', 'middle', 'managers', 'who', 'report', 'to', 'them.', 'Middle', 'managers,', 'examples', 'of', 'which', 'would', 'include', 'branch', 'managers,', 'regional', 'managers', 'and', 'section', 'managers,', 'provide', 'direction', 'to', 'front-line', 'managers.', 'Middle', 'managers', 'communicate', 'the', 'strategic', 'goals', 'of', 'senior', 'management', 'to', 'the', 'front-line', 'managers.', 'Lower', 'managers,', 'such', 'as', 'supervisors', 'and', 'front-line', 'team', 'leaders,', 'oversee', 'the', 'work', 'of', 'regular', 'employees', '(or', 'volunteers,', 'in', 'some', 'voluntary', 'organizations)', 'and', 'provide', 'direction', 'on', 'their', 'work.', 'In', 'smaller', 'organizations,', 'the', 'roles', 'of', 'managers', 'have', 'much', 'wider', 'scopes.', 'A', 'manager', 'can', 'perform', 'several', 'roles', 'or', 'even', 'all', 'of', 'the', 'roles', 'commonly', 'observed', 'in', 'a', 'large', 'organization.', 'There', 'are', 'many', 'more', 'smaller', 'organizations', 'than', 'larger', 'ones.']\n" ] } ], "source": [ "# print each and every word\n", "print(data.collect())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4. Read the file, generate words, trim each word, put in lowercase, replace special char, filter out \"empty\" word and print each word:**" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "data = (word_rdd\n", " .flatMap(lambda line: line.split(\" \"))\n", " .map(lambda word: word.strip().lower())\n", " .map(lambda word: re.sub(\"[,.:;'\\\"\\\\?\\\\-!\\\\(\\\\)]\", \"\", word))\n", " .filter(lambda word: len(word) > 2)\n", " )" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['management', 'managing', 'the', 'administration', 'organization', 'whether', 'business', 'notforprofit', 'organization', 'government', 'body', 'management', 'includes', 'the', 'activities', 'setting', 'the', 'strategy', 'organization', 'and', 'coordinating', 'the', 'efforts', 'its', 'employees', 'volunteers', 'accomplish', 'its', 'objectives', 'through', 'the', 'application', 'available', 'resources', 'such', 'financial', 'natural', 'technological', 'and', 'human', 'resources', 'the', 'term', 'management', 'may', 'also', 'refer', 'the', 'people', 'who', 'manage', 'organization', 'management', 'also', 'academic', 'discipline', 'social', 'science', 'whose', 'objective', 'study', 'social', 'organization', 'and', 'organizational', 'leadership', 'management', 'studied', 'colleges', 'and', 'universities', 'some', 'important', 'degrees', 'management', 'are', 'the', 'bachelor', 'commerce', 'bcom', 'and', 'master', 'business', 'administration', 'mba', 'and', 'for', 'the', 'public', 'sector', 'the', 'master', 'public', 'administration', 'mpa', 'degree', 'individuals', 'who', 'aim', 'becoming', 'management', 'researchers', 'professors', 'may', 'complete', 'the', 'doctor', 'business', 'administration', 'dba', 'the', 'phd', 'business', 'administration', 'management', 'larger', 'organizations', 'there', 'are', 'generally', 'three', 'levels', 'managers', 'which', 'are', 'typically', 'organized', 'hierarchical', 'pyramid', 'structure', 'senior', 'managers', 'such', 'the', 'board', 'directors', 'chief', 'executive', 'officer', 'ceo', 'president', 'organization', 'set', 'the', 'strategic', 'goals', 'the', 'organization', 'and', 'make', 'decisions', 'how', 'the', 'overall', 'organization', 'will', 'operate', 'senior', 'managers', 'provide', 'direction', 'the', 'middle', 'managers', 'who', 'report', 'them', 'middle', 'managers', 'examples', 'which', 'would', 'include', 'branch', 'managers', 'regional', 'managers', 'and', 'section', 'managers', 'provide', 'direction', 'frontline', 'managers', 'middle', 'managers', 'communicate', 'the', 'strategic', 'goals', 'senior', 'management', 'the', 'frontline', 'managers', 'lower', 'managers', 'such', 'supervisors', 'and', 'frontline', 'team', 'leaders', 'oversee', 'the', 'work', 'regular', 'employees', 'volunteers', 'some', 'voluntary', 'organizations', 'and', 'provide', 'direction', 'their', 'work', 'smaller', 'organizations', 'the', 'roles', 'managers', 'have', 'much', 'wider', 'scopes', 'manager', 'can', 'perform', 'several', 'roles', 'even', 'all', 'the', 'roles', 'commonly', 'observed', 'large', 'organization', 'there', 'are', 'many', 'more', 'smaller', 'organizations', 'than', 'larger', 'ones']\n" ] } ], "source": [ "# print each and every word\n", "print(data.collect())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**5. Read the file, generate words, trim each word, put in lower case, replace special char, filter out \"empty\" word, count each word and print the word and the count:**" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "data = (word_rdd\n", " .flatMap(lambda line: line.split(\" \"))\n", " .map(lambda word: word.strip().lower())\n", " .map(lambda word: re.sub(\"[,.:;'\\\"\\\\?\\\\-!\\\\(\\\\)]\", \"\", word))\n", " .filter(lambda word: len(word) > 2)\n", " .map(lambda word: (word, 1))\n", " .reduceByKey(lambda a, b: a + b)\n", " )" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('leadership', 1), ('human', 1), ('bcom', 1), ('provide', 3), ('even', 1), ('perform', 1), ('set', 1), ('than', 1), ('several', 1), ('levels', 1), ('science', 1), ('officer', 1), ('ceo', 1), ('typically', 1), ('large', 1), ('becoming', 1), ('commerce', 1), ('resources', 2), ('pyramid', 1), ('examples', 1), ('voluntary', 1), ('objectives', 1), ('larger', 2), ('may', 2), ('managers', 13), ('leaders', 1), ('make', 1), ('manager', 1), ('middle', 3), ('there', 2), ('strategic', 2), ('financial', 1), ('chief', 1), ('overall', 1), ('researchers', 1), ('have', 1), ('volunteers', 2), ('social', 2), ('more', 1), ('roles', 3), ('work', 2), ('efforts', 1), ('operate', 1), ('degrees', 1), ('communicate', 1), ('term', 1), ('professors', 1), ('would', 1), ('three', 1), ('include', 1), ('refer', 1), ('team', 1), ('bachelor', 1), ('master', 2), ('doctor', 1), ('whose', 1), ('individuals', 1), ('sector', 1), ('academic', 1), ('discipline', 1), ('public', 2), ('management', 9), ('organizations', 4), ('lower', 1), ('scopes', 1), ('board', 1), ('senior', 3), ('mpa', 1), ('direction', 3), ('regional', 1), ('them', 1), ('executive', 1), ('organizational', 1), ('regular', 1), ('are', 4), ('business', 4), ('coordinating', 1), ('oversee', 1), ('activities', 1), ('dba', 1), ('commonly', 1), ('important', 1), ('can', 1), ('frontline', 3), ('how', 1), ('for', 1), ('all', 1), ('smaller', 2), ('its', 2), ('wider', 1), ('natural', 1), ('the', 22), ('who', 3), ('complete', 1), ('president', 1), ('observed', 1), ('managing', 1), ('report', 1), ('available', 1), ('many', 1), ('supervisors', 1), ('government', 1), ('colleges', 1), ('will', 1), ('degree', 1), ('structure', 1), ('such', 3), ('strategy', 1), ('branch', 1), ('phd', 1), ('and', 10), ('some', 2), ('which', 2), ('hierarchical', 1), ('goals', 2), ('includes', 1), ('accomplish', 1), ('much', 1), ('organized', 1), ('administration', 5), ('aim', 1), ('also', 2), ('body', 1), ('directors', 1), ('mba', 1), ('study', 1), ('section', 1), ('decisions', 1), ('through', 1), ('setting', 1), ('generally', 1), ('their', 1), ('objective', 1), ('universities', 1), ('whether', 1), ('employees', 2), ('technological', 1), ('organization', 9), ('ones', 1), ('people', 1), ('manage', 1), ('studied', 1), ('application', 1), ('notforprofit', 1)]\n" ] } ], "source": [ "# print each and every word\n", "print(data.collect())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**6. Read the file, generate words, trim each word, put in lower case, replace special char, filter out \"empty\" word, count each word, sort by count ASC and print the word and the count:**" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "data = (word_rdd\n", " .flatMap(lambda line: line.split(\" \"))\n", " .map(lambda word: word.strip().lower())\n", " .map(lambda word: re.sub(\"[,.:;'\\\"\\\\?\\\\-!\\\\(\\\\)]\", \"\", word))\n", " .filter(lambda word: len(word) > 2)\n", " .map(lambda word: (word, 1))\n", " .reduceByKey(lambda a, b: a + b)\n", " .map(lambda wc: (wc[1], wc[0]))\n", " .sortByKey(False)\n", " )" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(22, 'the'), (13, 'managers'), (10, 'and'), (9, 'management'), (9, 'organization'), (5, 'administration'), (4, 'organizations'), (4, 'are'), (4, 'business'), (3, 'provide'), (3, 'middle'), (3, 'roles'), (3, 'senior'), (3, 'direction'), (3, 'frontline'), (3, 'who'), (3, 'such'), (2, 'resources'), (2, 'larger'), (2, 'may'), (2, 'there'), (2, 'strategic'), (2, 'volunteers'), (2, 'social'), (2, 'work'), (2, 'master'), (2, 'public'), (2, 'smaller'), (2, 'its'), (2, 'some'), (2, 'which'), (2, 'goals'), (2, 'also'), (2, 'employees'), (1, 'leadership'), (1, 'human'), (1, 'bcom'), (1, 'even'), (1, 'perform'), (1, 'set'), (1, 'than'), (1, 'several'), (1, 'levels'), (1, 'science'), (1, 'officer'), (1, 'ceo'), (1, 'typically'), (1, 'large'), (1, 'becoming'), (1, 'commerce'), (1, 'pyramid'), (1, 'examples'), (1, 'voluntary'), (1, 'objectives'), (1, 'leaders'), (1, 'make'), (1, 'manager'), (1, 'financial'), (1, 'chief'), (1, 'overall'), (1, 'researchers'), (1, 'have'), (1, 'more'), (1, 'efforts'), (1, 'operate'), (1, 'degrees'), (1, 'communicate'), (1, 'term'), (1, 'professors'), (1, 'would'), (1, 'three'), (1, 'include'), (1, 'refer'), (1, 'team'), (1, 'bachelor'), (1, 'doctor'), (1, 'whose'), (1, 'individuals'), (1, 'sector'), (1, 'academic'), (1, 'discipline'), (1, 'lower'), (1, 'scopes'), (1, 'board'), (1, 'mpa'), (1, 'regional'), (1, 'them'), (1, 'executive'), (1, 'organizational'), (1, 'regular'), (1, 'coordinating'), (1, 'oversee'), (1, 'activities'), (1, 'dba'), (1, 'commonly'), (1, 'important'), (1, 'can'), (1, 'how'), (1, 'for'), (1, 'all'), (1, 'wider'), (1, 'natural'), (1, 'complete'), (1, 'president'), (1, 'observed'), (1, 'managing'), (1, 'report'), (1, 'available'), (1, 'many'), (1, 'supervisors'), (1, 'government'), (1, 'colleges'), (1, 'will'), (1, 'degree'), (1, 'structure'), (1, 'strategy'), (1, 'branch'), (1, 'phd'), (1, 'hierarchical'), (1, 'includes'), (1, 'accomplish'), (1, 'much'), (1, 'organized'), (1, 'aim'), (1, 'body'), (1, 'directors'), (1, 'mba'), (1, 'study'), (1, 'section'), (1, 'decisions'), (1, 'through'), (1, 'setting'), (1, 'generally'), (1, 'their'), (1, 'objective'), (1, 'universities'), (1, 'whether'), (1, 'technological'), (1, 'ones'), (1, 'people'), (1, 'manage'), (1, 'studied'), (1, 'application'), (1, 'notforprofit')]\n" ] } ], "source": [ "# print each and every word\n", "print(data.collect())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**\n", "7 . \n", "a. Read the file, generate words, replace special char, trim and lowercase, filter out \"empty\", count each word, sort by count ASC \n", "b. store the result in memory \n", "c. print the word and the count \n", "d. print the word only \n", "e. print the number of word that start with each char for only the word with more than 5 occurrences. The result is sorted by count ASC \n", "**" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# a. generate words, replace special char, trim and lowercase, filter out \"empty\", count each word, sort by count ASC\n", "import re\n", "data = (word_rdd\n", " .flatMap(lambda line: line.split(\" \"))\n", " .map(lambda word: word.strip().lower())\n", " .map(lambda word: re.sub(\"[,.:;'\\\"\\\\?\\\\-!\\\\(\\\\)]\", \"\", word))\n", " .filter(lambda word: len(word) > 2)\n", " .map(lambda word: (word, 1))\n", " .reduceByKey(lambda a, b: a + b)\n", " .map(lambda wc: (wc[1], wc[0]))\n", " .sortByKey(False)\n", " )" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PythonRDD[31] at RDD at PythonRDD.scala:48" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# b. store the result in memory\n", "from pyspark import StorageLevel\n", "data.persist(StorageLevel.MEMORY_ONLY)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('the', 22), ('managers', 13), ('and', 10), ('management', 9), ('organization', 9), ('administration', 5), ('organizations', 4), ('are', 4), ('business', 4), ('provide', 3), ('middle', 3), ('roles', 3), ('senior', 3), ('direction', 3), ('frontline', 3), ('who', 3), ('such', 3), ('resources', 2), ('larger', 2), ('may', 2), ('there', 2), ('strategic', 2), ('volunteers', 2), ('social', 2), ('work', 2), ('master', 2), ('public', 2), ('smaller', 2), ('its', 2), ('some', 2), ('which', 2), ('goals', 2), ('also', 2), ('employees', 2), ('leadership', 1), ('human', 1), ('bcom', 1), ('even', 1), ('perform', 1), ('set', 1), ('than', 1), ('several', 1), ('levels', 1), ('science', 1), ('officer', 1), ('ceo', 1), ('typically', 1), ('large', 1), ('becoming', 1), ('commerce', 1), ('pyramid', 1), ('examples', 1), ('voluntary', 1), ('objectives', 1), ('leaders', 1), ('make', 1), ('manager', 1), ('financial', 1), ('chief', 1), ('overall', 1), ('researchers', 1), ('have', 1), ('more', 1), ('efforts', 1), ('operate', 1), ('degrees', 1), ('communicate', 1), ('term', 1), ('professors', 1), ('would', 1), ('three', 1), ('include', 1), ('refer', 1), ('team', 1), ('bachelor', 1), ('doctor', 1), ('whose', 1), ('individuals', 1), ('sector', 1), ('academic', 1), ('discipline', 1), ('lower', 1), ('scopes', 1), ('board', 1), ('mpa', 1), ('regional', 1), ('them', 1), ('executive', 1), ('organizational', 1), ('regular', 1), ('coordinating', 1), ('oversee', 1), ('activities', 1), ('dba', 1), ('commonly', 1), ('important', 1), ('can', 1), ('how', 1), ('for', 1), ('all', 1), ('wider', 1), ('natural', 1), ('complete', 1), ('president', 1), ('observed', 1), ('managing', 1), ('report', 1), ('available', 1), ('many', 1), ('supervisors', 1), ('government', 1), ('colleges', 1), ('will', 1), ('degree', 1), ('structure', 1), ('strategy', 1), ('branch', 1), ('phd', 1), ('hierarchical', 1), ('includes', 1), ('accomplish', 1), ('much', 1), ('organized', 1), ('aim', 1), ('body', 1), ('directors', 1), ('mba', 1), ('study', 1), ('section', 1), ('decisions', 1), ('through', 1), ('setting', 1), ('generally', 1), ('their', 1), ('objective', 1), ('universities', 1), ('whether', 1), ('technological', 1), ('ones', 1), ('people', 1), ('manage', 1), ('studied', 1), ('application', 1), ('notforprofit', 1)]\n" ] } ], "source": [ "# c. print the word and the count\n", "print(data.map(lambda wc: (wc[1], wc[0])).collect())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['the', 'managers', 'and', 'management', 'organization', 'administration', 'organizations', 'are', 'business', 'provide', 'middle', 'roles', 'senior', 'direction', 'frontline', 'who', 'such', 'resources', 'larger', 'may', 'there', 'strategic', 'volunteers', 'social', 'work', 'master', 'public', 'smaller', 'its', 'some', 'which', 'goals', 'also', 'employees', 'leadership', 'human', 'bcom', 'even', 'perform', 'set', 'than', 'several', 'levels', 'science', 'officer', 'ceo', 'typically', 'large', 'becoming', 'commerce', 'pyramid', 'examples', 'voluntary', 'objectives', 'leaders', 'make', 'manager', 'financial', 'chief', 'overall', 'researchers', 'have', 'more', 'efforts', 'operate', 'degrees', 'communicate', 'term', 'professors', 'would', 'three', 'include', 'refer', 'team', 'bachelor', 'doctor', 'whose', 'individuals', 'sector', 'academic', 'discipline', 'lower', 'scopes', 'board', 'mpa', 'regional', 'them', 'executive', 'organizational', 'regular', 'coordinating', 'oversee', 'activities', 'dba', 'commonly', 'important', 'can', 'how', 'for', 'all', 'wider', 'natural', 'complete', 'president', 'observed', 'managing', 'report', 'available', 'many', 'supervisors', 'government', 'colleges', 'will', 'degree', 'structure', 'strategy', 'branch', 'phd', 'hierarchical', 'includes', 'accomplish', 'much', 'organized', 'aim', 'body', 'directors', 'mba', 'study', 'section', 'decisions', 'through', 'setting', 'generally', 'their', 'objective', 'universities', 'whether', 'technological', 'ones', 'people', 'manage', 'studied', 'application', 'notforprofit']\n" ] } ], "source": [ "# d. print the word only\n", "print(data.map(lambda wc: wc[1]).collect())" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "[(9, 'o'), (10, 'a'), (22, 'm'), (22, 't')]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# e. print the number of word that start with each char for only the word with more than 5 occurrences. \n", "# The result is sorted by count ASC\n", "data.filter(lambda cw: cw[0] > 5).map(lambda cw: (cw[1][0], cw[0])).reduceByKey(lambda a, b: a + b).map(lambda cw: (cw[1], cw[0])).sortByKey().collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**\n", "8 . \n", "a. Read the file, generate words, count each word \n", "b. store the word splits in a file \n", "c. store the word counts in a file \n", "**" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# a. generate words, replace special char, trim and lowercase, filter out \"empty\", count each word, sort by count ASC\n", "import re\n", "splits = (word_rdd\n", " .flatMap(lambda line: line.split(\" \"))\n", " .map(lambda word: word.strip().lower())\n", " .map(lambda word: re.sub(\"[,.:;'\\\"\\\\?\\\\-!\\\\(\\\\)]\", \"\", word))\n", " .filter(lambda word: len(word) > 2)\n", " .map(lambda word: (word, 1))\n", " )" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# b. store the word splits in a file\n", "splits.coalesce(1).saveAsTextFile('splitoutput')" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "counts = splits.reduceByKey(lambda a, b: a + b)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# c. store the word counts in a file\n", "counts.coalesce(1).saveAsTextFile('countoutput')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "spark.stop()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 2 }