{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Ngrams with pyspark\n", "\n", "### Create a Spark context\n", "\n", "A Spark context (or a session, that encapsulates a context) is the entry gate for Spark. \n", "It represents the Spark engine (whether on the local machine or on a cluster) and provides an API for creating and running data pipelines.\n", "\n", "In this example, we're going to load a text file into a RDD, split the text into ngrams, and count the frequency of ngrams." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark import SparkContext\n", "from operator import add" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "filename = \"wiki429MB\"\n", "\n", "sc = SparkContext(\n", " appName = \"Ngrams with pyspark \" + filename\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View Spark context" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
SparkContext
\n", "\n", " \n", "\n", "v2.4.0-cdh6.3.2
yarn
Ngrams with pyspark wiki429MB