{ "cells": [ { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "# Initial definitions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env HADOOP_VERSION 2.9.2\n", "%env HADOOP_PATH hadoop-2.9.2" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "# Preparing the environment" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "## Downloading Hadoop" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz -q --show-progress" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "## Extracting compressed files and removing .tar" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !rm ${HADOOP_PATH} -r\n", "!tar -xvf hadoop-${HADOOP_VERSION}.tar.gz >/dev/null \n", "!rm hadoop-${HADOOP_VERSION}.tar.gz" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "## Discovering the Java path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!dirname $(dirname $(readlink -f $(which javac)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting the Java path envvar\n", "\n", "We also added it to user's .bashrc so it will be loaded as the nodes perform ssh connections." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!echo \"export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \" >> ~/.bashrc\n", "!echo \"export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \" >> ~/.profile\n", "!echo \"export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \" >> ${HADOOP_PATH}/etc/hadoop/hadoop-env.sh" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "# Hadoop in Standalone Mode (local)" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "## MapReduce in the local filesystem - word count example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!${HADOOP_PATH}/bin/hadoop jar ${HADOOP_PATH}/share/hadoop/mapreduce/hadoop-mapreduce-examples-${HADOOP_VERSION}.jar wordcount \\\n", " ./resources/examples/newyorknewyork.txt ./output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Listing files in the output folder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls ./output/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading output file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! cat ./output/part-r-00000" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "# Hadoop in Pseudo-Distributed Mode" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "## Preparing the environment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Starting sshd server\n", "\n", "Check `/binder/postBuild` and `/resources/configs/ssh/sshd_config` files for more details" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!/usr/sbin/sshd -f resources/configs/ssh/sshd_config " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding names to know hosts \n", "\n", "Commands below stablish ssh connections to used host names/ips. This step avoids yes/no host confirmation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ssh -o \"StrictHostKeyChecking no\" $USER@localhost -p 8822 -C \"exit\" \n", "!ssh -o \"StrictHostKeyChecking no\" $USER@ -p 8822 -C \"exit\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding ssh options to Hadoop via envvar\n", "\n", "* connecting in a diferent port (`-p 8822`)\n", "* avoiding host key checking (`-o StrictHostKeyChecking=no`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env HADOOP_SSH_OPTS= -o StrictHostKeyChecking=no -p 8822" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env PDSH_RCMD_TYPE ssh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Copying configurations files to Hadoop folder\n", "\n", "Check the configuration files accordingly to the Hadoop version. \n", "Refer to the `/resources/configs/hadoop/`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cp resources/configs/hadoop/${HADOOP_VERSION}/core-site.xml ${HADOOP_PATH}/etc/hadoop/\n", "!cp resources/configs/hadoop/${HADOOP_VERSION}/hdfs-site.xml ${HADOOP_PATH}/etc/hadoop/" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "## Formatting the filesystem" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!${HADOOP_PATH}/bin/hdfs namenode -format -force -nonInteractive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Starting DFS (NameNode, SecondaryNameNode, and DataNode daemons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!${HADOOP_PATH}/sbin/start-dfs.sh\n", "!jps" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "## MapReduce - Word count example " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating folders in the distributed file system" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/\n", "!${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/matheus/\n", "!${HADOOP_PATH}/bin/hdfs dfs -mkdir /user/matheus/input/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Copying a file to a folder in the distributed file system" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!${HADOOP_PATH}/bin/hdfs dfs -put ./resources/examples/newyorknewyork.txt /user/matheus/input/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Listing files in a folder of the distributed file system" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!${HADOOP_PATH}/bin/hdfs dfs -ls /user/matheus/input/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieving the contents of a file in the distributed file system" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!${HADOOP_PATH}/bin/hdfs dfs -cat /user/matheus/input/newyorknewyork.txt" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "### Running MapReduce job in Pseudo-Distributed Mode" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!./${HADOOP_PATH}/bin/hadoop jar ./${HADOOP_PATH}/share/hadoop/mapreduce/hadoop-mapreduce-examples-${HADOOP_VERSION}.jar wordcount \\\n", " /user/matheus/input /user/matheus/output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Listing files in the output folder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!./${HADOOP_PATH}/bin/hdfs dfs -ls /user/matheus/output/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading output file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!./${HADOOP_PATH}/bin/hdfs dfs -cat /user/matheus/output/part-r-00000" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "# Starting YARN in Pseudo-Distributed Mode" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "## Preparing the environment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Copying configurations files to Hadoop folder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cp resources/configs/hadoop/${HADOOP_VERSION}/mapred-site.xml ${HADOOP_PATH}/etc/hadoop/\n", "!cp resources/configs/hadoop/${HADOOP_VERSION}/yarn-site.xml ${HADOOP_PATH}/etc/hadoop/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Starting YARN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!${HADOOP_PATH}/sbin/start-yarn.sh\n", "!jps" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "## MapReduce via YARN - /user/matheus/input /user/matheus/output2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Listing files in the output folder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!./${HADOOP_PATH}/bin/hdfs dfs -ls /user/matheus/output2/" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "### Reading output file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!./${HADOOP_PATH}/bin/hdfs dfs -cat /user/matheus/output2/part-r-00000" ] }