{"paragraphs":[{"text":"%md\n\n## Intro to Machine Learning\n#### with Linear Regression\n\n**Level**: Beginner\n**Language**: Scala\n**Requirements**: \n- [HDP 2.6](http://hortonworks.com/products/sandbox/) (or later) or [HDCloud](https://hortonworks.github.io/hdp-aws/)\n- Spark 2.x\n\n**Author**: Robert Hryniewicz\n**Follow** [@RobH8z](https://twitter.com/RobertH8z)","user":"admin","dateUpdated":"2017-06-13T18:48:29+0000","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"title":false,"results":[{"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}}}],"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
Level: Beginner
Language: Scala
Requirements:
- HDP 2.6 (or later) or HDCloud
- Spark 2.x
Author: Robert Hryniewicz
Follow @RobH8z
In this lab we’ll cover basics of building a Linear Regression model using Apache Spark ML Pipeline API.
\nIn this lab we will use basic Scala syntax. If you would like to learn more about Scala, here’s an excellent Tutorial.
\nTo run a paragraph in a Zeppelin notebook you can either click the play
button (blue triangle) on the right-hand side or simply press Shift + Enter
.
A model is a mathematical formula with a number of parameters that need to be learned from the data. Fitting a model to the data is a process known as model training.
\nTake, for instance one feature/variable linear regression, where a goal is to fit a line (described by the well know eqution y = ax + b
) to a set of distributed data points.
For example, assume that once model training is complete we get a model equation y = 2x + 5
. Then for a set of inputs [1, 0, 7, 2, …]
we would get a set of outputs [7, 5, 19, 9, …]
. That’s it!
In this notebook you will get a chance to learn a step-by-step process of training a one variable linear regression model with Spark.
\nWe’re introducing Machine Learning with Linear Regression because it’s one of the more basic and commonly used predictive analytics method. It’s also easy to explain and grasp intuitively as you’ll make your way through the examples.
\nNote, that we will not cover the details of how the underlying Linear Regression algorithm works. We will merely focus on applying the algorithm and generating a model. If you would like to learn more about Linear Regression and other algorithms check out this excellent Coursera Machine Learning Course taught by Andrew Ng.
\nNote: The following paragraphs require the Python Pandas library which is not installed by default. Instead, we’ve ran the paragraphs for you and disabled run so you will avoid any errors.
\nIn this lab we have looked at Linear Regression, but there are other popular algorithms. In the following labs we’ll begin exploring:
\n\nWe hope you’ve enjoyed this introductory lab. Below are additional resources that you should find useful:
\n