---
pinned: true
title: "Learning PySpark with Google Colab"
description: "PySpark on Google Colab is an efficient way to manipulate and explore the data, and a good fit for a group of AI learners."
authors: ["glegoux"]
time_reading_minutes: 10
category: "Data"
---

Learning  [**Apache Spark**](https://spark.apache.org/)  with a quick learning curve is challenging. Discover  **distributed computation**  and  **machine learning**  with  [PySpark](https://www.databricks.com/glossary/pyspark), with  **several tutorials**  until building your movie recommendation engine.

**Links:** [GitHub](https://github.com/criteo-research/master-iasd/tree/master/module4)  \|  [Tutorials quick start](https://colab.research.google.com/github/criteo-research/master-iasd/blob/master/module4/td1/td1-rdd-questions.ipynb)  \|  [Dataset](https://grouplens.org/datasets/movielens/)

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:359/1*HZloIx45zZtPrtcEksmeRw.png"
abs_url=true
%}

Let’s discover how to use PySpark on Google Colab with accessible tutorials.

As a teaching fellow with  [David Diebold](https://www.linkedin.com/in/david-diebold-51249977/)  about  [Systems, paradigms, and algorithms for Big Data](https://www.lamsade.dauphine.fr/wp/iasd/en/programme/options/systemes-paradigmes-et-langages-pour-les-big-data/)  for the international  [Master IASD](https://www.lamsade.dauphine.fr/wp/iasd/en/)  ([graduate degree M2](https://www.universite-paris-saclay.fr/en/education/french-higher-education-system)) for the French  [Dauphine Paris University](https://dauphine.psl.eu/)  member of the  [PSL University](https://psl.eu/en), I needed to organize sessions of tutorials for the students on the distributed computation with  [**Apache Spark**](https://spark.apache.org/).

**Fast, flexible, and developer-friendly**, this  **data-distributed processing**  framework has become  **one of the world’s most significant**. Before teaching the features provided by Spark, we had to choose  **which language and platform**  our learners could run the tutorials we prepared. We chose the tech stack:  **PySpark with Google Colab**.

# PySpark vs. Spark

[**PySpark**](https://www.databricks.com/glossary/pyspark)  allows interaction with Spark in  **Python**. It gives a  **better learning curve**  than Spark (written originally in Scala). Even though it is less performant for a production world using  [Py4J](https://www.py4j.org/)  to interact with the JVM of Spark, it gives  **sufficient performance**  (sometimes close to Spark with Java/Scala) to experiment with distributed data science and machine learning.

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:571/1*jLK8saUaKj8KuovUgXumUg.png"
abs_url=true
title="PySpark is an interaction of Spark with Python"
source_author=true
%}

PySpark with Python remains largely the  **preferred language**  for  **Notebooks.**

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:344/1*L8jzF1__70sSDfflDM5CBg.png"
abs_url=true
title="Percent of commands on the Databricks platform in each Spark language "
source="https://www.databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html"
%}

The  **support**  of PySpark is already  **excellent**  and continues to be improved with the future version of  [**Spark 3+**](https://www.databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html).

Python is  **easier to learn**  than Scala and has a  **mature ecosystem for applied mathematics**.

For these reasons,  **we preferred teaching PySpark over Spark**. Moving from one to another is easy, only the cost for the learners to become familiar with the Scala/Java ecosystem for advanced use.

{% include article/read-more.md
src="https://medium.com/@glegoux/apache-spark-pyspark-with-google-colab-for-data-science-63478138a63e"
%}