---
pinned: false
title: "Apache Spark: Ecosystem overview with Apache Hadoop YARN and HDFS"
description: "Though my experience of Software Engineer @Criteo during several years to import and enrich e-commerce catalogs. I would like to share with a large overview how Spark works and can be used in the data Industry to manage data processing at scale."
authors: ["glegoux"]
time_reading_minutes: 10
category: "Data"
---

Apache Spark: ecosystem overview with Apache Hadoop YARN and HDFS[Apache Spark™](https://spark.apache.org/)  is a multi-language engine for executing data engineerinApache Spark: ecosystem overview with Apache Hadoop YARN and HDFSg, data science, and machine learning on single-node machines or clusters. It is largely used in the industry for distributed computation of data at scale.

This project is  [open-source](https://github.com/apache/spark)  under the  [Apache software foundation,](https://apache.org/)  created by the founders of the company  [Databricks](https://www.databricks.com/)  that provides cluster and support for Spark.

Let deep dive on the architecture of  **Apache Spark on YARN** in a distributed ecosystem of containers and  **Java VMs**.

This architecture need to address the  **computation**  and the  **storage**  at  **high scale**  to manipulate big volume of data with an efficient processing.

**Table of contents**

-   Spark ecosystem
-   Spark architecture through Hadoop
-   Spark engine for computation and storage
-   Go with Spark in production
-   Example of Spark application at Scale
-   Go further
-   What’s next
-   Thanks to
-   Glossary
-   References

Let’s start with the basics, you can jump straight to the sections that interest you, otherwise see vertically how this whole ecosystem works.

# Spark ecosystem

## Data pipelines

Spark allows building data pipeline in  **batch mode** and the  **streaming mode**.

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:408/0*zqTHvZ2Uckv_Hjz3.png"
abs_url=true
title="Lambda architecture with Spark"
source=true
%}

Here, you see a  **lambda architecture**, but to associate online and offline processing, you can have a  **kappa**  or  **zeta**  **architecture**  on your data pipeline. You will see it in another article.

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:394/1*4QrgYLPJto1TBvMcs45iqg.png"
abs_url=true
title="Spark vs Spark Streaming"
source_author="https://www.databricks.com/blog/2020/11/20/delta-vs-lambda-why-simplicity-trumps-complexity-for-data-pipelines.html"
%}

The streaming mode of Spark is more micro-batch approach. Other framework like  [Kafka Streaming](https://kafka.apache.org/documentation/streams/),  [Faust Streaming](https://faust-streaming.github.io/faust/)  or  [Apache Flink](https://flink.apache.org/)  are more used to process data streams. Spark is forceful to process data batch overall.

## APIs

Mainly, Spark can run with several APIs, one for each following programming language  [Scala](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html)/[Java](https://spark.apache.org/docs/latest/api/java/),  [Python](https://spark.apache.org/docs/latest/api/python/)  and  [R](https://spark.apache.org/docs/latest/api/R/), respectively with  **Spark**,  **PySpark**  and  **SparkR**.

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:700/1*zXXVD_S-pBs8mWBi3OcR7w.png"
abs_url=true
title="Different types of Spark engines"
source_author=true
%}

In practice, Spark and PySpark are more used than SparkR. For the experimentation, PySpark is more helpful, and for the performances Spark is better.

## Connectors

Spark has many  **connectors**  to write  **to different types of storage**.

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:684/0*uQtzEnoXUORb2dgi"
abs_url=true
title="Apache Spark’s ecosystem of connectors"
source="https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf"
%}

Here, we are going to speak a bit more about the connection with Apache Hadoop HDFS.

## Spark SQL

In addition, you can also run Spark with  [SQL](https://spark.apache.org/docs/latest/api/sql/). With a SQL request, you can also access to heterogeneous data then combine them in memory:

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:720/format:webp/0*Ts-DH76DiRFZh4p1.png"
abs_url=true
title="Spark SQL overview"
source="http://www.gatorsmile.io/sparksqloverview/"
%}

The  [Catalyst & Tungsten](https://www.linkedin.com/pulse/catalyst-tungsten-apache-sparks-speeding-engine-deepak-rajak/) engines allows optimizing the execution plan of the query:

{% include content/image.html
src="https://miro.medium.com/v2/resize:fit:700/0*CRcfOe8vbDWNGJhV.png"
abs_url=true
title="Catalyst"
source="https://www.databricks.com/glossary/catalyst-optimizer"
%}

## PySpark

In the Python driver program, SparkContext uses  [Py4J](https://www.py4j.org/)  to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

{% include article/read-more.md
src="https://medium.com/@glegoux/apache-spark-ecosystem-with-hadoop-apache-yarn-and-hdfs-8e64eeba68c0"
%}