--- layout: global title: Third-Party Projects type: "page singular" navigation: weight: 5 show: true --- This page tracks external software projects that supplement Apache Spark and add to its ecosystem. ## Popular libraries with PySpark integrations - [great-expectations](https://github.com/great-expectations/great_expectations) - Always know what to expect from your data - [Apache Airflow](https://github.com/apache/airflow) - A platform to programmatically author, schedule, and monitor workflows - [xgboost](https://github.com/dmlc/xgboost) - Scalable, portable and distributed gradient boosting - [shap](https://github.com/shap/shap) - A game theoretic approach to explain the output of any machine learning model - [python-deequ](https://github.com/awslabs/python-deequ) - Measures data quality in large datasets - [datahub](https://github.com/datahub-project/datahub) - Metadata platform for the modern data stack - [dbt-spark](https://github.com/dbt-labs/dbt-spark) - Enables dbt to work with Apache Spark - [Hamilton](https://github.com/DAGWorks-Inc/hamilton) - Enables one to declaratively describe PySpark transformations that helps keep code testable, modular, and logically visualizable. - [ScaleDP](https://stabrise.com/scaledp/) - An Open-Source Library for Processing Documents using AI/ML in Apache Spark. ## Connectors - [spark-redshift](https://github.com/spark-redshift-community/spark-redshift) - Performant Redshift data source for Apache Spark - [spark-sql-connector](https://github.com/microsoft/sql-spark-connector) - Apache Spark Connector for SQL Server and Azure SQL - [azure-cosmos-spark](https://github.com/Azure/azure-cosmosdb-spark) - Apache Spark Connector for Azure Cosmos DB - [azure-event-hubs-spark](https://github.com/Azure/azure-event-hubs-spark) - Enables continuous data processing with Apache Spark and Azure Event Hubs - [azure-kusto-spark](https://github.com/Azure/azure-kusto-spark) - Apache Spark connector for Azure Kusto - [mongo-spark](https://github.com/mongodb/mongo-spark) - The MongoDB Spark connector - [couchbase-spark-connector](https://github.com/couchbase/couchbase-spark-connector) - The Official Couchbase Spark connector - [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) - DataStax connector for Apache Spark to Apache Cassandra - [elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop) - Elasticsearch real-time search and analytics natively integrated with Spark - [neo4j-spark-connector](https://github.com/neo4j-contrib/neo4j-spark-connector) - Neo4j Connector for Apache Spark - [starrocks-connector-for-apache-spark](https://github.com/StarRocks/starrocks-connector-for-apache-spark) - StarRocks Apache Spark connector - [tispark](https://github.com/pingcap/tispark) - TiSpark is built for running Apache Spark on top of TiDB/TiKV - [spark-pdf](https://stabrise.com/spark-pdf/) - PDF Datasource for Apache Spark - [spark-connector-oceanbase](https://github.com/oceanbase/spark-connector-oceanbase) - Apache Spark Connectors for OceanBase - [lance-spark](https://github.com/lancedb/lance-spark) - Apache Spark connector for Lance datasets - [spark-clickhouse-connector](https://github.com/ClickHouse/spark-clickhouse-connector) - Apache Spark connector for ClickHouse ## Open table formats - Delta Lake - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads - [Hudi](https://github.com/apache/hudi): Upserts, Deletes And Incremental Processing on Big Data - [Iceberg](https://github.com/apache/iceberg) - Open table format for analytic datasets - [Lance](https://github.com/lancedb/lance) - Modern columnar data format for ML and LLMs

Infrastructure projects

- [Kyuubi](https://github.com/apache/kyuubi) - Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses - REST Job Server for Apache Spark - REST interface for managing and submitting Spark jobs on the same cluster. - Apache Mesos - Cluster management system that supports running Spark - Alluxio (née Tachyon) - Memory speed virtual distributed storage system that supports running Spark - FiloDB - a Spark integrated analytical/columnar database, with in-memory option capable of sub-second concurrent queries - Zeppelin - Multi-purpose notebook which supports 20+ language backends, including Apache Spark - Kubeflow Spark Operator - Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes. - IBM Spectrum Conductor - Cluster management software that integrates with Spark and modern computing frameworks. - MLflow - Open source platform to manage the machine learning lifecycle, including deploying models from diverse machine learning libraries on Apache Spark. - Apache DataFu - A collection of utils and user-defined-functions for working with large scale data in Apache Spark, as well as making Scala-Python interoperability easier.

Applications using Spark

- Apache Mahout - Previously on Hadoop MapReduce, Mahout has switched to using Spark as the backend - ADAM - A framework and CLI for loading, transforming, and analyzing genomic data using Apache Spark - TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning - Natural Language Processing for Apache Spark - A library to provide simple, performant, and accurate NLP annotations for machine learning pipelines - Rumble for Apache Spark - A JSONiq engine to query, with a functional language, large, nested, and heterogeneous JSON datasets that do not fit in dataframes. - Lightning Catalog - A data catalog for running ad-hoc queries, wrangling data by federating enterprise data assets, and building a unified semantic layer with data quality checks.

Performance, monitoring, and debugging tools for Spark

- Data Mechanics Delight - Delight is a free, hosted, cross-platform Spark UI alternative backed by an open-source Spark agent. It features new metrics and visualizations to simplify Spark monitoring and performance tuning. - DataFlint - DataFlint is A Spark UI replacement installed via an open-source library, which updates in real-time and alerts on performance issues

Additional language bindings

C# / .NET

- Mobius: C# and F# language binding and extensions to Apache Spark

Clojure

- Geni - A Clojure dataframe library that runs on Apache Spark with a focus on optimizing the REPL experience.

Julia

- Spark.jl

Kotlin

- Kotlin for Apache Spark ## Adding new projects To add a project, open a pull request against the [spark-website](https://github.com/apache/spark-website) repository. Add an entry to [this markdown file](https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md), then run `jekyll build` to generate the HTML too. Include both in your pull request. See the README in this repo for more information. Note that all project and product names should follow [trademark guidelines](/trademarks.html).