--- layout: global title: Third-Party Projects type: "page singular" navigation: weight: 5 show: true --- This page tracks external software projects that supplement Apache Spark and add to its ecosystem. ## Popular libraries with PySpark integrations - [great-expectations](https://github.com/great-expectations/great_expectations) - Always know what to expect from your data - [Apache Airflow](https://github.com/apache/airflow) - A platform to programmatically author, schedule, and monitor workflows - [xgboost](https://github.com/dmlc/xgboost) - Scalable, portable and distributed gradient boosting - [shap](https://github.com/shap/shap) - A game theoretic approach to explain the output of any machine learning model - [python-deequ](https://github.com/awslabs/python-deequ) - Measures data quality in large datasets - [datahub](https://github.com/datahub-project/datahub) - Metadata platform for the modern data stack - [dbt-spark](https://github.com/dbt-labs/dbt-spark) - Enables dbt to work with Apache Spark - [Hamilton](https://github.com/DAGWorks-Inc/hamilton) - Enables one to declaratively describe PySpark transformations that helps keep code testable, modular, and logically visualizable. - [ScaleDP](https://stabrise.com/scaledp/) - An Open-Source Library for Processing Documents using AI/ML in Apache Spark. ## Connectors - [spark-redshift](https://github.com/spark-redshift-community/spark-redshift) - Performant Redshift data source for Apache Spark - [spark-sql-connector](https://github.com/microsoft/sql-spark-connector) - Apache Spark Connector for SQL Server and Azure SQL - [azure-cosmos-spark](https://github.com/Azure/azure-cosmosdb-spark) - Apache Spark Connector for Azure Cosmos DB - [azure-event-hubs-spark](https://github.com/Azure/azure-event-hubs-spark) - Enables continuous data processing with Apache Spark and Azure Event Hubs - [azure-kusto-spark](https://github.com/Azure/azure-kusto-spark) - Apache Spark connector for Azure Kusto - [mongo-spark](https://github.com/mongodb/mongo-spark) - The MongoDB Spark connector - [couchbase-spark-connector](https://github.com/couchbase/couchbase-spark-connector) - The Official Couchbase Spark connector - [spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector) - DataStax connector for Apache Spark to Apache Cassandra - [elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop) - Elasticsearch real-time search and analytics natively integrated with Spark - [neo4j-spark-connector](https://github.com/neo4j-contrib/neo4j-spark-connector) - Neo4j Connector for Apache Spark - [starrocks-connector-for-apache-spark](https://github.com/StarRocks/starrocks-connector-for-apache-spark) - StarRocks Apache Spark connector - [tispark](https://github.com/pingcap/tispark) - TiSpark is built for running Apache Spark on top of TiDB/TiKV - [spark-pdf](https://stabrise.com/spark-pdf/) - PDF Datasource for Apache Spark - [spark-connector-oceanbase](https://github.com/oceanbase/spark-connector-oceanbase) - Apache Spark Connectors for OceanBase - [lance-spark](https://github.com/lancedb/lance-spark) - Apache Spark connector for Lance datasets - [spark-clickhouse-connector](https://github.com/ClickHouse/spark-clickhouse-connector) - Apache Spark connector for ClickHouse ## Open table formats - Delta Lake - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads - [Hudi](https://github.com/apache/hudi): Upserts, Deletes And Incremental Processing on Big Data - [Iceberg](https://github.com/apache/iceberg) - Open table format for analytic datasets - [Lance](https://github.com/lancedb/lance) - Modern columnar data format for ML and LLMs