aid: apache-spark name: Apache Spark description: Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. Spark offers a comprehensive suite of APIs for batch processing, SQL queries, streaming analytics, machine learning, and graph computation, governed by the Apache Software Foundation. type: Index position: Consumer access: 3rd-Party image: https://spark.apache.org/images/spark-logo-trademark.png tags: - Analytics - Big Data - Distributed Computing - Machine Learning - Open Source - Streaming created: '2024-01-01' modified: '2026-05-19' url: https://raw.githubusercontent.com/api-evangelist/apache-spark/refs/heads/main/apis.yml specificationVersion: '0.19' apis: - aid: apache-spark:apache-spark-rest-api name: Apache Spark REST API description: REST API for monitoring Spark applications, accessing cluster information, and managing Spark jobs through the Spark UI backend. Exposes endpoints for applications, jobs, stages, tasks, storage, environment, executors, and streaming statistics on port 4040 (or 18080 for Spark History Server). humanURL: https://spark.apache.org/docs/latest/monitoring.html#rest-api tags: - Jobs - Metrics - Monitoring - Stages properties: - type: Documentation url: https://spark.apache.org/docs/latest/monitoring.html#rest-api - url: openapi/apache-spark-openapi.yml type: OpenAPI - type: NaftikoCapability url: capabilities/apache-spark.yaml - aid: apache-spark:apache-spark-sql-api name: Apache Spark SQL API description: Spark module for structured data processing with DataFrame and Dataset APIs. Provides a SQL interface and supports various data sources including Parquet, ORC, JSON, CSV, JDBC, Hive, and Delta Lake. The Spark SQL API supports Scala, Python, Java, and R bindings. humanURL: https://spark.apache.org/docs/latest/sql-programming-guide.html tags: - DataFrames - SQL - Structured Data properties: - type: Documentation url: https://spark.apache.org/docs/latest/sql-programming-guide.html - type: SDK url: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/index.html title: Scala API Reference - type: SDK url: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html title: Python API Reference - type: SDK url: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/package-summary.html title: Java API Reference - aid: apache-spark:apache-spark-streaming-api name: Apache Spark Streaming API description: Scalable, high-throughput, fault-tolerant stream processing of live data streams. Supports Structured Streaming (the newer DStream-based API) with exactly-once semantics, continuous processing mode, and integration with Kafka, Kinesis, HDFS, and other sources. humanURL: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html tags: - Data Processing - Real-Time - Streaming properties: - type: Documentation url: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html - type: SDK url: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/index.html title: Scala Streaming API - type: SDK url: https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming/index.html title: Python Streaming API - aid: apache-spark:apache-spark-mllib-api name: Apache Spark MLlib API description: Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and feature engineering. Supports pipeline-based ML workflows through the spark.ml package. humanURL: https://spark.apache.org/docs/latest/ml-guide.html tags: - Algorithms - Data Science - Machine Learning - ML properties: - type: Documentation url: https://spark.apache.org/docs/latest/ml-guide.html - type: SDK url: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/index.html title: Scala MLlib API - type: SDK url: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html title: Python MLlib API - aid: apache-spark:apache-spark-graphx-api name: Apache Spark GraphX API description: Spark API for graphs and graph-parallel computation with a collection of graph algorithms and builders, including PageRank, Connected Components, Triangle Counting, and shortest paths. humanURL: https://spark.apache.org/docs/latest/graphx-programming-guide.html tags: - Analytics - Graph Processing - Graphs properties: - type: Documentation url: https://spark.apache.org/docs/latest/graphx-programming-guide.html - type: SDK url: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/graphx/index.html title: Scala GraphX API common: - type: LinkedIn url: https://www.linkedin.com/company/apachespark - type: GitHubRepository url: https://github.com/apache/spark - type: Portal url: https://spark.apache.org/ - type: Documentation url: https://spark.apache.org/docs/latest/ - type: GettingStarted url: https://spark.apache.org/docs/latest/quick-start.html - type: Blog url: https://spark.apache.org/news/ - type: Support url: https://spark.apache.org/community.html - type: TermsOfService url: https://www.apache.org/licenses/LICENSE-2.0 - type: StackOverflow url: https://stackoverflow.com/questions/tagged/apache-spark - type: SDK url: https://pypi.org/project/pyspark/ title: PySpark (Python) - type: SDK url: https://search.maven.org/search?q=g:org.apache.spark title: Maven (Scala/Java) - type: Features data: - name: Unified Analytics Engine description: Single engine for batch, streaming, SQL, ML, and graph processing workloads. - name: Lazy Evaluation and DAG Execution description: Optimized execution plans with Catalyst optimizer and DAG scheduling. - name: In-Memory Processing description: Up to 100x faster than Hadoop MapReduce for iterative algorithms via in-memory caching. - name: Structured Streaming description: Unified streaming and batch processing with exactly-once semantics and Kafka integration. - name: Multi-Language Support description: High-level APIs in Scala, Java, Python (PySpark), and R (SparkR). - name: Delta Lake Integration description: ACID transactions, schema evolution, and time travel for data lakes. - name: Kubernetes Native description: Native Kubernetes scheduling for cloud-native deployment of Spark workloads. - type: UseCases data: - name: Large-Scale ETL description: Extract, transform, and load petabytes of data across distributed clusters. - name: Real-Time Analytics description: Streaming analytics on live event data with sub-second latency. - name: Machine Learning Pipelines description: Distributed ML training and feature engineering at scale with MLlib. - name: Data Lake Processing description: Query and transform data stored in cloud object stores and HDFS. - name: Interactive SQL Analytics description: Interactive SQL queries on structured and semi-structured data at scale. - type: Integrations data: - name: Apache Hadoop description: HDFS storage, YARN cluster manager, and Hadoop ecosystem integration. - name: Apache Kafka description: Structured Streaming source and sink for real-time event processing. - name: Delta Lake description: Open-source storage layer with ACID transactions for data lakes. - name: Apache Iceberg description: Open table format for huge analytic datasets on cloud storage. - name: Apache Hive description: Hive metastore integration for table catalog and metadata management. - name: Kubernetes description: Native Kubernetes scheduling for cloud-native Spark deployments. - name: Apache Airflow description: Workflow orchestration for scheduling and managing Spark jobs. maintainers: - FN: Kin Lane email: info@apievangelist.com