# Architecture Apache Pulsar is a distributed pub-sub messaging and streaming platform. The codebase is performance-critical, heavily asynchronous, and concurrency-sensitive (brokers, storage, networking). The authoritative documentation lives at — see the [Architecture Overview](https://pulsar.apache.org/docs/4.2.x/concepts-architecture-overview/) for the conceptual model. For a deeper, generated architecture description see [DeepWiki](https://deepwiki.com/apache/pulsar); coding agents can install the [DeepWiki MCP](https://docs.devin.ai/work-with-devin/deepwiki-mcp) for richer coverage of Pulsar's architecture. This file is a map of the repository for contributors (and AI coding agents) who need to find their way around the modules quickly. ## Big picture Pulsar separates a **stateless serving layer** (brokers) from **durable storage** (Apache BookKeeper) and a **metadata store** (Oxia / ZooKeeper). The Gradle modules layer accordingly: - **`pulsar-client-api`, `pulsar-client-admin-api`** — public, backward-compatible interfaces only. `pulsar-client-api-v5` / `pulsar-client-v5` are the newer V5 client API (PIP-466/468). - **`pulsar-client` (`:pulsar-client-original`)** — the Java client implementation (producer/consumer/reader, connection pooling). `pulsar-client-admin` implements the admin REST client. - **`pulsar-common`** — wire protocol and shared types. Protobuf / lightproto messages are **generated** into `generated-lightproto/` / `generated-sources/` (excluded from checkstyle and spotless). - **`pulsar-metadata`** — pluggable metadata store abstraction (Oxia / ZooKeeper, plus RocksDB and memory) used by broker and bookkeeper. - **`managed-ledger`** — the storage abstraction over **Apache BookKeeper**: append-only ledgers + cursors that track consumer/subscription positions. This is the durability layer the broker reads and writes through. - **`pulsar-broker`** — the server. `PulsarService` is the composition root wiring everything together; `BrokerService` manages topics, subscriptions, and client connections. Entry points: `PulsarBrokerStarter` (broker), `PulsarStandalone` / `PulsarStandaloneStarter` (all-in-one), `PulsarClusterMetadataSetup` (cluster init). - **`pulsar-proxy`** — optional proxy/gateway in front of brokers. - **`pulsar-functions/*`** — serverless compute (Functions): `proto`, `api-java`, `instance`, `runtime`, `worker`, `localrun`. - **`pulsar-io/*`** — connector framework core only; most built-in connectors were moved to the separate `pulsar-connectors` repo (PIP-465). - **`pulsar-transaction/*`** — transaction coordinator and common types. - **`tiered-storage/*`, `offloaders/`** — offload ledger data to cloud/filesystem storage. - **`pulsar-websocket`** — WebSocket-to-Pulsar bridge. **`pulsar-client-tools`** — the `pulsar-admin` / `pulsar-client` CLIs. - **Shaded / distribution** — `pulsar-client-shaded`, `pulsar-client-all`, `pulsar-client-admin-shaded` produce relocated fat jars; `distribution/*` assembles server/shell/offloader tarballs. ## Pulsar Improvement Proposals (`pip/`) The **`pip/`** directory holds **Pulsar Improvement Proposals** (`pip-.md`) — the design documents for significant changes, referenced as `PIP-` throughout commit messages and code (e.g. PIP-463 = Maven→Gradle migration, PIP-465 = IO connectors moved out, PIP-466/468 = V5 client). `pip/README.md` describes the process and `pip/TEMPLATE.md` is the proposal template. Consult the relevant PIP for the rationale behind a non-trivial feature or architectural decision. A PIP **number is reserved by the first `dev@pulsar.apache.org` thread that uses it** — start the discussion to claim the next free number. ## Concurrency model (a known gap) Pulsar does **not** have a clearly established, documented concurrency model, which makes it hard to evaluate whether a given piece of code is correct by construction. (Contrast Netty, which has a clear rule: all handling on the IO thread is non-blocking, which by extension means avoiding synchronization and locks on that path.) Pulsar does not strictly follow such a rule; modern JVMs and hardware optimize `synchronized` code well enough that this has not blocked high performance, but it does make reasoning about correctness harder than it needs to be. Conventions that **should** be documented (and largely are not yet): - which work belongs on the network-connection **event loop** vs. other threads; - how the various **thread pools** are intended to be used, and what kind of work belongs on each; - how threads are expected to **hand off state** to each other; - when a `CompletableFuture`'s **completion thread should be switched** to another thread, and which one; - **concurrency limits** for asynchronous tasks; - preferring the **single-writer principle** to avoid concurrent state mutation. Until such a model is written down, follow the surrounding code's conventions and the Java-Memory-Model rules in [`CODING.md`](CODING.md#concurrency). Once a model is defined, it becomes far more tractable to "lift and shift" existing code toward it and enforce the rules consistently rather than having each contributor rediscover the conventions case by case. ## Backpressure Closely tied to the concurrency model is **backpressure** — how the system avoids accepting more work than it can handle, particularly with respect to memory. The memory side is described in [PIP-442 "Existing Broker Memory Management"](pip/pip-442.md#existing-broker-memory-management). Broader backpressure (beyond memory) is not yet documented and would benefit from being defined alongside the concurrency model. ## Build infrastructure Apache Pulsar uses a **Gradle** build (migrated from Maven via PIP-463; some older tooling and docs elsewhere still reference Maven). The wrapper `./gradlew` requires **JDK 21 or 25** (bytecode targets Java 17). See [`CONTRIBUTING.md` → Building](CONTRIBUTING.md#building) for the build and lint commands. - `settings.gradle.kts` — all modules, organized in dependency tiers (Tier 0 has no internal deps, higher tiers build on lower ones). - `build-logic/conventions/` — convention plugins (`pulsar.java-conventions`, `pulsar.code-quality-conventions`, `pulsar.shadow-conventions`, etc.) applied by modules. Shared compile/test/dependency config lives here — edit it here rather than duplicating across modules. - `gradle/libs.versions.toml` — version catalog (single source of truth for dependency versions; referenced as `libs.*` in build scripts). - `pulsar-dependencies` — enforced platform (BOM) pinning all dependency versions; applied to every module. The build enables both the **configuration cache** (`org.gradle.configuration-cache=true`) and **configure-on-demand** (`org.gradle.configureondemand=true`). ### Module name vs. directory name gotcha Several Gradle project paths do **not** match their directory because the Maven artifactId is preserved. Most importantly: - Directory `pulsar-client/` → project **`:pulsar-client-original`** - Directory `pulsar-client-admin/` → project **`:pulsar-client-admin-original`** - Directory `pulsar-functions/localrun/` → project `:pulsar-functions:pulsar-functions-local-runner-original` Always use the Gradle project path (left of any `--tests`), e.g. `./gradlew :pulsar-client-original:test`. Check `settings.gradle.kts` when a path is ambiguous. ### Changing the build When editing `build-logic/`, `settings.gradle.kts`, a module `build.gradle.kts`, `gradle.properties`, `gradle/libs.versions.toml`, or the `pulsar-dependencies` platform: - **Edit shared config in `build-logic/conventions/`**, not per-module. - **Versions come from `gradle/libs.versions.toml`** (`libs.*` / `pulsar-dependencies`) — never hardcode a version in a build script. - **Keep tasks configuration-cache and configure-on-demand compatible** (both are enabled): no reading of mutable state at execution time and no `Project` access in task actions — use `Provider` / value sources, and verify with `--configuration-cache`. Tasks reached by the common flows (`assemble`, `test`, `integrationTest`, `rat` / `spotlessCheck` / `checkstyle*`, `checkBinaryLicense`, `docker*`) must be compatible; one-off tooling tasks not part of those flows (e.g. `verifyTestGroups`, ad-hoc report tasks) may be exempt. - **Published modules must not depend on internal modules** at compile/runtime scope — the artifact would be unresolvable from Maven Central. A module is published only when it applies `pulsar.public-java-library-conventions`. - **After a dependency change**, run `./gradlew checkBinaryLicense` and update the distribution `LICENSE`/`NOTICE`; justify any genuinely new dependency (see [`CODING.md` → Dependencies](CODING.md#dependencies)). - **Follow the [Gradle best practices](https://docs.gradle.org/current/userguide/best_practices_index.html)** — AI agents should read the [AsciiDoc source](https://github.com/gradle/gradle/blob/master/platforms/documentation/docs/src/docs/userguide/best-practices/best_practices_index.adoc), which is plain text and cheaper to parse than the rendered HTML. Before finishing a build change, confirm the affected task and `./gradlew help` run clean with `--configuration-cache`, and that `assemble` and `rat spotlessCheck checkstyleMain checkstyleTest` pass (plus `checkBinaryLicense` if a dependency changed).