Architecture

The following sections describe the StackSaga-Kafka architecture progressively across three deployment stages. Each stage builds on the previous one, adding production-readiness capabilities incrementally. This staged approach makes it possible to start with a minimal setup and harden the system as requirements grow.

  1. Stage 1: Basic Setup — orchestrator + worker communication over Kafka.

  2. Stage 2: Retry-Ready Setup — distributed retry via ring coordinator.

  3. Stage 3: Monitoring Setup — saga-level observability via trace window.

Service Classification: Orchestrator, Worker, and Standard Utility

Before going any further into the architecture, it is important to first recognize the different types of services within the StackSaga ecosystem. Understanding how a service is classified makes everything that follows — the staged architecture, the dependency tables, the request flow — far easier to map back onto your own system.

Orchestrator and worker are StackSaga-ecosystem role labels, not new kinds of deployments or separate services. Every microservice in a system — order-service, payment-service, user-service, and so on — is, first and foremost, a standard utility service: it keeps exposing its own REST/gRPC endpoints, running its own business logic, and serving its own consumers exactly as it did before StackSaga-Kafka entered the picture.

Adding a StackSaga dependency to one of these services does not replace or restrict what it already does — it overlays an additional role on top. For example, if order-service adds stacksaga-kafka-orchestrator, it keeps doing everything it already did (exposing its own endpoints, calling other internal services, etc.) and additionally gains the orchestrator role for driving a specific saga. Within the StackSaga ecosystem, that service is now referred to as the orchestrator service — but only in the context of that saga. The same applies to a worker: payment-service keeps functioning as a normal utility service while also picking up saga commands through KafkaWorkerEndpoints, earning the worker service label.

This is one of the biggest advantages of StackSaga: it can be introduced into any existing microservice with zero disruption. There is no need to peel a service out of its domain, stand up a new dedicated orchestration service, or rewrite existing endpoints and business logic to make room for it. You simply add the relevant StackSaga dependency to a service that already lives in your system, and it takes on the orchestrator or worker role alongside its existing responsibilities. Adoption is incremental and additive, not a redesign.

These labels exist purely to make the saga topology easier to reason about. Every service in the system falls into one (or two, for orchestrator/worker-with-utility-duties) of three categories:

Role How to Identify It

Orchestrator service

Any existing service that has been given the additional role of driving a saga’s business transaction end-to-end. Identified by the stacksaga-kafka-orchestrator dependency (plus stacksaga-database-support for event store persistence). Within the saga, it owns the StackSagaKafkaTemplate, the SagaEventNavigator, and all onNext() / onNextRevert() step logic — outside the saga, it is unchanged.

Worker service

Any existing service that has been given the additional role of participating in a saga by exchanging command/reply messages with the orchestrator. Identified by the stacksaga-kafka-worker dependency and a registered KafkaWorkerEndpoints handler. Outside of handling saga commands, it continues to operate as a normal standard utility service.

Standard utility service

The baseline role every service has. A service that has no StackSaga dependency and is never invoked by the orchestrator remains purely a standard utility service — its behavior is completely unaffected by StackSaga-Kafka.

A single service is almost always a standard utility service and one of the StackSaga roles at the same time — e.g., payment-service keeps its normal APIs while also handling saga commands via KafkaWorkerEndpoints. The orchestrator/worker labels describe a service’s involvement in a given saga, not a separate deployment unit. Only one service per saga acts as the orchestrator.

See the Stage 1 Dependencies table for the exact dependency artifacts associated with each role.

Stage 1 — Basic Setup

This stage establishes the minimal viable topology: an orchestrator service that can start and drive sagas, and one or more worker services that can receive commands and return results over Kafka. No retry coordination or external monitoring is configured at this stage.

StackSaga-Kafka Architecture: Stage 1 — Basic Setup

Dependencies

Service Dependency Purpose

Orchestrator

stacksaga-kafka-orchestrator

Provides the saga engine, StackSagaKafkaTemplate, SagaEventNavigator, and the Kafka producer for outbound command topics.

Orchestrator

stacksaga-database-support

Provides the event store adapter for persisting SagaDomainEntity state and execution history.

Worker

stacksaga-kafka-worker

Provides the Kafka consumer infrastructure and KafkaWorkerEndpoints abstraction.

Request and Execution Flow

  1. An inbound HTTP request (e.g., POST /order) reaches the OrderController on the orchestrator service.

  2. The controller instantiates a PlaceOrderDomainEntity — the SagaDomainEntity subclass for the order placement saga — populates its initial payload, and calls StackSagaKafkaTemplate.init(…​).startWith(..).execute();. From this point the saga engine takes full control.

  3. The engine evaluates the PlaceOrderEventNavigator to determine the first step and calls onNext() with the current domain entity.

  4. The onNext() implementation produces a command message to the designated outbound Kafka topic for the target service (e.g., the payment service’s command topic). The message carries the saga step identifier and the serialized domain entity payload.

  5. The worker service, which has the stacksaga-kafka-worker dependency, consumes the message from its designated topic via the registered KafkaWorkerEndpoints handler. It executes the business operation (e.g., charge the payment) and produces a reply message to the per-domain reply topic managed by the framework. For the PlaceOrderDomainEntity domain, this is a single, dedicated reply topic created automatically by the framework.

  6. The orchestrator’s internal Kafka consumer reads the reply from the domain reply topic. If the step succeeded, the engine updates the domain entity state, persists the new snapshot to the event store, and advances to the next step by calling onNext() again.

  7. If the worker replies with a failure, the engine transitions the saga to FAILED, persists the failure state, and begins compensation by calling onNextRevert() on each previously completed step in reverse order. Each onNextRevert() produces a compensation command to the relevant worker topic. Workers handle compensation commands through their dedicated KafkaWorkerEndpoints compensation handler and reply via the same domain reply topic.

  8. Each state transition — step completion, failure, compensation start, compensation completion — is written to the event store via stacksaga-database-support.

The diagram above shows a single span (MakePaymentSpan) for clarity. In practice, a saga can traverse any number of spans across different worker services, each with its own command topic. All replies — regardless of which worker sent them — flow back to the single per-domain reply topic. The reply topic consumer is configured with auto.offset.reset=earliest and at-least-once delivery semantics. This is safe because the framework’s built-in idempotency check (via the per-span idempotency key) detects any duplicate re-delivery that may occur after a consumer rebalance or an offset commit delay, and discards the duplicate without re-executing the saga step.

Stage 2 — Retry-Ready Setup

Stage 2 introduces the distributed retry capability. Without this stage, a saga that stalls due to a transient infrastructure failure (e.g., a worker service being temporarily unavailable, a Kafka broker partition becoming unavailable, or a timeout) will remain in an incomplete state indefinitely. Stage 2 makes the system self-healing for such failures.

StackSaga-Kafka Architecture: Stage 2 — Retry-Ready Setup

New Components at This Stage

Service New Component Purpose

Ring Coordinator Service

stacksaga-ring-coordinator

A standalone service that manages the token ring. It tracks available orchestrator instances, distributes Murmur3 token sub-ranges among them, and handles range rebalancing when instances join or leave the cluster. see Transaction Retry Architecture With Retry Coordinator

Orchestrator

stacksaga-ring-coordinator-connector

Connects the orchestrator instance to the ring coordinator (via RSocket request-stream), receives and holds its assigned token sub-range, and enables the local retry scheduler to scan the event store for transactions whose Murmur3 hash falls within the owned range.

Retry Mechanism

By adding stacksaga-ring-coordinator-connector to the orchestrator service, the instance is promoted to a Retry Node. The ring coordinator assigns each registered orchestrator instance a contiguous sub-range of the Murmur3 token ring. The retry scheduler on each instance periodically scans the event store for transactions that are in a non-terminal state (e.g., IN_PROGRESS, COMPENSATING) and whose transaction ID hashes into the locally owned token range. When such a transaction is found, the scheduler re-submits it to the saga engine for re-execution from the last incomplete step.

This partitioning ensures that in a multi-instance orchestrator deployment, each stuck transaction is retried by exactly one instance — there is no duplication of retry work and no need for distributed locking. When an instance restarts or a new instance joins, the ring coordinator rebalances token ranges and the new assignment is delivered over the existing RSocket stream.

The ring coordinator is a separate service (stacksaga-ring-coordinator-spring-boot-starter) that must be deployed independently. It is a lightweight coordination service and does not participate in the business logic or the Kafka message flow.
It is recommended read Transaction Retry Architecture With Retry Coordinator to understand the retry mechanism in depth and how the ring coordinator works.

Stage 3 — Monitoring and Observability Setup

Stage 3 adds the observability layer, enabling the StackSaga Trace Window to query real-time and historical saga execution data from the orchestrator service.

StackSaga-Kafka Architecture: Stage 3 — Monitoring + Retry-Ready

New Component at This Stage

Service New Component Purpose

Orchestrator

stacksaga-trace-window-connector

Exposes a set of internal APIs (consumed by the StackSaga Trace Window UI) that surface per-transaction execution traces, step-level timelines, failure details, retry histories, and compensation status from the event store.

What Becomes Visible

With the trace window connector in place, the StackSaga Trace Window provides:

  • Per-saga execution graphs showing each step, its execution timestamp, duration, status, and any error payload.

  • Compensation traces showing which onNextRevert() calls were made, in what order, and whether they succeeded.

  • Retry audit logs showing how many retry attempts were made, which retry node handled each attempt, and the outcome.

  • Live transaction status for in-progress sagas.

This layer does not change the Kafka topology or the retry behavior — it is a passive observability connector that reads from the existing event store and exposes the data through an API consumed by the Trace Window UI.

With Stage 3 in place, the full StackSaga-Kafka deployment is production-ready: it supports asynchronous saga execution, distributed retry with token ring partitioning, comprehensive observability via the Trace Window.

See more about how the application is deployed in environments.

Saga Transaction State Machine

Every saga instance transitions through a well-defined set of states. Understanding these states and their transition triggers is essential for interpreting transaction history in the Trace Window and for implementing correct exception handling in endpoints and the EventManager.

State Type Triggered By

STARTED

Intermediate

StackSagaKafkaTemplate.execute() or executeAsync() is called successfully and the first span is queued for dispatch.

IN_PROGRESS

Intermediate

The first worker reply is received and the engine advances to the next span.

COMPLETED

Terminal — success

SagaPrimaryEventAction.complete() is returned from onNext() after the last span completes successfully.

FAILED

Intermediate

A NonRetryableExecutorException is thrown from a worker’s doProcess(), or any exception is thrown from onNext() in the EventManager. Immediately triggers the compensation sequence.

COMPENSATING

Intermediate

Follows FAILED. The engine begins calling onNextRevert() and dispatching compensation commands to workers in reverse step order.

COMPENSATED

Terminal — success

All compensation spans complete successfully.

Compensation Failed

Terminal — failure

An unhandled exception is thrown from onNextRevert() in the EventManager, or a compensation step exhausts its configured retry limit. See stacksaga-database-support for retry limit and dead-letter configuration.

A saga in any non-terminal state (STARTED, IN_PROGRESS, FAILED, COMPENSATING) is eligible for recovery by the ring coordinator retry scheduler if it stalls. Only the three terminal states — COMPLETED, COMPENSATED, and Compensation Failed — are considered permanently resolved and will not be retried.