StackSaga-Kafka Framework (Asynchronous)

Overview

StackSaga-Kafka Framework (stacksaga-kafka-spring-boot-starter) is the asynchronous transport implementation of the StackSaga ecosystem. It replaces synchronous inter-service calls with Kafka as the messaging backbone, enabling fully event-driven saga execution for long-running transactions (LRT) in a microservices system.

Unlike a simple choreography approach where each service publishes and reacts to domain events independently, StackSaga-Kafka follows a centralized orchestration model: a single orchestrator service owns the full saga lifecycle, emits commands to target worker services via dedicated Kafka topics, and collects responses through a framework-managed per-domain callback topic. This separation gives you the scalability and decoupling of Kafka while retaining the deterministic control flow and auditability of the saga orchestration pattern.

Announcement — Kafka Share Group support (KIP-932)

StackSaga now supports Kafka 4 Queues with Share Groups (KIP-932) end to end, for both the orchestrator and worker sides of the framework. Share groups let multiple consumer instances cooperatively read from the same topic with per-record acknowledgment, so saga processing throughput is no longer capped by the topic’s partition count.

Enable it per endpoint or per event manager with the groupType attribute — no repartitioning or topic redesign required.

The framework is built on top of Spring Boot and integrates with StackSaga’s core engine for event sourcing, state management, retry scheduling, and cluster coordination. Each saga execution is fully persisted — every state transition of the SagaDomainEntity (the saga’s aggregate root and payload carrier) is written to the event store through stacksaga-database-support — so that any in-progress transaction can be recovered, retried, or inspected at any point in time.

Key capabilities:

Asynchronous saga orchestration — the orchestrator issues commands to worker services via Kafka and advances the saga state only upon receiving the worker’s reply, with no blocking thread held during the wait.
Built for Kafka 4 with KIP-932 share groups — both the orchestrator and the worker can consume either through the classic consumer-group protocol or through the Kafka 4 share-group protocol (KIP-932), a queue-style consumption model whose parallelism is not capped by the partition count. The protocol is selectable per endpoint and per event manager through the groupType attribute, so you can scale saga processing beyond the number of partitions without repartitioning topics.
Forward and backward recovery — built-in support for compensating transactions: when a saga step fails, the framework invokes onNextRevert() on the EventManager in reverse step order, triggering rollback actions on each previously completed service.
Per-domain callback topology — outbound command topics are defined per worker action; inbound replies use one callback topic per saga domain. A system with three saga domains uses exactly three callback topics, keeping Kafka partition management predictable.
Durable state via event sourcing — the full SagaDomainEntity payload and status (STARTED → IN_PROGRESS → COMPLETED | FAILED → COMPENSATING → COMPENSATED) is persisted on every state transition, enabling point-in-time recovery and complete audit trails.
Distributed retry with token ring partitioning — failed or stalled transactions are retried by the owning orchestrator instance as determined by Murmur3-based token ring partitioning, coordinated through the stacksaga-ring-coordinator.
Reactive and imperative support — the framework’s worker-side @SagaEndpoint abstraction supports both reactive (Mono/Flux) and blocking handler implementations, giving application developers freedom to choose their programming model without affecting saga coordination.

Glossary

The following terms are used throughout this documentation. Familiarising yourself with them before reading the technical sections will significantly reduce the learning curve.

Term Definition

Term	Definition
LRT (Long-Running Transaction)	A business transaction that spans multiple services and may take seconds, minutes, or longer to complete. LRTs require durable state management, distributed coordination, and compensation support for partial failures.
Saga Domain	A specific type of LRT identified by its `DomainEntity` subclass. All saga instances of the same type (e.g., every `PlaceOrder` transaction) belong to the same saga domain. Each domain has its own `EventManager`, Kafka callback topic, and set of spans.
Span	A single atomic execution step within a saga. Each span targets one worker service via a dedicated Kafka topic. A saga is composed of one or more sequential spans in the primary flow, each with an optional corresponding compensation span.
SEC (Saga Execution Coordinator)	The internal engine component responsible for driving a saga forward: evaluating the `EventManager`, dispatching commands to workers, receiving replies, persisting state transitions, and triggering compensation when needed. `StackSagaKafkaTemplate` is the developer-facing entry point to the SEC.
Domain Entity	The aggregate root for a single saga instance. Carries the full accumulated business payload and current execution state throughout the saga lifecycle. Every state transition is persisted as a snapshot to the event store.
EventManager	The orchestrator-side component that defines the routing logic for a saga domain. Determines which topic to trigger next after each successful span (`onNext()`) and provides hooks for the compensation sequence (`onNextRevert()`).
Executor	A worker-side handler that executes the business logic for a specific saga step. Implemented as a `QueryEndpoint` (read-only, no compensation) or a `CommandEndpoint` (state-changing, has a compensation action).
Compensation	The reverse process that undoes previously completed steps when a saga fails. Executed in reverse order via `onNextRevert()` in the `EventManager` and `undoProcess()` in `CommandEndpoint` implementations.
Event Store	The persistent storage layer for all saga state transitions and domain entity snapshots. Provided by `stacksaga-database-support`. Enables point-in-time recovery, retry, and full audit trails.
Revert Hint Store	A key-value store that carries metadata forward through the compensation sequence. Values written during one `undoProcess()` call are available to the next compensation step.
Topic Key	A unique `float` value assigned to each saga topic. Used in Kafka message headers instead of the raw topic name, keeping messages compact and decoupling the wire format from topic name refactoring.
Ring Coordinator	A standalone service (`stacksaga-ring-coordinator`) that manages token ring partitioning for retry ownership. Distributes Murmur3 token sub-ranges among orchestrator instances so each stalled transaction is retried by exactly one instance without distributed locking.

LRT (Long-Running Transaction)

A business transaction that spans multiple services and may take seconds, minutes, or longer to complete. LRTs require durable state management, distributed coordination, and compensation support for partial failures.

Saga Domain

A specific type of LRT identified by its DomainEntity subclass. All saga instances of the same type (e.g., every PlaceOrder transaction) belong to the same saga domain. Each domain has its own EventManager, Kafka callback topic, and set of spans.

Span

A single atomic execution step within a saga. Each span targets one worker service via a dedicated Kafka topic. A saga is composed of one or more sequential spans in the primary flow, each with an optional corresponding compensation span.

SEC (Saga Execution Coordinator)

The internal engine component responsible for driving a saga forward: evaluating the EventManager, dispatching commands to workers, receiving replies, persisting state transitions, and triggering compensation when needed. StackSagaKafkaTemplate is the developer-facing entry point to the SEC.

Domain Entity

The aggregate root for a single saga instance. Carries the full accumulated business payload and current execution state throughout the saga lifecycle. Every state transition is persisted as a snapshot to the event store.

EventManager

The orchestrator-side component that defines the routing logic for a saga domain. Determines which topic to trigger next after each successful span (onNext()) and provides hooks for the compensation sequence (onNextRevert()).

Executor

A worker-side handler that executes the business logic for a specific saga step. Implemented as a QueryEndpoint (read-only, no compensation) or a CommandEndpoint (state-changing, has a compensation action).

Compensation

The reverse process that undoes previously completed steps when a saga fails. Executed in reverse order via onNextRevert() in the EventManager and undoProcess() in CommandEndpoint implementations.

Event Store

The persistent storage layer for all saga state transitions and domain entity snapshots. Provided by stacksaga-database-support. Enables point-in-time recovery, retry, and full audit trails.

Revert Hint Store

A key-value store that carries metadata forward through the compensation sequence. Values written during one undoProcess() call are available to the next compensation step.

Topic Key

A unique float value assigned to each saga topic. Used in Kafka message headers instead of the raw topic name, keeping messages compact and decoupling the wire format from topic name refactoring.

Ring Coordinator

A standalone service (stacksaga-ring-coordinator) that manages token ring partitioning for retry ownership. Distributes Murmur3 token sub-ranges among orchestrator instances so each stalled transaction is retried by exactly one instance without distributed locking.

Why StackSaga-Kafka Over Raw Kafka?

Using Kafka alone for service-to-service communication in a saga does not automatically give you saga orchestration. A raw Kafka topology requires every participating service to understand the broader transaction context: which step it is in, what has already succeeded, what to roll back if something goes wrong, and how to track overall saga progress. This logic inevitably leaks into every service, making the system fragile and difficult to evolve.

The following comparison explains the specific gaps StackSaga-Kafka closes.

Centralized Transaction State Management

Concern Detail

Concern	Detail
Raw Kafka	Kafka delivers messages durably but has no concept of a business transaction. There is no built-in mechanism to know that "step 2 of 5 has completed" or that a transaction is in a compensating state.
StackSaga-Kafka	The orchestrator tracks the full saga lifecycle through well-defined states: `STARTED → IN_PROGRESS → COMPLETED \| FAILED → COMPENSATING → COMPENSATED`. The `SagaDomainEntity` is the single source of truth: it carries the transaction payload (all accumulated data from previous steps) and the current execution cursor. State transitions are persisted atomically to the event store on every step completion or failure.
Advantage	No saga coordination logic needs to be implemented in individual worker services. The orchestrator is the sole authority on saga state.

Raw Kafka

Kafka delivers messages durably but has no concept of a business transaction. There is no built-in mechanism to know that "step 2 of 5 has completed" or that a transaction is in a compensating state.

StackSaga-Kafka

The orchestrator tracks the full saga lifecycle through well-defined states: STARTED → IN_PROGRESS → COMPLETED | FAILED → COMPENSATING → COMPENSATED. The SagaDomainEntity is the single source of truth: it carries the transaction payload (all accumulated data from previous steps) and the current execution cursor. State transitions are persisted atomically to the event store on every step completion or failure.

Advantage

No saga coordination logic needs to be implemented in individual worker services. The orchestrator is the sole authority on saga state.

Resilience and Fault Tolerance

Concern Detail

Concern	Detail
Raw Kafka	Kafka provides delivery durability and at-least-once guarantees, but it does not know what to do when the message is consumed and the downstream operation fails. Retry logic, timeouts, and compensation decisions must be built separately.
StackSaga-Kafka	The framework implements configurable retry windows, boundary guard thresholds, and automatic compensation invocation. If a worker service returns a failure response, the saga engine immediately begins the reverse traversal of the `EventManager`, calling `onNextRevert()` for each completed forward step in reverse order. For transient failures (timeouts, infrastructure errors), the retry subsystem re-invokes the saga from the last unacknowledged step using the Murmur3 token ring scheduler.
Advantage	Data consistency is maintained across services even under partial failures, without embedding retry or rollback logic in each service.

Raw Kafka

Kafka provides delivery durability and at-least-once guarantees, but it does not know what to do when the message is consumed and the downstream operation fails. Retry logic, timeouts, and compensation decisions must be built separately.

StackSaga-Kafka

The framework implements configurable retry windows, boundary guard thresholds, and automatic compensation invocation. If a worker service returns a failure response, the saga engine immediately begins the reverse traversal of the EventManager, calling onNextRevert() for each completed forward step in reverse order. For transient failures (timeouts, infrastructure errors), the retry subsystem re-invokes the saga from the last unacknowledged step using the Murmur3 token ring scheduler.

Advantage

Data consistency is maintained across services even under partial failures, without embedding retry or rollback logic in each service.

Simplified Development

Concern Detail

Concern	Detail
Raw Kafka	Developers must implement state machine logic, idempotency checks, saga step routing, and compensation orchestration in every service that participates in a business transaction.
StackSaga-Kafka	The orchestrator-side abstraction (`StackSagaKafkaTemplate`, `EventManager`, `SagaDomainEntity`) and the worker-side abstraction (`@SagaEndpoint`) provide clear contracts for each role. Developers define what the steps do and in what order; the framework handles routing, persistence, retrying, and compensation sequencing.
Advantage	Reduces boilerplate significantly and confines distributed transaction concerns to a small, testable layer.

Raw Kafka

Developers must implement state machine logic, idempotency checks, saga step routing, and compensation orchestration in every service that participates in a business transaction.

StackSaga-Kafka

The orchestrator-side abstraction (StackSagaKafkaTemplate, EventManager, SagaDomainEntity) and the worker-side abstraction (@SagaEndpoint) provide clear contracts for each role. Developers define what the steps do and in what order; the framework handles routing, persistence, retrying, and compensation sequencing.

Advantage

Reduces boilerplate significantly and confines distributed transaction concerns to a small, testable layer.

Operational Observability

Concern Detail

Concern	Detail
Raw Kafka	Consumer lag metrics and broker-level monitoring are available, but there is no concept of "transaction X is stuck at step 3" or "compensation for order Y failed".
StackSaga-Kafka	The `stacksaga-trace-window-connector` exposes APIs consumed by the StackSaga Trace Window UI, providing per-transaction step-level traces, execution timelines, failure points, retry counts, and compensation status.
Advantage	Operations teams can diagnose stalled or failed sagas at the business-transaction level, not just at the Kafka consumer level.

Raw Kafka

Consumer lag metrics and broker-level monitoring are available, but there is no concept of "transaction X is stuck at step 3" or "compensation for order Y failed".

StackSaga-Kafka

The stacksaga-trace-window-connector exposes APIs consumed by the StackSaga Trace Window UI, providing per-transaction step-level traces, execution timelines, failure points, retry counts, and compensation status.

Advantage

Operations teams can diagnose stalled or failed sagas at the business-transaction level, not just at the Kafka consumer level.

Components

StackSaga-Kafka Framework Component Overview

The StackSaga-Kafka Framework is split into two artifacts with distinct responsibilities. They communicate with each other exclusively through Kafka: the orchestrator sends outbound command messages to worker-designated topics, and workers send reply messages back to the framework-managed per-domain callback topic.

stacksaga-kafka-orchestrator
stacksaga-kafka-worker

stacksaga-kafka-orchestrator

stacksaga-kafka-orchestrator is the core runtime dependency added to the orchestrator service — the single service responsible for initiating and driving the saga lifecycle.

It exposes:

StackSagaKafkaTemplate — the entry point for starting a new saga execution or resuming a recovered one. Accepts an initialized SagaDomainEntity and hands it off to the saga engine.
EventManager — defines the ordered steps and compensation steps of a saga. The engine calls onNext() sequentially for forward progress and onNextRevert() in reverse order during compensation.
SagaDomainEntity — the aggregate root for a saga instance. Carries the accumulated business payload and the current execution state. It is serialized and persisted to the event store on every state transition.

To run your application as an orchestrator service, first add the stacksaga-kafka-orchestrator-starter dependency to your project.

<dependencyManagement>
    <dependencies>
        <dependency> <!--Only for stacksaga dependencies version management-->
            <groupId>org.stacksaga</groupId>
            <artifactId>stacksaga-bom</artifactId>
            <version1.0.0-SNAPSHOT</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <dependency>
        <groupId>org.stacksaga</groupId>
        <artifactId>stacksaga-kafka-orchestrator-starter</artifactId>
    </dependency>
</dependencies>

The framework targets Kafka 4 and Spring Boot 4 and integrates directly with Spring for Apache Kafka’s listener containers. It fully supports the Kafka 4 share-group protocol (KIP-932) for queue-style consumption in addition to the classic consumer-group protocol, selectable per event manager and per endpoint via the groupType attribute. Application code can be written in either a reactive (Mono/Flux) or a non-reactive (imperative/blocking) style — both are first-class and fully supported.

If your orchestrator service should work as the worker for another saga domain owned by a different orchestrator, no need to add the worker dependency (stacksaga-kafka-worker-starter) separately. because the orchestrator dependency already includes the worker dependency transitively.

Additionally, the orchestrator service must include:

stacksaga-database-support — provides the event store integration (MySQL, Cassandra, or ScyllaDB depending on configuration) for persisting SagaDomainEntity snapshots and state transitions.
stacksaga-ring-coordinator-connector (optional, required for retry) — connects the orchestrator instance to the Ring Coordinator Service, registers it as a retry node, and receives its assigned Murmur3 token sub-range. Enables the retry subsystem to re-invoke failed transactions owned by this instance.

stacksaga-kafka-worker

stacksaga-kafka-worker is the dependency added to every worker service (also called a target service or utility service) that participates in sagas initiated by an orchestrator.

The worker dependency provides:

A Kafka listener infrastructure that subscribes to the topic(s) designated for that service by the framework’s topic configuration.
@SagaEndpoint — the abstraction through which application developers implement the forward processing logic and the compensation logic for each saga step handled by this service.
Automatic response dispatch — after the endpoint handler completes (successfully or with a failure result), the worker sends the outcome back to the framework via the orchestrator’s per-domain callback topic. The application developer does not manage this reply flow directly.

Worker services do not interact with the event store or the ring coordinator. They are stateless with respect to the saga: they receive a command, execute the relevant business operation, and return a result.

A single worker service can handle steps for multiple orchestrator domains (i.e., multiple saga types), each routed to the correct @SagaEndpoint implementation by the framework.

To add the worker dependency, include stacksaga-kafka-worker-starter in your worker project as below.

<dependencyManagement>
    <dependencies>
        <dependency> <!--Only for stacksaga dependencies version management-->
            <groupId>org.stacksaga</groupId>
            <artifactId>stacksaga-bom</artifactId>
            <version1.0.0-SNAPSHOT</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>
<dependencies>
    <dependency>
        <groupId>org.stacksaga</groupId>
        <artifactId>stacksaga-kafka-worker-starter</artifactId>
    </dependency>
</dependencies>