Transaction Retry Architecture With Retry Coordinator

This article is the reference guide for the Transaction Retry Architecture of the StackSaga framework. It is intended for architects, DevOps engineers, and developers who integrate, deploy, or operate StackSaga-based microservices.

Introduction

In a distributed microservices architecture, transactions often span multiple services and databases, making them susceptible to failures, timeouts, and pauses. The StackSaga framework addresses this challenge with a robust transaction retry subsystem that ensures reliability and consistency without risking duplicate processing or conflicts.

This subsystem detects paused transactions and re-invokes them safely across a cluster of microservice instances, using a combination of token ring partitioning, RSocket communication, and time-windowed publishing. The result is high availability and resilience even in the face of transient failures, while guaranteeing that transactions are processed exactly once.

Scope of This Document

This document focuses exclusively on the transaction retry subsystem.

Topics covered:

Identifying retryable vs non-retryable errors
Why instance-level retrying is insufficient
Deployment modes: Standard-Node vs Retry-Node
The three-component retry ecosystem: Orchestrator, Retry-Coordinator-Slave, and Retry-Coordinator-Master
Token ring partitioning using Murmur3
RSocket communication patterns between components
Time-windowed token publishing and conflict avoidance
Failure modes, reconnection behaviour, and deployment considerations
Multi-region deployments
Virtual cluster partitioning for massive-scale systems

Foundational Concepts

Before diving into the architecture, two foundational concepts must be clear: what kinds of errors trigger retrying, and why retrying cannot simply be handled by the originating node.

Retryable vs Non-Retryable Errors

StackSaga distinguishes two classes of errors:

Non-retryable error — a business logic failure or permanent error condition. These types of errors are not transient and cannot be resolved by retrying. Examples include validation failures, authorization errors, or any condition that indicates a fundamental issue with the transaction itself. These errors can occur in both the primary flow and the compensation flow. In the primary flow, a non-retryable error triggers compensation; if it occurs during compensation, the transaction is marked as permanently failed. In either case, the transaction will not be retried.

Retryable error — a transient condition, such as a downstream service being unavailable, a database connection timeout, or a network partition. The transaction is paused and will be replayed automatically. Retryable errors can occur in both the primary flow and the compensation flow. This is precisely where the retry subsystem comes into play, ensuring that the transaction is retried safely without duplication or conflict.

Why Instance-Level Retrying Is Insufficient

At first glance it may seem easy to retry a transaction on the same node that originally processed it. However, this approach has a critical flaw.

Standard nodes are ephemeral — they are short-lived, temporary, and designed to be created, destroyed, or replaced on-demand based on traffic. This means that when the time comes to retry a transaction, the node that originally processed it may no longer be running. The transaction then remains paused indefinitely with no node to reclaim it.

To make this concrete: imagine 3 instances are running and each has saved some transactions for retrying due to network issues. If a scheduler is triggered on each instance to replay its own transactions, but one instance has since been terminated due to scale-down, the transactions it owned will never be picked up by the other instances.

why instance does not involve directly for retrying in StackSaga

StackSaga solves this by decoupling transaction retrying from the originating node entirely, using a three-component architecture described in the next sections.

Deployment Modes

Because standard nodes are ephemeral, StackSaga defines two distinct deployment modes for an orchestrator service:

Standard-Node: The regular orchestrator service (with StackSaga enabled) that processes transactions and handles business logic. Standard nodes can be freely scaled up or down based on traffic.
Retry-Node: A specialised node that handles transaction retries in addition to regular transaction processing. Retry-Nodes are intended to be more stable and long-lived than Standard-Nodes, ensuring reliable retry management even as the rest of the cluster scales.

stacksaga diagram standard node vs retry node

In a typical deployment you might have 5–1000 Standard-Nodes handling regular traffic and 2 Retry-Nodes that remain stable to manage retries. There is no technical restriction on scaling Retry-Nodes up or down — the distinction is operational: they are simply the nodes you choose to keep stable.

Retrying is the worst-case scenario for a transaction, not the regular flow. See Failure Probability and Retry Necessity in Saga-Based Distributed Transactions for an analysis of retrying frequency and its weight relative to normal processing.

The next challenge is how retry responsibility is safely distributed across the cluster of Retry-Nodes without any two nodes conflicting over the same transaction. This is where Token Ring Partitioning comes in.

Token Ring Partitioning

Overview

StackSaga uses the Murmur3 consistent hashing algorithm to assign every transaction to a deterministic position in a 64-bit token ring. This token is computed when the transaction is initialised (by any Orchestrator) and stored alongside the transaction record in the database, together with the cluster name and region of the originating Orchestrator. The retry subsystem uses this stored token, cluster, and region to determine which Orchestrator instance is responsible for retrying a given transaction at any point in time.

The full token space spans from \(-2^{63}\) to \(2^{63}-1\):

-9,223,372,036,854,775,808  →  9,223,372,036,854,775,807

The diagram below illustrates how Retry-Nodes are assigned non-overlapping segments of the token ring. In this example there are 4 Retry-Nodes, each responsible for a distinct quarter of the token space:

stacksaga diagram token ring partitioning for transaction retrying with retry nodes

Hash space: -9223372036854775808 → 9223372036854775807
Total range size: 18446744073709551616
Partition size per node (4 nodes): 4611686018427387904

Node Slot Token Range Start Token Range End

Node	Slot	Token Range Start	Token Range End
`order-service-instance-1`	A	`-9223372036854775808`	`-4611686018427387905`
`order-service-instance-2`	B	`-4611686018427387904`	`-1`
`order-service-instance-3`	C	`0`	`4611686018427387903`
`order-service-instance-4`	D	`4611686018427387904`	`9223372036854775807`

order-service-instance-1

-9223372036854775808

-4611686018427387905

order-service-instance-2

-4611686018427387904

-1

order-service-instance-3

0

4611686018427387903

order-service-instance-4

4611686018427387904

9223372036854775807

Two-Level Partitioning

The token ring is divided in two levels:

Master → Retry-Coordinator-Slave: The Retry-Coordinator-Master divides the full ring equally across registered Retry-Coordinator-Slave nodes.
Retry-Coordinator-Slave → Orchestrator: Each Retry-Coordinator-Slave further divides its slice equally across its registered Orchestrator instances.

stacksaga diagram ring partitioning overview

Figure 1. Ring partitioning: Node perspective

stacksaga diagram token ring partitioning for transaction retrying

Figure 2. Ring partitioning: Partitioning perspective

Master → Retry-Coordinator-Slave Example (3 Retry-Coordinator-Slaves)

Retry-Coordinator-Slave Node Start Token (Inclusive) End Token (Inclusive)

Retry-Coordinator-Slave Node	Start Token (Inclusive)	End Token (Inclusive)
Retry-Coordinator-Slave Node 1	`-9,223,372,036,854,775,808`	`-3,074,457,345,618,258,603`
Retry-Coordinator-Slave Node 2	`-3,074,457,345,618,258,602`	`3,074,457,345,618,258,602`
Retry-Coordinator-Slave Node 3	`3,074,457,345,618,258,603`	`9,223,372,036,854,775,807`

Retry-Coordinator-Slave Node 1

-9,223,372,036,854,775,808

-3,074,457,345,618,258,603

Retry-Coordinator-Slave Node 2

-3,074,457,345,618,258,602

3,074,457,345,618,258,602

Retry-Coordinator-Slave Node 3

3,074,457,345,618,258,603

9,223,372,036,854,775,807

Partition boundaries are recalculated automatically each time the number of registered Retry-Coordinator-Slaves changes (at the next 30-second publish cycle).

Retry-Coordinator-Slave → Orchestrator Example (Retry-Coordinator-Slave Node 1, 2 Orchestrators)

Orchestrator Instance Start Token (Inclusive) End Token (Inclusive)

Orchestrator Instance	Start Token (Inclusive)	End Token (Inclusive)
Order Service Instance 1	`-9,223,372,036,854,775,808`	`-6,148,914,691,236,517,206`
Order Service Instance 2	`-6,148,914,691,236,517,205`	`-3,074,457,345,618,258,603`

Order Service Instance 1

-9,223,372,036,854,775,808

-6,148,914,691,236,517,206

Order Service Instance 2

-6,148,914,691,236,517,205

-3,074,457,345,618,258,603

All other Retry-Coordinator-Slave nodes apply the same logic for their own registered Orchestrator instances.

It is not required to have multiple physical Ring-Coordinator machines to achieve partitioning. The ring-coordinator-starter module can be configured as either Master or Retry-Coordinator-Slave via properties. In a small deployment a single node can act as both Master and Retry-Coordinator-Slave simultaneously. In larger deployments these roles are separated across multiple nodes for fault isolation and scalability.

The Three-Component Retry Architecture

Token ring partitioning is managed collaboratively by three components. Here is a brief orientation before the detailed descriptions:

Component	Responsibility
Retry-Coordinator-Master	Single authority for the domain. Divides the full token ring equally across Retry-Coordinator-Slaves every 30 seconds. Tells new Orchestrators which Retry-Coordinator-Slave to connect to (Round Robin).
Retry-Coordinator-Slave	Relay between Master and Orchestrators. Receives its ring slice from the Master and divides it further across its connected Orchestrators. Delivers each Orchestrator its personal sub-range every minute.
Orchestrator (Retry-Node)	Holds its sub-range for the current minute. Polls the database for paused transactions whose token falls in that sub-range and re-invokes them — without ever conflicting with another instance.

Component

Responsibility

Retry-Coordinator-Master

Single authority for the domain. Divides the full token ring equally across Retry-Coordinator-Slaves every 30 seconds. Tells new Orchestrators which Retry-Coordinator-Slave to connect to (Round Robin).

Retry-Coordinator-Slave

Relay between Master and Orchestrators. Receives its ring slice from the Master and divides it further across its connected Orchestrators. Delivers each Orchestrator its personal sub-range every minute.

Orchestrator (Retry-Node)

Holds its sub-range for the current minute. Polls the database for paused transactions whose token falls in that sub-range and re-invokes them — without ever conflicting with another instance.

Each component does one small job cleanly. No component reaches into another’s responsibility. The result is a retry system that scales horizontally, tolerates partial failures, and guarantees that every paused transaction is eventually retried — exactly once at a time.

Retry-Coordinator-Master

What It Is

The Retry-Coordinator-Master is the single authority for retry coordination within a microservice domain. There is exactly one Master per domain (or per virtual cluster — see Virtual Clusters).

The Master has no knowledge of your business logic and never touches the transaction database. Its entire responsibility is coordination: dividing the token ring, tracking which Retry-Coordinator-Slaves are alive, and pointing new Orchestrators to the right Retry-Coordinator-Slave when they start up.

It is built on a fully non-blocking Netty stack via RSocket, which means a single Master instance can comfortably handle thousands of concurrent Retry-Coordinator-Slave connections — far beyond what most deployments will ever need.

Three Responsibilities

Responsibility 1 — Maintain the Retry-Coordinator-Slave registry. Every Retry-Coordinator-Slave that starts up connects to the Master via a persistent RSocket request-stream and registers itself. The Master keeps a live registry of all connected Retry-Coordinator-Slaves. When a Retry-Coordinator-Slave disconnects (crashes, restarts), the Master notices immediately and updates its registry.

Responsibility 2 — Publish token ring partitions every 30 seconds. At the 30th second of every minute the Master fires a timer. It looks at the current list of registered Retry-Coordinator-Slaves, divides the full 64-bit token ring equally among them, and pushes each Retry-Coordinator-Slave its assigned range, tagged as valid for the next full minute. This 30-second lead gives the entire cluster time to receive and prepare the new assignments before they go live.

Responsibility 3 — Direct new Orchestrators to a Retry-Coordinator-Slave. When a new Orchestrator instance starts up, it sends a one-shot Request-Response message to the Master asking: "Which Retry-Coordinator-Slave should I connect to?" The Master picks a Retry-Coordinator-Slave using a Round Robin strategy — spreading Orchestrators evenly across available Retry-Coordinator-Slaves — and responds with the Retry-Coordinator-Slave’s host and port. After this single exchange the Orchestrator never contacts the Master again; all future communication goes through the Retry-Coordinator-Slave directly.

What It Knows and Does Not Know

Knows

The registered Retry-Coordinator-Slave nodes and their connection status.
How to divide the 64-bit token ring equally using Murmur3 partitioning.
Which Retry-Coordinator-Slave to assign to each new Orchestrator (Round Robin).

Does not know

Individual Orchestrator instances after the initial lookup.
The transaction database.
Your business logic or transaction structure.
Other domain Masters — each Master is isolated to its own domain.

Master Node Failure

When the Master fails:

All registered Retry-Coordinator-Slaves lose their connection to the Master.
Retry-Coordinator-Slaves stop receiving token range updates. Their last known range remains cached but expires at the next minute boundary.
Orchestrators continue retrying using their last received sub-range until it expires.
New Orchestrator instances starting up will fail to obtain a Retry-Coordinator-Slave assignment.

A Master failure affects only that domain’s retry subsystem. Other microservice domains continue operating normally. Primary transaction execution (new transactions) is unaffected by Master availability.

For systems where even this scoped failure window is unacceptable, see Virtual Clusters.

Retry-Coordinator-Slave

What It Is

The Retry-Coordinator-Slave is a dedicated infrastructure service — a small, lightweight process deployed alongside your microservice. It has no awareness of your business logic whatsoever. It does not touch the transaction database. It does not execute transactions.

Its entire purpose is to act as a distribution bridge between the Retry-Coordinator-Master and the Orchestrator instances: it receives a large token range from the Master and breaks it into smaller, non-overlapping sub-ranges, one for each connected Orchestrator.

Four-Step Routine

Step 1 — Register with the Master. On startup, the Retry-Coordinator-Slave opens a persistent RSocket request-stream connection to the Master, announces itself, and waits for token range updates.

Step 2 — Receive a token range slice. Every 30 seconds the Master publishes updated ranges. The Retry-Coordinator-Slave receives its slice — a contiguous portion of the full 64-bit ring — and is told which minute that slice is valid for.

Step 3 — Accept Orchestrator registrations. When an Orchestrator starts up and connects, the Retry-Coordinator-Slave records it in its local registry.

Step 4 — Divide and deliver. With its slice in hand and a list of connected Orchestrators, the Retry-Coordinator-Slave divides the slice equally — one non-overlapping sub-range per Orchestrator — and pushes each sub-range to the respective Orchestrator over the persistent stream.

This cycle repeats every minute, keeping every connected Orchestrator informed of its current retry ownership window.

Why Have a Retry-Coordinator-Slave Layer At All?

Your service pods (Orchestrators) come and go frequently — they restart on deployments, scale up under load, scale down at night. If the Master had to track every individual pod, it would be overwhelmed with registration and deregistration events and would constantly be recalculating the entire ring.

The Retry-Coordinator-Slave acts as a buffer. The Master only needs to track a small, stable set of Retry-Coordinator-Slave nodes. All the volatility of your application pods is absorbed by the Retry-Coordinator-Slave, which quietly adjusts its sub-range distribution whenever an Orchestrator joins or leaves — without disturbing the Master at all.

What It Knows and Does Not Know

Knows

The Master it is registered with (stacksaga.agent.slave.target-master.host / .port).
The Orchestrator instances currently connected to it.
The token range it has been assigned and how to divide it.

Does not know

Your business logic or transaction structure.
The transaction database.
Other Retry-Coordinator-Slave nodes — each Retry-Coordinator-Slave works independently.

Retry-Coordinator-Slave Node Failure

When a Retry-Coordinator-Slave crashes or becomes unreachable:

All Orchestrators connected to that Retry-Coordinator-Slave lose their stream. Their sub-ranges become stale; retry polling pauses for those instances.
The Master detects the lost connection and applies a lazy rebalance strategy:
- If the lost Retry-Coordinator-Slave is not the last index in the registry, the Master assumes it will recover shortly (especially true in Kubernetes) and does not immediately rebalance.
- The Master continues sending cached token ranges to the remaining Retry-Coordinator-Slaves.
- The crashed Retry-Coordinator-Slave’s token range is frozen — no Orchestrator covers it during the outage.
When the Retry-Coordinator-Slave restarts it reconnects to the Master and is treated as a new registration.
At the next 30-second publish cycle the Master recalculates and redistributes ranges across all currently registered Retry-Coordinator-Slaves, including the restarted one.

During a Retry-Coordinator-Slave outage, transactions whose tokens fall within the frozen range are not retried until the range is covered again. These transactions remain safely stored in the database with status FAILED_WITH_RETRYABLE_ERROR and are not lost.

Orchestrator (Retry-Node)

What It Is

The Orchestrator is your microservice, augmented by StackSaga to execute distributed transactions. When deployed as a Retry-Node, it also manages transaction retrying. It runs two jobs side by side at all times.

Two Concurrent Jobs

Job 1 — Execute transactions (regular orchestrator role). When a new business operation arrives (a customer places an order, for example), the Orchestrator creates a transaction, breaks it into spans, and executes them one by one — calling downstream services, updating databases, firing events. If a span fails with a permanent error, it starts a compensation sequence to undo what was already done.

Job 2 — Retry paused transactions. Some spans fail not because of a business error but because a resource was temporarily unavailable. StackSaga does not discard these transactions — it marks them as paused and saves them in the database. The Orchestrator periodically checks the database for these paused transactions and re-invokes them. Crucially, it only checks the transactions it is currently responsible for, based on the token sub-range it holds for the current time window. This is how StackSaga guarantees that multiple running instances never accidentally retry the same transaction twice.

What It Knows and Does Not Know

Knows

Its own business logic and transaction spans.
The token sub-range it currently holds (delivered by its Retry-Coordinator-Slave).
The transaction database — it reads and writes transaction records directly.

Does not know

Other Orchestrator instances — there is no peer-to-peer communication between service instances.
The Retry-Coordinator-Master — after the initial startup handshake, the Orchestrator never contacts the Master again (unless the Retry-Coordinator-Slave connection is lost and must be re-established).
How the token ring works internally — it simply receives a range and uses it.

Orchestrator (Retry-Node) Failure

When an Orchestrator instance fails and restarts:

It performs a fresh Request-Response lookup to the Master for an available Retry-Coordinator-Slave (potentially receiving a different Retry-Coordinator-Slave due to Round Robin).
It opens a new persistent stream to the newly assigned Retry-Coordinator-Slave.
It receives the current sub-range at the next 30-second cycle and resumes retrying.

Any transactions being actively retried at the moment of failure will be re-attempted in the next window by whichever Orchestrator instance acquires that token range.

Registration & Token Distribution Flow

The sequence diagram below ties the three components together, showing the full lifecycle from startup to active retry execution.

Step-by-Step Description

Phase 1 — Retry-Coordinator-Slave Registration

On startup, each Retry-Coordinator-Slave opens a persistent RSocket request-stream connection to the Retry-Coordinator-Master.
The Master records the Retry-Coordinator-Slave in its registry.
The Retry-Coordinator-Slave remains connected and passively waits for token range updates.

Phase 2 — Orchestrator Registration

On startup, each Orchestrator sends a RSocket Request-Response message to the Master requesting an available Retry-Coordinator-Slave assignment.

The Master applies Round Robin to select a Retry-Coordinator-Slave and returns the Retry-Coordinator-Slave’s host and port.

If the same Orchestrator instance restarts, it may receive a different Retry-Coordinator-Slave assignment, as the Round Robin pointer advances. This ensures balanced Orchestrator distribution across Retry-Coordinator-Slaves over time.

The Orchestrator opens a persistent RSocket request-stream connection to its assigned Retry-Coordinator-Slave and begins listening for sub-range updates.

Phase 3 — Token Range Distribution

At the 30th second of every minute the Master fires its publish timer.
The Master partitions the full token ring equally among all registered Retry-Coordinator-Slaves and sends each Retry-Coordinator-Slave its range, tagged with validForMinute = T+1.
Each Retry-Coordinator-Slave divides its range equally among its registered Orchestrators and pushes sub-range updates downstream.
Each Orchestrator stores the received sub-range and activates it at the start of minute T+1.

Phase 4 — Retry Execution Loop

Once the time window is active, each Orchestrator enters its retry polling loop:

Query the transaction store for records where:
- token is within [subRangeStart, subRangeEnd]
- cluster matches the Orchestrator’s configured cluster name
- region matches the Orchestrator’s configured region
- status = FAILED_WITH_RETRYABLE_ERROR
For each matching transaction, re-invoke the next pending span.
Repeat until the time window expires (sub-range is superseded by the next update).

RSocket Communication Patterns

StackSaga uses RSocket for all inter-component communication, chosen for its support for persistent, reactive, bidirectional streams backed by a non-blocking Netty transport.

Connection Interaction Model Description

Connection	Interaction Model	Description
Retry-Coordinator-Slave → Master	`Request-Stream`	Persistent subscription. Retry-Coordinator-Slave registers with Master and keeps the stream open to receive token range updates.
Orchestrator → Master	`Request-Response`	One-shot lookup. Orchestrator requests an available Retry-Coordinator-Slave node assignment from Master on startup.
Orchestrator → Retry-Coordinator-Slave	`Request-Stream`	Persistent subscription. Orchestrator subscribes to its assigned Retry-Coordinator-Slave and keeps the stream open to receive sub-range updates.

Retry-Coordinator-Slave → Master

Request-Stream

Persistent subscription. Retry-Coordinator-Slave registers with Master and keeps the stream open to receive token range updates.

Orchestrator → Master

Request-Response

One-shot lookup. Orchestrator requests an available Retry-Coordinator-Slave node assignment from Master on startup.

Orchestrator → Retry-Coordinator-Slave

Request-Stream

Persistent subscription. Orchestrator subscribes to its assigned Retry-Coordinator-Slave and keeps the stream open to receive sub-range updates.

Token Time Window & Conflict Avoidance

The Publish Cycle

The Retry-Coordinator-Master runs a repeating timer that fires at the 30th second of every minute. At each trigger the Master:

Recalculates the token ring partition based on currently registered Retry-Coordinator-Slaves.
Publishes the new token range to each Retry-Coordinator-Slave.
Each Retry-Coordinator-Slave recalculates its sub-ranges and pushes updates to its registered Orchestrators.

The published range is labelled as valid for the next full minute (T+1), not the current minute.

Why the 30-Second Offset?

Publishing at the 30th second and marking the range as valid for the next minute creates a 30-second preparation window. This window accounts for worst-case communication delays: Retry-Coordinator-Slave distribution, Orchestrator updates, and network latency in large clusters.

By the time minute T+1 begins, all Orchestrators are guaranteed to have received their new sub-ranges and are ready to start polling.

Timeline example:

  Minute T, second 30 → Master publishes ranges (valid for T+1)
  Minute T, second 31–59 → Retry-Coordinator-Slaves and Orchestrators receive and cache new ranges
  Minute T+1, second 0  → Orchestrators begin polling with new ranges
  Minute T+1, second 30 → Master publishes ranges for T+2
  ...

Conflict Avoidance

Because each Orchestrator holds a non-overlapping sub-range for a defined time window, two Orchestrator instances can never claim responsibility for the same transaction simultaneously. This eliminates the need for distributed locking when retrying transactions.

Multi-Region Deployment

For systems deployed across multiple geographic regions, StackSaga uses the region property to scope retry ownership. Every component — Master, Retry-Coordinator-Slave, and Orchestrator — is stamped with the region it belongs to.

How Region Scoping Works

When a transaction is created it inherits the region of the Orchestrator that created it. That region value is stored in the transaction record in the database.

During the retry polling loop, each Orchestrator adds region as a mandatory filter alongside the token range:

SELECT *
FROM   transactions
WHERE  token   BETWEEN :subRangeStart AND :subRangeEnd
AND    region  = :region
AND    cluster = :cluster
AND    status  = 'FAILED_WITH_RETRYABLE_ERROR';

This guarantees that an Orchestrator in us-central never picks up and retries a transaction that was originally created by an Orchestrator in asia-south — even if both regions share the same database (e.g., a globally replicated Cassandra cluster).

Virtual Clusters

Why Virtual Clusters?

In very large deployments the single-master-per-domain model presents two concerns as the system grows:

Scale — a single Master must manage all Retry-Coordinator-Slave connections for the domain. Although the Master’s non-blocking Netty stack via RSocket can handle thousands of concurrent Retry-Coordinator-Slave connections, some extreme-scale deployments may want to distribute this coordination load further.
Fault isolation — if the domain’s single Master goes down, the entire domain’s retry subsystem pauses until it recovers. Some systems require that a partial infrastructure failure never affects more than a defined fraction of retry capacity.

Virtual clusters address both concerns by dividing a single physical deployment into multiple independent master-slave groups, each operating as a completely separate retry coordination unit within the same region.

What Is a Virtual Cluster?

A virtual cluster is a named group of Retry-Coordinator-Master, Retry-Coordinator-Slave, and Orchestrator instances that operate in full isolation from other groups. The group identity is declared via the stacksaga.instance.cluster property.

All three components sharing the same cluster name form one logical retry cluster:

The Master only accepts registrations from Retry-Coordinator-Slaves with a matching cluster name.
The Retry-Coordinator-Slave connects only to the Master of its cluster (stacksaga.agent.slave.target-master.host / .port).
The Orchestrator connects only to the Master of its cluster for the initial Retry-Coordinator-Slave lookup, and subsequently to the Retry-Coordinator-Slave assigned by that Master.
Retry queries are filtered by cluster name in addition to token range and region.

Two virtual clusters running in the same physical Kubernetes cluster are completely isolated — they share no state, no connections, and no retry responsibility.

Deployment Topologies

Multi-Region Single-Cluster: The simplest multi-region setup — one virtual cluster per region. This is the baseline configuration for geographic distribution.
Multi-Region Multi-Virtual-Cluster: For the highest levels of scale and fault isolation, each physical region can host multiple virtual clusters. If any single Master fails, only the fraction of transactions belonging to that virtual cluster is affected — the remaining virtual clusters in the same region continue retrying normally.

Transaction Ownership Across Virtual Clusters

A transaction is permanently bound to the region and cluster of the Orchestrator that created it. No other virtual cluster — even within the same region — will ever attempt to retry it.

The retry query always includes both region and cluster as mandatory filters:

SELECT *
FROM   transactions
WHERE  token   BETWEEN :subRangeStart AND :subRangeEnd
AND    region  = 'us-central'        -- physical region
AND    cluster = 'us-central-c1'     -- virtual cluster
AND    status  = 'FAILED_WITH_RETRYABLE_ERROR';

This makes the system particularly well-suited to Apache Cassandra deployments where region and cluster form part of the partition key, enabling the database to route queries directly to the correct nodes with zero full-table scans.

Glossary

Term Definition

Term	Definition
Span	A single, trackable unit of work within a distributed transaction. A transaction is composed of one or more sequential spans.
Token	A 64-bit integer derived from the Murmur3 hash of a transaction ID. Used to deterministically assign the transaction to a retry owner.
Token Ring	The full 64-bit integer space divided into contiguous, non-overlapping ranges. StackSaga uses the Murmur3Partitioner model, identical in principle to Apache Cassandra’s token ring.
Token Range	A contiguous subset of the token ring, assigned to a Retry-Coordinator-Slave or Orchestrator for a given time window.
Time Window	A one-minute interval during which an Orchestrator holds exclusive ownership of a sub-token range.
Publish Cycle	The repeating event at the 30th second of every minute in which the Master redistributes token ranges.
Compensation	The rollback sequence executed when a non-retryable error occurs in the primary transaction flow.
Murmur3Partitioner	A consistent hashing scheme using the MurmurHash3 algorithm to evenly distribute keys across the token ring.
Region	A physical deployment boundary (e.g., `us-central`, `asia-south`). Declared via `stacksaga.instance.region`. Stored on transactions and used as a mandatory retry query filter.
Virtual Cluster	A named logical group of Master, Retry-Coordinator-Slave, and Orchestrator instances that operate in complete isolation from other groups. Declared via `stacksaga.instance.cluster`. Allows multiple independent retry units within the same physical region.
RSocket	A binary application-level protocol supporting multiple interaction models: `request-response`, `Request-Response`, `request-stream`, and `channel`. StackSaga uses `request-stream` and `Request-Response`.
Lazy Rebalance	The Master’s strategy of deferring token ring recalculation until the next scheduled publish cycle, rather than reacting immediately to Retry-Coordinator-Slave disconnections.
Standard-Node	An ephemeral Orchestrator instance that handles transaction processing and is scaled freely based on traffic.
Retry-Node	A stable, long-lived Orchestrator instance that handles both transaction processing and transaction retry management.

Span

A single, trackable unit of work within a distributed transaction. A transaction is composed of one or more sequential spans.

Token

A 64-bit integer derived from the Murmur3 hash of a transaction ID. Used to deterministically assign the transaction to a retry owner.

Token Ring

The full 64-bit integer space divided into contiguous, non-overlapping ranges. StackSaga uses the Murmur3Partitioner model, identical in principle to Apache Cassandra’s token ring.

Token Range

A contiguous subset of the token ring, assigned to a Retry-Coordinator-Slave or Orchestrator for a given time window.

Time Window

A one-minute interval during which an Orchestrator holds exclusive ownership of a sub-token range.

Publish Cycle

The repeating event at the 30th second of every minute in which the Master redistributes token ranges.

Compensation

The rollback sequence executed when a non-retryable error occurs in the primary transaction flow.

Murmur3Partitioner

A consistent hashing scheme using the MurmurHash3 algorithm to evenly distribute keys across the token ring.

Region

A physical deployment boundary (e.g., us-central, asia-south). Declared via stacksaga.instance.region. Stored on transactions and used as a mandatory retry query filter.

Virtual Cluster

A named logical group of Master, Retry-Coordinator-Slave, and Orchestrator instances that operate in complete isolation from other groups. Declared via stacksaga.instance.cluster. Allows multiple independent retry units within the same physical region.

RSocket

A binary application-level protocol supporting multiple interaction models: request-response, Request-Response, request-stream, and channel. StackSaga uses request-stream and Request-Response.

Lazy Rebalance

The Master’s strategy of deferring token ring recalculation until the next scheduled publish cycle, rather than reacting immediately to Retry-Coordinator-Slave disconnections.

Standard-Node

An ephemeral Orchestrator instance that handles transaction processing and is scaled freely based on traffic.

Retry-Node

A stable, long-lived Orchestrator instance that handles both transaction processing and transaction retry management.

StackSaga Module Interaction for Transaction Re-Invocation

Below diagram shows how the StackSaga Modules interact with each other for transaction re-invocation in high level. The diagram is simplified to show only the relevant modules and components for transaction re-invocation, and does not include all the details of the interactions.

stacksaga diagram module interaction for invocation

1	`stacksaga-ring-coordinator` (Slave) publishes the Orchestrator’s assigned token sub-range to the Orchestrator over a persistent RSocket stream to the `stacksaga-ring-connector` module.
2	`ReInvokeTaskManager` in one of `stacksaga-{database}-support` implementatons long polls the event-store looking for transactions that should be re-invoked based on the assigned token sub-range.
3	The transactions which is eligible for re-invocation are handed over to the `TransactionReInvokeManager` in to one of the `stacksaga-{impl}-support` modules.
4	`TransactionReInvokeManager` rebuilds the transaction snapshot to the point of failure by requesting data via the `event-store service` from the `stacksaga-{database}-support`.
5	Once the transaction snapshot is rebuilt, `TransactionReInvokeManager` sends the snapshot to the `ExecutionManager` within the same module.
6	Finally, the transaction is executed by the engine as usual.

Even though stacksaga-env-{impl}-support is not shown in the diagram, it is responsible for providing the deployment environment specific metadata.