Transaction Retry Architecture With Retry Coordinator
This article is the reference guide for the Transaction Retry Architecture of the StackSaga framework. It is intended for architects, DevOps engineers, and developers who integrate, deploy, or operate StackSaga-based microservices.
Introduction
In a distributed microservices architecture, transactions often span multiple services and databases, making them susceptible to failures, timeouts, and pauses. The StackSaga framework addresses this challenge with a robust transaction retry subsystem that ensures reliability and consistency without risking duplicate processing or conflicts.
This subsystem detects paused transactions and re-invokes them safely across a cluster of microservice instances, using a combination of token ring partitioning, RSocket communication, and time-windowed publishing. The result is high availability and resilience even in the face of transient failures, while guaranteeing that transactions are processed exactly once.
Scope of This Document
This document focuses exclusively on the transaction retry subsystem.
Topics covered:
-
Identifying retryable vs non-retryable errors
-
Why instance-level retrying is insufficient
-
Deployment modes: Standard-Node vs Retry-Node
-
The three-component retry ecosystem: Orchestrator, Retry-Coordinator-Slave, and Retry-Coordinator-Master
-
Token ring partitioning using Murmur3
-
RSocket communication patterns between components
-
Time-windowed token publishing and conflict avoidance
-
Failure modes, reconnection behaviour, and deployment considerations
-
Multi-region deployments
-
Virtual cluster partitioning for massive-scale systems
Foundational Concepts
Before diving into the architecture, two foundational concepts must be clear: what kinds of errors trigger retrying, and why retrying cannot simply be handled by the originating node.
Retryable vs Non-Retryable Errors
StackSaga distinguishes two classes of errors:
Non-retryable error — a business logic failure or permanent error condition. These types of errors are not transient and cannot be resolved by retrying. Examples include validation failures, authorization errors, or any condition that indicates a fundamental issue with the transaction itself. These errors can occur in both the primary flow and the compensation flow. In the primary flow, a non-retryable error triggers compensation; if it occurs during compensation, the transaction is marked as permanently failed. In either case, the transaction will not be retried.
Retryable error — a transient condition, such as a downstream service being unavailable, a database connection timeout, or a network partition. The transaction is paused and will be replayed automatically. Retryable errors can occur in both the primary flow and the compensation flow. This is precisely where the retry subsystem comes into play, ensuring that the transaction is retried safely without duplication or conflict.
Why Instance-Level Retrying Is Insufficient
At first glance it may seem easy to retry a transaction on the same node that originally processed it. However, this approach has a critical flaw.
Standard nodes are ephemeral — they are short-lived, temporary, and designed to be created, destroyed, or replaced on-demand based on traffic. This means that when the time comes to retry a transaction, the node that originally processed it may no longer be running. The transaction then remains paused indefinitely with no node to reclaim it.
To make this concrete: imagine 3 instances are running and each has saved some transactions for retrying due to network issues. If a scheduler is triggered on each instance to replay its own transactions, but one instance has since been terminated due to scale-down, the transactions it owned will never be picked up by the other instances.
StackSaga solves this by decoupling transaction retrying from the originating node entirely, using a three-component architecture described in the next sections.
Deployment Modes
Because standard nodes are ephemeral, StackSaga defines two distinct deployment modes for an orchestrator service:
- Standard-Node
-
The regular orchestrator service (with StackSaga enabled) that processes transactions and handles business logic. Standard nodes can be freely scaled up or down based on traffic.
- Retry-Node
-
A specialised node that handles transaction retries in addition to regular transaction processing. Retry-Nodes are intended to be more stable and long-lived than Standard-Nodes, ensuring reliable retry management even as the rest of the cluster scales.
In a typical deployment you might have 5–1000 Standard-Nodes handling regular traffic and 2 Retry-Nodes that remain stable to manage retries. There is no technical restriction on scaling Retry-Nodes up or down — the distinction is operational: they are simply the nodes you choose to keep stable.
|
Retrying is the worst-case scenario for a transaction, not the regular flow. See Failure Probability and Retry Necessity in Saga-Based Distributed Transactions for an analysis of retrying frequency and its weight relative to normal processing. |
The next challenge is how retry responsibility is safely distributed across the cluster of Retry-Nodes without any two nodes conflicting over the same transaction. This is where Token Ring Partitioning comes in.
Token Ring Partitioning
Overview
StackSaga uses the Murmur3 consistent hashing algorithm to assign every transaction to a deterministic position in a 64-bit token ring. This token is computed when the transaction is initialised (by any Orchestrator) and stored alongside the transaction record in the database, together with the cluster name and region of the originating Orchestrator. The retry subsystem uses this stored token, cluster, and region to determine which Orchestrator instance is responsible for retrying a given transaction at any point in time.
The full token space spans from \(-2^{63}\) to \(2^{63}-1\):
-9,223,372,036,854,775,808 → 9,223,372,036,854,775,807
The diagram below illustrates how Retry-Nodes are assigned non-overlapping segments of the token ring. In this example there are 4 Retry-Nodes, each responsible for a distinct quarter of the token space:
-
Hash space:
-9223372036854775808→9223372036854775807 -
Total range size:
18446744073709551616 -
Partition size per node (4 nodes):
4611686018427387904
| Node | Slot | Token Range Start | Token Range End |
|---|---|---|---|
|
A |
|
|
|
B |
|
|
|
C |
|
|
|
D |
|
|
Two-Level Partitioning
The token ring is divided in two levels:
-
Master → Retry-Coordinator-Slave: The Retry-Coordinator-Master divides the full ring equally across registered Retry-Coordinator-Slave nodes.
-
Retry-Coordinator-Slave → Orchestrator: Each Retry-Coordinator-Slave further divides its slice equally across its registered Orchestrator instances.
Master → Retry-Coordinator-Slave Example (3 Retry-Coordinator-Slaves)
| Retry-Coordinator-Slave Node | Start Token (Inclusive) | End Token (Inclusive) |
|---|---|---|
Retry-Coordinator-Slave Node 1 |
|
|
Retry-Coordinator-Slave Node 2 |
|
|
Retry-Coordinator-Slave Node 3 |
|
|
| Partition boundaries are recalculated automatically each time the number of registered Retry-Coordinator-Slaves changes (at the next 30-second publish cycle). |
Retry-Coordinator-Slave → Orchestrator Example (Retry-Coordinator-Slave Node 1, 2 Orchestrators)
| Orchestrator Instance | Start Token (Inclusive) | End Token (Inclusive) |
|---|---|---|
Order Service Instance 1 |
|
|
Order Service Instance 2 |
|
|
All other Retry-Coordinator-Slave nodes apply the same logic for their own registered Orchestrator instances.
|
It is not required to have multiple physical Ring-Coordinator machines to achieve partitioning.
The |
The Three-Component Retry Architecture
Token ring partitioning is managed collaboratively by three components. Here is a brief orientation before the detailed descriptions:
| Component | Responsibility |
|---|---|
Retry-Coordinator-Master |
Single authority for the domain. Divides the full token ring equally across Retry-Coordinator-Slaves every 30 seconds. Tells new Orchestrators which Retry-Coordinator-Slave to connect to (Round Robin). |
Retry-Coordinator-Slave |
Relay between Master and Orchestrators. Receives its ring slice from the Master and divides it further across its connected Orchestrators. Delivers each Orchestrator its personal sub-range every minute. |
Orchestrator (Retry-Node) |
Holds its sub-range for the current minute. Polls the database for paused transactions whose token falls in that sub-range and re-invokes them — without ever conflicting with another instance. |
Each component does one small job cleanly. No component reaches into another’s responsibility. The result is a retry system that scales horizontally, tolerates partial failures, and guarantees that every paused transaction is eventually retried — exactly once at a time.
Retry-Coordinator-Master
What It Is
The Retry-Coordinator-Master is the single authority for retry coordination within a microservice domain. There is exactly one Master per domain (or per virtual cluster — see Virtual Clusters).
The Master has no knowledge of your business logic and never touches the transaction database. Its entire responsibility is coordination: dividing the token ring, tracking which Retry-Coordinator-Slaves are alive, and pointing new Orchestrators to the right Retry-Coordinator-Slave when they start up.
It is built on a fully non-blocking Netty stack via RSocket, which means a single Master instance can comfortably handle thousands of concurrent Retry-Coordinator-Slave connections — far beyond what most deployments will ever need.
Three Responsibilities
Responsibility 1 — Maintain the Retry-Coordinator-Slave registry.
Every Retry-Coordinator-Slave that starts up connects to the Master via a persistent RSocket request-stream and registers itself.
The Master keeps a live registry of all connected Retry-Coordinator-Slaves.
When a Retry-Coordinator-Slave disconnects (crashes, restarts), the Master notices immediately and updates its registry.
Responsibility 2 — Publish token ring partitions every 30 seconds. At the 30th second of every minute the Master fires a timer. It looks at the current list of registered Retry-Coordinator-Slaves, divides the full 64-bit token ring equally among them, and pushes each Retry-Coordinator-Slave its assigned range, tagged as valid for the next full minute. This 30-second lead gives the entire cluster time to receive and prepare the new assignments before they go live.
Responsibility 3 — Direct new Orchestrators to a Retry-Coordinator-Slave.
When a new Orchestrator instance starts up, it sends a one-shot Request-Response message to the Master asking: "Which Retry-Coordinator-Slave should I connect to?"
The Master picks a Retry-Coordinator-Slave using a Round Robin strategy — spreading Orchestrators evenly across available Retry-Coordinator-Slaves — and responds with the Retry-Coordinator-Slave’s host and port.
After this single exchange the Orchestrator never contacts the Master again; all future communication goes through the Retry-Coordinator-Slave directly.
What It Knows and Does Not Know
- Knows
-
-
The registered Retry-Coordinator-Slave nodes and their connection status.
-
How to divide the 64-bit token ring equally using Murmur3 partitioning.
-
Which Retry-Coordinator-Slave to assign to each new Orchestrator (Round Robin).
-
- Does not know
-
-
Individual Orchestrator instances after the initial lookup.
-
The transaction database.
-
Your business logic or transaction structure.
-
Other domain Masters — each Master is isolated to its own domain.
-
Master Node Failure
When the Master fails:
-
All registered Retry-Coordinator-Slaves lose their connection to the Master.
-
Retry-Coordinator-Slaves stop receiving token range updates. Their last known range remains cached but expires at the next minute boundary.
-
Orchestrators continue retrying using their last received sub-range until it expires.
-
New Orchestrator instances starting up will fail to obtain a Retry-Coordinator-Slave assignment.
|
A Master failure affects only that domain’s retry subsystem. Other microservice domains continue operating normally. Primary transaction execution (new transactions) is unaffected by Master availability. For systems where even this scoped failure window is unacceptable, see Virtual Clusters. |
Retry-Coordinator-Slave
What It Is
The Retry-Coordinator-Slave is a dedicated infrastructure service — a small, lightweight process deployed alongside your microservice. It has no awareness of your business logic whatsoever. It does not touch the transaction database. It does not execute transactions.
Its entire purpose is to act as a distribution bridge between the Retry-Coordinator-Master and the Orchestrator instances: it receives a large token range from the Master and breaks it into smaller, non-overlapping sub-ranges, one for each connected Orchestrator.
Four-Step Routine
Step 1 — Register with the Master.
On startup, the Retry-Coordinator-Slave opens a persistent RSocket request-stream connection to the Master, announces itself, and waits for token range updates.
Step 2 — Receive a token range slice. Every 30 seconds the Master publishes updated ranges. The Retry-Coordinator-Slave receives its slice — a contiguous portion of the full 64-bit ring — and is told which minute that slice is valid for.
Step 3 — Accept Orchestrator registrations. When an Orchestrator starts up and connects, the Retry-Coordinator-Slave records it in its local registry.
Step 4 — Divide and deliver. With its slice in hand and a list of connected Orchestrators, the Retry-Coordinator-Slave divides the slice equally — one non-overlapping sub-range per Orchestrator — and pushes each sub-range to the respective Orchestrator over the persistent stream.
This cycle repeats every minute, keeping every connected Orchestrator informed of its current retry ownership window.
Why Have a Retry-Coordinator-Slave Layer At All?
Your service pods (Orchestrators) come and go frequently — they restart on deployments, scale up under load, scale down at night. If the Master had to track every individual pod, it would be overwhelmed with registration and deregistration events and would constantly be recalculating the entire ring.
The Retry-Coordinator-Slave acts as a buffer. The Master only needs to track a small, stable set of Retry-Coordinator-Slave nodes. All the volatility of your application pods is absorbed by the Retry-Coordinator-Slave, which quietly adjusts its sub-range distribution whenever an Orchestrator joins or leaves — without disturbing the Master at all.
What It Knows and Does Not Know
- Knows
-
-
The Master it is registered with (
stacksaga.agent.slave.target-master.host/.port). -
The Orchestrator instances currently connected to it.
-
The token range it has been assigned and how to divide it.
-
- Does not know
-
-
Your business logic or transaction structure.
-
The transaction database.
-
Other Retry-Coordinator-Slave nodes — each Retry-Coordinator-Slave works independently.
-
Retry-Coordinator-Slave Node Failure
When a Retry-Coordinator-Slave crashes or becomes unreachable:
-
All Orchestrators connected to that Retry-Coordinator-Slave lose their stream. Their sub-ranges become stale; retry polling pauses for those instances.
-
The Master detects the lost connection and applies a lazy rebalance strategy:
-
If the lost Retry-Coordinator-Slave is not the last index in the registry, the Master assumes it will recover shortly (especially true in Kubernetes) and does not immediately rebalance.
-
The Master continues sending cached token ranges to the remaining Retry-Coordinator-Slaves.
-
The crashed Retry-Coordinator-Slave’s token range is frozen — no Orchestrator covers it during the outage.
-
-
When the Retry-Coordinator-Slave restarts it reconnects to the Master and is treated as a new registration.
-
At the next 30-second publish cycle the Master recalculates and redistributes ranges across all currently registered Retry-Coordinator-Slaves, including the restarted one.
|
During a Retry-Coordinator-Slave outage, transactions whose tokens fall within the frozen range are not retried until the range is covered again.
These transactions remain safely stored in the database with status |
Orchestrator (Retry-Node)
What It Is
The Orchestrator is your microservice, augmented by StackSaga to execute distributed transactions. When deployed as a Retry-Node, it also manages transaction retrying. It runs two jobs side by side at all times.
Two Concurrent Jobs
Job 1 — Execute transactions (regular orchestrator role). When a new business operation arrives (a customer places an order, for example), the Orchestrator creates a transaction, breaks it into spans, and executes them one by one — calling downstream services, updating databases, firing events. If a span fails with a permanent error, it starts a compensation sequence to undo what was already done.
Job 2 — Retry paused transactions. Some spans fail not because of a business error but because a resource was temporarily unavailable. StackSaga does not discard these transactions — it marks them as paused and saves them in the database. The Orchestrator periodically checks the database for these paused transactions and re-invokes them. Crucially, it only checks the transactions it is currently responsible for, based on the token sub-range it holds for the current time window. This is how StackSaga guarantees that multiple running instances never accidentally retry the same transaction twice.
What It Knows and Does Not Know
- Knows
-
-
Its own business logic and transaction spans.
-
The token sub-range it currently holds (delivered by its Retry-Coordinator-Slave).
-
The transaction database — it reads and writes transaction records directly.
-
- Does not know
-
-
Other Orchestrator instances — there is no peer-to-peer communication between service instances.
-
The Retry-Coordinator-Master — after the initial startup handshake, the Orchestrator never contacts the Master again (unless the Retry-Coordinator-Slave connection is lost and must be re-established).
-
How the token ring works internally — it simply receives a range and uses it.
-
Orchestrator (Retry-Node) Failure
When an Orchestrator instance fails and restarts:
-
It performs a fresh
Request-Responselookup to the Master for an available Retry-Coordinator-Slave (potentially receiving a different Retry-Coordinator-Slave due to Round Robin). -
It opens a new persistent stream to the newly assigned Retry-Coordinator-Slave.
-
It receives the current sub-range at the next 30-second cycle and resumes retrying.
Any transactions being actively retried at the moment of failure will be re-attempted in the next window by whichever Orchestrator instance acquires that token range.
Registration & Token Distribution Flow
The sequence diagram below ties the three components together, showing the full lifecycle from startup to active retry execution.
Step-by-Step Description
Phase 1 — Retry-Coordinator-Slave Registration
-
On startup, each Retry-Coordinator-Slave opens a persistent RSocket
request-streamconnection to the Retry-Coordinator-Master. -
The Master records the Retry-Coordinator-Slave in its registry.
-
The Retry-Coordinator-Slave remains connected and passively waits for token range updates.
Phase 2 — Orchestrator Registration
-
On startup, each Orchestrator sends a RSocket
Request-Responsemessage to the Master requesting an available Retry-Coordinator-Slave assignment. -
The Master applies Round Robin to select a Retry-Coordinator-Slave and returns the Retry-Coordinator-Slave’s host and port.
If the same Orchestrator instance restarts, it may receive a different Retry-Coordinator-Slave assignment, as the Round Robin pointer advances. This ensures balanced Orchestrator distribution across Retry-Coordinator-Slaves over time. -
The Orchestrator opens a persistent RSocket
request-streamconnection to its assigned Retry-Coordinator-Slave and begins listening for sub-range updates.
Phase 3 — Token Range Distribution
-
At the 30th second of every minute the Master fires its publish timer.
-
The Master partitions the full token ring equally among all registered Retry-Coordinator-Slaves and sends each Retry-Coordinator-Slave its range, tagged with
validForMinute = T+1. -
Each Retry-Coordinator-Slave divides its range equally among its registered Orchestrators and pushes sub-range updates downstream.
-
Each Orchestrator stores the received sub-range and activates it at the start of minute
T+1.
Phase 4 — Retry Execution Loop
Once the time window is active, each Orchestrator enters its retry polling loop:
-
Query the transaction store for records where:
-
tokenis within[subRangeStart, subRangeEnd] -
clustermatches the Orchestrator’s configured cluster name -
regionmatches the Orchestrator’s configured region -
status = FAILED_WITH_RETRYABLE_ERROR
-
-
For each matching transaction, re-invoke the next pending span.
-
Repeat until the time window expires (sub-range is superseded by the next update).
RSocket Communication Patterns
StackSaga uses RSocket for all inter-component communication, chosen for its support for persistent, reactive, bidirectional streams backed by a non-blocking Netty transport.
| Connection | Interaction Model | Description |
|---|---|---|
Retry-Coordinator-Slave → Master |
|
Persistent subscription. Retry-Coordinator-Slave registers with Master and keeps the stream open to receive token range updates. |
Orchestrator → Master |
|
One-shot lookup. Orchestrator requests an available Retry-Coordinator-Slave node assignment from Master on startup. |
Orchestrator → Retry-Coordinator-Slave |
|
Persistent subscription. Orchestrator subscribes to its assigned Retry-Coordinator-Slave and keeps the stream open to receive sub-range updates. |
Token Time Window & Conflict Avoidance
The Publish Cycle
The Retry-Coordinator-Master runs a repeating timer that fires at the 30th second of every minute. At each trigger the Master:
-
Recalculates the token ring partition based on currently registered Retry-Coordinator-Slaves.
-
Publishes the new token range to each Retry-Coordinator-Slave.
-
Each Retry-Coordinator-Slave recalculates its sub-ranges and pushes updates to its registered Orchestrators.
The published range is labelled as valid for the next full minute (T+1), not the current minute.
Why the 30-Second Offset?
Publishing at the 30th second and marking the range as valid for the next minute creates a 30-second preparation window. This window accounts for worst-case communication delays: Retry-Coordinator-Slave distribution, Orchestrator updates, and network latency in large clusters.
By the time minute T+1 begins, all Orchestrators are guaranteed to have received their new sub-ranges and are ready to start polling.
Timeline example: Minute T, second 30 → Master publishes ranges (valid for T+1) Minute T, second 31–59 → Retry-Coordinator-Slaves and Orchestrators receive and cache new ranges Minute T+1, second 0 → Orchestrators begin polling with new ranges Minute T+1, second 30 → Master publishes ranges for T+2 ...
Multi-Region Deployment
For systems deployed across multiple geographic regions, StackSaga uses the region property to scope retry ownership.
Every component — Master, Retry-Coordinator-Slave, and Orchestrator — is stamped with the region it belongs to.
How Region Scoping Works
When a transaction is created it inherits the region of the Orchestrator that created it. That region value is stored in the transaction record in the database.
During the retry polling loop, each Orchestrator adds region as a mandatory filter alongside the token range:
SELECT *
FROM transactions
WHERE token BETWEEN :subRangeStart AND :subRangeEnd
AND region = :region
AND cluster = :cluster
AND status = 'FAILED_WITH_RETRYABLE_ERROR';
This guarantees that an Orchestrator in us-central never picks up and retries a transaction that was originally created by an Orchestrator in asia-south — even if both regions share the same database (e.g., a globally replicated Cassandra cluster).
Virtual Clusters
Why Virtual Clusters?
In very large deployments the single-master-per-domain model presents two concerns as the system grows:
-
Scale — a single Master must manage all Retry-Coordinator-Slave connections for the domain. Although the Master’s non-blocking Netty stack via RSocket can handle thousands of concurrent Retry-Coordinator-Slave connections, some extreme-scale deployments may want to distribute this coordination load further.
-
Fault isolation — if the domain’s single Master goes down, the entire domain’s retry subsystem pauses until it recovers. Some systems require that a partial infrastructure failure never affects more than a defined fraction of retry capacity.
Virtual clusters address both concerns by dividing a single physical deployment into multiple independent master-slave groups, each operating as a completely separate retry coordination unit within the same region.
What Is a Virtual Cluster?
A virtual cluster is a named group of Retry-Coordinator-Master, Retry-Coordinator-Slave, and Orchestrator instances that operate in full isolation from other groups.
The group identity is declared via the stacksaga.instance.cluster property.
All three components sharing the same cluster name form one logical retry cluster:
-
The Master only accepts registrations from Retry-Coordinator-Slaves with a matching cluster name.
-
The Retry-Coordinator-Slave connects only to the Master of its cluster (
stacksaga.agent.slave.target-master.host/.port). -
The Orchestrator connects only to the Master of its cluster for the initial Retry-Coordinator-Slave lookup, and subsequently to the Retry-Coordinator-Slave assigned by that Master.
-
Retry queries are filtered by cluster name in addition to token range and region.
Two virtual clusters running in the same physical Kubernetes cluster are completely isolated — they share no state, no connections, and no retry responsibility.
Deployment Topologies
- Multi-Region Single-Cluster
-
The simplest multi-region setup — one virtual cluster per region. This is the baseline configuration for geographic distribution.
- Multi-Region Multi-Virtual-Cluster
-
For the highest levels of scale and fault isolation, each physical region can host multiple virtual clusters. If any single Master fails, only the fraction of transactions belonging to that virtual cluster is affected — the remaining virtual clusters in the same region continue retrying normally.
Transaction Ownership Across Virtual Clusters
A transaction is permanently bound to the region and cluster of the Orchestrator that created it. No other virtual cluster — even within the same region — will ever attempt to retry it.
The retry query always includes both region and cluster as mandatory filters:
SELECT *
FROM transactions
WHERE token BETWEEN :subRangeStart AND :subRangeEnd
AND region = 'us-central' -- physical region
AND cluster = 'us-central-c1' -- virtual cluster
AND status = 'FAILED_WITH_RETRYABLE_ERROR';
This makes the system particularly well-suited to Apache Cassandra deployments where region and cluster form part of the partition key, enabling the database to route queries directly to the correct nodes with zero full-table scans.
Glossary
| Term | Definition |
|---|---|
Span |
A single, trackable unit of work within a distributed transaction. A transaction is composed of one or more sequential spans. |
Token |
A 64-bit integer derived from the Murmur3 hash of a transaction ID. Used to deterministically assign the transaction to a retry owner. |
Token Ring |
The full 64-bit integer space divided into contiguous, non-overlapping ranges. StackSaga uses the Murmur3Partitioner model, identical in principle to Apache Cassandra’s token ring. |
Token Range |
A contiguous subset of the token ring, assigned to a Retry-Coordinator-Slave or Orchestrator for a given time window. |
Time Window |
A one-minute interval during which an Orchestrator holds exclusive ownership of a sub-token range. |
Publish Cycle |
The repeating event at the 30th second of every minute in which the Master redistributes token ranges. |
Compensation |
The rollback sequence executed when a non-retryable error occurs in the primary transaction flow. |
Murmur3Partitioner |
A consistent hashing scheme using the MurmurHash3 algorithm to evenly distribute keys across the token ring. |
Region |
A physical deployment boundary (e.g., |
Virtual Cluster |
A named logical group of Master, Retry-Coordinator-Slave, and Orchestrator instances that operate in complete isolation from other groups.
Declared via |
RSocket |
A binary application-level protocol supporting multiple interaction models: |
Lazy Rebalance |
The Master’s strategy of deferring token ring recalculation until the next scheduled publish cycle, rather than reacting immediately to Retry-Coordinator-Slave disconnections. |
Standard-Node |
An ephemeral Orchestrator instance that handles transaction processing and is scaled freely based on traffic. |
Retry-Node |
A stable, long-lived Orchestrator instance that handles both transaction processing and transaction retry management. |
StackSaga Module Interaction for Transaction Re-Invocation
Below diagram shows how the StackSaga Modules interact with each other for transaction re-invocation in high level. The diagram is simplified to show only the relevant modules and components for transaction re-invocation, and does not include all the details of the interactions.
| 1 | stacksaga-ring-coordinator (Slave) publishes the Orchestrator’s assigned token sub-range to the Orchestrator over a persistent RSocket stream to the stacksaga-ring-connector module. |
| 2 | ReInvokeTaskManager in one of stacksaga-{database}-support implementatons long polls the event-store looking for transactions that should be re-invoked based on the assigned token sub-range. |
| 3 | The transactions which is eligible for re-invocation are handed over to the TransactionReInvokeManager in to one of the stacksaga-{impl}-support modules. |
| 4 | TransactionReInvokeManager rebuilds the transaction snapshot to the point of failure by requesting data via the event-store
service from the stacksaga-{database}-support. |
| 5 | Once the transaction snapshot is rebuilt, TransactionReInvokeManager sends the snapshot to the ExecutionManager within the same module. |
| 6 | Finally, the transaction is executed by the engine as usual. |
Even though stacksaga-env-{impl}-support is not shown in the diagram, it is responsible for providing the deployment environment specific metadata.
|