Transaction Retrying.

In general, In microservices architecture, retry is a mechanism used to handle transaction failures by automatically reattempting an operation that has failed. Transient failures are short-lived issues that are expected to resolve themselves after a short delay, such as network timeouts, temporary resource unavailability, or overloaded systems.

Resource-Unavailable term in the Microservice Architecture

In microservice architecture, a "resource-unavailable" issue typically refers to a situation where one microservice is unable to access a necessary resource. This resource could be another microservice, a database, a third-party API, or any other external system or service that the microservice depends on.

Transaction Retrying approaches in StackSaga

In Stacksaga, you can make retry in two different ways.

Immediate Retry:

The system attempts to retry the request many times in quick succession, possibly with minimal or no delay between retries. Immediate Retry is the approach that you use spring retry, Resilience4j or any other retry libraries in general. This is the retry strategy that handles retries synchronously and immediately after the initial failure. You can use Immediate-Retry in Stacksaga as usual without any disruption.

Even after the Immediate Retry, the request cannot be achieved due to a resource-unavailable issue you can transfer the request to Stacksaga framework to handle the request with Asynchronous Retry.
Asynchronous Retry:

If the request still fails even after immediate retries, the request is scheduled for retrying later in a background process with the help of stacksaga agent. In this section, we will discuss the asynchronous retry approach that provides by Stacksaga.

Why Asynchronous Retrying

You know already that most of the time there are multiple atomic transactions(multiple subtasks) may consist of one single transaction (business transaction) in microservice. That means even we make a single request, internally, it may consist of multiple subtasks that should be completed by communicating with other microservices.

While executing all atomic transactions, If one of the subtasks is failed due to Resource Unavailable, it can be retried most of the time instead of stopping the task with en exception in microservice environment. Because you know that giving up the task due to an exception is not easy in microservice environment like in a standalone application with a single database. It may cause to data inconsistency most of the time due to one task consist of multiple sub-tasks (multiple atomic transactions). If one of atomic transactions is failed, other all atomic transactions that have been successfully executed so far should be state back (undo).

If there is an exception while the executing main consideration is that we should identify the exception type. Whether the exception is Resource Unavailable or not. If the execution is a Resource-Unavailable exception, it will be resolved soon and then the execution can execute successfully. As mentioned above, if we can identify the exception as a Resource-Unavailable exception, we can wait some time and re-invoke the executions. Replaying tasks (transactions/operations) helps to have the eventual consistency without giving up the transaction.

For identifying the exceptions as a retryable or not, it is used NonRetryableExecutorException and RetryableExecutorException. If the executor throws and RetryableExecutorException the saga engine keep the transaction in the event-store for retrying.

StackSaga Agent

Stacksaga agent is the application that invokes for retrying the transactions. You already know that if the transaction is failed with a network exception (Resource Unavailable) the transaction can be replayed. That retrying part is done by the Stacksaga agent service.

Retrying Transactions with StackSaga Agent

You know that the transactions are executed by the StackSaga framework. If the transaction is not able to process due to some Resource Unavailable exception, the transaction is kept in the event-store for retrying. for instance, while make-order process an exception is occurred in the MakePaymentExecutor due to the payment-services are not available for some reason. then order-service saves the transaction with the help of StackSaga framework in the event-store for retrying. With the given configurations the agent application trigger the schedulers for gathering the transaction that should be retried. The transactions that are eligible for retrying are caught by the agent and the agent distributes the transactions for the available instances for retrying.

Sometimes you might think that why can’t be re-invoked the transactions by the own instance without an agent?
The short answer is it can’t be done due the instances are Ephemeral in nature in the microservice architecture.
For instance, just imagine there are 10 instances running on, and there are 100 transactions have been saved temporarily by them due to some network issues. After a while, the scheduler is triggered for replying the transactions on each instance. At that moment it can have different instances count due to scaling up or scaling down based on the load. Just imagine there was an instance called "a" that made some retryable transactions before triggering the scheduler, and that instance might not be running when the scheduler is triggered for retrying. Then the transactions that made by the instance called "a" cannot be run ever. (Because the instance is identified with a unique id and the transaction is stored with that instance id when the transaction is initiated). Then some transactions can be missing. This is one of the major reasons that replaying cannot involve all available instances at the same time.

In the below diagram you can see there are 7 transactions that should be exposed for retrying that have been run by 3 instances. but the problem is when the scheduler is triggered for retrying in each instance, only 2 instances are being running called instance:3491465 and instance:3491425. the instance:3400001 was destroyed. then that transaction can not be exposed for retrying.

why instance does not involve directly for retrying in StackSaga

Here you can see how the StackSaga Agent involves and distributes the transactions within available instances in high-level.

how stacksaga agent distribute transactions

Here you can see it does not matter which instance initiated the transaction. all the transaction that should be retried are scanned by the agent and those are distributed with the available services. The communication happens via the In-built Http endpoints that has been provided by stacksaga-spring-boot-starter.

Ephemeral behavior of the instances.

In the context of microservices, ephemeral refers to the principle that a microservice can be created, destroyed, and replenished on-demand on a target easily, quickly, and with no side effects.

Cluster grouping of StackSaga agent

You can isolate (group) the StackSaga agent applications in different isolation levels. the reason for to Isolate the agent applications is that to share the load between the available clusters without any collision when the data is accessed by the agent. Cluster is the logical entity that share the token range. if the cluster has only one node the entire range is acquired by the single node. or if the cluster has multiple nodes, the token range is divided withing the available multiple nodes.

You already know that there are two parts mainly service-agent does in retrying.

Fetching transactions from the event-store for retrying.
Sharing the data to the available utility services.

Cluster grouping only considers accessing the data from the event-store for managing. it is not about sharing the data withing the available utility services by the agent.

Regional Clustering
1. Single Node Cluster
2. Multi Node Cluster
Zonal Clustering
1. Single Node Cluster
2. Multi Node Cluster

Regional Clustering

The service agent application(s) grouped by the region in the Regional Clustering architecture. in this way all the service agent nodes that spreads in the entire region are grouped as a cluster. even one region can have multiple zones it is considered as a one cluster.

For instance, if your application is deployed in tow different regions you will have two clusters. and the transaction’s data is saved with the relevant region. if there are 10_000 transaction in region-1 and 12_000 transactions are in the region-2, the agent node(s) in the region-1-cluster will take cre of 10_000 transaction in region-1. and the agent node(s) in the region-2-cluster will take care of 12_000 transactions in region-2.

Regional Clustering is the default mode StackSaga support by default. but you can change the Cluster Grouping Mode in the agent application. If you are using Cassandra implementation, you have to change the mode in both service and agent application.

You can change the Cluster grouping mode as you want without eny extra effect in Relational database’s implementations. But if you are using Cassandra implementation make sure to decide at the first what is the Cluster grouping mode. because you will have to take some effort to change the Cluster grouping mode due to the data structure.

Regional Clustering With Single Node

You can deploy one single service agent node for the entire region and then the agent service node acquires all the transactions (the transactions that should be retried) that have been initiated under that particular region. That means the entire token range belongs to that single instance in the region.

The following diagram shows how it works.

You can see here the order-service application has been deployed in the us-east-region. The instances are in both zones called us-east-region-zone-1 and us-east-region-zone-2. But you can see only one order-service-agent for the entire region. (It can be either in zone-1 or zone-2). This order-service-agent will take care of all the transactions that have been initiated under this region.

stacksaga diagram eureka service registry regional architecture standalone mode

Steps,

Fetch the transactions ( bulk) that should be retried and that have been initiated under us-east-region. Because the agent knows where the agent application is running on.
After fetching the transactions, the transactions are shared within the available instances in the region.
Each order-service receives the retrying notification through http requests and retris the transactions by accessing the event-store.

Regional Clustering With Multi Node

Zonal Clustering

Agent application(s) are grouped by the zone in the Zonal Clustering architecture. for instance if you deploy the order-service in 2 different zones in the same region, you have to deploy tow order-service-agent applications for each zone separately.

If you deploy two agent nodes separately by changing the mode in one agent application as Zonal and other one as Regional by mistake, It leads to collisions. Because one agent access to one relevant zone’s transaction and other agent access the entire transactions in the region. Then some transactions are exposed to two clusters. The rule is that one transaction can be exposed only one cluster.

How StackSaga agent deployment modes in Cloud environment

The concepts are discussed here for database-per-service architecture. If you are using shared database withing the multiple services as the event-store, based on your database implementation may have some limitations or different approaches.
specially if you are using Cassandra implementation. [Cassandra Service Agent, MYSQL Service Agent]

You know that if we are in the cloud environment, we have to make our deployment by adapting to the geographically distributed system. Then we mainly deal with regions and zones. In the stacksaga ecosystem, it is very important to identify the distributed deployments geographically to enhance the performance and balance the load between the target nodes.

Zonal Isolation In Standalone Mode

stacksaga diagram eureka service registry zonal architecture standalone mode

Cluster Mode

If multiple instances of Stacksaga service agent applications are deployed in an isolation group it’s called cluster mode. It helps to scale the system horizontally based on the demand for retrying the transaction. The difference between standalone mode and cluster mode is that in cluster mode, all transactions that should be retried are divided within the available service agent instances based on the token range that each agent service is responsible for. The token generated based on the transaction-id by using Murmur3 hashing algorithm. the token will be a hash between -2^63 to +2^63-1.

For instance, if you have 4 instances in the cluster (withing the transaction isolation group ) the entire range is divided into 4 ranges evenly. the range as follows:

Table 1. Token Range For Each Node
Node (Agent Instance)	Token Range
order-service-agent-0	-9223372036854775808 TO -4611686018427387905
order-service-agent-1	-4611686018427387904 TO -1
order-service-agent-2	0 TO 4611686018427387903
order-service-agent-3	4611686018427387904 TO 9223372036854775807

stacksaga diagram transaction range in cluster mode

Cluster mode also can be deployed in two different ways as the regional isolation or zonal isolation.

Follow the following links to see how the token range is allocated for each instance in cluster mode in different environments.

Eureka environment
Kubernetes environment

Regional Isolation In Cluster Mode

Regional isolation means that the transactions are managed based on the region. The transactions are saved and retried based on the region that your service instance is located and if there are some transaction for retrying the retrying is managed based on the region. For instance, if you deploy a service called order-service in the us-region and asia-region, the transactions are saved on the tables that related to that specific region. and you have to deploy service-agent applications for both region separately. if you deployed service-agent application on only one region, other region’s transaction are not exposed for the service-agent application. because that service-agent application considers only the data that related to the regain it is running on.

By default, StackSaga supports Regional isolation architecture.

stacksaga diagram eureka service registry regional architecture cluster mode

Zonal Isolation In Cluster Mode

Zonal isolation refers to isolate the transactions within the zone. for instance if you deploy your order-service application in different zones in the same region by default StackSaga isolate all the services withing the region as one group. but if you use Zonal isolation architecture the instances are isolated into each zone. then you have to deploy at least one service-agent application for each zone.

stacksaga diagram eureka service registry zonal architecture cluster mode

StackSaga agent supports environments

Stacksaga agent application can be run in both Eureka service discovery environment and also Kubernetes service discovery environment.

Stacksaga agent in the Eureka environment.

How token range is shared with the available agents dynamically in Eureka environment

You already know that Stacksaga Agent can be deployed in the kubernetes environment and also in the eureka environment. If you are in the Eureka environment, you can use the eureka-mode when the agent application is deployed. In the Eureka environment also the agent application can be scaled horizontally based on the load like in the kubernetes environment, and then the entire token range is shared between available agent instances in the cluster. That task is done the master instance of the cluster. That means in the eureka-mode the agent applications are deployed master and slave architecture. The master agent service is responsible for updating the token ranges for each available instance in the cluster, and also it acts as a slave.

How does master server calculate the token range?

All agent applications are registered with the eureka server in eureka environment. so the master service will have all other agent instances' details through the eureka server. the master server periodically checks the changes of the instance based on the local eureka service registry cache and updates the database with the relevant token range for each instance. the position of each instance is sored based on the instance started time. for instance, if there are 5 SatackSaga-agent instances in the cluster, the token range is divided with the help of Murmur3 Partition algorithm as follows:

Steps:

1	Master fetch the cache from the eureka server. (It can be a single eureka server or peers)
2	Calculate the range for each instance periodically based on their timetamps and updates the ranges into the database. Even the scheduler gets triggered time to time, The database update is done only if the instance state is changed compared to the previous state. For instance, if there is no instance change on the next schedule, the database is not updated. or some instances may have changed, so the range is re-arranged and updated in the database by the master agent instance.
3	All agent instances fetch their token range from the database periodically based on the given configurations.

Finally, all the agent applications will fetch the transactions that should be retried based on the token range from the event-store.

How to configure the application as master and slave?

In the Master instance:

Set the instance-type as MASTER and Eureka instance ID as a fixed (Static ID) value in the property file like this.

stacksaga.cloud.agent.eureka.instance-type=MASTER
eureka.instance.instance-id=order-service-agent-us-east-master

It is recommended to use this format for the master instance ID.
Format: ${service-name}-agent-${region}-master
Using the service name in the master instance ID helps to avoid the collision if you are using same event-store for multiple services. Because the slaves identify the master instance in the database by the master instance ID. and adding the region to the master instance ID guarantees the region-based uniqueness.

In the Slave instance:

Set the instance-type as SLAVE and set the Master’s Static ID for stacksaga.cloud.agent.master-id in the slave instance’s property file like this.
```
stacksaga.cloud.agent.eureka.instance-type=SLAVE
stacksaga.cloud.agent.eureka.master-id=order-service-agent-us-east-master
eureka.instance.instance-id=${spring.application.name}:${random.uuid}
```
For the slave instances, you can use any random ID as the instance id as you prefer.

Filtering Retryable transactions from the event-store.

You know already that the replay process is done by running schedulers. When the scheduler is triggered, the master node fetches the transactions that should be retried from the event-store.

When filtering the retryable transactions, the following things are considered.

Region: The transactions are filtered for the region of the master instance.
Transaction status: The transaction status should be reverting or processing
Transaction Lifetime
Transaction Leisure time
Transaction Restore Retention Time

Transaction Lifetime

All the transactions are retried within a specific duration that you configured. After the time duration that transactions are expired, It ensures not accumulating the transactions that cannot be invoked successfully after invoking many times.

In the admin dashboard, you can see the expired transactions. And also after fixing the issue, you can extend the time for exposing to the retrying process again.

In the below, you can see it with diagram. The transaction is initiated at the first after initialization the transaction can be exposed to the schedulers withing the specific time period. After the time period, the transaction is not exposed to re-invoking.

stacksaga diagram transaction retry life time

Transaction Leisure time

After exposing the transaction to be retried, the transaction is shared to one of available instances immediately to execute. After the executing by that particular instance, if the transaction is failed again due to a network issues, the transaction can be exposed to the same scheduler nearly.
There is no point in executing the transaction again and again within a small amount of time while the target service is unavailable.
You can configure how long time a transaction should be kept at leisure without exposing to the scheduler. That time is called as Transaction Leisure time.

In the below, you can see it with diagram. After initiating the transaction, the transaction has been exposed to retrying if the transaction is failed due to resource-unattainable issue. After exposing the transaction, the transaction is frozen for a while (based on your configuration) as leisure time. While that time, no one can access the data for retrying. After the end of that leisure time, the transaction is exposed for replaying if the transaction is still one of running status (processing or reverting).

stacksaga diagram retry leisure time

Transaction Restore Retention Time

How long the transaction should be kept waiting to determine whether the transaction unexpectedly crashed. The value should be in hours. If there are some transactions in the event-store that have been shared for replaying but even after 12-hours (configured time,) that transaction has not been retried with that token. This is a very rare case. For instance, after receiving the transaction for replaying by the one of available instances, the instance goes down due to a power cut without executing the transaction. But the leader has been updated as the transaction has been shared to an instance for doing replay. Due to that, the leader doesn’t invoke those transactions again until the transaction is updated by the received instance or the crashedTransactionRestoreRetentionHours is exceeded. Before collecting the transactions that should be retried, the leader checks that if there are some transactions that exceed the crashedTransactionRestoreRetentionHours time and those transactions update again as to be eligible for retrying.

stacksaga diagram tx retry stucked retention

What happens if a transaction is retried after being declared as crashed?

That means that due to the retention time is exceeded, the engine decides to expose the transaction for retrying. Then the transaction will be shared to one of the available instances.
While then that instance which received the transaction for retrying previously (before the latest expose) invokes the transaction accidentally.

Just imagine the instance has been stuck for 10 hours due to memory issues or kind of situation.
After 10 hours the that transaction will be executed by the instance that was stuck.

Then there are two situations can be happened.

The transaction can be still in the replying status (even though exposed many times after the retention time.
The transaction already executed successfully. (By using other instance(s).

In the first scenario, you may think that the transaction can be executed two times. Because the old instance again has started to execute the collected transaction to the queue. And the transaction can be in another instance’s queue for executing due to the engine exposed the transaction for retrying after exceeding in the retention time. Even though There is a one-in-a-billion chance of that happening, it is not invoked two times at all.

Because along with every retry notification, a toke is passed when the execution is shared. The token number is an integer number that increased one by one every time the transaction is exposed for retrying.

In our case, the old instance’s queue can have a less number for the retry notification event than the new instance’s queue has. Therefore, the engine will allow only the token that recently issued. Then the old transaction is rejected executing. The diagram shows how it works. Here you can see that only the execution that contains the latest value is executed (the latest token should be the same as the value in the event-store). Any other executions are rejected.

stacksaga diagram retry leisure time crash