Proportional Analysis of long-running transactions in Saga

Overview

Proportional Analysis of Long-Running Transactions in Saga refers to assessing potential failures by comparing different aspects of long-running transactions in a Saga-based system and what are the solutions that StackSaga provides for those potential failures. Even in a well-structured microservice architecture using the Saga pattern, failures can still occur. While the Saga pattern ensures eventual consistency, it does not eliminate all failures. However, at the implementation level, certain rare but critical failures can still be observed, emphasizing the need for robust failure-handling strategies and continuous monitoring.

Due to the system is well-managed and well-configured, most of the transactions are successful in general. But there are some rare but critical failures can still be observed like resource-unavailable issues network outages, etc. These edge cases, though rare, demand careful consideration during system implementation to ensure overall robustness and resilience.

funnel theory of distributed transactions

The diagram shows that most of the transactions are successful in general due to we are using well-managed distributed systems. Even though the following failures can occur rarely in the system. It does not say that every time these errors can occur, but there is a possibility.

1. Successful transactions.

Successful transactions represent the majority of the entire transactions due to the fact that we are using healthy and well managed systems. According to the saga design pattern, these are the transactions that executed all primary executions successfully.

2. Primary-Execution failed transactions

If one of primary executions fails while going forward and the compensation process is successfully done after the primary execution failure, these transactions are considered as compensation success transactions. According to the saga design pattern, successful transactions are not considered as failed transactions. Because, the transaction has overcome the eventual consistency.

The above situations are accepted by the saga design pattern. But the following failures are not accepted by the saga design pattern. But these failures can occur, and we should manage them to avoid inconsistency in the system. Even the following unexpected failures are a headache as a developer. Do not worry about it, Stacksaga will take care of them with reliable solutions.

3. Compensation failed transactions.

If one of the compensation executions fails (except a resource-unavailable issue,) the entire transaction is stuck. So as a developer, you have to guarantee developing the compensation executions without having errors.

Stacksaga solution. Even if these transactions are frozen due to compensation error without being exposed for scheduling, you are able to restore those transactions by fixing the relevant issue with a manual intervention. You can find out the transaction batch wise from the event store via the endpoint that Stacksaga provides. After collecting the transactions, it can be restored by calling again the restore endpoint that Stacksaga provides. Then those transactions are exposed to the next schedule for re-executing where the transaction was stopped.

4. Asynchronous retry Timeouts

While communicating with other services, some services cannot be accessed sometimes due to connection issues. As a best practice if you are using a resilience and fault tolerance pattern like a circuit breaker in your application, after a few attempts the transaction is frozen for some period of time and the transaction is scheduled for asynchronous retrying. But the case is that if the transaction fails even after a few Asynchronous retrying, the asynchronous retrying period of the transaction is exceeded. Then the transaction will not be exposed to re-invoking. In simple terms, every transaction has a time period for retrying asynchronously. After exceeding that time, the transaction will be frozen.

Stacksaga solution. Even if these transactions are frozen due to the threshold has been reached, you are able to restore those transactions by fixing the relevant issue with a manual intervention. You can find the transactions from the event store batch wise via the endpoint that Stacksaga provides. After collecting the transactions, it can be restored by calling again the restore endpoint that Stacksaga provides. Then those transactions are exposed to the next schedule for re-executing where the transaction was stopped.

4. Crashed transaction.

In microservices architecture, one business transaction consists of multiple sub-transactions (atomic-transactions). And as well as another extra atomic transaction is made by the SEC behind the scene for storing the state of the transaction in the event store.

So if one atomic transaction is crashed, (the crash can be occurred due to various reasons like Power Outage, hardware failure, etc.) without any update (fallback), the entire transaction is stuck. Because the atomic transactions of the business transaction are executed in sequence order in general.

These kinds of failures may occur in mainly two forms.

  1. The Application is crashed while processing the atomic execution to the event-store by the SEC. It leads to a dual-consistency problem.

  2. The Application is crashed while processing one of the executors (one of atomic transactions of the business transaction).

Stacksaga solution. In the case of a crash of the transactions, the SEC has no idea where the transaction is stuck exactly. Because the execution process got killed without announcing anything to the SEC. StackSaga is ready for handling these crashes as well. Each time the transaction is updated in the event store by the SEC, SEC updates the Transaction Restore Retention Time, and in case if the transaction got crashed, the transaction will be caught for retrying after exceeding the Transaction Restore Retention Time.

5. Missing transactions

In asynchronous retry processes, transactions are transferred for execution to available services through mechanisms such as queues, HTTP requests, or other communication channels. However, during this process, a transaction may be lost or fail to execute due to issues like message loss, queue mismanagement, or communication failures. These missing transactions can lead to inconsistencies and require careful monitoring and recovery strategies.

Stacksaga solution. In this case, the same Transaction-Restore-Retention-Time logic is applied as the solution. If the transaction has no any sign after sending for retrying, after some configured Transaction-Restore-Retention-Time, the transaction is re-exposed for retrying automatically.