Retry with Backoff Pattern in OCI

3 min readJun 12, 2024

A standard pattern for improving application stability is to retry with backoff certain operations that fail. Let’s look at how to do this in OCI using the PLSQL API from Oracle Autonomous Databases and OCI Data Integration but the same can be applied to other services and many of the other languages (Python, Java and so on).

In distributed architectures, there is a certain class of errors such as those caused by service throttling or temporary loss of network connectivity for example — automatically retrying operations that fail because of these errors improves the application stability and resilience. Doing this retry too often can overload network bandwidth and cause contention — exponential backoff is a technique where operations are retried by increasing the time between retry attempts.

This retry with backoff pattern can be used when dependent services throttle the request to prevent overload, as a user you may get a 429 Too many requests exception.

Retry with Backoff Pattern

The following code shows the implementation of the retry with backoff pattern. The function will try submitting the task run until it does not get a 429 and it will use a simple backoff pattern (first 3 seconds, then 6 seconds, then 18 seconds etc.).

CREATE OR REPLACE FUNCTION OCI_DI_RUN_TASK (
    APPLICATION VARCHAR2,
    TASK        VARCHAR2,
    WSID        VARCHAR2,
    REGION      VARCHAR2,
    CRED        VARCHAR2
) RETURN DBMS_CLOUD_OCI_DI_DATA_INTEGRATION_CREATE_TASK_RUN_RESPONSE_T AS
    CREATE_TASK_RUN_DETAILS DBMS_CLOUD_OCI_DATAINTEGRATION_CREATE_TASK_RUN_DETAILS_T;
    REGISTRY_METADATA       DBMS_CLOUD_OCI_DATAINTEGRATION_REGISTRY_METADATA_T;
    RESULT                  DBMS_CLOUD_OCI_OBS_OBJECT_STORAGE_RENAME_OBJECT_RESPONSE_T;
    TASK_RUN                DBMS_CLOUD_OCI_DI_DATA_INTEGRATION_CREATE_TASK_RUN_RESPONSE_T;
BEGIN
    REGISTRY_METADATA := DBMS_CLOUD_OCI_DATAINTEGRATION_REGISTRY_METADATA_T(TASK, NULL, NULL, NULL, NULL);
    CREATE_TASK_RUN_DETAILS := DBMS_CLOUD_OCI_DATAINTEGRATION_CREATE_TASK_RUN_DETAILS_T(NULL, NULL, NULL, NULL, NULL,
                                                                                       NULL, NULL, NULL, NULL, NULL,
                                                                                       NULL, REGISTRY_METADATA);

    TASK_RUN := DBMS_CLOUD_OCI_DI_DATA_INTEGRATION.CREATE_TASK_RUN(WORKSPACE_ID => WSID, APPLICATION_KEY => APPLICATION, CREATE_TASK_RUN_DETAILS => CREATE_TASK_RUN_DETAILS
    , REGION => REGION, CREDENTIAL_NAME => CRED);

    IF TASK_RUN.STATUS_CODE = 429 THEN
        FOR I IN 1..10 LOOP
            DBMS_SESSION.SLEEP(5*I);
            TASK_RUN := DBMS_CLOUD_OCI_DI_DATA_INTEGRATION.CREATE_TASK_RUN(WORKSPACE_ID => WSID, APPLICATION_KEY => APPLICATION, CREATE_TASK_RUN_DETAILS => CREATE_TASK_RUN_DETAILS
            , REGION => REGION, CREDENTIAL_NAME => CRED);

            IF TASK_RUN.STATUS_CODE != 429 THEN
                EXIT;
            END IF;
        END LOOP;
    END IF;

    RETURN TASK_RUN; --null;

END;

This simple approach will allow you to handle multiple applications submitting requests from one tenancy. Generally the throttling is restricted at the tenancy level and within a tenancy you may have many different applications or users using the service so you should handle this kind of case.

Bulk API Pattern

You can check task run status one by one, this will obviously hit 429 throttle limits quite easily, there is another approach of using bulk APIs rather than the one by one route. For example in OCI Data Integration there is an API that can be used to get the task run status of a list of task run keys. The function below takes a table of task run keys and also uses the backoff when calling it to get the task run status.

CREATE OR REPLACE FUNCTION GET_RUNS_STATUS (
    CRED            VARCHAR2,
    REGION          VARCHAR2,
    WORKSPACE_OCID  VARCHAR2,
    APPLICATION_KEY VARCHAR2,
    KEYS            DBMS_CLOUD_OCI_DATAINTEGRATION_VARCHAR2_TBL
) RETURN DBMS_CLOUD_OCI_DI_DATA_INTEGRATION_LIST_TASK_RUNS_RESPONSE_T AS
    RESULT DBMS_CLOUD_OCI_DI_DATA_INTEGRATION_LIST_TASK_RUNS_RESPONSE_T;
BEGIN
    RESULT := DBMS_CLOUD_OCI_DI_DATA_INTEGRATION.LIST_TASK_RUNS(WORKSPACE_ID => WORKSPACE_OCID, APPLICATION_KEY => APPLICATION_KEY, KEY => KEYS
    , REGION => REGION, CREDENTIAL_NAME => CRED);

    IF RESULT.STATUS_CODE = 429 THEN
        FOR I IN 1..10 LOOP
            DBMS_SESSION.SLEEP(3 * I);
            RESULT := DBMS_CLOUD_OCI_DI_DATA_INTEGRATION.LIST_TASK_RUNS(WORKSPACE_ID => WORKSPACE_OCID, APPLICATION_KEY => APPLICATION_KEY
            , KEY => KEYS, REGION => REGION, CREDENTIAL_NAME => CRED);

            IF RESULT.STATUS_CODE != 429 THEN
                EXIT;
            END IF;
        END LOOP;

    END IF;

    RETURN RESULT;
END;

You can then use the list runs response to check the status of the runs, in cluded in this reponse are other attributes including status, bytesProcessed, recordsWritten and more, check the doc for this.

Conclusion

Cloud computing offers unparalleled flexibility, scalability, and cost efficiency. However, to maximize its benefits, applications must be designed to handle transient failures gracefully. The retry backoff strategy is a powerful technique to enhance the reliability and robustness of cloud-based systems. By understanding and implementing appropriate backoff strategies, developers can build applications that remain resilient in the face of temporary disruptions.

Retry with Backoff Pattern in OCI

Written by David Allan

No responses yet