Skip to content

Parallelization in GÉANT

In GÉANT, there are a few tasks and workflows that make use of parallel execution of workflows. These are mostly used for mass-migration or -reconciliation. Parallelization takes place outside the orchestrator-core however, as we add items to the Celery queue directly. This only allows us to run multiple workflows in parallel, and not separate steps.

Hierarchy

In this section, we will refer to mass-deploy workflows, but their behaviour is identical to how mass-migrate and -reconcile workflows behave. When running the workflow redeploy_base_config, the standard set of configuration statements is applied to an existing router. It can happen that there is a change in this set of statements, and our base config is updated.

To prevent all router subscriptions from falling out of sync during their nightly validation, we want to redeploy base config on all routers. For this, we use a special task redeploy_base_config. The input form of this task allows the operator to select multiple routers for which the redeploy_base_config workflow should run.

@shared_task(ignore_result=False)
def bulk_wf_task(
    wf_payloads: list[dict],
    callback_route: str,
    success_key: str,
    failure_key: str,
) -> None:
    """Kicks off one Celery task per workflow, then runs the final callback."""
    identifiers = {wf_payload["identifier"] for wf_payload in wf_payloads}

    header = [process_one_wf.s(p) for p in wf_payloads]

    chord(header)(
        finalize_bulk_wf.s(
            callback_route,
            list(identifiers),
            success_key,
            failure_key
        )
    )

The code snippet shows how we submit a task of multiple instances of the same workflow that gets submitted to Celery. This causes the workflows to be structured like a parent with multiple child workflows. The parent workflow is responsible for submitting multiple child workflows, one for each router that is redeployed. Upon completion of all child workflows, Celery will call back to the parent to signal that the processes have completed. The parent will then validate whether all child processes have completed successfully. If this is not the case, the parent workflow will fail.

In the parent workflow, we have the following callback step defined, with an action and an evaluation sub-step.

@step("Start worker to redeploy base config on selected routers")
def start_redeploy_workflows(tt_number: TTNumber, selected_routers: list[UUID], callback_route: str) -> State:
    """Start the massive ``redeploy_base_config_task`` with the selected routers."""
    wf_payloads = []
    for subscription_id in selected_routers:
        wf_payload = BulkWfPayload(
            workflow_key="redeploy_base_config",
            user_inputs=[
                {"subscription_id": str(subscription_id)},
                {"tt_number": tt_number, "is_massive_redeploy": True},
            ],
            callback_route=callback_route,
        )
        wf_payloads.append(wf_payload.model_dump())

    bulk_wf_task.apply_async(args=(wf_payloads, callback_route, _SUCCESS_KEY, _FAILURE_KEY), countdown=5)
    return {_FAILURE_KEY: {}, _SUCCESS_KEY: {}}


@step("Evaluate provisioning proxy result")
def evaluate_results(callback_result: dict) -> State:
    """Evaluate the result of the provisioning proxy callback."""
    failed_wfs = callback_result.pop(_FAILURE_KEY, {})
    successful_wfs = callback_result.pop(_SUCCESS_KEY, {})
    return {"callback_result": callback_result, _FAILURE_KEY: failed_wfs, _SUCCESS_KEY: successful_wfs}

Then in the StepList of the workflow:

callback_step(
    name="Start running redeploy workflows on selected routers",
    action_step=start_redeploy_workflows,
    validate_step=evaluate_results,
)

The parent workflow does not handle retrying child workflows. This is an exercise left to the operator.

sequenceDiagram
  actor P as Parent<br/>workflow
  participant B@{ "type" : "control" } as Celery
  participant C@{ "type" : "collections" } as Child<br/>workflows
  P->>B: Submit bulk workflow task
  par For each selected subscription
    B-)C: Start workflow
  end
  C->>B: Complete workflow
  alt At least one child workflow has failed
    B->>P: Fail the parent workflow
    Note right of C: If the child workflow has failed,<br/>the operator has to resolve<br/>this manually.
    alt
      C->>C: Retry workflow
    else
      C--xC: Abort workflow
    end
  else All child workflows have completed
    B->>P: Complete parent workflow
  end

Once all child workflows are in a state that is not failed, the evaluation step of the parent workflow will report a success. Note that we do not check if the state is success. This allows for the operator to make the call to abort the child workflow for whatever reason. If this happens, the operator is responsible for any potential manual cleanup, but the parent workflow can now finish successfully.