Modeling a merge queue with TLA+

In this article, we use TLA+ to create a basic merge queue system.
Software engineer @ Aviator

Modeling a merge queue with TLA+

In this article, we use TLA+ to create a basic merge queue system.

When building a complex software system, it is hard to reason its behavior. Formal methods are a computer science research area that assists this situation by using a mathematical basis.

Model checking is one of the formal method techniques. By creating a small version of a system, it explores possible system states and checks if its desired properties are satisfied. Computer science students and graduates might be familiar with model-checking tools such as Alloy, SPIN, and NuSMV.

TLA+ is a specification language used for model checking created by Leslie Lamport. It is known for industrial usage in companies like Microsoft and AWS. In this article, we use TLA+ to create a basic merge queue system.

TLA+

TLA+ was created by Leslie Lamport. You might have heard of his other works such as Lamport Clock and Paxos. The language is used for system specification; you write the system state and processes that interact with the state.

With this description, you might imagine that it is like a database schema and a programming language. To some degree it is correct. Both are used for describing a system. Specification languages, however, are specialized to describe expected “properties” of systems.

You can think of properties as features of a system. Let’s take a merge queue as an example. Merge queue is a queueing system that takes a pull request one by one, tests them, and merges them into the mainline in order. In this system, you expect that a PR is merged in a queueing order, and a PR gets rejected when a test fails.

When writing such a system, you can run a few test cases to see if it’s merging PRs in order. This is limited in the sense that it only tests certain situations. There can be a timing bug in that a PR gets merged earlier than previous PRs.

Model checkers like TLA+ address this problem by doing an exhaustive search. Any non-trivial system has way larger state space, so we need to write a smaller model that captures the core part of the system. We can get insights by formally describing a system or describing the desired property, and the model checkers assist us with their consistency.

TLA+ itself is a specification language, but there is another language called PlusCal that is close to programming languages. Specifications written in PlusCal can be translated into TLA+. For most software engineers, PlusCal would be easier to use. There are some good TLA+ and PlusCal tutorials. The author used Learn TLA+. In addition to TLA+’s IDE, there is also VSCode extension.

Modeling a bare-bones merge queue

Let’s start with a small merge queue system. In this system, we want to model the following actors:

  • Developers who enqueue PRs into the system.
  • CI system that tests a PR.
  • Merge queue system that merges a tested PR.

In this first merge queue system, we ignore the “queue” part and merge the first PR that finished testing. Later we will add the “queue” part.

---- MODULE mergequeue ----
EXTENDS TLC, Sequences, Integers, FiniteSets

PRNumber == 1..3
PRState == {
    "pending",
    "queued-waiting-validation",
    "queued-validated",
    "merged"
}

(* --algorithm mergequeue

variables
    prs = [
        prn in PRNumber |->
        [
           state |-> "pending"
        ]
    ]

define
    Merged == <> A idx in PRNumber :
        prs[idx]["state"] = "merged"
end define;

fair process worker in {"w1", "w2"}
begin
    ProcessPR:
        with prn in PRNumber do
            await prs[prn]["state"] = "queued-validated";
            prs[prn]["state"] := "merged";
            time := time + 1;
        end with;
        goto ProcessPR;
end process;

fair process queuer in {"q1", "q2"}
begin
    QueuePR:
        with prn in PRNumber do
            await prs[prn]["state"] = "pending";
            prs[prn]["state"] := "queued-waiting-validation";
        end with;
        goto QueuePR;
end process;

fair process ci_worker in {"ci1", "ci2"}
begin
    RunCI:
        with prn in PRNumber do
            await prs[prn]["state"] = "queued-waiting-validation";
            prs[prn]["state"] := "queued-validated";
        end with;
        goto RunCI;
end process;

end algorithm; *)

There are 6 processes and 3 pull requests in the system. They are working in parallel and eventually, they get merged. As a property, it defines one that eventually all PRs get merged. This is a very trivial system, but it’s a good starting point.

Running this passes the validation. To test if the validation is actually validating what we want, we can change the merged property to expect everything to be “queued-validated”. Running that version produces a counter-example.

At step 10, it finds all states to be “merged”, not “queued-validated”.

Let’s modify the model so that it can handle test failures. In this modification, the CI system produces either failed or passed, and the merge queue system marks the PR to be blocked for the failed ones.

PRState == {
    "pending",
    "queued-waiting-validation",
    "queued-validated",
    "queued-failed",
    "merged",
    "blocked",
}

The CI workers nondeterministically mark the PRs as failed or passed.

RunCI:
    with prn in PRNumber do
        await prs[prn]["state"] = "queued-waiting-validation";
        either
            prs[prn]["state"] := "queued-validated";
        or
            prs[prn]["state"] := "queued-failed";
        end either;
    end with;
    goto RunCI;

The merge queue workers merge the PRs or mark them as blocked.

ProcessPR:
    with prn in PRNumber do
        either
            await prs[prn]["state"] = "queued-validated";
            prs[prn]["state"] := "merged";
        or
            await prs[prn]["state"] = "queued-failed";
            prs[prn]["state"] := "blocked";
        end either;
    end with;
    goto ProcessPR;

Running this modified model will actually produce a counter-example.

We need to change the expected property for the model.

MergedOrBlocked == <> A idx in PRNumber :
    prs[idx]["state"] in {"merged", "blocked"}

In this model, eventually, every PR gets merged or blocked.

Adding the “queue” part to the merge queue

The described merge queue system merges the finished PR and rejects them if the tests fail, but they are merged out of order. We want them to get merged in the queuing order. Let’s add the queuing part to the system.

In this case, we change the MergeQueue to behave like this:

  • MergeQueue system merges a PR if the PR that passed the test and the previously queued PRs are merged.

To represent the queue order, we need to add a concept of “time”. Let’s add that to PRs.

variables
    time = 1,
    prs = [
        prn in PRNumber |->
        [
           state |-> "pending",
           queuedAt |-> 0
        ]
    ]

And queuing a PR sets this queued at and increments the clock.

QueuePR:
    with prn in PRNumber do
        await prs[prn]["state"] = "pending";
        prs[prn]["state"] := "queued-waiting-validation" || prs[prn]["queuedAt"] := time;
        time := time + 1;
    end with;
    goto QueuePR;

How do we make the merge queue workers merge only if the prior PRs get merged? Let’s create a predicate to test if prior PRs are already merged.

PriorPRMerged(prn) ==
    A idx in PRNumber:
        (/ prs[idx]["queuedAt"] # 0
         / prs[idx]["queuedAt"] < prs[prn]["queuedAt"])
            => prs[idx]["state"] = "merged"

This condition checks if the PR’s queuedAt is already filled and it’s earlier than the PR in question. If so, it should be marked “merged”. Otherwise, this condition won’t become true.

With this condition, we can guard the merge process.

await prs[prn]["state"] = "queued-validated" / PriorPRMerged(prn);

This modification will make the queue stuck. Why? Let’s see the counter example produced by the model checker.

In the counter-example, the PRs are queued from 1 to 3. However, the first PR gets blocked. Other PRs are stuck because PriorPRMerged condition will never be satisfied.

What do we want to do in this situation? We can remove the blocked PR from the queue and let the other PRs get merged. This can get more complicated if the MergeQueue system supports ChangeSets, where the system merges related PRs together. This is not in the current scope, so let’s punt on that. Let’s remove the blocked PR from the queue.

await prs[prn]["state"] = "queued-failed";
prs[prn]["state"] := "blocked" || prs[prn]["queuedAt"] := 0;

Changing this makes the model checker happy.

As we see in this extension, the model checker presents a possible system state that won’t satisfy the expected property. In order to fix the problem, we need to think about what extra behavior is needed for the system. In this case, if a PR gets blocked by a CI, it can clog the entire queue.

This time, we simply remove the offending PR from the queue, but there can be other actions, such as enqueuing the blocked PR to the back of the queue. This iterative process inspires you to the high-level spec of the system in a verifiable way.

Final remarks

We went through a system modeling process by creating a small merge queue system. Starting from a small one, we expand it to have more specifications. In the process, we discovered an example situation where an expected property won’t be held, and what can be missing in the system. The resulting system is still small, but this is a good starting point. We can iterate the process until the system has enough features.

Model checking is relatively easy to start. For people who are more interested in this area, we suggest reading Software Abstractions. You should be able to find relevant classes in undergraduate CS courses in universities. Some model checkers are based on SAT solvers. Occasionally, you will find a problem that can be converted into a SAT problem in software engineering. Adding this as a toolbox allows you to have more options for problems at your hand.

TLA+ is fun and easy to learn. AWS reports that engineers at various levels could get useful results in 2-3 weeks after they started learning. The author of this article also learned TLA+ for the first time and was able to write this small MergeQueue model in half a day. Being able to get a useful result in a few weeks is totally believable. If you haven’t tried it out, we recommend it.

Aviator: Automate your cumbersome processes

Aviator automates tedious developer workflows by managing git Pull Requests (PRs) and continuous integration test (CI) runs to help your team avoid broken builds, streamline cumbersome merge processes, manage cross-PR dependencies, and handle flaky tests while maintaining their security compliance.

There are 4 key components to Aviator:

  1. MergeQueue – an automated queue that manages the merging workflow for your GitHub repository to help protect important branches from broken builds. The Aviator bot uses GitHub Labels to identify Pull Requests (PRs) that are ready to be merged, validates CI checks, processes semantic conflicts, and merges the PRs automatically.
  2. ChangeSets – workflows to synchronize validating and merging multiple PRs within the same repository or multiple repositories. Useful when your team often sees groups of related PRs that need to be merged together, or otherwise treated as a single broader unit of change.
  3. TestDeck – a tool to automatically detect, take action on, and process results from flaky tests in your CI infrastructure.
  4. Stacked PRs CLI – a command line tool that helps developers manage cross-PR dependencies. This tool also automates syncing and merging of stacked PRs. Useful when your team wants to promote a culture of smaller, incremental PRs instead of large changes, or when your workflows involve keeping multiple, dependent PRs in sync.

Try it for free.

Aviator.co | Blog

Subscribe

Be the first to know once we publish a new blog post

Join our Discord

Learn best practices from modern engineering teams

Get a free 30-min consultation with the Aviator team to improve developer experience across your organization.