Reducing flaky Maestro tests

Diagnose and eliminate flaky Maestro flows — timing, network, animation, and shared-state failures.

A flaky test is one that passes and fails on the same code with no change. They erode trust in CI faster than any other class of bug, because the natural response is "rerun it" — which works, until it stops working and nobody believes the failure when it's real. This guide is a triage protocol.

TL;DR

Flakes have four root causes: (1) racing the UI before it's ready, (2) network non-determinism, (3) shared backend state, (4) animation timing. Fix in that order — don't add sleep and call it done. Tag flaky flows so they don't block CI while you investigate.

Step 1: confirm it's flaky

Before fixing, confirm it's actually flaky:

for i in {1..20}; do
  maestro test path/to/flow.yaml --debug-output ./debug-$i || echo "fail $i"
done

If 0 fail, the issue was environmental (a one-off CI hiccup). If 20 fail, it's deterministically broken — not flaky. Flakes typically land in the 1–4 fail range out of 20.

Step 2: diagnose

Compare the debug output of a passing run vs. a failing run for the same flow:

Failing run shows the screenshot one step early? UI race.
Failing run shows a network spinner that never resolved? Network issue.
Failing run shows a different user's data? Shared state.
Failing run shows the right screen but tapOn missed? Animation timing.

This 30-second comparison saves hours. Don't skip it.

Cause 1: UI race

The flow asserts on or interacts with an element before it's hittable. Most common cause of flakes overall.

Bad:

- tapOn: "Sign in"
- inputText: "qa@example.com"   # form may not be visible yet

Good:

- tapOn: "Sign in"
- extendedWaitUntil:
    visible:
      id: "auth.email.input"
    timeout: 10000
- inputText: "qa@example.com"

extendedWaitUntil polls until the element is visible or the timeout hits — far better than a fixed sleep.

Cause 2: Network

The app waits on a backend request that sometimes takes 3 s, sometimes 30. Three options, in order of preference:

Mock the network at the app level (a debug build that hits a fixture server). Best for CI; eliminates the dependency.
Increase timeouts on the relevant extendedWaitUntil. A blunt fix, but acceptable.
Run against a faster environment. If staging is the bottleneck, a local mock is faster.

Avoid: hitting production from CI. Even if it works, you'll be the team that DDOS'd checkout the day before launch.

Cause 3: Shared backend state

Two CI runs hit the same staging database concurrently. Run A creates user_42; Run B asserts there is exactly one user named "Alice"; Run A's user collides.

Fix at the data layer, not the flow:

Generate per-run unique IDs (UUID, timestamp).
Reset state at the start of the suite, not between flows.
Where possible, run against an ephemeral DB seeded fresh per CI job.

The test data management guide goes deeper.

Cause 4: Animation

tapOn lands on a button that is in the middle of a transition. The tap is registered against the source view, not the destination, and "nothing happens".

Fix:

- waitForAnimationToEnd
- tapOn: "Continue"

waitForAnimationToEnd polls the animation state and returns when there are no in-flight transitions. Cheap and effective.

Quarantine, don't ignore

While you investigate, tag the flow and exclude it from blocking CI:

tags: [checkout, flaky]

maestro test --exclude-tags=flaky .maestro/    # blocking
maestro test --include-tags=flaky .maestro/    # non-blocking, separate job

Crucially: track the flaky list. A flow that has been quarantined for 6 months is a deleted flow with extra steps. Either fix or delete.

Retries — when and why not

Maestro supports --retries N. Use sparingly:

OK: retries on the suite as a whole during a known-flaky migration.
Not OK: retries set to 3 by default forever. This is how you ship a real bug.

If you find yourself bumping retries to make CI green, the flake is a symptom of something worse — likely cause 2 or 3 above.

Common pitfalls

sleep: 5 everywhere. Slows the suite, doesn't fix the race condition.
"It works locally." Local is faster, often less loaded, with different data. Reproduce on the CI VM before declaring victory.
Treating retries as a fix. They're a workaround.

Reducing flaky Maestro tests

On this page