The failure mode of linear scripts
A linear automation script looks like a recipe: click, wait 2 seconds, click, wait 3 seconds, fill form, submit. It breaks the first time the page does literally anything you didn’t expect. A background fetch delays a control by 500ms and a step fails. An A/B test injects an extra confirmation dialog and the next selector is stale. A refresh is required because state didn’t settle and now you have to unwind whatever was mid-flight.
Each individual fix is a one-line patch. The cumulative effect is a fragile script nobody wants to touch.
The page as a state machine
The reframe is to model each page as a small state machine with explicit, observable states: idle, pending, actionable, completed, error. These states are derived from the DOM (a spinner being present, a submit button being enabled, a specific data attribute appearing), not from a clock.
Events on a dispatch queue, not steps in a script
Once the page has observable states, actions become small composable events (click, wait-for, verify, screenshot, reconcile) placed on a dispatch queue. A lightweight event loop pulls the next event, checks the current page state, and decides whether to execute, re-route, cancel, or retry.
loop do
event = queue.next
state = Page.observe # read current DOM state
next queue.defer(event) if event.precondition_unmet?(state)
next queue.drop(event) if event.superseded_by?(state)
result = event.execute(driver)
queue.enqueue(event.followups(result))
raise event.error_for(result) unless event.postcondition_met?(driver)
end
Every event carries three things: a predicate on the current state saying whether it should run right now, the action itself, and a postcondition check that confirms the action actually had the effect the caller expected. Events are idempotent by design: retrying is safe because the precondition check will either fire or skip.
Concrete patterns
- Observable waits beat blind sleeps. Wait on visibility, attribute changes, network idleness, or a timestamped DOM marker, not on a wall-clock duration.
- Layered selectors. Prefer stable
data-*attributes or ARIA roles, fall back to semantic paths, and only resort to positional lookup when there is no other option. - Precondition / postcondition on every event. If the precondition fails, defer. If the postcondition fails, raise. No action completes silently.
- Throttle by queue, not by sleeping. A queue-level rate limiter adds natural variability and prevents bursts from outrunning the DOM.
- Instrument heavily. Attempt counts per event, median completion times, and a rolling log of the last N events make debugging timing issues much cheaper than stepping through headful sessions.
- Use CDP where the DOM isn’t enough. Early script injection or network inspection sidesteps whole classes of race conditions that Selenium alone can’t handle.
What changed in practice
Stale-element errors dropped from “constant” to “almost never” because the dispatch layer re-reads the DOM before every action. Page refreshes disappeared because refresh was a band-aid for racing against state that was never actually observed. Adding a new task variation became a one-line enqueue or a small handler beside existing code, instead of a rewrite of the central flow.
Most importantly, debugging got interactive. When an event fails, its log line carries the state it observed, the precondition it checked, and the postcondition that failed. You can replay recent queue activity locally without re-running the whole session.
Takeaways
- Make state observable. If you can’t query it from the DOM, you can’t reliably wait for it.
- Treat each action as a verifiable event with explicit preconditions and postconditions.
- Keep events small, composable, and idempotent.
- Put throttling and retries in the queue, not in each handler.
- The script stops being a recipe and starts being a tiny runtime. That’s the point.