Modernizing Selenium Support in D


WebKit GTK on Windows: it doesn't exist

I spent about a month building Insurgent, a companion app for a real-time strategy game that provides active monitoring of matches and an in-depth catalog of game data. The frontend was built around WebKit GTK because it ships with D's GtkD bindings and was the obvious choice on Linux and macOS. It allowed the game to run in a separate window where I could hook WebSockets and inject JavaScript. Then I tried to build on Windows.

The Windows port was abandoned years ago. Not buggy. Abandoned. There is no path forward. I needed a cross-platform browser engine, and I needed it without rewriting the project in another language.

Selenium was the replacement candidate

ChromeDriver is maintained by Google and runs everywhere the project needs to ship. I also have experience with Selenium from my work on LinkedIn Learning scraping and workflow automation, so the protocol and API patterns were already familiar.

The obstacle was that D had no working Selenium bindings at all.

The only option was from 2018

The existing library, selenium.d, targeted the old JsonWire protocol. Selenium deprecated JsonWire in favor of W3C WebDriver years ago, and modern ChromeDriver simply rejects JsonWire session requests outright. Every endpoint path, every payload shape, and every error response format had changed.

In JsonWire, you POST to /session with a flat desiredCapabilities object. In W3C, the same endpoint expects a nested capabilities object with alwaysMatch and firstMatch arrays. A JsonWire session request against a modern ChromeDriver returns a 500 with no session ID, and every subsequent command fails.

The protocol gap was not a version mismatch. It was a different wire format, different capability negotiation, and different error handling. Every request path and every payload shape had to be revisited.

This is not my first time with WebDriver

The Ruby automation I wrote for AutoHandshake was a dispatch-oriented runtime with tab queuing, JavaScript-level request interception, and companion browser extensions. That experience mattered here because I did not have to learn what good bindings feel like while also learning the W3C protocol. I already knew what the surface should look like.

Ruby's metaprogramming also excellently complements D's metaprogramming. Both languages reward building small, composable abstractions, and that shaped how I structured the API.

Session negotiation

Session creation was the big one. The fix started at the entry point. The Bridge now builds a W3C-compliant capabilities payload and parses the standardized response shape:

void start(Options options)
{
    if (!running)
        throw new WebDriverConnectionError("Bridge is not running.");

    JSONValue payload = JSONValue.emptyObject;
    payload["capabilities"] = options.toJSONValue();

    HTTP http = HTTP();
    Response response = send(http, HTTP.Method.post, serverUrl~"/session", payload);
    JSONValue json = checkAndParse(response);

    if ("sessionId" in json)
        sessionId = json["sessionId"].str;
    else if ("value" in json && "sessionId" in json["value"])
        sessionId = json["value"]["sessionId"].str;
}

The response parser had to handle both the old JsonWire sessionId top-level field and the new W3C value.sessionId nested field. The same dual-protocol handling applies to element references. W3C uses a constant key, element-6066-11e4-a52e-4f735466cecf, while legacy drivers use ELEMENT:

static string parseElementId(JSONValue json)
{
    enum W3C_KEY = "element-6066-11e4-a52e-4f735466cecf";
    JSONValue value = ("value" in json) ? json["value"] : json;

    if (value.type == JSONType.object)
    {
        if (W3C_KEY in value && value[W3C_KEY].type == JSONType.string)
            return value[W3C_KEY].str;
        if ("ELEMENT" in value && value["ELEMENT"].type == JSONType.string)
            return value["ELEMENT"].str;
    }

    return null;
}

Modeling the API after Ruby

The D version follows the same conventions as Ruby's Selenium bindings: a Driver owns the session, Element wraps node references, and locators are strongly typed instead of raw strings.

enum Locator : string
{
    ClassName = "class name",
    CssSelector = "css selector",
    Id = "id",
    Name = "name",
    LinkText = "link text",
    PartialLinkText = "partial link text",
    TagName = "tag name",
    XPath = "xpath"
}

Element lookup returns an Element that carries its own element ID, so the caller never manages opaque UUID strings directly:

class Element
{
    Bridge bridge;
    string id;

    this(Bridge bridge, string id)
    {
        this.bridge = bridge;
        this.id = id;
    }

    void click()
    {
        bridge.request(HTTP.Method.post, path("/click"));
    }

    void sendKeys(string[] keys)
    {
        bridge.request(HTTP.Method.post, path("/value"), ["text": keys.join()]);
    }

    void sendKeys(string keys)
    {
        sendKeys([keys]);
    }

    string text()
        => bridge.request!string(HTTP.Method.get, path("/text"));

    string attribute(string name)
        => bridge.request!string(HTTP.Method.get, path("/attribute/"~name));

    bool displayed()
        => bridge.request!bool(HTTP.Method.get, path("/displayed"));

private:
    string path(string suffix)
        => "/element/"~id~suffix;
}

The API surface is small and intentionally familiar if you have used Selenium in another language:

auto driver = Driver.start();
driver.navigate("https://example.com");
auto element = driver.find(Locator.CssSelector, "input[name='q']");
element.sendKeys("query");

Driver auto-detection and lifecycle

One quality-of-life feature from the Ruby ecosystem is that the bindings often find the driver executable for you. The D port replicates that with a path search and executable inference from the filename:

static string autoDetectExecutable(DriverType type = DriverType.Any)
{
    string[] candidates;
    switch (type)
    {
        case DriverType.Chrome:
            candidates = ["chromedriver"];
            break;
        case DriverType.Firefox:
            candidates = ["geckodriver"];
            break;
        case DriverType.Edge:
            candidates = ["msedgedriver"];
            break;
        case DriverType.Safari:
            candidates = ["safaridriver"];
            break;
        default:
            candidates = [
                "chromedriver",
                "msedgedriver",
                "safaridriver",
                "geckodriver"
            ];
            break;
    }

    foreach (candidate; candidates)
    {
        Tuple!(int, "status", string, "output") result =
            std.process.execute(["which", candidate]);
        if (result.status == 0)
            return result.output.strip;
    }

    throw new WebDriverConnectionError(
        "Could not auto-detect executable for "~type.to!string
    );
}

The Driver class spawns the child process, waits for the HTTP port to become ready, and shuts it down cleanly on quit() or stop():

void launch(ushort requestedPort = 0)
{
    if (running)
        return;

    port = requestedPort == 0 ? findFreePort() : requestedPort;
    pid = spawnProcess([executablePath, "--port="~port.to!string]);
    serverUrl = "http://127.0.0.1:"~port.to!string;
    waitForServer(5000);
    running = true;
}

void stop()
{
    if (!running || pid is Pid.init)
        return;

    tryKill(pid);
    running = false;
    pid = Pid.init;
}

Error handling for both protocols

Because the library may still encounter legacy drivers in the wild, the error mapper handles both W3C error strings and legacy JsonWire status codes:

static WebDriverError mapError(ushort status, JSONValue json)
{
    string message = extractMessage(json);

    if ("value" in json && json["value"].type == JSONType.object)
    {
        JSONValue value = json["value"];
        if ("error" in value && value["error"].type == JSONType.string)
        {
            switch (value["error"].str)
            {
                case "no such element":
                    return new NoSuchElementError(message);
                case "stale element reference":
                    return new StaleElementReferenceError(message);
                case "invalid element state":
                    return new InvalidElementStateError(message);
                case "timeout":
                    return new WebDriverTimeoutError(message);
                case "session not created":
                    return new WebDriverConnectionError(message);
                default:
                    return new WebDriverError(message);
            }
        }
    }

    if ("status" in json)
    {
        switch (json["status"].type == JSONType.integer
            ? json["status"].get!long : 0)
        {
            case 7:
                return new NoSuchElementError(message);
            case 10:
                return new StaleElementReferenceError(message);
            case 12:
                return new InvalidElementStateError(message);
            case 21:
                return new WebDriverTimeoutError(message);
            case 33:
                return new WebDriverConnectionError(message);
            default:
                return new WebDriverError(message);
        }
    }

    return new WebDriverError(message);
}

Implicit waits and synchronization

A raw Selenium script without waits breaks immediately on slow-loading pages. The library exposes implicit-wait configuration on the bridge and syncs it lazily before element lookups:

void ensureImplicitWaitSynced()
{
    if (implicitWait == syncedImplicitWait)
        return;

    JSONValue body_ = JSONValue.emptyObject;
    body_["implicit"] = JSONValue(implicitWait.total!"msecs");
    try
    {
        request(HTTP.Method.post, "/timeouts", body_);
        syncedImplicitWait = implicitWait;
    }
    catch (Exception) { }
}

JavaScript execution and screenshot capture

Two features that Insurgent needed heavily were script injection and screenshot capture for debugging. Both map cleanly to W3C endpoints. execute is generic over the return type and handles serialization automatically:

T execute(T = string)(string script, JSONValue args = JSONValue.emptyArray)
{
    return bridge.request!T(HTTP.Method.post, "/execute/sync", [
        "script": JSONValue(script),
        "args": args,
    ]);
}

string screenshot()
    => bridge.request!string(HTTP.Method.get, "/screenshot");

What is still missing

The library covers the core WebDriver surface that Insurgent needed: navigation, element interaction, form input, script execution, screenshots, and session management. It does not yet implement the Actions API for low-level mouse and keyboard input, nor does it support BiDi (WebDriver Bidi) for network interception or console log streaming. Remote Grid support and mobile WebDriver endpoints are also absent.

Those gaps are acceptable for now. The goal was not parity with the Java or Python bindings. The goal was a working W3C foundation in a language that had none, and one that is easy to write unittests for.

Where it stands now

The library is working and I am continually expanding its functionality. It is forked from gedaiu/selenium.d and lives at github.com/cetio/selenium-sdk. It is not complete, but it is enough to drive a browser from D and support the majority of basic Selenium functionality, and a month ago that was not possible at all.

I will absolutely continue to make changes and things will continue to stray from the source at the time of this article, but bringing modern Selenium to D is massive for the ecosystem and I look forward to what others could do with it.