Chemica II: identity, integration, and one shared Page type


The problem: five schemas that disagree about "a compound"

Each scientific source models the world differently. PubChem keys on a numeric CID and a sprawling REST dialect. MediaWiki-family sources (Wikipedia, PsychonautWiki) speak Wikitext plus their own templates. PubMed and PMC live in NCBI’s Entrez ecosystem and return JATS XML. None of them agree on what counts as a compound, what its canonical name is, or whether it even deserves its own page.

The job of the library is to make the caller never have to learn any of that.

Identity is the load-bearing wall

The single decision that collapsed most of the complexity was picking one source of truth for identity and treating every other integration as a resolution problem layered on top of it. PubChem got that role because PubChem is the only source that is both deterministic and exhaustive for small molecules.

Every other source is resolved from a PubChem CID, never toward one. That turns fuzzy string matching into structured lookup, and it means the resolution layer fails in one place instead of five.

Wikipedia resolution stops being “fuzzy-match a title” and becomes “take the canonical name and known synonyms, rank candidate pages, and confirm against the chemistry infobox.” The confirmation is not DOM scraping; it is CID extraction from the raw Wikitext source using four independent regex patterns because chemistry editors do not agree on how to cite PubChem:

string[] extractCIDs(Page page)
{
    string[] cids;

    enum pubChemRe = ctRegex!`PubChem\s*=\s*(\d+)`;
    foreach (m; matchAll(page.raw, pubChemRe))
        cids ~= strip(m[1]);

    enum pubChemCidRe = ctRegex!`PubChemCID\s*=\s*(\d+)`;
    foreach (m; matchAll(page.raw, pubChemCidRe))
        cids ~= strip(m[1]);

    enum pubChemTemplateRe = ctRegex!`\{\{\s*PubChem\s*\|\s*(\d+)\s*\}\}`;
    foreach (m; matchAll(page.raw, pubChemTemplateRe))
        cids ~= strip(m[1]);

    enum pubChemUrlRe = ctRegex!
        `https?://(?:www\.)?pubchem\.ncbi\.nlm\.nih\.gov/compound/(\d+)`;
    foreach (m; matchAll(page.raw, pubChemUrlRe))
        cids ~= strip(m[1]);

    return cids;
}

PsychonautWiki follows the same pattern with a graceful fallback through parent compounds. Most psychoactives do not have their own PsychonautWiki page; they redirect to a parent (LSD instead of ALD-52, for example). The resolution pipeline handles this by querying for structurally similar compounds from PubChem, filtering by XLogP and molecular weight, and then matching exact synonyms against PsychonautWiki page titles.

The Compound registry: flyweight by necessity

PubChem compounds are referenced everywhere: properties, assays, proteins, conformers, dosage lookups, similarity searches. Creating a new object every time the same CID appears would mean fetching the same title and synonyms repeatedly. Instead, Compound uses a static registry with getOrCreate:

class Compound
{
    static Compound[int] registry;

    static Compound getOrCreate(int cid)
    {
        assert(cid != 0, "CID must not be zero");
        if (auto p = cid in registry)
            return *p;

        Compound c = new Compound(cid);
        registry[cid] = c;
        return c;
    }

    int cid;
    Properties properties;

    ref string name()
    {
        if (properties.title == null)
            properties = getProperties!"cid"(cid)[0].properties;
        return properties.title;
    }
}

Every property is lazy. name() triggers a fetch. smiles(), iupac(), inchi() do the same. synonyms(), description(), and conformer3D are cached on first access. The caller gets a rich object that looks like it was fully hydrated, but nothing happens until it is actually needed.

One Page type, five source formats

Wikipedia and PsychonautWiki emit Wikitext. PubMed and PMC emit JATS XML. The frontend does not care. Every text source materializes as the same object with the same interface:

class Page
{
    string title;
    string source;
    string url;

    ref string raw();        // lazy fetch via delegate
    ref Document document(); // parsed AST (Wikitext or XML)
    Section[] sections();
    string fulltext();
    string preamble();
}

The lazy fetch is a delegate stored at construction time and nulled after first use:

ref string raw()
{
    if (_raw is null && _fetchContent !is null)
    {
        _fetchContent(this);
        _fetchContent = null;
    }
    return _raw;
}

Each source module defines its own fetch delegate. Wikipedia uses the MediaWiki action=parse API. PsychonautWiki does the same. PubMed uses NCBI EUtils efetch with a regex-based article splitter because the returned XML contains multiple PubmedArticle records. The caller never sees any of this.

A zero-copy AST for both Wikitext and XML

Existing D Wikitext options either lost formatting, choked on templates, or pulled in dependencies I did not want. So Akashi has its own parser. The AST is a flat array of nodes with child index ranges, not a tree of pointers. Every node’s text field is a slice into the original source string. No copies during parsing.

struct Node
{
    NodeType type;
    string text;     // slice into original source (zero-copy)
    NodeFlags flags; // Bold | Italic, combined freely
    ubyte level;     // Section heading depth or list nesting
    string target;   // Link URL or page name
    uint childStart; // index into Document.nodes
    uint childEnd;
}

struct Document
{
    Node[] nodes;   // flat storage; nodes[0] is always root
    string source;  // kept alive so slices remain valid
}

The flat array makes tree traversal cache-friendly and lets the parser emit nodes in a single pass without allocating per-node children vectors. Walking siblings is just incrementing an index. Walking subtrees uses the child range. Dropping non-renderable node types (Templates, References, Comments, Categories, Images) is an in-place filter that rebuilds the flat array while preserving the remaining structure:

ref Document document()
{
    if (!_parsed)
    {
        if (source == "pubmed" || source == "pmc")
            _doc = parseXml(raw);
        else
            _doc = parseWikitext(raw);

        _doc.drop(Document.dropFlag(
            NodeType.Template,
            NodeType.Reference,
            NodeType.Comment,
            NodeType.Category,
            NodeType.Image,
        ));
        _parsed = true;
    }
    return _doc;
}

Parsing dosage from Wikitext templates

PsychonautWiki stores dosage data in Template:SubstanceBox/<compound> pages, not in the article text. The template parameters look like this:

| OralROA_Threshold = 50 mg
| OralROA_Light     = 50 - 100 mg
| OralROA_Common    = 100 - 150 mg
| OralROA_Strong    = 150 - 200 mg
| OralROA_Heavy     = 200+ mg

Extracting this requires fetching the template page separately, then running a compiled regex over the raw Wikitext. The regex captures route name, rank, and value. Values are cleaned by stripping nested [[Effect::...]] semantic links, [[Page|Display]] wikilinks, and <ref> citation tags:

enum roaRe = ctRegex!(
    `\|\s*(\w+)ROA_(Bioavailability|Threshold|Light|Common|Strong|Heavy)`
   ~`\s*=\s*([^\n|]+)`);

foreach (m; matchAll(raw, roaRe))
{
    string range = cleanValue(m[3], m[2]);
    if (range.length == 0 || range.indexOf(" x ") != -1)
        continue;
    routes[m[1]][m[2]] = range;
}

PsychonautWiki resolution: similarity, not names

When a compound lacks its own PsychonautWiki page, Akashi falls back to structural similarity. The process is deliberately conservative:

  1. Query PubChem for structurally similar compounds (Tanimoto > 90%).
  2. Fetch full properties for each candidate.
  3. Filter by XLogP (within ±0.5 units) and molecular weight (within ±8%).
  4. Query PsychonautWiki for each filtered candidate by name.
  5. Match the returned page title against the candidate’s complete synonym list from PubChem.
Compound[] similar = similaritySearch(compound.cid, 90, 5);
Compound[] withProps = getProperties!"cid"(simCids);

foreach (sim; withProps)
{
    if (!isNaN(refXLogP) && !isNaN(sim.properties.xlogp))
        if (abs(sim.properties.xlogp - refXLogP) > 0.5)
            continue;

    if (!isNaN(refMW) && refMW > 0 && !isNaN(sim.properties.weight))
        if (abs(sim.properties.weight - refMW) / refMW > 0.08)
            continue;

    filtered ~= sim;
}

Only exact synonym matches are accepted. A close structural analog with a different name will not silently pass as the target compound. Every inferred dosage is flagged with the source compound’s name so the caller knows where it came from.

PubMed and PMC: XML into the same AST

NCBI EUtils returns JATS XML, which is verbose, deeply nested, and full of metadata tags the renderer does not care about. Akashi has a second parser that reads this XML into the same AST structure as the Wikitext parser. article-title becomes a Section node. abstract and sec become sections with headings. p becomes Paragraph. bold and italic become Styled nodes with bitflags. Tables, figures, and reference lists are skipped entirely.

Document parseXml(string src)
{
    XmlParser parser = XmlParser(src);
    parser.parse();
    return Document(parser.nodes, src);
}

// Inside the parser:
if (matchTag("article-title")) { parseSimpleElement("article-title", NodeType.Section, 1); }
if (matchTag("abstract"))      { parseSectionElement("abstract", "Abstract", 2); }
if (matchTag("sec"))           { parseSec(); }
if (matchTag("p"))             { parseParagraphElement(); }
if (matchTag("bold") || matchTag("b"))
    { parseInlineElement("bold", NodeType.Styled, NodeFlags.Bold); }

Network hygiene: per-source rate limiting

Every adapter uses the conductor HTTP library, but each source carries its own rate-limiting state. PubChem PUG REST is generous. NCBI EUtils demands 333ms between requests. MediaWiki asks for 200ms. These are not global locks; they are per-module clocks. Two modules hitting two different providers never share a bottleneck.

static Orchestrator psychonautOrchestrator =
    Orchestrator("https://psychonautwiki.org/w", 200);

static Orchestrator entrezOrchestrator =
    Orchestrator("https://eutils.ncbi.nlm.nih.gov/entrez/eutils", 333);

JSONValue query(string[string] params)
{
    params["format"] = "json";
    params["formatversion"] = "2";
    orchestrator.rateLimit();
    orchestrator.client.get(
        orchestrator.buildURL("/api.php", params),
        (ubyte[] data) { json = parseJSON(data.assumeUTF); },
        null
    );
    return json;
}

The PubMed adapter also switches from GET to POST automatically when the ID list exceeds 200 items or 2000 characters, because NCBI EUtils rejects oversized GET requests without a helpful error.

The extension test

After Chemica shipped I added a pubchem/bio module (genes, proteins, orthologs, pathways, disease associations) hydrated from PubChem’s PUG REST and PUG View gene endpoints. Chemica does not use any of it yet. But it took an afternoon, not a refactor. That is the test I care about. Not “does it work today,” but “how much does it cost me to extend.”

What I’d carry into the next one