Building Akashi

The problem: five schemas that disagree about "a compound"

Each scientific source models the world differently. PubChem keys on a numeric CID and a sprawling REST dialect. MediaWiki-family sources (Wikipedia, PsychonautWiki) speak Wikitext plus their own templates. PubMed and PMC live in NCBI’s Entrez ecosystem and return JATS XML. None of them agree on what counts as a compound, what its canonical name is, or whether it even deserves its own page.

The job of the library is to make the caller never have to learn any of that.

Identity is the load-bearing wall

The single decision that collapsed most of the complexity was picking one source of truth for identity and treating every other integration as a resolution problem layered on top of it. PubChem got that role because PubChem is the only source that is both deterministic and exhaustive for small molecules.

Every other source is resolved from a PubChem CID, never toward one. That turns fuzzy string matching into structured lookup, and it means the resolution layer fails in one place instead of five.

Wikipedia resolution stops being “fuzzy-match a title” and becomes “take the canonical name and known synonyms, rank candidate pages, and confirm against the chemistry infobox.” PsychonautWiki follows the same pattern with a graceful fallback through parent compounds for the many psychoactives that don’t have their own page. The core resolution layer was rewritten three times before I accepted that the right answer was to stop being clever and anchor everything to one trustworthy identifier.

One `Page` type to rule five sources

Every text source (Wikipedia, PsychonautWiki, PubMed, PMC) materializes as the same object:

struct Page
{
    string raw();        // untouched source payload
    Document document(); // parsed AST (Wikitext or XML)
    Section[] sections();
    string preamble();
    string fulltext();   // flattened plain text
}

The frontend never learns which source produced a given page. That one abstraction erased a huge amount of special-case code in Chemica and made “render whatever we have about caffeine” a single code path instead of five.

Lazy hydration, no caching

Fields are fetched when they are first touched, and nothing is cached across calls. It reads heavier than it is: in practice, lazy hydration is simpler to reason about, easier to test, and honest about the fact that scientific data changes under you. If the caller wants caching, they add it at the call site where they actually know the correct lifetime.

`conductor`: network hygiene lives in one place

HTTP, retries, exponential backoff, and per-source rate limiting all live in the conductor library. Every source adapter is written against it. Adding a sixth source later is a matter of writing a thin adapter, not relearning network hygiene.

Each module owns its own rate-limiting state (a clock and a minimum interval) instead of sharing a global singleton. Two modules hitting two providers never share a bottleneck, and when a provider changes policy, the fix is local.

Writing the parsers I didn’t want to write

The Wikitext parser is a small AST with style bitflags instead of per-style node types. That keeps the tree compact, the renderer boring, and the transform layer able to do structural passes without switching on dozens of node kinds. Every existing D option either lost formatting, choked on templates, or pulled in an unwanted dependency.

PMC returns JATS XML, which is its own flavor of painful. Sharing utilities between the Wikitext and XML paths was one of those small cleanups that paid for itself many times over.

The extension test

After Chemica shipped I added a pubchem/bio module (genes, proteins, orthologs, pathways, disease associations) hydrated from PubChem’s PUG REST and PUG View gene endpoints. Chemica doesn’t use any of it yet. But it took an afternoon, not a refactor. That is the test I care about. Not “does it work today,” but “how much does it cost me to extend.”

What I’d carry into the next one

Pick the identity anchor early. A bad identity model will cost more than a bad parser ever will.
Treat unreliable sources as unreliable. Build retries, rate limits, and fallbacks before you need them.
Write the boring abstraction. A shared Page type was worth more than any clever feature on top of it.
Own your parsers when you have to. A small purpose-built AST is often less code than integrating a generic one.

cetio

←

Building Akashi: identity, integration, and one shared `Page` type

The problem: five schemas that disagree about "a compound"

Identity is the load-bearing wall

One `Page` type to rule five sources

Lazy hydration, no caching

`conductor`: network hygiene lives in one place

Writing the parsers I didn’t want to write

The extension test

What I’d carry into the next one

The problem: five schemas that disagree about "a compound"

Identity is the load-bearing wall

One Page type to rule five sources

Lazy hydration, no caching

conductor: network hygiene lives in one place

Writing the parsers I didn’t want to write

The extension test

What I’d carry into the next one

One `Page` type to rule five sources

`conductor`: network hygiene lives in one place