The problem: five schemas that disagree about "a compound"
Each scientific source models the world differently. PubChem keys on a numeric CID and a sprawling REST dialect. MediaWiki-family sources (Wikipedia, PsychonautWiki) speak Wikitext plus their own templates. PubMed and PMC live in NCBI’s Entrez ecosystem and return JATS XML. None of them agree on what counts as a compound, what its canonical name is, or whether it even deserves its own page.
The job of the library is to make the caller never have to learn any of that.
Identity is the load-bearing wall
The single decision that collapsed most of the complexity was picking one source of truth for identity and treating every other integration as a resolution problem layered on top of it. PubChem got that role because PubChem is the only source that is both deterministic and exhaustive for small molecules.
Wikipedia resolution stops being “fuzzy-match a title” and becomes “take the canonical name and known synonyms, rank candidate pages, and confirm against the chemistry infobox.” PsychonautWiki follows the same pattern with a graceful fallback through parent compounds for the many psychoactives that don’t have their own page. The core resolution layer was rewritten three times before I accepted that the right answer was to stop being clever and anchor everything to one trustworthy identifier.
One Page type to rule five sources
Every text source (Wikipedia, PsychonautWiki, PubMed, PMC) materializes as the same object:
struct Page
{
string raw(); // untouched source payload
Document document(); // parsed AST (Wikitext or XML)
Section[] sections();
string preamble();
string fulltext(); // flattened plain text
}
The frontend never learns which source produced a given page. That one abstraction erased a huge amount of special-case code in Chemica and made “render whatever we have about caffeine” a single code path instead of five.
Lazy hydration, no caching
Fields are fetched when they are first touched, and nothing is cached across calls. It reads heavier than it is: in practice, lazy hydration is simpler to reason about, easier to test, and honest about the fact that scientific data changes under you. If the caller wants caching, they add it at the call site where they actually know the correct lifetime.
conductor: network hygiene lives in one place
HTTP, retries, exponential backoff, and per-source rate limiting all live in the conductor library. Every source adapter is written against it. Adding a sixth source later is a matter of writing a thin adapter, not relearning network hygiene.
Each module owns its own rate-limiting state (a clock and a minimum interval) instead of sharing a global singleton. Two modules hitting two providers never share a bottleneck, and when a provider changes policy, the fix is local.
Writing the parsers I didn’t want to write
The Wikitext parser is a small AST with style bitflags instead of per-style node types. That keeps the tree compact, the renderer boring, and the transform layer able to do structural passes without switching on dozens of node kinds. Every existing D option either lost formatting, choked on templates, or pulled in an unwanted dependency.
PMC returns JATS XML, which is its own flavor of painful. Sharing utilities between the Wikitext and XML paths was one of those small cleanups that paid for itself many times over.
The extension test
After Chemica shipped I added a pubchem/bio module (genes, proteins, orthologs, pathways, disease associations) hydrated from PubChem’s PUG REST and PUG View gene endpoints. Chemica doesn’t use any of it yet. But it took an afternoon, not a refactor. That is the test I care about. Not “does it work today,” but “how much does it cost me to extend.”
What I’d carry into the next one
- Pick the identity anchor early. A bad identity model will cost more than a bad parser ever will.
- Treat unreliable sources as unreliable. Build retries, rate limits, and fallbacks before you need them.
- Write the boring abstraction. A shared
Pagetype was worth more than any clever feature on top of it. - Own your parsers when you have to. A small purpose-built AST is often less code than integrating a generic one.