Eosin IV: Emergent Systems // Thomas Havlik

Lately I've been thinking about why machine learning with biological data feels harder than it should. The research moves slowly not because scientists lack ideas, but because the infrastructure around data is fractured. Every archive, every consortium, every lab is opinionated about schema, format, or ontology. ML engineers spend more time writing loaders and patching metadata than training models. Scientists wait months or years to see what others can do with their data.

Nothing but Friction

A typical workflow looks like this:

A lab generates data.
Months pass while the paper and supplemental materials are prepared.
Eventually, the dataset appears on SRA, GEO, TCGA, or a challenge platform.
ML researchers discover it, rebuild schemas from scratch, and run private experiments.
Results circulate weeks or months later, usually disconnected from the original authors.

None of this delay is scientific - it's just friction baked into today's bioinformatics infrastructure.

A Simple Idea

What if uploading data were the start of the analysis instead of the end of it? Not via a closed ecosystem, but through a shared, stable representation of biological entities:

cyto://ncbi/gene/brca2
cyto://jhu/slide/CRCC003-P0001
cyto://mit/eeg/subject001-sessionA

If raw data from many vendors and modalities can be normalized into a consistent, lossless schema, then ML systems no longer need bespoke tooling for each dataset. A single automatic workflow becomes possible. Multi-omics ML becomes exponentially easier with metadata standardization.

Consequences of Normalized Data

Imagine that normalized data appears the moment a scientist uploads it. An ML engineer could subscribe to a dataset type - slides, variants, expression matrices - and automatically dispatch their training or inference jobs. Results could flow back into Cyto without coordination or manual integration.

A lab could upload a dataset and come back the next morning to a list of models, baselines, visualizations, and embeddings produced independently by people they've never met. This isn't speculative: researchers already do similar things informally in Kaggle competitions, internal Slack groups, and ad-hoc GitHub projects. The only missing piece is consistent structure.

The effort is in unifying vendor schemas into a stable core with a polymorphic extension section. Critical fields are reconciled into a coherent, vendor-agnostic core; vendor-specific ones are preserved via polymorphism. A clean schema is what allows fragmented entities to be reconciled into a single, canonical entity. Clean plumbing is what enables automation across scientific data pipelines.

A Two-Sided Benefit

Scientists get faster analysis, and ML engineers get clean data with zero glue code. ML for biology becomes delightful if:

data arrives in predictable shapes
compute workflows can fire automatically
outputs attach back to the originating dataset
collaboration happens by default, not by invitation

None of this requires new scientific insight - just reliable infrastructure. This kind of predictable input transforms manual preprocessing into automated scientific machine learning workflows, saving hours of redundant engineering.

A Note on Trust

Not all data is equal. Cyto will support verification layers: institutions, labs, or individuals can become "trusted" sources. This lets ML automation run only on data with known provenance. It's optional, not hierarchical - just a filter for people who want it. For example, competitive ML teams might want to only analyze data from authoritative sources. Alternatively, from high reputation individual researchers.

Early Publication Without Penalty

There's a long-standing fear in academia that releasing data early risks being scooped. In a system with traceable uploads and automatic analysis on arrival, the opposite becomes true: the first uploader is the first author of record. Releasing early becomes safer, not riskier. This could inspire new and interesting workflows.

The ultimate goal is to shorten the distance between data and insight.

Contributions are welcome: github.com/eosin-platform

-Tom