Eosin V: The Registry // Thomas Havlik

Modern omics workflows require pulling thousands of files from dozens of vendors, each with their own identifiers and formats. We solve this by treating every external resource as a vendor artifact, unified by a consistent OCI-like interface.

Artifacts

Here we define any entity that can be pushed or pulled to the registry as an artifact. Following the OCI paradigm, an artifact has a single metadata object, but it may have multiple payloads (corresponding to OCI's :tag).

WSI Pull Example

Pushing/pulling large files like whole-slide images is conceptually trivial:

cyto pull camelyon17/slide/patient_000_node_0

This command will download both the metadata and the 2.6 GB patient_000_node_0.tif whole-slide image file from CAMELYON17.

Gene Pull Example

The edge cases arise from -omics data. We will be using BRCA2 DNA repair associated Homo sapiens (human) gene in these examples, which NCBI assigned an ID of 675.

The metadata returned by NCBI includes references to genomic sequences with start/end indices. Resolving this via nuccore gives us the FASTA that we want for our analysis workflows. We can't be opinionated about canonical sequences (they might not exist) so we'll use the :tag to control this aspect. Let's consider some examples.

To download payloads (FASTA) for all available genomic sequences:

cyto pull ncbi/gene/675

This is equivalent to:

cyto pull ncbi/gene/675:full

And to download the FASTA for a specific genomic sequence, use the relevant NCBI RefSeq accession number as the tag:

cyto pull ncbi/gene/675:NC_000013.11

Refer to the BRCA2 Gene entry for visiblity here. The described behavior (FASTA export) is demonstrated under the "Genomic regions, transcripts, and products" section.

Note that genes are genomic regions, not sequences. The FASTA payloads come from the genomic accessions (e.g. NC_000013.11) that define the region boundaries. Cyto simply automates the NCBI → nuccore → slice → FASTA pipeline, which is central to many computational workflows.

The role of the :tag will depend on the entity type. For community-pushed artifacts, it can be used as a version, echoing the OCI registry paradigm.

The Annotation Layer

Ontology entities and other similar annotations lack payloads and thus don't belong as pullable artifacts. Given that most users won't need ontology data, it also doesn't make sense to include them as attached fields - this would needlessly inflate metadata for most artifacts of interest. Ontologies belong in a dynamic query layer, not as artifacts with payloads.

The solution here is an annotation layer that sits on top of the artifact registry. Ontologies et al are then included as optional add-ons to cyto pull. Further details are beyond the scope of this blog post, but will be covered later.

Datasets

Instead of hand-writing lists of 5,000 gene IDs, datasets can declare rules for inclusion and exclusion. Cyto resolves these rules into concrete vendor-defined references, yielding a reproducible, versioned feature index. Pulling a dataset artifact will pull all of its child artifacts, analogous to pulling the individual image layers in OCI registries.

The ops win here is the ability to implicitly define datasets. This avoids manually curated lists, brittle spreadsheets, mismatched Ensembl/RefSeq releases, and all sorts of other chaos in ML pipelines.

For illustrative purposes, here's an example dataset config:

features:
  # Positive selectors - everything that should be included before exclusions are applied
  include:
    # 1. Direct gene symbols
    - type: symbol
      species: human
      symbols:
        - BRCA1
        - BRCA2
        - TP53
        - RAD51
        - PALB2

    # 2. All genes annotated to a GO term (with descendants)
    - type: go_term
      id: "GO:0000724"           # Double-strand break repair via homologous recombination
      include_descendants: true

    # 3. Reactome pathway
    - type: pathway
      vendor: reactome
      id: "R-HSA-5693567"        # Homology Directed Repair

    # 4. Include a predefined geneset stored in Cyto
    - type: panel
      ref: "geneset/msigdb/hallmark_DNA_REPAIR"

    # 5. Dataset-driven top variable genes for RNA-seq
    - type: top_variable
      modality: rna
      top_n: 2000

    # 6. Dataset-driven top variable proteins
    - type: top_variable
      modality: protein
      top_n: 1000

  # Negative selectors - remove these from the candidate set
  exclude:
    # 1. Cyto internal housekeeping gene panel
    - type: panel
      ref: "geneset/std/housekeeping_human"

    # 2. Mitochondrial genes by biotype
    - type: biotype
      biotypes:
        - Mt_rRNA
        - Mt_tRNA
        - Mt_protein

    # 3. Lowly expressed genes (dataset-driven)
    - type: low_expression
      modality: rna
      threshold_tpm: 1

    # 4. Genes with extremely low variance after normalization
    - type: low_variance
      modality: rna
      bottom_percent: 10

Kernels

Now that our entire multi-omics dataset can be downloaded in a single cyto pull command, we want to train models with it. This process is simplified with kernel configuration, which describes precisely how these artifacts will be served up by your Dataset class in pytorch/tensorflow/etc.

Kernels are specified declaratively (similar to the above Dataset example). They describe the feature axis (gene/transcript/protein), how multiple modalities are fused, how missing data is handled, and what the final tensor shape should be. Lysis then materializes a PyTorch/TensorFlow-ready dataset that matches a kernel exactly.

Filling the Niche

Cyto is not a database aggregator; it doesn't store raw payloads, and the design intent is radically different. Official vendor websites (NCBI/Ensembl/etc.) remain as a data visibility layer. Cyto is the missing ops layer that unifies cross-vendor data in bulk using terse config files. Multi-omics researchers need only to modify a single config file to determine dataset composition, and the kernel delivers the data to Python in predictable shapes.

By treating vendor resources as typed artifacts, separating annotation from payloads, and allowing datasets and kernels to be described declaratively, Cyto reduces a large amount of operational overhead that normally accumulates around multi-omics projects.

Contributions welcome: github.com/eosin-platform

-Tom