Curate AnnData
based on the CELLxGENE schema¶
This guide shows how to curate an AnnData object with the help of laminlabs/cellxgene
against the CELLxGENE schema v5.1.0.
Load your instance where you want to register the curated AnnData object:
# !pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.0
!lamin init --storage ./test-cellxgene-curate --name test-cellxgene-curate --schema bionty
→ initialized lamindb: testuser1/test-cellxgene-curate
import lamindb as ln
import cellxgene_lamin as cxg
→ connected lamindb: testuser1/test-cellxgene-curate
Let’s start with an AnnData object that we’d like to inspect and curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.
adata = cxg.datasets.anndata_human_immune_cells()
AnnData object with n_obs × n_vars = 1626 × 36503
obs: 'donor', 'tissue', 'cell_type', 'assay', 'sex_ontology_term_id', 'organism', 'sex'
var: 'feature_is_filtered'
uns: 'default_embedding'
obsm: 'X_umap'
Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad || exit 1
Loading dependencies
Loading validator modules
Starting validation...
Unable to open 'anndata_human_immune_cells.h5ad' with AnnData
Validate and curate metadata¶
We create a Curate
object that references the AnnData
During instantiation, any :class:~lamindb.Feature
records are saved.
curator = cxg.Curator(adata, organism="human", schema_version="5.1.0")
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/lamindb/ FutureWarning: `name` will be removed soon, please pass 'Transfer from `laminlabs/cellxgene`' to `description` instead
transform = Transform(
✓ added 5 records from laminlabs/cellxgene with for "columns": 'assay', 'cell_type', 'tissue', 'organism', 'sex_ontology_term_id'
✓ added 1 record from laminlabs/cellxgene with for "columns": 'sex'
Let’s fix the “donor_id” column name:
adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)
validated = curator.validate()
✗ missing required obs columns development_stage, disease, self_reported_ethnicity, suspension_type, tissue_type
• consider initializing a Curate object like 'Curate(adata, defaults=cxg.CellxGeneFields.OBS_FIELD_DEFAULTS)'to automatically add these columns with default values.
For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:
{'cell_type': 'unknown',
'development_stage': 'unknown',
'disease': 'normal',
'donor_id': 'unknown',
'self_reported_ethnicity': 'unknown',
'sex': 'unknown',
'suspension_type': 'cell',
'tissue_type': 'tissue'}
CELLxGENE requires columns tissue
, organism
, and assay
to have existing values from the ontologies.
Therefore, these columns need to be added and populated manually.
curator = cxg.Curator(
→ added default value 'unknown' to the adata.obs['development_stage']
→ added default value 'normal' to the adata.obs['disease']
→ added default value 'unknown' to the adata.obs['self_reported_ethnicity']
→ added default value 'cell' to the adata.obs['suspension_type']
→ added default value 'tissue' to the adata.obs['tissue_type']
✓ added 6 records from laminlabs/cellxgene with for "columns": 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'tissue_type', 'suspension_type'
validated = curator.validate()
→ validating metadata using registries of instance laminlabs/cellxgene
• saving validated records of 'var_index'
✓ added 36390 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', 'ENSG00000238009', 'ENSG00000239945', 'ENSG00000239906', 'ENSG00000241860', 'ENSG00000241599', 'ENSG00000286448', 'ENSG00000236601', 'ENSG00000284733', 'ENSG00000235146', 'ENSG00000284662', 'ENSG00000229905', 'ENSG00000237491', 'ENSG00000177757', 'ENSG00000228794', 'ENSG00000225880', 'ENSG00000230368', 'ENSG00000272438', ...
• saving validated records of 'assay'
✓ added 3 records from public with for "assay": '10x 5' v2', '10x 5' v1', '10x 3' v3'
• saving validated records of 'cell_type'
✓ added 31 records from public with for "cell_type": 'CD4-positive helper T cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'non-classical monocyte', 'memory B cell', 'CD16-negative, CD56-bright natural killer cell, human', 'plasmacytoid dendritic cell', 'lymphocyte', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'dendritic cell, human', 'T follicular helper cell', 'progenitor cell', 'mucosal invariant T cell', 'germinal center B cell', 'alveolar macrophage', 'group 3 innate lymphoid cell', 'naive B cell', 'megakaryocyte', 'mast cell', ...
✓ added 1 record from laminlabs/cellxgene with for "development_stage": 'unknown'
✓ added 1 record from laminlabs/cellxgene with for "disease": 'normal'
✓ added 1 record from laminlabs/cellxgene with for "self_reported_ethnicity": 'unknown'
✓ added 1 record from laminlabs/cellxgene with Phenotype.ontology_id for "sex_ontology_term_id": 'PATO:0000384'
✓ added 1 record from laminlabs/cellxgene with for "suspension_type": 'cell'
• saving validated records of 'tissue'
✓ added 16 records from public with for "tissue": 'thymus', 'liver', 'mesenteric lymph node', 'bone marrow', 'spleen', 'caecum', 'blood', 'skeletal muscle tissue', 'jejunal epithelium', 'duodenum', 'sigmoid colon', 'ileum', 'transverse colon', 'lamina propria', 'thoracic lymph node', 'omentum'
✓ added 1 record from laminlabs/cellxgene with for "tissue_type": 'tissue'
• mapping "var_index" on Gene.ensembl_gene_id
! 113 terms are not validated: 'ENSG00000269933', 'ENSG00000261737', 'ENSG00000259834', 'ENSG00000256374', 'ENSG00000263464', 'ENSG00000203812', 'ENSG00000272196', 'ENSG00000272880', 'ENSG00000270188', 'ENSG00000287116', 'ENSG00000237133', 'ENSG00000224739', 'ENSG00000227902', 'ENSG00000239467', 'ENSG00000272551', 'ENSG00000280374', 'ENSG00000236886', 'ENSG00000229352', 'ENSG00000286601', 'ENSG00000227021', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
✓ "assay" is validated against
✓ "cell_type" is validated against
✓ "development_stage" is validated against
✓ "disease" is validated against
• mapping "donor_id" on
! 12 terms are not validated: 'D496-1', '621B-1', 'A29-1', 'A36-1', 'A35-1', '637C-1', 'A52-1', 'A37-1', 'D503-1', '640C-1', 'A31-1', '582C-1'
→ fix typos, remove non-existent values, or save terms via .add_new_from("donor_id")
✓ "self_reported_ethnicity" is validated against
✓ "sex_ontology_term_id" is validated against Phenotype.ontology_id
✓ "suspension_type" is validated against
• mapping "tissue" on
! 1 term is not validated: 'lungg'
→ fix typos, remove non-existent values, or save terms via .add_new_from("tissue")
✓ "tissue_type" is validated against
✓ "organism" is validated against
Remove unvalidated values¶
We remove all unvalidated genes. These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).
adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
if adata.raw is not None:
raw_data = adata.raw.to_adata()
raw_data = raw_data[
:, ~raw_data.var_names.isin(curator.non_validated["var_index"])
adata.raw = raw_data
# We must create the Curate object again to ensure that it references the correct AnnData object
curator = cxg.Curator(adata, organism="human", schema_version="5.1.0")
Register new metadata labels¶
Following the suggestions above to register genes and labels that aren’t present in the current instance:
(Note that our instance is rather empty. Once you filled up the registries, registering new labels won’t be frequently needed)
For donors, we register the new labels:
✓ added 12 records with for "donor_id": 'A36-1', 'A35-1', 'A37-1', 'A29-1', 'D496-1', 'D503-1', 'A31-1', 'A52-1', '582C-1', '640C-1', '637C-1', '621B-1'
An error is shown for the tissue label “lungg”, which is a typo, should be “lung”. Let’s fix it:
tissues = curator.lookup().tissue
Show code cell output
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
Let’s validate the object again:
validated = curator.validate()
Show code cell output
→ validating metadata using registries of instance laminlabs/cellxgene
• saving validated records of 'tissue'
✓ added 1 record from public with for "tissue": 'lung'
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "assay" is validated against
✓ "cell_type" is validated against
✓ "development_stage" is validated against
✓ "disease" is validated against
✓ "donor_id" is validated against
✓ "self_reported_ethnicity" is validated against
✓ "sex_ontology_term_id" is validated against Phenotype.ontology_id
✓ "suspension_type" is validated against
✓ "tissue" is validated against
✓ "tissue_type" is validated against
✓ "organism" is validated against
donor_id | tissue | cell_type | assay | sex_ontology_term_id | organism | sex | development_stage | disease | self_reported_ethnicity | suspension_type | tissue_type | |
CZINY-0109_CTGGTCTAGTCTGTAC | D496-1 | blood | classical monocyte | 10x 3' v3 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
CZI-IA10244332+CZI-IA10244434_CCTTCGACATACTCTT | 621B-1 | thoracic lymph node | T follicular helper cell | 10x 5' v2 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7935491_CTGGTCTGTACATGTC | A29-1 | spleen | memory B cell | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7980367_GGGCATCCAGGTGGAT | A36-1 | lung | alveolar macrophage | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7935494_ATCATGGTCTACCTGC | A29-1 | mesenteric lymph node | naive thymus-derived CD4-positive, alpha-beta ... | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Save artifact¶
artifact = curator.save_artifact(
description=f"dataset curated against cellxgene schema {curator.schema_version}"
Show code cell output
! run input wasn't tracked, call `ln.track()` and re-run
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'dKuaK5lXG2OVCTQ20000' │ ├── .size = 54670616 │ ├── .hash = 'VYhEnkViOhtD-7kN2odUGw' │ ├── .n_observations = 1626 │ ├── .path = │ │ /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/dKuaK5lXG2OVCTQ20000. │ │ h5ad │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-01-24 13:59:11 ├── Dataset features/schema │ ├── var • 36390 [bionty.Gene] │ │ MIR1302-2HG float │ │ FAM138A float │ │ OR4F5 float │ │ OR4F29 float │ │ OR4F16 float │ │ LINC01409 float │ │ FAM87B float │ │ LINC01128 float │ │ LINC00115 float │ │ FAM41C float │ └── obs • 12 [Feature] │ assay cat[bionty.ExperimentalF… 10x 3' v3, 10x 5' v1, 10x 5' v2 │ cell_type cat[bionty.CellType] CD16-negative, CD56-bright natural kille… │ development_stage cat[bionty.Developmental… unknown │ disease cat[bionty.Disease] normal │ donor_id cat[ULabel] 582C-1, 621B-1, 637C-1, 640C-1, A29-1, A… │ organism cat[bionty.Organism] human │ self_reported_ethnicity cat[bionty.Ethnicity] unknown │ sex_ontology_term_id cat[bionty.Phenotype] male │ suspension_type cat[ULabel] cell │ tissue cat[bionty.Tissue] blood, bone marrow, caecum, duodenum, il… │ tissue_type cat[ULabel] tissue │ sex cat[bionty.Phenotype] └── Labels └── .organisms bionty.Organism human .tissues bionty.Tissue thymus, liver, mesenteric lymph node, bo… .cell_types bionty.CellType CD4-positive helper T cell, CD8-positive… .diseases bionty.Disease normal .phenotypes bionty.Phenotype male .experimental_factors bionty.ExperimentalFactor 10x 5' v2, 10x 5' v1, 10x 3' v3 .developmental_stages bionty.DevelopmentalStage unknown .ethnicities bionty.Ethnicity unknown .ulabels ULabel cell, tissue, A36-1, A35-1, A37-1, A29-1…
The below is optional – it mimics the way cellxgene creates collections of AnnData
objects to link them to studies.
# register a new collection
title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
collection = ln.Collection(
[artifact], # registered artifact above, can also pass a list of artifacts
name=title, # title of the publication
description="10.1126/science.abl5197", # DOI of the publication
reference="E-MTAB-11536", # accession number (e.g. GSE#, E-MTAB#, etc.)
reference_type="ArrayExpress", # source type (e.g. GEO, ArrayExpress, SRA, etc.)
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
/tmp/ipykernel_3490/ FutureWarning: argument `name` will be removed, please pass Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only) to `key` instead
collection = ln.Collection(
Return an input h5ad file for cellxgene-schema¶
adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
AnnData object with n_obs × n_vars = 1626 × 36390
obs: 'donor_id', 'sex_ontology_term_id', 'suspension_type', 'tissue_type', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id', 'organism_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data'
var: 'feature_is_filtered'
uns: 'default_embedding', 'title', 'cxg_lamin_schema_reference', 'cxg_lamin_schema_version'
obsm: 'X_umap'
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad || exit 1
Loading dependencies
Loading validator modules
Starting validation...
Unable to open 'anndata_human_immune_cells_cxg.h5ad' with AnnData
The Curate class is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.