Analysis flow¶
Here, we’ll track typical data transformations like subsetting that occur during analysis.
# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./analysis-flow --schema bionty
Show code cell output
→ initialized lamindb: testuser1/analysis-flow
import lamindb as ln
import bionty as bt
→ connected lamindb: testuser1/analysis-flow
Save an initial dataset¶
register_example_file.py¶
import lamindb as ln
import bionty as bt
ln.track("K4wsS5DTYdFp0000")
# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()
# validate and register features
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "cell_type_id": bt.CellType.ontology_id,
        "tissue": bt.Tissue.name,
        "disease": bt.Disease.name,
    },
    organism="human",
)
curate.add_new_from("cell_type")
curate.validate()
curate.save_artifact(description="anndata with obs")
ln.finish()
!python analysis-flow-scripts/register_example_file.py
Show code cell output
→ connected lamindb: testuser1/analysis-flow
→ created Transform('K4wsS5DTYdFp0000'), started new Run('99QB36hR...') at 2025-01-24 14:02:42 UTC
✓ added 4 records with Feature.name for "columns": 'cell_type', 'cell_type_id', 'tissue', 'disease'
• saving validated records of 'cell_type'
✓ added 3 records from public with CellType.name for "cell_type": 'hepatocyte', 'T cell', 'hematopoietic stem cell'
✓ added 1 record with CellType.name for "cell_type": 'my new cell type'
• saving validated records of 'var_index'
✓ added 99 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', ...
• saving validated records of 'tissue'
✓ added 4 records from public with Tissue.name for "tissue": 'brain', 'heart', 'kidney', 'liver'
• saving validated records of 'disease'
✓ added 4 records from public with Disease.name for "disease": 'chronic kidney disease', 'liver lymphoma', 'Alzheimer disease', 'cardiac ventricle disorder'
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "cell_type" is validated against CellType.name
✓ "cell_type_id" is validated against CellType.ontology_id
✓ "tissue" is validated against Tissue.name
✓ "disease" is validated against Disease.name
→ finished Run('99QB36hR') after 4s at 2025-01-24 14:02:46 UTC
Open a dataset, subset it, and register the result¶
Track the current notebook:
ln.track("eNef4Arw8nNM0000")
Show code cell output
→ created Transform('eNef4Arw8nNM0000'), started new Run('8TcyWE2S...') at 2025-01-24 14:02:47 UTC
→ notebook imports: bionty==1.0.0 lamindb==1.0.5
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'b3zR63lUYgctquOK0000' │ ├── .size = 46992 │ ├── .hash = 'IJORtcQUSS11QBqD-nTD0A' │ ├── .n_observations = 40 │ ├── .path = │ │ /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow/.lamindb/b3zR63lUYgctquOK0000.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2025-01-24 14:02:46 │ └── .transform = 'register_example_file.py' ├── Dataset features/schema │ ├── var • 99 [bionty.Gene] │ │ TSPAN6 float │ │ TNMD float │ │ DPM1 float │ │ SCYL3 float │ │ FIRRM float │ │ FGR float │ │ CFH float │ │ FUCA2 float │ │ GCLC float │ │ NFYA float │ │ STPG1 float │ │ NIPAL3 float │ │ LAS1L float │ │ ENPP4 float │ │ SEMA3F float │ │ CFTR float │ │ ANKIB1 float │ │ CYP51A1 float │ │ KRIT1 float │ │ RAD52 float │ └── obs • 4 [Feature] │ cell_type cat[bionty.CellType] T cell, hematopoietic stem cell, hepatoc… │ cell_type_id cat[bionty.CellType] T cell, hematopoietic stem cell, hepatoc… │ disease cat[bionty.Disease] Alzheimer disease, cardiac ventricle dis… │ tissue cat[bionty.Tissue] brain, heart, kidney, liver └── Labels └── .tissues bionty.Tissue brain, heart, kidney, liver .cell_types bionty.CellType hepatocyte, T cell, hematopoietic stem c… .diseases bionty.Disease chronic kidney disease, liver lymphoma, …
Get a backed AnnData object¶
adata = artifact.open()
adata
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object b3zR63lUYgctquOK0000.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']
Subset dataset to specific cell types and diseases¶
cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")
Create the subset:
subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Show code cell output
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64
Register the subsetted AnnData:
curate = ln.Curator.from_anndata(
    adata_subset.to_memory(),
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "disease": bt.Disease.name,
        "tissue": bt.Tissue.name,
    },
    organism="human",
)
curate.validate()
Show code cell output
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "cell_type" is validated against CellType.name
✓ "disease" is validated against Disease.name
✓ "tissue" is validated against Tissue.name
/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
True
artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'Ti0zq3KRRASJnRfV0000' │ ├── .size = 38992 │ ├── .hash = 'RgGUx7ndRplZZSmalTAWiw' │ ├── .n_observations = 20 │ ├── .path = │ │ /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow/.lamindb/Ti0zq3KRRASJnRfV0000.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2025-01-24 14:02:48 │ └── .transform = 'Analysis flow' ├── Dataset features/schema │ ├── var • 99 [bionty.Gene] │ │ TSPAN6 float │ │ TNMD float │ │ DPM1 float │ │ SCYL3 float │ │ FIRRM float │ │ FGR float │ │ CFH float │ │ FUCA2 float │ │ GCLC float │ │ NFYA float │ │ STPG1 float │ │ NIPAL3 float │ │ LAS1L float │ │ ENPP4 float │ │ SEMA3F float │ │ CFTR float │ │ ANKIB1 float │ │ CYP51A1 float │ │ KRIT1 float │ │ RAD52 float │ └── obs • 4 [Feature] │ cell_type cat[bionty.CellType] T cell, hematopoietic stem cell │ disease cat[bionty.Disease] chronic kidney disease, liver lymphoma │ tissue cat[bionty.Tissue] kidney, liver │ cell_type_id cat[bionty.CellType] └── Labels └── .tissues bionty.Tissue kidney, liver .cell_types bionty.CellType T cell, hematopoietic stem cell .diseases bionty.Disease chronic kidney disease, liver lymphoma
Examine data lineage¶
Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:
cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset
Show code cell output
Artifact(uid='Ti0zq3KRRASJnRfV0000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', kind='dataset', otype='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, space_id=1, storage_id=1, run_id=2, created_by_id=1, created_at=2025-01-24 14:02:48 UTC)
Common questions that might arise are:
- What is the history of this artifact? 
- Which features and labels are associated with it? 
- Which notebook analyzed and registered this artifact? 
- By whom? 
- And which artifact is its parent? 
Let’s answer this using LaminDB:
print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()
print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)
print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)
print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)
print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.df())
--> What is the lineage of this artifact?
--> Which features and labels are associated with it?
Artifact .h5ad/AnnData └── Dataset features/schema ├── var • 99 [bionty.Gene] │ TSPAN6 float │ TNMD float │ DPM1 float │ SCYL3 float │ FIRRM float │ FGR float │ CFH float │ FUCA2 float │ GCLC float │ NFYA float │ STPG1 float │ NIPAL3 float │ LAS1L float │ ENPP4 float │ SEMA3F float │ CFTR float │ ANKIB1 float │ CYP51A1 float │ KRIT1 float │ RAD52 float └── obs • 4 [Feature] cell_type cat[bionty.CellType] T cell, hematopoietic stem cell disease cat[bionty.Disease] chronic kidney disease, liver lymphoma tissue cat[bionty.Tissue] kidney, liver cell_type_id cat[bionty.CellType]
Artifact .h5ad/AnnData └── Labels └── .tissues bionty.Tissue kidney, liver .cell_types bionty.CellType T cell, hematopoietic stem cell .diseases bionty.Disease chronic kidney disease, liver lymphoma
--> Which notebook analyzed and saved this artifact
Transform(uid='eNef4Arw8nNM0000', is_latest=True, key='analysis-flow.ipynb', description='Analysis flow', type='notebook', space_id=1, created_by_id=1, created_at=2025-01-24 14:02:47 UTC)
--> Who save this artifact?
User object (1)
--> Which artifacts were inputs?
| uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||||
| 1 | b3zR63lUYgctquOK0000 | None | anndata with obs | .h5ad | dataset | AnnData | 46992 | IJORtcQUSS11QBqD-nTD0A | None | 40 | md5 | True | False | 1 | 1 | None | None | True | 1 | 2025-01-24 14:02:46.312000+00:00 | 1 | None | 1 | 
Show code cell content
!rm -r ./analysis-flow
!lamin delete --force analysis-flow
• deleting instance testuser1/analysis-flow