lamindb.Curator¶
- class lamindb.Curator¶
- Bases: - BaseCurator- Dataset curator. - A - Curatorobject makes it easy to save validated & annotated artifacts.- Example: - >>> curator = ln.Curator.from_df( >>> df, >>> # define validation criteria as mappings >>> columns=ln.Feature.name, # map column names >>> categoricals={"perturbation": ln.ULabel.name}, # map categories >>> ) >>> curator.validate() # validate the data in df >>> artifact = curator.save_artifact(description="my RNA-seq") >>> artifact.describe() # see annotations - curator.validate()maps values within- dfaccording to the mapping criteria and logs validated & problematic values.- If you find non-validated values, you have several options: - new values found in the data can be registered using - add_new_from()
- non-validated values can be accessed using - non_validated()and addressed manually
 - Class methods¶- classmethod from_anndata(data, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None, sources=None)¶
- Curation flow for - AnnData.- See also - Curator.- Note that if genes are removed from the AnnData object, the object should be recreated using - from_anndata().- See Curate AnnData based on the CELLxGENE schema for instructions on how to curate against a specific cellxgene schema version. - Parameters:
- data (ad.AnnData | UPathStr) – The AnnData object or an AnnData-like path. 
- var_index (FieldAttr) – The registry field for mapping the - .varindex.
- categoricals (dict[str, FieldAttr] | None, default: - None) – A dictionary mapping- .obs.columnsto a registry field.
- obs_columns (FieldAttr, default: - FieldAttr(Feature.name)) – The registry field for mapping the- .obs.columns.
- using_key (str | None, default: - None) – A reference LaminDB instance.
- verbosity (str, default: - 'hint') – The verbosity level.
- organism (str | None, default: - None) – The organism name.
- sources (dict[str, Record] | None, default: - None) – A dictionary mapping- .obs.columnsto Source records.
- exclude – A dictionary mapping column names to values to exclude from validation. When specific - Sourceinstances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
 
- Return type:
- AnnDataCurator 
 - Examples - >>> import bionty as bt >>> curator = ln.Curator.from_anndata( ... adata, ... var_index=bt.Gene.ensembl_gene_id, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... }, ... organism="human", ... ) 
 - classmethod from_df(df, categoricals=None, columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None)¶
- Curation flow for a DataFrame object. - See also - Curator.- Parameters:
- df ( - DataFrame) – The DataFrame object to curate.
- columns ( - DeferredAttribute, default:- FieldAttr(Feature.name)) – The field attribute for the feature column.
- categoricals ( - dict[- str,- DeferredAttribute] |- None, default:- None) – A dictionary mapping column names to registry_field.
- using_key ( - str|- None, default:- None) – The reference instance containing registries to validate against.
- verbosity ( - str, default:- 'hint') – The verbosity level.
- organism ( - str|- None, default:- None) – The organism name.
- sources – A dictionary mapping column names to Source records. 
- exclude – A dictionary mapping column names to values to exclude from validation. When specific - Sourceinstances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
 
- Return type:
- Returns:
- A curator object. 
 - Examples - >>> import bionty as bt >>> curator = ln.Curator.from_df( ... df, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... } ... ) 
 - classmethod from_mudata(mdata, var_index, categoricals=None, using_key=None, verbosity='hint', organism=None)¶
- Curation flow for a - MuDataobject.- See also - Curator.- Note that if genes or other measurements are removed from the MuData object, the object should be recreated using - from_mudata().- Parameters:
- mdata ( - MuData) – The MuData object to curate.
- var_index ( - dict[- str,- dict[- str,- DeferredAttribute]]) – The registry field for mapping the- .varindex for each modality. For example:- {"modality_1": bt.Gene.ensembl_gene_id, "modality_2": ln.CellMarker.name}
- categoricals ( - dict[- str,- DeferredAttribute] |- None, default:- None) – A dictionary mapping- .obs.columnsto a registry field. Use modality keys to specify categoricals for MuData slots such as- "rna:cell_type": bt.CellType.name".
- using_key ( - str|- None, default:- None) – A reference LaminDB instance.
- verbosity ( - str, default:- 'hint') – The verbosity level.
- organism ( - str|- None, default:- None) – The organism name.
- sources – A dictionary mapping - .obs.columnsto Source records.
- exclude – A dictionary mapping column names to values to exclude from validation. When specific - Sourceinstances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
 
- Return type:
 - Examples - >>> import bionty as bt >>> curator = ln.Curator.from_mudata( ... mdata, ... var_index={ ... "rna": bt.Gene.ensembl_gene_id, ... "adt": ln.CellMarker.name ... }, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... }, ... organism="human", ... ) 
 - classmethod from_spatialdata(sdata, var_index, categoricals=None, using_key=None, organism=None, sources=None, exclude=None, verbosity='hint', *, sample_metadata_key='sample')¶
- Curation flow for a - Spatialdataobject.- See also - Curator.- Note that if genes or other measurements are removed from the SpatialData object, the object should be recreated. - In the following docstring, an accessor refers to either a - .tablekey or the- sample_metadata_key.- Parameters:
- sdata – The SpatialData object to curate. 
- var_index ( - dict[- str,- DeferredAttribute]) – A dictionary mapping table keys to the- .varindices.
- categoricals ( - dict[- str,- dict[- str,- DeferredAttribute]] |- None, default:- None) – A nested dictionary mapping an accessor to dictionaries that map columns to a registry field.
- using_key ( - str|- None, default:- None) – A reference LaminDB instance.
- organism ( - str|- None, default:- None) – The organism name.
- sources ( - dict[- str,- dict[- str,- Record]] |- None, default:- None) – A dictionary mapping an accessor to dictionaries that map columns to Source records.
- exclude ( - dict[- str,- dict] |- None, default:- None) – A dictionary mapping an accessor to dictionaries of column names to values to exclude from validation. When specific- Sourceinstances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
- verbosity ( - str, default:- 'hint') – The verbosity level of the logger.
- sample_metadata_key ( - str, default:- 'sample') – The key in- .attrsthat stores the sample level metadata.
 
 - Examples - >>> import lamindb as ln >>> import bionty as bt >>> curator = ln.Curator.from_spatialdata( ... sdata, ... var_index={ ... "table_1": bt.Gene.ensembl_gene_id, ... }, ... categoricals={ ... "table1": ... {"cell_type_ontology_id": bt.CellType.ontology_id, "donor_id": ln.ULabel.name}, ... "sample": ... {"experimental_factor": bt.ExperimentalFactor.name}, ... }, ... organism="human", ... ) 
 - classmethod from_tiledbsoma(experiment_uri, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key=None, organism=None, sources=None, exclude=None)¶
- Curation flow for - tiledbsoma.- See also - Curator.- Parameters:
- experiment_uri (lamindb.core.types.UPathStr) – A local or cloud path to a - tiledbsoma.Experiment.
- var_index ( - dict[- str,- tuple[- str,- DeferredAttribute]]) – The registry fields for mapping the- .varindices for measurements. Should be in the form- {"measurement name": ("var column", field)}. These keys should be used in the flattened form (- '{measurement name}__{column name in .var}') in- .standardizeor- .add_new_from, see the output of- .var_index.
- categoricals ( - dict[- str,- DeferredAttribute] |- None, default:- None) – A dictionary mapping categorical- .obscolumns to a registry field.
- obs_columns ( - DeferredAttribute, default:- FieldAttr(Feature.name)) – The registry field for mapping the names of the- .obscolumns.
- organism ( - str|- None, default:- None) – The organism name.
- sources ( - dict[- str,- Record] |- None, default:- None) – A dictionary mapping- .obscolumns to Source records.
- exclude ( - dict[- str,- str|- list[- str]] |- None, default:- None) – A dictionary mapping column names to values to exclude from validation. When specific- Sourceinstances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
 
- Return type:
 - Examples - >>> import bionty as bt >>> curator = ln.Curator.from_tiledbsoma( ... "./my_array_store.tiledbsoma", ... var_index={"RNA": ("var_id", bt.Gene.symbol)}, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... }, ... organism="human", ... ) 
 - Methods¶- save_artifact(description=None, key=None, revises=None, run=None)¶
- Save the dataset as artifact. - Parameters:
- description ( - str|- None, default:- None) – A description of the DataFrame object.
- key ( - str|- None, default:- None) – A path-like key to reference artifact in default storage, e.g.,- "myfolder/myfile.fcs". Artifacts with the same key form a revision family.
- revises ( - Artifact|- None, default:- None) – Previous version of the artifact. Triggers a revision.
- run ( - Run|- None, default:- None) – The run that creates the artifact.
 
- Return type:
- Returns:
- A saved artifact record. 
 
 - standardize(key)¶
- Replace synonyms with standardized values. - Inplace modification of the dataset. - Parameters:
- key ( - str) – The name of the column to standardize.
- Return type:
- None
- Returns:
- None 
 
 - validate()¶
- Validate dataset. - This method also registers the validated records in the current instance. - Return type:
- bool
- Returns:
- Boolean indicating whether the dataset is validated.