lamindb.Artifact¶

Bases: Record, IsVersioned, TracksRun, TracksUpdates

Datasets & models stored as files, folders, or arrays.

Artifacts manage data in local or remote storage.

Some artifacts are array-like, e.g., when stored as .parquet, .h5ad, .zarr, or .tiledb.

Parameters:

data – UPathStr A path to a local or remote folder or file.
type – Literal["dataset", "model"] | None = None The artifact type.
key – str | None = None A path-like key to reference artifact in default storage, e.g., "myfolder/myfile.fcs". Artifacts with the same key form a revision family.
description – str | None = None A description.
revises – Artifact | None = None Previous version of the artifact. Triggers a revision.
run – Run | None = None The run that creates the artifact.

See also

Storage: Storage locations for artifacts.
Collection: Collections of artifacts.
from_df(): Create an artifact from a DataFrame.
from_anndata(): Create an artifact from an AnnData.

Examples

Create an artifact from a file path and pass description:

>>> artifact = ln.Artifact("s3://my_bucket/my_folder/my_file.csv", description="My file")
>>> artifact = ln.Artifact("./my_local_file.jpg", description="My image")

You can also pass key to create a virtual filepath hierarchy:

>>> artifact = ln.Artifact("./my_local_file.jpg", key="example_datasets/dataset1.jpg")

What works for files also works for folders:

>>> artifact = ln.Artifact("s3://my_bucket/my_folder", description="My folder")
>>> artifact = ln.Artifact("./my_local_folder", description="My local folder")
>>> artifact = ln.Artifact("./my_local_folder", key="project1/my_target_folder")

Make a new version of an artifact:

>>> artifact = ln.Artifact.from_df(df, key="example_datasets/dataset1.parquet").save()
>>> artifact_v2 = ln.Artifact(df_updated, key="example_datasets/dataset1.parquet").save()

Alternatively, if you don’t want to provide a value for key, you can use revises:

>>> artifact = ln.Artifact.from_df(df, description="My dataframe").save()
>>> artifact_v2 = ln.Artifact(df_updated, revises=artifact).save()

Attributes¶

property feature_sets: QuerySet¶: Feature sets linked to this artifact.

features: FeatureManager¶

Feature manager.

Features denote dataset dimensions, i.e., the variables that measure labels & numbers.

Annotate with features & values:

artifact.features.add_values({
     "species": organism,  # here, organism is an Organism record
     "scientist": ['Barbara McClintock', 'Edgar Anderson'],
     "temperature": 27.6,
     "study": "Candidate marker study"
})

Query for features & values:

ln.Artifact.features.filter(scientist="Barbara McClintock")

Features may or may not be part of the artifact content in storage. For instance, the Curator flow validates the columns of a DataFrame-like artifact and annotates it with features corresponding to these columns. artifact.features.add_values, by contrast, does not validate the content of the artifact.

property labels: LabelManager¶

Label manager.

To annotate with labels, you typically use the registry-specific accessors, for instance ulabels:

candidate_marker_study = ln.ULabel(name="Candidate marker study").save()
artifact.ulabels.add(candidate_marker_study)

Similarly, you query based on these accessors:

ln.Artifact.filter(ulabels__name="Candidate marker study").all()

Unlike the registry-specific accessors, the .labels accessor provides a way of associating labels with features:

study = ln.Feature(name="study", dtype="cat").save()
artifact.labels.add(candidate_marker_study, feature=study)

Note that the above is equivalent to:

artifact.features.add_values({"study": candidate_marker_study})

property n_objects: int¶

params: ParamManager¶

Param manager.

Example:

artifact.params.add_values({
    "hidden_size": 32,
    "bottleneck_size": 16,
    "batch_size": 32,
    "preprocess_params": {
        "normalization_type": "cool",
        "subset_highlyvariable": True,
    },
})

property path: Path | UPath¶

Path.

File in cloud storage, here AWS S3:

>>> artifact = ln.Artifact("s3://my-bucket/my-file.csv").save()
>>> artifact.path
S3Path('s3://my-bucket/my-file.csv')

File in local storage:

>>> ln.Artifact("./myfile.csv", key="myfile").save()
>>> artifact = ln.Artifact.get(key="myfile")
>>> artifact.path
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')

property stem_uid: str¶

Universal id characterizing the version family.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length 12 (transform) or 16 (artifact, collection)
version_uid = "0000"  # an auto-incrementing 4-digit base62 number
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid

property transform: Transform | None¶: Transform whose run created the artifact.

property type: str¶

property versions: QuerySet¶

Lists all records of the same version family.

>>> new_artifact = ln.Artifact(df2, revises=artifact).save()
>>> new_artifact.versions()

Simple fields¶

uid: str¶: A universal random id.

key: str | None¶

A (virtual) relative file path within the artifact’s storage location.

Setting a key is useful to automatically group artifacts into a version family.

LaminDB defaults to a virtual file path to make renaming of data in object storage easy.

If you register existing files in a storage location, the key equals the actual filepath on the underyling filesytem or object store.

description: str | None¶: A description.

suffix: str¶

Path suffix or empty string if no canonical suffix exists.

This is either a file suffix (".csv", ".h5ad", etc.) or the empty string “”.

kind: Literal['dataset', 'model'] | None¶: ArtifactKind (default None).

otype: str | None¶: Default Python object type, e.g., DataFrame, AnnData.

size: int | None¶

Size in bytes.

Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.

hash: str | None¶

Hash or pseudo-hash of artifact content.

Useful to ascertain integrity and avoid duplication.

n_files: int | None¶

Number of files for folder-like artifacts, None for file-like artifacts.

Note that some arrays are also stored as folders, e.g., .zarr or .tiledbsoma.

Changed in version 1.0: Renamed from n_objects to n_files.

n_observations: int | None¶

Number of observations.

Typically, this denotes the first array dimension.

version: str | None¶

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

is_latest: bool¶: Boolean flag that indicates whether a record is the latest in its version family.

created_at: datetime¶: Time of creation of record.

updated_at: datetime¶: Time of last update to record.

Relational fields¶

space: Space¶: The space in which the record lives.

storage: Storage¶: Storage location, e.g. an S3 or GCP bucket or a local directory.

run: Run | None¶: Run that created the artifact.

schema: Schema | None¶: The schema of the artifact (to be populated in lamindb 1.1).

created_by: User¶: Creator of record.

ulabels: ULabel¶: The ulabels measured in the artifact (ULabel).

input_of_runs: Run¶: Runs that use this artifact as an input.

collections: Collection¶: The collections that this artifact is part of.

projects: Project¶: Associated projects.

references: Reference¶: Associated references.

Class methods¶

classmethod df(include=None, features=False, limit=100)¶

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use arguments include or feature to include other data.

Parameters:

include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "ulabels__name", "cell_types__name", etc. or a list of such strings.
features (bool | list[str], default: False) – If True, map all features of the Feature registry onto the resulting DataFrame. Only available for Artifact.
limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

Include the name of the creator in the DataFrame:

>>> ln.ULabel.df(include="created_by__name"])

Include display of features for Artifact:

>>> df = ln.Artifact.df(features=True)
>>> ln.view(df)  # visualize with type annotations

Only include select features:

>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])

classmethod filter(*queries, **expressions)¶

Query records.

Parameters:

queries – One or multiple Q objects.
expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ln.ULabel(name="my label").save()
>>> ln.ULabel.filter(name__startswith="my").df()

classmethod from_anndata(adata, key=None, description=None, run=None, revises=None, **kwargs)¶

Create from AnnData, validate & link features.

Parameters:

adata (AnnData | UPathStr) – An AnnData object or a path of AnnData-like.
key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5ad".
description (str | None, default: None) – A description.
revises (Artifact | None, default: None) – An old version of the artifact.
run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection(): Track collections.
Feature: Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> adata = ln.core.datasets.anndata_with_obs()
>>> artifact = ln.Artifact.from_anndata(adata, description="mini anndata with obs")
>>> artifact.save()

classmethod from_df(df, key=None, description=None, run=None, revises=None, **kwargs)¶

Create from DataFrame, validate & link features.

Parameters:

df (DataFrame) – A DataFrame object.
key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.parquet".
description (str | None, default: None) – A description.
revises (Artifact | None, default: None) – An old version of the artifact.
run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection(): Track collections.
Feature: Track features.

Examples

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1")
>>> artifact.save()

classmethod from_dir(path, key=None, *, run=None)¶

Create a list of artifact objects from a directory.

Hint

If you have a high number of files (several 100k) and don’t want to track them individually, create a single Artifact via Artifact(path) for them. See, e.g., RxRx: cell imaging.

Parameters:

path (lamindb.core.types.UPathStr) – Source path of folder.
key (str | None, default: None) – Key for storage destination. If None and directory is in a registered location, the inferred key will reflect the relative position. If None and directory is outside of a registered storage location, the inferred key defaults to path.name.
run (Run | None, default: None) – A Run object.

Return type:

list[Artifact]

Examples

>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)
>>> artifacts = ln.Artifact.from_dir(dir_path)
>>> ln.save(artifacts)

classmethod from_mudata(mdata, key=None, description=None, run=None, revises=None, **kwargs)¶

Create from MuData, validate & link features.

Parameters:

mdata (MuData) – An MuData object.
key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5mu".
description (str | None, default: None) – A description.
revises (Artifact | None, default: None) – An old version of the artifact.
run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection(): Track collections.
Feature: Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> mdata = ln.core.datasets.mudata_papalexi21_subset()
>>> artifact = ln.Artifact.from_mudata(mdata, description="a mudata object")
>>> artifact.save()

classmethod get(idlike=None, **expressions)¶

Get a single record.

Parameters:

idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.
expressions – Fields and values passed as Django query expressions.

Return type:

Record

Returns:

A record.

Raises:

lamindb.core.exceptions.DoesNotExist – In case no matching record is found.

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ulabel = ln.ULabel.get("FvtpPJLJ")
>>> ulabel = ln.ULabel.get(name="my-label")

classmethod lookup(field=None, return_field=None)¶

Return an auto-complete object for a field.

Parameters:

field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.
return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")

classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶

Search.

Parameters:

string (str) – The input string to match against the field ontology values.
field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.
limit (int | None, default: 20) – Maximum amount of top results to return.
case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")

classmethod using(instance)¶

Use a non-default LaminDB instance.

Parameters:: instance (str | None) – An instance identifier of form “account_handle/instance_name”.
Return type:: QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

Methods¶

cache(is_run_input=None)¶

Download cloud artifact to local cache.

Follows synching logic: only caches an artifact if it’s outdated in the local cache.

Returns a path to a locally cached on-disk object (say a .jpg file).

Return type:: Path

Examples

Sync file from cloud and return the local path of the cache:

>>> artifact.cache()
PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')

delete(permanent=None, storage=None, using_key=None)¶

Trash or permanently delete.

A first call to .delete() puts an artifact into the trash (sets _branch_code to -1). A second call permanently deletes the artifact. If it is a folder artifact with multiple versions, deleting a non-latest version will not delete the underlying storage by default (if storage=True is not specified). Deleting the latest version will delete all the versions for folder artifacts.

FAQ: Storage FAQ

Parameters:

permanent (bool | None, default: None) – Permanently delete the artifact (skip trash).
storage (bool | None, default: None) – Indicate whether you want to delete the artifact in storage.

Return type:

None

Examples

For an Artifact object artifact, call:

>>> artifact = ln.Artifact.filter(key="some.csv").one()
>>> artifact.delete() # delete a single file artifact

>>> artifact = ln.Artifact.filter(key="some.tiledbsoma". is_latest=False).first()
>>> artiact.delete() # delete an old version, the data will not be deleted

>>> artifact = ln.Artifact.filter(key="some.tiledbsoma". is_latest=True).one()
>>> artiact.delete() # delete all versions, the data will be deleted or prompted for deletion.

describe(print_types=False)¶

Describe relations of record.

Examples

>>> artifact.describe()

load(is_run_input=None, **kwargs)¶

Cache and load into memory.

See all loaders.

Return type:: Any

Examples

Load a DataFrame-like artifact:

>>> artifact.load().head()
sepal_length sepal_width petal_length petal_width iris_organism_code
      0.051       0.035        0.014       0.002                 0
      0.049       0.030        0.014       0.002                 0
      0.047       0.032        0.013       0.002                 0
      0.046       0.031        0.015       0.002                 0
      0.050       0.036        0.014       0.002                 0

Load an AnnData-like artifact:

>>> artifact.load()
AnnData object with n_obs × n_vars = 70 × 765

Fall back to cache() if no in-memory representation is configured:

>>> artifact.load()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')

open(mode='r', is_run_input=None)¶

Return a cloud-backed data object.

Works for AnnData (.h5ad and .zarr), generic hdf5 and zarr, tiledbsoma objects (.tiledbsoma), pyarrow compatible formats.

Parameters:: mode (str, default: 'r') – can only be "w" (write mode) for tiledbsoma stores, otherwise should be always "r" (read-only mode).
Return type:: AnnDataAccessor | BackedAccessor | SOMACollection | SOMAExperiment | PyArrowDataset

Notes

For more info, see tutorial: Slice arrays.

Examples

Read AnnData in backed mode from cloud:

>>> artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad")
>>> artifact.open()
AnnDataAccessor object with n_obs × n_vars = 70 × 765
    constructed for the AnnData object pbmc68k.h5ad
    ...

replace(data, run=None, format=None)¶

Replace artifact content.

Parameters:

data (UPathStr | pd.DataFrame | AnnData | MuData) – A file path.
run (Run | None, default: None) – The run that created the artifact gets auto-linked if ln.track() was called.

Return type:

None

Examples

Say we made a change to the content of an artifact, e.g., edited the image paradisi05_laminopathic_nuclei.jpg.

This is how we replace the old file in storage with the new file:

>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg")
>>> artifact.save()

Note that this neither changes the storage key nor the filename.

However, it will update the suffix if it changes.

restore()¶

Restore from trash.

Return type:: None

Examples

For any Artifact object artifact, call:

>>> artifact.restore()

save(upload=None, **kwargs)¶

Save to database & storage.

Parameters:: upload (bool | None, default: None) – Trigger upload to cloud storage in instances with hybrid storage mode.
Return type:: Artifact

Examples

>>> artifact = ln.Artifact("./myfile.csv", description="myfile")
>>> artifact.save()

view_lineage(with_children=True)¶

Graph of data flow.

Return type:: None

Notes

For more info, see use cases: Data lineage.

Examples

>>> collection.view_lineage()
>>> artifact.view_lineage()