RxRx: cell imaging¶
rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.
High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.
- In this guide, you’ll see how to query some of these data using LaminDB. 
- If you’d like to transfer data into your own LaminDB instance, see the transfer guide. 
# !pip install 'lamindb[bionty,jupyter,gcp]' wetlab
!lamin connect laminlabs/lamindata
→ connected lamindb: laminlabs/lamindata
import lamindb as ln
import bionty as bt
import wetlab as wl
→ connected lamindb: laminlabs/lamindata
Search & look up metadata¶
We’ll find all genetic treatments in the GeneticPerturbation registry:
df = wl.GeneticPerturbation.df()
df.shape
(100, 13)
Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:
sirnas = wl.GeneticPerturbation.filter(system="siRNA").lookup(return_field="name")
We’re also interested in cell lines & wells:
cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")
Load the collection¶
This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.
Let us get the corresponding object and some information about it:
collection = ln.Collection.get("Br2Z1lVSQBAkkbbt7ILu")
collection.view_lineage()
collection.describe()
Show code cell output
Collection └── General ├── .uid = 'Br2Z1lVSQBAkkbbt7ILu' ├── .key = 'Annotated RxRx1 images' ├── .hash = 'dycM8ypgnRRF9zXLSeD_' ├── .version = '1' ├── .created_by = sunnyosun (Sunny Sun) ├── .created_at = 2024-06-17 12:43:02 └── .transform = 'Ingest the RxRx1 dataset'
The dataset consists in a metadata file and a folder path pointing to the image files:
collection.meta_artifact.load().head()
! run input wasn't tracked, call `ln.track()` and re-run
| site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1.png | 
| 1 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w2.png | 
| 2 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w3.png | 
| 3 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w4.png | 
| 4 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w5.png | 
Query image files¶
Because we didn’t choose to register each image as a record in the Artifact registry, we have to query the images through the metadata file of the dataset:
df = collection.meta_artifact.load()
! run input wasn't tracked, call `ln.track()` and re-run
We can query a subset of images using metadata registries & pandas query syntax:
query = df[
    (df.cell_line == cell_lines.hep_g2_cell)
    & (df.sirna == sirnas.s15652)
    & (df.well == wells.m15)
    & (df.plate == 1)
    & (df.site == 2)
]
query
| site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3066 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1.png | 
| 3067 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w2.png | 
| 3068 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w3.png | 
| 3069 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w4.png | 
| 3070 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w5.png | 
| 3071 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w6.png | 
To access the individual images based on this query result:
collection.data_artifact.storage.root
'gs://rxrx1-europe-west4'
images = [f"{collection.data_artifact.storage.root}/{key}" for key in query.path]
images
['gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png']
Download an image to disk:
path = ln.UPath(images[1])
path.download_to(".")
from IPython.display import Image
Image(f"./{path.name}")
 
Use DuckDB to query metadata¶
As an alternative to pandas, we could use DuckDB to query image metadata.
import duckdb  # pip install duckdb
features = ln.Feature.lookup(return_field="name")
filter = (
    f"{features.cell_line} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
    f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
    f"{features.plate} == '1' and {features.site} == '2'"
)
region = ln.setup.settings.storage.region
parquet_data = duckdb.from_parquet(
    collection.meta_artifact.path.as_posix() + f"?s3_region={region}"
)
parquet_data.filter(filter)
┌──────────────────┬────────────────┬───────────┬─────────┬────────────┬───────┬─────────┬───────┬──────────────────┬─────────┬──────────┬───────────────────────────────────────────┐
│     site_id      │    well_id     │ cell_line │  split  │ experiment │ plate │  well   │ site  │    well_type     │  sirna  │ sirna_id │                   path                    │
│     varchar      │    varchar     │  varchar  │ varchar │  varchar   │ int64 │ varchar │ int64 │     varchar      │ varchar │  int64   │                  varchar                  │
├──────────────────┼────────────────┼───────────┼─────────┼────────────┼───────┼─────────┼───────┼──────────────────┼─────────┼──────────┼───────────────────────────────────────────┤
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w1.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w2.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w3.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w4.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w5.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w6.png │
└──────────────────┴────────────────┴───────────┴─────────┴────────────┴───────┴─────────┴───────┴──────────────────┴─────────┴──────────┴───────────────────────────────────────────┘