Adding custom cell-types to Cell Ontology

Adding custom cell-types to Cell Ontology#

We demonstrate here how to adjust the cell ontology for use in popV

First we download the cl.obo from the Cell Ontology.

%load_ext autoreload
%autoreload 2
# Download cl.ono fro OBO page.
!mkdir new_ontology
!wget http://purl.obolibrary.org/obo/cl/cl.json -O new_ontology/cl.json
--2024-12-15 00:52:50--  http://purl.obolibrary.org/obo/cl/cl.json
Resolving purl.obolibrary.org (purl.obolibrary.org)... 104.18.37.59, 172.64.150.197, 2606:4700:4400::6812:253b, ...
Connecting to purl.obolibrary.org (purl.obolibrary.org)|104.18.37.59|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/obophenotype/cell-ontology/releases/latest/download/cl.json [following]
--2024-12-15 00:52:50--  https://github.com/obophenotype/cell-ontology/releases/latest/download/cl.json
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/obophenotype/cell-ontology/releases/download/v2024-09-26/cl.json [following]
--2024-12-15 00:52:50--  https://github.com/obophenotype/cell-ontology/releases/download/v2024-09-26/cl.json
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/36889083/3cf3f808-5aae-4f63-b0eb-0a1e4ecf1d56?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20241215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241215T085250Z&X-Amz-Expires=300&X-Amz-Signature=4b8f4e292ec56102df084b137b62c75049b105608348e853dc0730fab38c6239&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dcl.json&response-content-type=application%2Foctet-stream [following]
--2024-12-15 00:52:50--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/36889083/3cf3f808-5aae-4f63-b0eb-0a1e4ecf1d56?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20241215%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241215T085250Z&X-Amz-Expires=300&X-Amz-Signature=4b8f4e292ec56102df084b137b62c75049b105608348e853dc0730fab38c6239&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dcl.json&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32400486 (31M) [application/octet-stream]
Saving to: ‘new_ontology/cl.json’

new_ontology/cl.jso 100%[===================>]  30.90M   109MB/s    in 0.3s    

2024-12-15 00:52:51 (109 MB/s) - ‘new_ontology/cl.json’ saved [32400486/32400486]

Edit OBO file#

We first read the obo file and show it’s content for an existing cell-type and display the number of edges and nodes and show a single content of the file.

import json

with open("new_ontology/cl.json") as f:
    cell_ontology = json.load(f)["graphs"][0]
popv_dict = {}
popv_dict["nodes"] = [entry for entry in cell_ontology["nodes"] if entry["type"] == "CLASS" and entry.get("lbl", False)]
popv_dict["lbl_sentence"] = {
    entry["lbl"]: f"{entry['lbl']}: {entry.get('meta', {}).get('definition', {}).get('val', '')}"
    for entry in popv_dict["nodes"]
}
popv_dict["lbl_sentence"]["T cell"]
'T cell: A type of lymphocyte whose defining characteristic is the expression of a T cell receptor complex.'

Our custom cell-type does not exist.

popv_dict["lbl_sentence"].get("specialized T cell", "No definition found")
'No definition found'

A random example how nodes are described for cell-types and how we need to structure the entry.

cell_ontology["graphs"][0]["nodes"][1000]
{'id': 'http://purl.obolibrary.org/obo/CL_0000871',
 'lbl': 'splenic macrophage',
 'type': 'CLASS',
 'meta': {'definition': {'val': 'A secondary lymphoid organ macrophage found in the spleen.',
   'xrefs': ['GO_REF:0000031', 'PMID:15771589', 'PMID:16322748']},
  'comments': ['Role or process: immune, clearance of apoptotic and senescent cells.'],
  'xrefs': [{'val': 'FMA:83026'}]}}
cell_ontology["graphs"][0]["nodes"].append(
    {
        "id": "CL:0200000",
        "lbl": "specialized T cell",
        "type": "CLASS",
        "meta": {"definition": {"val": "A T cell that has a specific function in the immune system."}},
    }
)  # All other fields are not used in popV.
cell_ontology["graphs"][0]["edges"][1000]
{'sub': 'http://purl.obolibrary.org/obo/CL_0000510',
 'pred': 'is_a',
 'obj': 'http://purl.obolibrary.org/obo/CL_0002563'}
cell_ontology["graphs"][0]["edges"].append(
    {
        "sub": "CL:0200000",  # new specialized T cell
        "pred": "is_a",
        "obj": "http://purl.obolibrary.org/obo/CL_0000084",  # T cell
    }
)
cell_ontology["graphs"][0]["edges"][-1]
{'sub': 'CL:0200000',
 'pred': 'is_a',
 'obj': 'http://purl.obolibrary.org/obo/CL_0000084'}
with open("new_ontology/cl_modified.json", "w") as f:
    json.dump(cell_ontology, f)

We need to create all ontology files for popV.

from popv import create_ontology_resources

create_ontology_resources("new_ontology/cl.json")

Run popV#

We need to create additional files, namely a dictionary and an LLM model of our Cell Ontology. We call our helper function in popV that creates these files in the same folder as our cl.obo file.

import sys

sys.path.insert(0, "popv")
from popv import create_ontology_resources

create_ontology_resources("resources/ontology/cl.json")
import scanpy as sc
query_adata = sc.read_h5ad("resources/dataset/test/lca_subset.h5ad")
ref_adata = sc.read_h5ad("resources/dataset/test/ts_lung_subset.h5ad")
# Add our new cell-type label to the reference dataset.
# ref_adata.obs['cell_ontology_class'] = ref_adata.obs['cell_ontology_class'].replace('CD4-positive, alpha-beta T cell', 'my special tcell')
# We use a newer cl.obo file that has updated the term for lung epithelial cells. You can find these in synonyms.
ref_adata.obs["cell_ontology_class"] = ref_adata.obs["cell_ontology_class"].replace(
    "type II pneumocyte", "pulmonary alveolar type 2 cell"
)
ref_adata.obs["cell_ontology_class"] = ref_adata.obs["cell_ontology_class"].replace(
    "type I pneumocyte", "pulmonary alveolar type 1 cell"
)

ref_adata.obs["cell_ontology_class"].value_counts()
cell_ontology_class
macrophage                                  370
pulmonary alveolar type 2 cell              247
basal cell                                   60
non-classical monocyte                       34
capillary endothelial cell                   33
club cell                                    32
classical monocyte                           27
basophil                                     23
CD4-positive, alpha-beta T cell              20
respiratory goblet cell                      18
lung ciliated cell                           15
vein endothelial cell                        14
lung microvascular endothelial cell          14
CD8-positive, alpha-beta T cell              12
fibroblast                                   11
intermediate monocyte                         9
adventitial cell                              9
endothelial cell of artery                    8
pulmonary alveolar type 1 cell                8
neutrophil                                    7
dendritic cell                                6
pericyte                                      6
effector CD8-positive, alpha-beta T cell      3
effector CD4-positive, alpha-beta T cell      3
bronchial smooth muscle cell                  3
plasma cell                                   2
smooth muscle cell                            2
endothelial cell of lymphatic vessel          1
mature NK T cell                              1
pulmonary ionocyte                            1
B cell                                        1
Name: count, dtype: int64
ref_adata.write_h5ad("resources/dataset/test/ts_lung_subset.h5ad")
from popv.preprocessing import Process_Query

adata = Process_Query(
    query_adata,
    ref_adata,
    query_labels_key=None,
    query_batch_key=None,
    ref_labels_key="cell_ontology_class",
    ref_batch_key=None,
    unknown_celltype_label="unknown",
    save_path_trained_models="test",
    # cl_obo_folder="resources/ontology",
    cl_obo_folder=[
        "new_ontology/cl_popv.json",
        "new_ontology/cl.ontology",
        "new_ontology/cl.ontology.nlp.emb",
    ],  # Point to new files.
    prediction_mode="retrain",
    n_samples_per_label=20,
    hvg=1000,
).adata
from popv.annotation import annotate_data

annotate_data(
    adata,
)
WARNING: consider updating your call to make use of `computation`
	Initialization is completed.
	Completed 1 / 10 iteration(s).
	Completed 2 / 10 iteration(s).
	Completed 3 / 10 iteration(s).
Reach convergence after 3 iteration(s).
Found 1000 genes among all datasets
[[0.    0.906]
 [0.    0.   ]]
Processing datasets (0, 1)
Epoch 20/20: 100%|██████████| 20/20 [00:13<00:00,  1.46it/s, v_num=1, train_loss_step=753, train_loss_epoch=1.03e+3]
Epoch 20/20: 100%|██████████| 20/20 [00:13<00:00,  1.45it/s, v_num=1, train_loss_step=753, train_loss_epoch=1.03e+3]
Training cost after epoch 1: loss:14.985920 acc: 0.105 auc: 0.639 auprc: 0.054
Training cost after epoch 2: loss:13.875673 acc: 0.219 auc: 0.799 auprc: 0.143
Training cost after epoch 3: loss:13.127899 acc: 0.338 auc: 0.886 auprc: 0.250
Training cost after epoch 4: loss:12.366190 acc: 0.449 auc: 0.944 auprc: 0.402
Training cost after epoch 5: loss:11.744126 acc: 0.539 auc: 0.972 auprc: 0.619
Training cost after epoch 6: loss:11.237657 acc: 0.611 auc: 0.984 auprc: 0.733
Training cost after epoch 7: loss:10.819752 acc: 0.692 auc: 0.992 auprc: 0.865
Training cost after epoch 8: loss:10.409714 acc: 0.734 auc: 0.997 auprc: 0.921
Training cost after epoch 9: loss:10.058737 acc: 0.781 auc: 0.998 auprc: 0.959
Training cost after epoch 10: loss:9.771173 acc: 0.820 auc: 0.999 auprc: 0.982
Training cost after epoch 11: loss:9.456179 acc: 0.859 auc: 1.000 auprc: 0.995
Training cost after epoch 12: loss:9.184665 acc: 0.877 auc: 1.000 auprc: 1.000
Training cost after epoch 13: loss:8.951223 acc: 0.937 auc: 1.000 auprc: 1.000
Training cost after epoch 14: loss:8.723780 acc: 0.967 auc: 1.000 auprc: 1.000
Training cost after epoch 15: loss:8.506960 acc: 0.970 auc: 1.000 auprc: 1.000
Training cost after epoch 16: loss:8.320274 acc: 0.973 auc: 1.000 auprc: 1.000
Training cost after epoch 17: loss:8.140976 acc: 0.991 auc: 1.000 auprc: 1.000
Training cost after epoch 18: loss:7.973863 acc: 0.994 auc: 1.000 auprc: 1.000
Training cost after epoch 19: loss:7.816499 acc: 0.994 auc: 1.000 auprc: 1.000
Training cost after epoch 20: loss:7.673470 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 21: loss:7.544237 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 22: loss:7.414350 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 23: loss:7.294295 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 24: loss:7.181498 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 25: loss:7.062901 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 26: loss:6.966999 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 27: loss:6.860525 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 28: loss:6.762265 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 29: loss:6.670916 acc: 1.000 auc: 1.000 auprc: 1.000
Training cost after epoch 30: loss:6.588750 acc: 1.000 auc: 1.000 auprc: 1.000
INFO     File test/scvi/model.pt already downloaded                                                                
INFO     Training for 20 epochs.
Epoch 20/20: 100%|██████████| 20/20 [00:37<00:00,  1.88s/it, v_num=1, train_loss_step=702, train_loss_epoch=968]
Epoch 20/20: 100%|██████████| 20/20 [00:37<00:00,  1.89s/it, v_num=1, train_loss_step=702, train_loss_epoch=968]