Hi, this is more of a philosophical question but I was wondering if the MSE is the right metric for this challenge. Most of the referenced state-of-the-art literature is reporting correlation based metrics such as spearman or pearson correlation. One reason is that the gene expression in a cell can be dominated by a few genes that could be even not important (e.g., house keeping genes). Translated to the MSE, this would result in the error being driven by these genes with higher absolute expression…
Hi,
After discussions with the Broad team, we have added Spearman correlation as a second scoring metric alongside MSE. For the first checkpoint, the Spearman correlation will be used for ranking. The Broad team will decide later which metric will be used for the final scoring.
I’m glad to hear this! Thank you for considering the feedback.
Hi @enzo @cruncher-abde ,
Do you compute the correlation gene-wise or cell-wise per sample? It should be gene-wise (i.e., correlation per gene per sample)
Region-wise:
Region == sample, but before that is it per cell or per gene? Do you compute the correlation for each gene in each region?
A sample have multiple region, that have multiple cells.
The predictions are processed by sample, then grouped by region and then scored.
The formula is:
mean(
mean(
scoring(x)
for per_region in per_sample.groupby("region")
)
for per_sample for prediction.groupby("sample")
)
Thanks for clarifying this. My question is if you are computing the correlation score per gene and then you average it across regions and samples? I am trying to understand what is the smallest point you are computing the correlation on (gene or cell) before you start averaging.
Hi @separate-orr , it looks like the correlation score is being computed cell-wise instead of gene-wise. In the state-of-art literature, the correlation score is computed gene-wise (e.g., HEST1k presented now at Neurips 2024) Here is also their code base HEST/src/hest/bench/trainer.py at 5a0cbba61550ed21c66bc81c36fa1780e853245d · mahmoodlab/HEST · GitHub
The gene wise correlation is then averaged across regions and samples.
It also is more meaningful because we ask how our prediction of gene A correlates with the ground truth of gene A and so on for the remaining genes…
When cell-wise, we are looking how well the gene is predicted ONLY in the specific cell (less meaningful and easier task). When gene-wise, we are looking how well the gene is predicted across ALL cells for the given region/sample.
Happy to provide more context if needed.
Could you please look into this?
Thanks!
Hi @many-kalin thanks for the question and that is a good one! We actually went back and forth on whether the evaluation should be for a gene across all cells (what you suggest) or the gene expression within a cell. Both answer different questions. We went with a per-cell basis because for this competition, we are most interested in predicting the gene programs in individual cells, rather than expression of any individual gene. Most relevantly, we would like to differentiate gene programs between normal mucosa cells and pre-malignant mucosa cells. So accurancy within the cell is relevant. For more general benchmarks like in HEST1k, it makes sense why they would generally focus on the gene expression across all cells. Also I would assume scoring well with a per-gene metric would largely result in scoring well on the per-cell basis too, but there are exceptions. Best, Orr
Hi @separate-orr, thank you for your response. In single-cell analysis, the focus is often on the population or cluster level, not the individual cell. When performing differential gene expression analysis, individual gene expression across populations or clusters is compared. This makes it crucial to ensure accurate gene expression measurements across cells.
To illustrate this, here is a toy example demonstrating how perfect cell-wise correlation does not always translate into a perfect gene-wise score. Such discrepancies can result in flawed downstream analyses (e.g., incorrect clustering; dge).
import pandas as pd
# True values (Y)
Y = pd.DataFrame({
"F1": [10, 15], # Feature 1
"F2": [20, 25], # Feature 2
"F3": [30, 35], # Feature 3
}, index=["S1", "S2"]) # Samples (S1, S2)
# Predicted values (Y_hat)
Y_hat = pd.DataFrame({
"F1": [5, 10],
"F2": [30, 25],
"F3": [45, 40],
}, index=["S1", "S2"])
print("Cell-wise: ", Y.corrwith(Y_hat, axis=1, method='spearman').mean())
print("Gene-wise: ", Y.corrwith(Y_hat, axis=0, method='spearman').mean())
Cell-wise: 1.0
Gene-wise: -0.3333333333333333
@separate-orr Here is an extended example with clustering. Despite having highly correlated features at the cell-wise level, the clustering results appear to be almost random.
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy as np
# True values (Y)
Y = pd.DataFrame({
"F1": [10, 15, 20, 25, 30], # Feature 1
"F2": [20, 25, 30, 35, 40], # Feature 2
"F3": [30, 35, 40, 45, 50], # Feature 3
}, index=["S1", "S2", "S3", "S4", "S5"]) # 5 Samples
# Predicted values (Y_hat)
Y_hat = pd.DataFrame({
"F1": [30, 25, 20, 15, 10], # Inverse of Y
"F2": [40, 35, 30, 25, 20], # Inverse of Y
"F3": [50, 45, 4.0, 35, 30], # Inverse of Y
}, index=["S1", "S2", "S3", "S4", "S5"])
# Apply KMeans clustering
kmeans_Y = KMeans(n_clusters=2, random_state=42).fit(Y)
kmeans_Y_hat = KMeans(n_clusters=2, random_state=42).fit(Y_hat)
# Get the cluster labels
labels_Y = kmeans_Y.labels_
labels_Y_hat = kmeans_Y_hat.labels_
# Measure Adjusted Rand Index (ARI); clustering accuracy
ari = adjusted_rand_score(labels_Y, labels_Y_hat)
# Calculate correlations
sample_wise_corr = Y.corrwith(Y_hat, axis=1, method='spearman').mean()
feature_wise_corr = Y.corrwith(Y_hat, axis=0, method='spearman').mean()
# Display results
print("Cell-wise (sample-wise) correlation: ", sample_wise_corr)
print("Gene-wise (feature-wise) correlation: ", feature_wise_corr)
print("Adjusted Rand Index (ARI):", ari)
Cell-wise (sample-wise) correlation: 0.7
Gene-wise (feature-wise) correlation: -0.8999999999999999
Adjusted Rand Index (ARI): 0.16666666666666666
Hi, has the Broad team decided which metric will be used for the final scoring?
Here is the answer from the Broad team:
Before any cluster level analysis, it’s crucial to assign individual cell correctly to a cluster. This is particular important for the spatial transcriptomics data that each cell occupy a fixed spatial location and we want correct cell type annotation first.
This is one of the reason why we currently choose per-cell based metric. For the two toy examples you provided, there are two issues:
- in both examples, we need to normalize gene expression per cell
- the second simulated data should have clear clusters first.
For example:
import pandas as pd from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score import numpy as np # True values (Y) Y = pd.DataFrame({ "F1": [10, 15, 20, 45, 50], # Feature 1 "F2": [20, 25, 30, 35, 40], # Feature 2 "F3": [30, 35, 40, 25, 30], # Feature 3 }, index=["S1", "S2", "S3", "S4", "S5"]) # 5 Samples # Predicted values (Y_hat) Y_hat = pd.DataFrame({ "F1": [30, 25, 20, 25, 30], # Inverse of Y "F2": [40, 35, 30, 35, 40], # Inverse of Y "F3": [50, 45, 4.0, 45, 50], # Inverse of Y }, index=["S1", "S2", "S3", "S4", "S5"]) Y = np.log1p(Y.T/Y.sum(axis=1)*100).T Y_hat = np.log1p(Y_hat.T/Y_hat.sum(axis=1)*100).T # Apply KMeans clustering kmeans_Y = KMeans(n_clusters=2, random_state=42).fit(Y) kmeans_Y_hat = KMeans(n_clusters=2, random_state=42).fit(Y_hat) # Get the cluster labels labels_Y = kmeans_Y.labels_ labels_Y_hat = kmeans_Y_hat.labels_ # Measure Adjusted Rand Index (ARI); clustering accuracy ari = adjusted_rand_score(labels_Y, labels_Y_hat) # Calculate correlations sample_wise_corr = Y.corrwith(Y_hat, axis=1, method='spearman').mean() feature_wise_corr = Y.corrwith(Y_hat, axis=0, method='spearman').mean() # Display results print("Cell-wise (sample-wise) correlation: ", sample_wise_corr) print("Gene-wise (feature-wise) correlation: ", feature_wise_corr) print("Adjusted Rand Index (ARI):", ari)
Hi, has the Broad team decided which metric will be used for the final scoring?
We do not want to announce it too soon, so we are waiting for one last confirmation before we say anything.
Your response does not sufficiently address our questions or provide meaningful insights. As we previously noted, this is simply a toy example, where you could assume that the data is normalized, scaled, and filtered to include the most variable features. Its primary purpose is to illustrate that a high Spearman cell-wise correlation does not necessarily translate to a high Spearman gene-wise correlation.
Additionally, you are again normalizing the prediction (Y_hat) - a practice we have already discussed as problematic. Re-apply normalization - #5 by many-kalin
Considering this was intended to be a scientific challenge, it remains unclear why performance is being evaluated cell-wise instead of gene-wise, as is commonly done in the literature [1,2,3,4,5,6] and as the end goal is to distinguish between healthy and disease populations (crunch 3). Moreover, in Crunch 3, the focus is on ranking genes based on their predictive value within the population (many cells) across both conditions, rather than identifying “which cell is the single most diseased one”.
[1] Jaume, G., Doucet, P., Song, A.H., Lu, M.Y., Almagro-Pérez, C., Wagner, S.J., Vaidya, A.J., Chen, R.J., Williamson, D.F., Kim, A. and Mahmood, F., 2024. Hest-1k: A dataset for spatial transcriptomics and histology image analysis. arXiv preprint arXiv:2406.16192.
[2] Xie, R., Pang, K., Chung, S., Perciani, C., MacParland, S., Wang, B. and Bader, G., 2024. Spatially Resolved Gene Expression Prediction from Histology Images via Bi-modal Contrastive Learning. Advances in Neural Information Processing Systems, 36.
[3] He, B., BergenstrĂĄhle, L., Stenbeck, L., Abid, A., Andersson, A., Borg, Ă…., Maaskola, J., Lundeberg, J. and Zou, J., 2020. Integrating spatial gene expression and breast tumour morphology via deep learning. Nature biomedical engineering, 4(8), pp.827-834.
[4] Jia, Y., Liu, J., Chen, L., Zhao, T. and Wang, Y., 2024. THItoGene: a deep learning method for predicting spatial transcriptomics from histological images. Briefings in Bioinformatics, 25(1), p.bbad464.
[5] Pang, M., Su, K. and Li, M., 2021. Leveraging information in spatial transcriptomics to predict super-resolution gene expression from histology images in tumors. BioRxiv, pp.2021-11.
and many more
[6] Schmauch, B., Herpin, L., Olivier, A., Duboudin, T., Dubois, R., Gillet, L., Schiratti, J.B., Di Proietto, V., Le Corre, D., Bourgoin, A. and TaĂŻeb, J., 2024. A deep learning-based multiscale integration of spatial omics with tumor morphology. bioRxiv, pp.2024-07.