Is the MSE the right metric for benchmark?

many-kalin · December 3, 2024, 11:21am

Hi, this is more of a philosophical question but I was wondering if the MSE is the right metric for this challenge. Most of the referenced state-of-the-art literature is reporting correlation based metrics such as spearman or pearson correlation. One reason is that the gene expression in a cell can be dominated by a few genes that could be even not important (e.g., house keeping genes). Translated to the MSE, this would result in the error being driven by these genes with higher absolute expression…

cruncher-abde · December 6, 2024, 4:10pm

Hi,

After discussions with the Broad team, we have added Spearman correlation as a second scoring metric alongside MSE. For the first checkpoint, the Spearman correlation will be used for ranking. The Broad team will decide later which metric will be used for the final scoring.

many-kalin · December 6, 2024, 7:10pm

I’m glad to hear this! Thank you for considering the feedback.

many-kalin · December 17, 2024, 3:27pm

Hi @enzo @cruncher-abde ,

Do you compute the correlation gene-wise or cell-wise per sample? It should be gene-wise (i.e., correlation per gene per sample)

github.com

crunchdao/competitions/blob/1bc16bfaa98fd351e7c8dbf2fcc002aef8273379/competitions/broad-1/scoring/scoring.py#L256


      
          ):
              cell_count = len(y_test.index)
              weight_on_cells = numpy.ones(cell_count) / cell_count
          
              A = y_test.to_numpy()
              B = prediction.to_numpy()
          
              return numpy.sum(weight_on_cells * (numpy.square(A - B)).mean(axis=1))
          
          
          def _spearman(
              prediction: pandas.DataFrame,
              y_test: pandas.DataFrame,
          ):
              cell_count = len(y_test.index)
              weight_on_cells = numpy.ones(cell_count) / cell_count
          
              A = y_test.to_numpy()
              B = prediction.to_numpy()
          
              rank_A = scipy.stats.rankdata(A, axis=1)

enzo · December 17, 2024, 3:43pm

Region-wise:

github.com

crunchdao/competitions/blob/1bc16bfaa98fd351e7c8dbf2fcc002aef8273379/competitions/broad-1/scoring/scoring.py#L167


      
              filtered_predictions = target_predictions[target_predictions["cell_id"].isin(y_test.index)]
              target_predictions = None
          
          with tracer.log("Pivoting the dataframe"):
              filtered_predictions = filtered_predictions.pivot(index='cell_id', columns='gene', values='prediction')
          
          with tracer.log("Score prediction"):
              mse_score = crunch.scoring.ScoredMetric(None, [])
              spearman_score = crunch.scoring.ScoredMetric(None, [])
          
              for region_id, cell_ids in tracer.loop(region_cell_mapping.groupby("region_id", observed=True), lambda x: f"Score region -> {x[0]}"):
                  cell_ids = set(cell_ids)
          
                  with tracer.log(f"Filter y_test"):
                      region_prediction = filtered_predictions[filtered_predictions.index.isin(cell_ids)]
          
                  with tracer.log(f"Filter y_test"):
                      region_y_test = y_test[y_test.index.isin(cell_ids)]
          
                  with tracer.log("Ensure the same index and column order"):
                      region_prediction = region_prediction.reindex(index=region_y_test.index, columns=region_y_test.columns)

many-kalin · December 17, 2024, 3:46pm

Region == sample, but before that is it per cell or per gene? Do you compute the correlation for each gene in each region?

enzo · December 17, 2024, 4:04pm

A sample have multiple region, that have multiple cells.
The predictions are processed by sample, then grouped by region and then scored.

The formula is:

mean(
    mean(
        scoring(x)
        for per_region in per_sample.groupby("region")
    )
    for per_sample for prediction.groupby("sample")
)

many-kalin · December 17, 2024, 4:13pm

Thanks for clarifying this. My question is if you are computing the correlation score per gene and then you average it across regions and samples? I am trying to understand what is the smallest point you are computing the correlation on (gene or cell) before you start averaging.

many-kalin · December 17, 2024, 4:51pm

Hi @separate-orr , it looks like the correlation score is being computed cell-wise instead of gene-wise. In the state-of-art literature, the correlation score is computed gene-wise (e.g., HEST1k presented now at Neurips 2024) Here is also their code base HEST/src/hest/bench/trainer.py at 5a0cbba61550ed21c66bc81c36fa1780e853245d · mahmoodlab/HEST · GitHub

The gene wise correlation is then averaged across regions and samples.

It also is more meaningful because we ask how our prediction of gene A correlates with the ground truth of gene A and so on for the remaining genes…

When cell-wise, we are looking how well the gene is predicted ONLY in the specific cell (less meaningful and easier task). When gene-wise, we are looking how well the gene is predicted across ALL cells for the given region/sample.

Happy to provide more context if needed.

Could you please look into this?
Thanks!

separate-orr · December 28, 2024, 10:28pm

Hi @many-kalin thanks for the question and that is a good one! We actually went back and forth on whether the evaluation should be for a gene across all cells (what you suggest) or the gene expression within a cell. Both answer different questions. We went with a per-cell basis because for this competition, we are most interested in predicting the gene programs in individual cells, rather than expression of any individual gene. Most relevantly, we would like to differentiate gene programs between normal mucosa cells and pre-malignant mucosa cells. So accurancy within the cell is relevant. For more general benchmarks like in HEST1k, it makes sense why they would generally focus on the gene expression across all cells. Also I would assume scoring well with a per-gene metric would largely result in scoring well on the per-cell basis too, but there are exceptions. Best, Orr

many-kalin · December 29, 2024, 9:39am

Hi @separate-orr, thank you for your response. In single-cell analysis, the focus is often on the population or cluster level, not the individual cell. When performing differential gene expression analysis, individual gene expression across populations or clusters is compared. This makes it crucial to ensure accurate gene expression measurements across cells.

To illustrate this, here is a toy example demonstrating how perfect cell-wise correlation does not always translate into a perfect gene-wise score. Such discrepancies can result in flawed downstream analyses (e.g., incorrect clustering; dge).

import pandas as pd

# True values (Y)
Y = pd.DataFrame({
    "F1": [10, 15],  # Feature 1
    "F2": [20, 25],  # Feature 2
    "F3": [30, 35],  # Feature 3
}, index=["S1", "S2"])  # Samples (S1, S2)

# Predicted values (Y_hat)
Y_hat = pd.DataFrame({
    "F1": [5, 10],
    "F2": [30, 25],
    "F3": [45, 40],
}, index=["S1", "S2"])

print("Cell-wise: ", Y.corrwith(Y_hat, axis=1, method='spearman').mean())
print("Gene-wise: ", Y.corrwith(Y_hat, axis=0, method='spearman').mean())

Cell-wise: 1.0
Gene-wise: -0.3333333333333333

many-kalin · December 29, 2024, 10:37am

@separate-orr Here is an extended example with clustering. Despite having highly correlated features at the cell-wise level, the clustering results appear to be almost random.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy as np

# True values (Y)
Y = pd.DataFrame({
    "F1": [10, 15, 20, 25, 30],  # Feature 1
    "F2": [20, 25, 30, 35, 40],  # Feature 2
    "F3": [30, 35, 40, 45, 50],  # Feature 3
}, index=["S1", "S2", "S3", "S4", "S5"])  # 5 Samples

# Predicted values (Y_hat)
Y_hat = pd.DataFrame({
    "F1": [30, 25, 20, 15, 10],  # Inverse of Y
    "F2": [40, 35, 30, 25, 20],  # Inverse of Y
    "F3": [50, 45, 4.0, 35, 30],  # Inverse of Y
}, index=["S1", "S2", "S3", "S4", "S5"])

# Apply KMeans clustering
kmeans_Y = KMeans(n_clusters=2, random_state=42).fit(Y)
kmeans_Y_hat = KMeans(n_clusters=2, random_state=42).fit(Y_hat)

# Get the cluster labels
labels_Y = kmeans_Y.labels_
labels_Y_hat = kmeans_Y_hat.labels_

# Measure Adjusted Rand Index (ARI); clustering accuracy
ari = adjusted_rand_score(labels_Y, labels_Y_hat)

# Calculate correlations
sample_wise_corr = Y.corrwith(Y_hat, axis=1, method='spearman').mean()
feature_wise_corr = Y.corrwith(Y_hat, axis=0, method='spearman').mean()

# Display results
print("Cell-wise (sample-wise) correlation: ", sample_wise_corr)
print("Gene-wise (feature-wise) correlation: ", feature_wise_corr)
print("Adjusted Rand Index (ARI):", ari)

Cell-wise (sample-wise) correlation: 0.7
Gene-wise (feature-wise) correlation: -0.8999999999999999
Adjusted Rand Index (ARI): 0.16666666666666666

relieved-jingzhe · January 14, 2025, 3:01pm

Hi, has the Broad team decided which metric will be used for the final scoring?

enzo · January 14, 2025, 3:05pm

Here is the answer from the Broad team:

Before any cluster level analysis, it’s crucial to assign individual cell correctly to a cluster. This is particular important for the spatial transcriptomics data that each cell occupy a fixed spatial location and we want correct cell type annotation first.

This is one of the reason why we currently choose per-cell based metric. For the two toy examples you provided, there are two issues:

in both examples, we need to normalize gene expression per cell
the second simulated data should have clear clusters first.

For example:

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy as np
# True values (Y)
Y = pd.DataFrame({
    "F1": [10, 15, 20, 45, 50],  # Feature 1
    "F2": [20, 25, 30, 35, 40],  # Feature 2
    "F3": [30, 35, 40, 25, 30],  # Feature 3
}, index=["S1", "S2", "S3", "S4", "S5"])  # 5 Samples
# Predicted values (Y_hat)
Y_hat = pd.DataFrame({
    "F1": [30, 25, 20, 25, 30],  # Inverse of Y
    "F2": [40, 35, 30, 35, 40],  # Inverse of Y
    "F3": [50, 45, 4.0, 45, 50],  # Inverse of Y
}, index=["S1", "S2", "S3", "S4", "S5"])
Y = np.log1p(Y.T/Y.sum(axis=1)*100).T
Y_hat = np.log1p(Y_hat.T/Y_hat.sum(axis=1)*100).T
# Apply KMeans clustering
kmeans_Y = KMeans(n_clusters=2, random_state=42).fit(Y)
kmeans_Y_hat = KMeans(n_clusters=2, random_state=42).fit(Y_hat)
# Get the cluster labels
labels_Y = kmeans_Y.labels_
labels_Y_hat = kmeans_Y_hat.labels_
# Measure Adjusted Rand Index (ARI); clustering accuracy
ari = adjusted_rand_score(labels_Y, labels_Y_hat)
# Calculate correlations
sample_wise_corr = Y.corrwith(Y_hat, axis=1, method='spearman').mean()
feature_wise_corr = Y.corrwith(Y_hat, axis=0, method='spearman').mean()
# Display results
print("Cell-wise (sample-wise) correlation: ", sample_wise_corr)
print("Gene-wise (feature-wise) correlation: ", feature_wise_corr)
print("Adjusted Rand Index (ARI):", ari)

enzo · January 14, 2025, 3:18pm

Hi, has the Broad team decided which metric will be used for the final scoring?

We do not want to announce it too soon, so we are waiting for one last confirmation before we say anything.

many-kalin · January 14, 2025, 5:11pm

Your response does not sufficiently address our questions or provide meaningful insights. As we previously noted, this is simply a toy example, where you could assume that the data is normalized, scaled, and filtered to include the most variable features. Its primary purpose is to illustrate that a high Spearman cell-wise correlation does not necessarily translate to a high Spearman gene-wise correlation.
Additionally, you are again normalizing the prediction (Y_hat) - a practice we have already discussed as problematic. Re-apply normalization - #5 by many-kalin

Considering this was intended to be a scientific challenge, it remains unclear why performance is being evaluated cell-wise instead of gene-wise, as is commonly done in the literature [1,2,3,4,5,6] and as the end goal is to distinguish between healthy and disease populations (crunch 3). Moreover, in Crunch 3, the focus is on ranking genes based on their predictive value within the population (many cells) across both conditions, rather than identifying “which cell is the single most diseased one”.

[1] Jaume, G., Doucet, P., Song, A.H., Lu, M.Y., Almagro-Pérez, C., Wagner, S.J., Vaidya, A.J., Chen, R.J., Williamson, D.F., Kim, A. and Mahmood, F., 2024. Hest-1k: A dataset for spatial transcriptomics and histology image analysis. arXiv preprint arXiv:2406.16192.
[2] Xie, R., Pang, K., Chung, S., Perciani, C., MacParland, S., Wang, B. and Bader, G., 2024. Spatially Resolved Gene Expression Prediction from Histology Images via Bi-modal Contrastive Learning. Advances in Neural Information Processing Systems, 36.
[3] He, B., Bergenstråhle, L., Stenbeck, L., Abid, A., Andersson, A., Borg, Å., Maaskola, J., Lundeberg, J. and Zou, J., 2020. Integrating spatial gene expression and breast tumour morphology via deep learning. Nature biomedical engineering, 4(8), pp.827-834.
[4] Jia, Y., Liu, J., Chen, L., Zhao, T. and Wang, Y., 2024. THItoGene: a deep learning method for predicting spatial transcriptomics from histological images. Briefings in Bioinformatics, 25(1), p.bbad464.
[5] Pang, M., Su, K. and Li, M., 2021. Leveraging information in spatial transcriptomics to predict super-resolution gene expression from histology images in tumors. BioRxiv, pp.2021-11.
and many more
[6] Schmauch, B., Herpin, L., Olivier, A., Duboudin, T., Dubois, R., Gillet, L., Schiratti, J.B., Di Proietto, V., Le Corre, D., Bourgoin, A. and Taïeb, J., 2024. A deep learning-based multiscale integration of spatial omics with tumor morphology. bioRxiv, pp.2024-07.

Topic		Replies	Views
Unknown evaluation metric Broad Institute Crunch #1	5	99	January 3, 2025
Is Spearman's rank correlation the right metric for benchmarking? Broad Institute Crunch #1	2	56	December 9, 2024
How is the ranking being determined in crunch 2? Broad Institute Crunch #2	5	65	March 11, 2025
Crunch 1 learderboard ranking Broad Institute Crunch #1	5	94	January 4, 2025
The current scoring is unstable Broad Institute Crunch #2	7	101	February 20, 2025

Is the MSE the right metric for benchmark?

Related topics