0 capping in the case of negative values and rounding values to 2 decimal points

many-kalin · December 14, 2024, 9:39am

Hi,

While I understand the rationale for enforcing 0 as the lowest score in the case of MSE, I wonder if this constraint should also apply to the correlation-based metrics. It is true that the model most likely never encountered negative log1p expression; however, interpolation in the negative domain could still provide meaningful model signal. This signal could be effectively captured by correlation-based metrics, which is another advantage of using them in such cases.
I think rounding to two decimals is also smoothing out the model’s signal. Eventually, the normalized expression is multiplied by 100 and then log1p transformed. Rounding to more decimals shouldn’t lead to numerical instability yet.

This model signal smoothing could also make the differences between models smaller on the leaderboard.

many-kalin · December 14, 2024, 9:49am

This is the code base I am referring to competitions/competitions/broad-1/scoring/scoring.py at master · crunchdao/competitions · GitHub

In general, I think external output transformations should be avoided as much as possible. Rounding errors add up quickly.

cruncher-abde · December 16, 2024, 2:20pm

Which transformation are you referring to in the file scoring.py?

many-kalin · December 16, 2024, 2:30pm

0 capping competitions/competitions/broad-1/scoring/scoring.py at 1bc16bfaa98fd351e7c8dbf2fcc002aef8273379 · crunchdao/competitions · GitHub
and you also have the requirement for 2 decimals rounding.

cruncher-abde · December 16, 2024, 2:49pm

You are predicting the expression of genes for a given cell, so it makes sense for the value to be positive as it represents the count of occurrences.

Regarding the rounding, this is a specific requirement outlined in the full specification:

Make sure your predictions are log1p-normalized as in anucleus.X […] rounded to 2 decimal points…

We will wait for feedback from the Broad team. @separate-orr

many-kalin · December 16, 2024, 2:57pm

Negative values represent model interpolation (i.e., model signal) in the negative domain (e.g., underexpression of a given gene)

cruncher-abde · December 18, 2024, 11:18am

Broad confirmed what I said:

gene expression values should not be negative, either detected (positive) or not detected (zero)

cruncher-abde · December 18, 2024, 11:20am

About precision:

I don’t think having more decimal points could boost model performance in a meaningful way, given that the distribution (e.g. standard deviation) of non-zero values in the cell by gene matrix is about two orders of magnitude of two decimal points (0.01).

Topic		Replies	Views
Is the MSE the right metric for benchmark? Broad Institute Crunch #1	15	259	January 14, 2025
Re-apply normalization Broad Institute Crunch #1	4	75	December 6, 2024
Unknown evaluation metric Broad Institute Crunch #1	5	115	January 3, 2025
Why do you ignore the 0s? Broad Institute Crunch #2	1	33	February 20, 2025
Submission run status is success but there is no score ADIA Lab	3	180	August 6, 2023

0 capping in the case of negative values and rounding values to 2 decimal points

Related topics