0 capping in the case of negative values and rounding values to 2 decimal points

Hi,

  1. While I understand the rationale for enforcing 0 as the lowest score in the case of MSE, I wonder if this constraint should also apply to the correlation-based metrics. It is true that the model most likely never encountered negative log1p expression; however, interpolation in the negative domain could still provide meaningful model signal. This signal could be effectively captured by correlation-based metrics, which is another advantage of using them in such cases.

  2. I think rounding to two decimals is also smoothing out the model’s signal. Eventually, the normalized expression is multiplied by 100 and then log1p transformed. Rounding to more decimals shouldn’t lead to numerical instability yet.

This model signal smoothing could also make the differences between models smaller on the leaderboard.

This is the code base I am referring to competitions/competitions/broad-1/scoring/scoring.py at master · crunchdao/competitions · GitHub

In general, I think external output transformations should be avoided as much as possible. Rounding errors add up quickly.

Which transformation are you referring to in the file scoring.py?

0 capping competitions/competitions/broad-1/scoring/scoring.py at 1bc16bfaa98fd351e7c8dbf2fcc002aef8273379 · crunchdao/competitions · GitHub
and you also have the requirement for 2 decimals rounding.

You are predicting the expression of genes for a given cell, so it makes sense for the value to be positive as it represents the count of occurrences.

Regarding the rounding, this is a specific requirement outlined in the full specification:

Make sure your predictions are log1p-normalized as in anucleus.X […] rounded to 2 decimal points…

We will wait for feedback from the Broad team. @separate-orr

Negative values represent model interpolation (i.e., model signal) in the negative domain (e.g., underexpression of a given gene)

Broad confirmed what I said:

gene expression values should not be negative, either detected (positive) or not detected (zero)

1 Like

About precision:

I don’t think having more decimal points could boost model performance in a meaningful way, given that the distribution (e.g. standard deviation) of non-zero values in the cell by gene matrix is about two orders of magnitude of two decimal points (0.01).

1 Like