Hi,
-
While I understand the rationale for enforcing 0 as the lowest score in the case of MSE, I wonder if this constraint should also apply to the correlation-based metrics. It is true that the model most likely never encountered negative log1p expression; however, interpolation in the negative domain could still provide meaningful model signal. This signal could be effectively captured by correlation-based metrics, which is another advantage of using them in such cases.
-
I think rounding to two decimals is also smoothing out the model’s signal. Eventually, the normalized expression is multiplied by 100 and then log1p transformed. Rounding to more decimals shouldn’t lead to numerical instability yet.
This model signal smoothing could also make the differences between models smaller on the leaderboard.
This is the code base I am referring to competitions/competitions/broad-1/scoring/scoring.py at master · crunchdao/competitions · GitHub
In general, I think external output transformations should be avoided as much as possible. Rounding errors add up quickly.
Which transformation are you referring to in the file scoring.py
?
You are predicting the expression of genes for a given cell, so it makes sense for the value to be positive as it represents the count of occurrences.
Regarding the rounding, this is a specific requirement outlined in the full specification:
Make sure your predictions are log1p-normalized as in anucleus.X […] rounded to 2 decimal points…
We will wait for feedback from the Broad team. @separate-orr
Negative values represent model interpolation (i.e., model signal) in the negative domain (e.g., underexpression of a given gene)
Broad confirmed what I said:
gene expression values should not be negative, either detected (positive) or not detected (zero)
1 Like