The current scoring is unstable

many-kalin · February 10, 2025, 3:06pm

Hi, thanks for incorporating my feedback regarding the gene-wise Spearman score into Crunch 2.

However, there are several issues with the current scoring approach that make the rankings highly unstable:

By averaging the leaderboard ranks for Spearman cell-wise and gene-wise scores, the model differences and their magnitudes are ignored.
In cases of tied ranks (e.g., Player A ranks 1st in gene-wise and 2nd in cell-wise, while Player B has the opposite ranking), the player who submitted their model first is ranked higher. This creates an arbitrary advantage and becomes even more problematic with multiple players and ties.
If someone submits two models optimized individually for each metric, they could reverse-engineer the leaderboard rankings to submit the model that maximizes their final rank.
Gene-wise and cell-wise rankings are likely to be orthogonal (high cell-wise->low gene-wise an vice versa), meaning the current method does not reflect meaningful performance differences.

These issues lead to an unstable and potentially unfair ranking system. A simple improvement would be to compute the average of the two metrics before ranking, which would add some stability. However, the fundamental issues would still remain.

I would like to hear your thoughts.

many-kalin · February 10, 2025, 4:40pm

Also one should keep in mind, that by introducing a new scoring scheme, the models developed in crunch 1 would be not optimal for crunch 2 (breaks the idea/relationship that crunch 2 builds upon crunch 1)

enzo · February 10, 2025, 5:15pm

Here is a rephrased response from the Broad team:

We agree that the two scores model different things.
A simple average was not very appropriate, so we currently use the average rank as a simple ranking mechanism.

But we agree that multiple rankings would have been better, the platform just does not currently support it.

Regarding the ties, we are still in discussion with the Broad team.

many-kalin · February 10, 2025, 5:37pm

I don’t think that this addresses the described fundamental issues, only acknowledges them.

many-kalin · February 10, 2025, 5:48pm

Right now, if most participants optimize for metric A while only a few focus on metric B, those optimizing for metric B will be favored due to lower competition and higher ranks—and vice versa. This introduces an element of luck, as a model’s ranking depends on the choices of the majority rather than its actual performance. Moreover, the rankings will fluctuate constantly based on which metric happens to be more favorable for the larger group of players at any given time.

many-kalin · February 11, 2025, 12:11pm

Model performance should be evaluated based on its own metrics and effectiveness, rather than being judged in comparison to other models (by looking at the ranks of other players).

many-kalin · February 12, 2025, 3:13pm

New recommendation: it is better to decide for one of the two metrics. Spearman gene-wise would be more reasonable both because it is used in the literature and because it is more meaningful then the cell-wise since you only have 20 genes per cell to rank. Disadvantage: different than crunch 1

enzo · February 20, 2025, 7:41pm

It is the participant’s job to optimize for both.

Topic		Replies	Views
How is the ranking being determined in crunch 2? Broad Institute Crunch #2	5	65	March 11, 2025
Crunch 1 learderboard ranking Broad Institute Crunch #1	5	94	January 4, 2025
Unknown evaluation metric Broad Institute Crunch #1	5	99	January 3, 2025
Live leaderboard Broad Institute Crunch #2	3	30	February 10, 2025
Is the MSE the right metric for benchmark? Broad Institute Crunch #1	15	229	January 14, 2025

The current scoring is unstable

Related topics