The current scoring is unstable

Hi, thanks for incorporating my feedback regarding the gene-wise Spearman score into Crunch 2.

However, there are several issues with the current scoring approach that make the rankings highly unstable:

  1. By averaging the leaderboard ranks for Spearman cell-wise and gene-wise scores, the model differences and their magnitudes are ignored.
  2. In cases of tied ranks (e.g., Player A ranks 1st in gene-wise and 2nd in cell-wise, while Player B has the opposite ranking), the player who submitted their model first is ranked higher. This creates an arbitrary advantage and becomes even more problematic with multiple players and ties.
  3. If someone submits two models optimized individually for each metric, they could reverse-engineer the leaderboard rankings to submit the model that maximizes their final rank.
  4. Gene-wise and cell-wise rankings are likely to be orthogonal (high cell-wise->low gene-wise an vice versa), meaning the current method does not reflect meaningful performance differences.

These issues lead to an unstable and potentially unfair ranking system. A simple improvement would be to compute the average of the two metrics before ranking, which would add some stability. However, the fundamental issues would still remain.

I would like to hear your thoughts.

Also one should keep in mind, that by introducing a new scoring scheme, the models developed in crunch 1 would be not optimal for crunch 2 (breaks the idea/relationship that crunch 2 builds upon crunch 1)

Here is a rephrased response from the Broad team:

We agree that the two scores model different things.
A simple average was not very appropriate, so we currently use the average rank as a simple ranking mechanism.

But we agree that multiple rankings would have been better, the platform just does not currently support it.

Regarding the ties, we are still in discussion with the Broad team.

I don’t think that this addresses the described fundamental issues, only acknowledges them.

Right now, if most participants optimize for metric A while only a few focus on metric B, those optimizing for metric B will be favored due to lower competition and higher ranks—and vice versa. This introduces an element of luck, as a model’s ranking depends on the choices of the majority rather than its actual performance. Moreover, the rankings will fluctuate constantly based on which metric happens to be more favorable for the larger group of players at any given time.

1 Like

Model performance should be evaluated based on its own metrics and effectiveness, rather than being judged in comparison to other models (by looking at the ranks of other players).

1 Like

New recommendation: it is better to decide for one of the two metrics. Spearman gene-wise would be more reasonable both because it is used in the literature and because it is more meaningful then the cell-wise since you only have 20 genes per cell to rank. Disadvantage: different than crunch 1

It is the participant’s job to optimize for both.