Is Spearman's rank correlation the right metric for benchmarking?

We can see in the actual leaderboard (09.12.24) that there would be no difference between using MSE or spearman’s correlation for ranking the submissions. But there is one outlier: this is the submission “many-kalin / deepspot” which has the worst MSE(=0.513) of all submissions, but is ranked on the 3rd place. Perhaps pearson correlation would be better.

The reason was shared in the announcement.

The EWSC at the Broad Institute has decided to experiment with two scoring metrics:

→ MSE (Mean Squared Error)
→ Spearman Correlation

This approach is being tested to determine which metric will provide the best evaluation for participants. Broad will finalize which score will be used for the next checkpoint and the final scoring of broad-1.

Thank you for the immediate answer.