Hi, I was reviewing the code base for the challenge and noticed a function that enforces log1p normalization on the predictions in the scoring scheme: feat(broad-1/scoring): re-apply normalization if necessary · crunchdao/competitions@3d72103 · GitHub.
I would argue that this step may not be necessary or desirable. Introducing normalization based on the sum of all predictions could create a dependence on the predicted variables. This might be problematic if the gene expressions for different genes are treated as independent predictions. For example, if I’m predicting some genes (X) well and others (Y) poorly, normalizing everything by the sum of both X and Y could lead to underperformance. This is because the mispredictions in Y might negatively impact the overall performance, even if X predictions are strong.
Therefore, it may be better to leave the predictions untransformed as they are. What are your thoughts on this?