Hello Crunch 1 organizers,
I’m participating in the Broad Obesity ML Competition (Crunch 1) and I’m running into a few points of confusion around the datasets and expected outputs. I’d really appreciate any clarification you can share.
-
program_proportion.csv vs obesity_challenge_1.h5ad proportions
- When I compute program proportions from obesity_challenge_1.h5ad (e.g., grouping by obs[“gene”] and averaging the pre_adipo / adipo / lipo / other indicators), I don’t get values that match program_proportion.csv.
- Could you confirm the exact aggregation and normalization used to generate program_proportion.csv?
-
lipo without adipo never occurs in the .h5ad
- In obesity_challenge_1.h5ad, I don’t see any cells with lipo=1 and adipo=0. Instead, lipogenic cells appear to be always adipo+lipo (i.e., “lipo is a subset of adipo”).
- Is this expected by design? and should we interpret the four columns as overlapping “enrichment” flags rather than mutually exclusive labels?
-
Clarification of each file’s role (local ground truth artifacts)
Could you provide a clear description of the purpose and intended usage of the following files and how they relate to the main dataset?- obesity_challenge_1_local_gtruth.h5ad.small.zip
- program_proportion_local_gtruth.csv
In particular: are these meant only for local debugging/quickstarter scoring, and do they differ from the main obesity_challenge_1.h5ad/program_proportion.csv in preprocessing, gene filtering, or labeling rules?
-
Expected Output: prediction.h5ad and genes_to_predict / gene list confusion
From the docs, predict_perturbations.txt lists 2,863 candidate gene perturbations, and we need to generate 100 predicted cells per perturbation (total 286,300 cells).- In the released obesity_challenge_1.h5ad, I see ~11,046 genes. However, the documentation mentions validation N=10,237 and a maximum of 21,592. How and when do participants obtain the exact genes_to_predict list (for validation vs test)?
Thank you very much for your help!