Clarifications about Crunch 1 datasets, program proportions, and genes_to_predict

Hello Crunch 1 organizers,

I’m participating in the Broad Obesity ML Competition (Crunch 1) and I’m running into a few points of confusion around the datasets and expected outputs. I’d really appreciate any clarification you can share.

  1. program_proportion.csv vs obesity_challenge_1.h5ad proportions

    • When I compute program proportions from obesity_challenge_1.h5ad (e.g., grouping by obs[“gene”] and averaging the pre_adipo / adipo / lipo / other indicators), I don’t get values that match program_proportion.csv.
    • Could you confirm the exact aggregation and normalization used to generate program_proportion.csv?
  2. lipo without adipo never occurs in the .h5ad

    • In obesity_challenge_1.h5ad, I don’t see any cells with lipo=1 and adipo=0. Instead, lipogenic cells appear to be always adipo+lipo (i.e., “lipo is a subset of adipo”).
    • Is this expected by design? and should we interpret the four columns as overlapping “enrichment” flags rather than mutually exclusive labels?
  3. Clarification of each file’s role (local ground truth artifacts)
    Could you provide a clear description of the purpose and intended usage of the following files and how they relate to the main dataset?

    • obesity_challenge_1_local_gtruth.h5ad.small.zip
    • program_proportion_local_gtruth.csv
      In particular: are these meant only for local debugging/quickstarter scoring, and do they differ from the main obesity_challenge_1.h5ad/program_proportion.csv in preprocessing, gene filtering, or labeling rules?
  4. Expected Output: prediction.h5ad and genes_to_predict / gene list confusion
    From the docs, predict_perturbations.txt lists 2,863 candidate gene perturbations, and we need to generate 100 predicted cells per perturbation (total 286,300 cells).

    • In the released obesity_challenge_1.h5ad, I see ~11,046 genes. However, the documentation mentions validation N=10,237 and a maximum of 21,592. How and when do participants obtain the exact genes_to_predict list (for validation vs test)?

Thank you very much for your help!

  1. Use the “default” version of the dataset (not the small one), and you will get the exact match with program_proportion.csv.

  2. This is expected. The organizers share this notebook to better understand how the data was processed.

  3. Both contain ground truth data for the five genes originally intended for the training data. This can help you validate your results by enabling you to compute a score locally.

  4. Use the default dataset, not the small one. When you run in the cloud, the genes, genes to predict list will also be provided, which you must follow when making your prediction.