Clarifications about program proportions and the R notebook pipeline

Hi,

I have some questions regarding the program proportions:

  1. From what I understood from the explanations in the overview page and in the full specifications PDF, the ‘program_proprtion.csv’ of the training data is calculated from the pertrub-seq data using the pipeline in GitHub - julielaffy/obesity-broad-ml-competition-2025. Is it correct?
  2. Assuming it is correct, are there instructions on how we can run this pipeline ourselves? Its inputs aren’t clear and it seems that the input data isn’t exactly h5ad files but some files in other formats.
  3. From what I understand, this pipeline is based on detection of signature genes (or ‘signature programs’) and deciding the cell state by those. Are they similar to the ‘signature_genes.csv’ file we got as part of the dataset? Is the ‘signature_genes.csv’ useful for us in any way? There weren’t enough explanations about the meaning of this file - each row there represents a gene that when highly expressed it means that the cell is in the specified state? or maybe some genes are negatively correlated to the state?
  4. If there is a non-ML pipeline (as in the ‘program_analysis’ - GitHub - julielaffy/obesity-broad-ml-competition-2025) that calculates the program proportions from pertrub-seq data, then why are we supposed to submit our own predictions of the program proportions of the test data? Isn’t it enough to predict the RNA-Seq and then the program proprtions can be computed directly from it using the pipeline? Why machine-learning is even needed in this step?

Thanks!

  1. Yes.
  2. The pipeline is not operational yet because it depends on data that has not yet been published. This data will likely be made public at the end of the competition.
  3. The file is provided so that participants can be aware of who they are because:

    Genes detected in fewer than 10 cells were removed,
    and known signature genes from signature_genes.csv were subsequently
    re-introduced.

  4. The point is to discover how they might react with other genes. A matrix of 21,000 genes is too large for humans to test, which is why we need machines to help us focus on the most important ones.