Hi,
I have some questions regarding the program proportions:
- From what I understood from the explanations in the overview page and in the full specifications PDF, the ‘program_proprtion.csv’ of the training data is calculated from the pertrub-seq data using the pipeline in GitHub - julielaffy/obesity-broad-ml-competition-2025. Is it correct?
- Assuming it is correct, are there instructions on how we can run this pipeline ourselves? Its inputs aren’t clear and it seems that the input data isn’t exactly h5ad files but some files in other formats.
- From what I understand, this pipeline is based on detection of signature genes (or ‘signature programs’) and deciding the cell state by those. Are they similar to the ‘signature_genes.csv’ file we got as part of the dataset? Is the ‘signature_genes.csv’ useful for us in any way? There weren’t enough explanations about the meaning of this file - each row there represents a gene that when highly expressed it means that the cell is in the specified state? or maybe some genes are negatively correlated to the state?
- If there is a non-ML pipeline (as in the ‘program_analysis’ - GitHub - julielaffy/obesity-broad-ml-competition-2025) that calculates the program proportions from pertrub-seq data, then why are we supposed to submit our own predictions of the program proportions of the test data? Isn’t it enough to predict the RNA-Seq and then the program proprtions can be computed directly from it using the pipeline? Why machine-learning is even needed in this step?
Thanks!