Crunch.test() returns incorrect X_train, y_train data

mashrur · June 2, 2025, 11:44am

In the DataCrunch competition, when I used crunch.load_data() to load X_train and y_train, the moon column has values from 0 to 468. But when running crunch.test() the X_train has moons: 0 to 455. Is this a bug?
The overview page for the competition states that last 13 moons are used as test_data in X_test and y_test, that would mean moons 456 to 468 are being used as test data but this is not the case, since X_test moons are 469 to 480.

It doesn’t make sense as to why moons 456-468 are neither in the train data nor the test data. Some clarification and help would be appreciated, thanks

enzo · June 2, 2025, 12:01pm

This is the intended behavior.

When loading the data, you are given a fixed number of moons to train.

However, when running the test, your code can retrain based on the moon (like every 2 moons) with the data used to predict previous moons. Else retraining on the same moons would be useless.

To disable the behavior, do:

crunch.test(
    train_frequency=0
)

mashrur · June 2, 2025, 12:28pm

to clarify the question a bit…
When running the notebook, when my model is making prediction on moon 469, it is given training data (through X_train, y_train in train function) of moons 0 to 455 only, this doesn’t include the latest 13 moons. My assumption was that the model would be trained on moons 0 to 468 and used to predict moon 469, but the most recent data (456-468) is missing in the training data

Next when it is predicting moon 470, it is given training data of moons 0 to 456, still the latest 13 moons remain missing

Is this intended, is the model not supposed to have access to the latest 13 moons?

enzo · June 2, 2025, 2:10pm

Just to be sure, I re-tested the notebook just now with the following:

def train(
    X_train: pd.DataFrame,
) -> None:
    print("X_train", X_train["moon"].min(), X_train["moon"].max())

def infer(
    X_test: pd.DataFrame,
    id_column_name: str,
    moon_column_name: str,
    prediction_column_names: [],
) -> pd.DataFrame:
    prediction = X_test[[moon_column_name, id_column_name]].copy()

    for prediction_column_name in prediction_column_names:
        prediction[prediction_column_name] = X_test["vratios_Feature_6"]  # whatever

    print("X_test", X_test["moon"].min(), X_test["moon"].max())

    return prediction

Here are the results with a train_frequency=2 (every other moon):

looping moon=469 train=True (1/12)
train X_train 0 455
infer X_test 469 469

looping moon=470 train=True (2/12)
train X_train 0 456
infer X_test 470 470

looping moon=471 train=False (3/12)
infer X_test 471 471

looping moon=472 train=True (4/12)
train X_train 0 458
infer X_test 472 472

looping moon=473 train=False (5/12)
infer X_test 473 473

looping moon=474 train=True (6/12)
train X_train 0 460
infer X_test 474 474

looping moon=475 train=False (7/12)
infer X_test 475 475

looping moon=476 train=True (8/12)
train X_train 0 462
infer X_test 476 476

looping moon=477 train=False (9/12)
infer X_test 477 477

looping moon=478 train=True (10/12)
train X_train 0 464
infer X_test 478 478

looping moon=479 train=False (11/12)
infer X_test 479 479

looping moon=480 train=True (12/12)
train X_train 0 466
infer X_test 480 480

It’s clear that X_train is growing slowly after each loop.

I’m not sure where the problem could lie.
Are you sure that you are using the X_train parameter provided to the train function?

mashrur · June 2, 2025, 2:40pm

the part where i wanted clarification was:
when moon=469, X_train has data up to moon 455. shouldn’t this be 468, so the model can use latest data to make its prediction? or is this gap intentional?

Loading X_train through crunch.load_data, I see moons from 0 to 468. but X_train in train function has moons 0 to 455, why are the last 13 moons removed?

enzo · June 2, 2025, 2:52pm

Yes, the gap is intentional. There is an embargo of 13 moons to avoid any potential leakage.

Remember that the test data provided locally is only for testing purposes. You will only receive real data during the out-of-sample phase.

mashrur · June 2, 2025, 3:10pm

Okay i get it, thanks for helping me out

Topic		Replies	Views
Y_test is getting downloaded, but not accessible. More details in below ADIA Lab	5	196	August 14, 2023
Does the model retraining work correctly? ADIA Lab	4	154	August 14, 2023
What is test data? Causality Discovery	3	127	August 27, 2024
Problem with my Notebook run #21832 - Training doesnt happen , direct to test ADIA Lab	7	42	October 16, 2024
About the DataCrunch category DataCrunch	2	56	July 9, 2024

Crunch.test() returns incorrect X_train, y_train data

Related topics