Leakage across Dates?

Within the training dataset there are 250+ dates. If we are splitting the data into various train/test sets by date, should there be a gap of N dates between the train and test sets to avoid overtraining leakage? If so, what is the recommended N?

1 Like

Hi casual-andy!

The embargo between the train/test set is of 1 date.
So if you want to do multiple train test sets by date in a walk forward fashion, I would consider having one date obfuscated/dropped between any of your train/test set

Mr xgilbert, I have a question.

The embargo between the train/test set is of 1 date.

Does this mean that there is a one-day gap between the last date of X_train passed to the train method and the date of X_test passed to the infer method when these codes are executed on the competition server?

A date is a time period. It can be one week, one month or another time period. CrunchDAO doesn’t have this information.
Hope this helped :slight_smile:

1 Like

I apologize for asking in a way that could be misunderstood.
I would like to know how many time index gaps there are between training and inferring.
Specifically, if the last “date” of train is t, then the “date” of infer is t+1, not t+2, is that correct?
I am a little confused by the word “embargo”.

If the last date of train is t and there is one date embargo, then the first date of infer is t+2.
The embargo is t+1 date since it contains information about t+2 target. This date thus can’t be used.

Embargo is just the gap between the train and test sets to avoid leakage.

1 Like

Thank you for the clear explanation. I understand now!

1 Like