Submission fails

Hi @enzo

I’m struggle to find out why all my submissions are failing since this morning. crunch test worked fine. Here are 3 submissions that returned an error at the beginning (after 42 seconds):

Run #78581

Run #78594

Run #78578

Yesterday everthing was fine.

The error is: Task state is updated from RUNNING to FAILED on zones/europe-west1-b/instances/6766334117289933714 with exit code 1. with code 1

Thanks.

same is happening to me

I think it’s a problem with the underlying infrastructure that fails.

Hi @mpware and @salty-francisco,

We apologize for the downtime for the runs using GCP as a compute provider.

We identified the issue and deployed a fix.

Don’t hesitate to reach us if you are still experiencing issues.

1 Like

Hello, staff!
Because of these constant failures(just related to AWS), I have exausted all my compute time. It’s stack. I could not even pause or turn off such submissions. Same code for next time - works perfect.
How could we eliminate such problems?

What could I do with exausted limits(15 hours) because of that?

Hello Efim,

To avoid people being stuck without any quota by accidentally consuming it all, failed runs do not count in the weekly quota.

Your run, #78934, took over 12 hours by itself. After that, you ran multiple runs at ~30 minutes each.

Technically, you used your quota as expected.
I just checked the provider’s website (it’s GCP this year, not AWS) and the runtime is really 12 hours.

Are you saying that the 12 hours ran in only 30 minutes by submitting the exact same code?

Thank you for quick answer!

Yes, you are right. I missed this super long run. I was thinking that Terminated runs are considered in overall.

I did some investigation for this super long run:

As I see, it was because of super long determinism check:

Could we consider super long determinism check(required by the platform) as failed submission?

Because, training/inference/whole setup was pretty fast. Also, I always do my local runs before submission. So, it’s just because stacked determinism.

Here is the full log:

Created Task: projects/311731030305/locations/europe-west1/jobs/tournament--run-78934--1778706294783


started


downloading runner...


downloading code...


/context/code/catboost_info/time_left.tsv: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/56483/catboost_info/time_left.tsv (16143 bytes)


/context/code/requirements.txt: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/56483/requirements.txt (170 bytes)


/context/code/catboost_info/catboost_training.json: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/56483/catboost_info/catboost_training.json (96759 bytes)


/context/code/catboost_info/learn/events.out.tfevents: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/56483/catboost_info/learn/events.out.tfevents (54870 bytes)


/context/code/notebook.ipynb: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/56483/notebook.ipynb (5878 bytes)


/context/code/main.ipynb: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/56483/main.ipynb (24751 bytes)


/context/code/main.py: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/56483/main.py (14040 bytes)


/context/code/catboost_info/learn_error.tsv: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/submissions/56483/catboost_info/learn_error.tsv (16780 bytes)


installing python requirements...


Running pip... Toggle 'Show advanced logs' in order to see more details


installing crunch-cli...


Running pip... Toggle 'Show advanced logs' in order to see more details


Changed status: RUNNING


Running pip... Toggle 'Show advanced logs' in order to see more details


downloading data...


/context/data/X_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/234/X_train.parquet (218514418 bytes)


/context/data/y_train.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/234/y_train.parquet (8356193 bytes)


/context/data/X_test.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/234/X_test.parquet (216845130 bytes)


/context/data/y_train_index.parquet: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/data-releases/234/y_train_index.parquet (100089 bytes)


downloading model...


/context/code/resources/gru_weights.npz: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/models/49450/gru_weights.npz (71992 bytes)


/context/code/resources/model.joblib: download from https:crunchdao--competition--production.s3-accelerate.amazonaws.com/models/49450/model.joblib (539414 bytes)


prepare prediction directory...


executing - command=train


trained LGBM on 5036517 rows x 25 feats; positives=1283914 (25.49%)


GRU weights ready at /context/code/resources/gru_weights.npz


executing - command=infer


/context/code/main.py:355: RuntimeWarning: overflow encountered in exp


  z = 1.0 / (1.0 + np.exp(-(iz + hz)))


/context/code/main.py:354: RuntimeWarning: overflow encountered in exp


  r = 1.0 / (1.0 + np.exp(-(ir + hr)))


checking determinism by executing the inference again with 30% of the data (tolerance: 1e-08)


executing - command=infer


/context/code/main.py:355: RuntimeWarning: overflow encountered in exp


  z = 1.0 / (1.0 + np.exp(-(iz + hz)))


/context/code/main.py:354: RuntimeWarning: overflow encountered in exp


  r = 1.0 / (1.0 + np.exp(-(ir + hr)))


determinism check: passed


uploading result...


prediction: found file name=`prediction.parquet` size=22463612


prediction: done walking files.len=1 total_size=22463612


prediction: uploading name=`prediction.parquet`


model: found file name=`model.joblib` size=672822


model: found file name=`gru_weights.npz` size=71992


model: done walking files.len=2 total_size=744814 has_changed=True


model: uploading name=`model.joblib`


model: uploading name=`gru_weights.npz`


result submitted


ended

Hi stormy-efim,

I just invalidated it as an exceptional occasion. You are now reporting only two hours of consumed quota.

The determinism check is only run on 30% of the data and should not normally take longer than your usual inference speed. It is a mandatory requirement to ensure your code is deterministic, the output is eventually discarded.