Issues with AWS

ultimate-sahand · September 12, 2024, 12:02pm

My code runs fine on Google Colab. I submit it to the platform and it starts running it on AWS, but it halts after 15 mins , in the middle of one of the test datasets !!! it freezes , and then eventually times out , after 14 hours , but it literally freezes in 25 mins or less. The code is fine , no error , just a freeze !!!
Also a couple times I got a strange label on my submission “BAD PREDICTION” which says “Different ID(S)” despite the fact that the code ran perfectly fine. Can you please explain what this error means?

enzo · September 12, 2024, 12:13pm

Could you tell me the Run ID?

ultimate-sahand · September 12, 2024, 12:27pm

18985 is “BAD PREDICTION” “Different ID(S)”

ultimate-sahand · September 12, 2024, 12:31pm

19248 was one stuck one. I had to dumb down a model to one Batch training only , so that it can run in 6 seconds per data set. Then I submitted it for test run , but it stopped shortly after maybe after 35 data sets and 10 mins (excluding 4-5 minutes to run the code). Not sure. The model is simple and runs fine on Google Colab. I am getting very frustrated. In fact , I just tried to access my submissions and they are also not really available.

enzo · September 12, 2024, 12:48pm

You have 9399 IDs that should not be included.

Here is the list:
run_18985_bad_ids.txt (101.0 KB)

enzo · September 12, 2024, 12:53pm

It just looks like you didn’t have enough quota.
The last TQDM printout showed that it already ran for 1h07m09s, if you include the run setup + data loading, the 1h20m timeout is fair.

The quota refresh is today. If you wait a few hours, you will have access to 15 hours of compute.

You are right, the output of the run is not accessible because the participant could use it to leak the dataset.

You seem to be printing a lot.
After 1500 lines, the logs are no longer recorded. This is to prevent spam, as some runs would be unreadable because of all the logs.

ultimate-sahand · September 12, 2024, 1:26pm

I am not sure why this is the case that I am getting unwanted IDs ? I have not changed the mechanical part of the code where the IDs are read , graph is created , and then from a graph a series of 1’s and 0’s are written into predictions[f’{dataset_id}{i}{j}'] = int(G.loc[i, j]) ! Given that I haven’t manipulated dataset_id , then why do I have 9399 IDS that shouldn’t be there?

ultimate-sahand · September 12, 2024, 1:27pm

I was printing to debug my code. The submitted code doesn’t have to have printing. regarding leaking the data , isn’t the test data already provided to us? we can read the test data , and I thought it is the data that the scores are calculated on?

ultimate-sahand · September 12, 2024, 1:30pm

Regarding the 15h computation quota , do we get 15 hours per day , or 15 hours is for 3 days or one week or … ? I need more , as I want to test all the existing algorithms in the gcastle package , some algorithms from other packages and also some from paper with codes , and I will also need a lot of computational power to test multiple mixing and blending algorithms. 15hours per week is absolutely not enough. is it 15 hours per day?

enzo · September 12, 2024, 1:49pm

Just by looking at 00003_1_0 (the very first one in the list of bad ids), there is no node #0 or #1 in the dataset 00003.

The same is true for 00004_1_7:

Of the 47,000 data sets in the competition, only half are provided for the submission phase.
Your code will discover the other half during the out-of-sample phase.
To ensure that your code is ready, the current data is again split 80-20, with the 20% not directly downloadable, but only available in the runner.

No, it is 15 hours per week.

ultimate-sahand · September 12, 2024, 11:00pm

I see why I got those extra IDS. My mistake. So the reason that my code froze sometimes shortly after start was that I was finished with the 15 hours weekly quota? when does this 15 hours start and end?

enzo · September 12, 2024, 11:13pm

Yes, you likely didn’t saw that the Run got killed because of the timeout and just continued waiting for it.

Once a Run is stopped, the logs also stop being refreshed. So the lines that are a bit late only appear after a page refresh.

Sorry for the confusion.

ultimate-sahand · September 13, 2024, 1:53am

No worries. I think you have changed something now in the platform. In the past I used to make a submission and see it being executed, but now, as soon as I submit , it kicks me out , says “Access Denied” . Is it a new feature or something customize for my account?

ultimate-sahand · September 13, 2024, 2:20am

By the way , I got frozen again , this time after 12 minutes , and running only 15 or so of the test data :

it is the Run# 19354

enzo · September 13, 2024, 9:28am

How do you want to submit?
Through the website? Or via the Crunch CLI?

I cannot find the root cause of the problem. It is probably related to your code, since others are running fine.
Without the ability to see and debug it, there is not much I can do.

ultimate-sahand · September 13, 2024, 3:41pm

I will submit a notebook

Topic		Replies	Views
Run does not terminate ADIA Lab 2025	8	35	May 20, 2025
The run stops after 10 minutes despite no errors in the code ADIA Lab	1	22	October 6, 2024
Submissions crash ADIA Lab	5	42	October 6, 2024
RUN ID #21784 - stopped after just 3 minutes of running without any error ADIA Lab	3	14	October 16, 2024
Error: The following cell IDs are missing in predictions: Broad Institute Crunch #1	4	36	December 2, 2024

Issues with AWS

Related topics