Error: pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet file size is 0 bytes

many-kalin · December 1, 2024, 4:54pm

Hi,

I am experiencing the following error when submitting through the cli:

return self.sandbox(
       ^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/crunch/runner/cloud.py”, line 592, in sandbox
return utils.read(self.prediction_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/crunch/utils.py”, line 135, in read
return pandas.read_parquet(path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pandas/io/parquet.py”, line 667, in read_parquet
return impl.read(
^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pandas/io/parquet.py”, line 274, in read
pa_table = self.api.parquet.read_table(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pyarrow/parquet/core.py”, line 1793, in read_table
dataset = ParquetDataset(
^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pyarrow/parquet/core.py”, line 1360, in init
[fragment], schema=schema or fragment.physical_schema,
^^^^^^^^^^^^^^^^^^^^^^^^
File “pyarrow/_dataset.pyx”, line 1431, in pyarrow._dataset.Fragment.physical_schema.get
File “pyarrow/error.pxi”, line 155, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow/error.pxi”, line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source ‘’: Parquet file size is 0 bytes

I saw a similar issue described here pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet file size is 0 bytes

Do you know what the issue might be and how it can be fixed?

Thank you!

enzo · December 1, 2024, 5:06pm

Could you give me the Run ID please?

many-kalin · December 1, 2024, 5:07pm

25770

many-kalin · December 1, 2024, 6:00pm

Hi @enzo, do you have any insights into whether the issue is on my side or the AWS side? I found some similar issues described: python - pyarrow.lib.ArrowIOError: Invalid Parquet file size is 0 bytes - Stack Overflow and [Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix · Issue #38794 · apache/arrow · GitHub

I was hoping to resolve it soon to benefit from the first checkpoint…

enzo · December 1, 2024, 6:13pm

I am still on it

many-kalin · December 1, 2024, 6:13pm

thanks for the feedback. Let me know if you need more information.

enzo · December 1, 2024, 9:03pm

Sorry, but it has been a few hours already and I still don’t understand what the root cause is.

It happen literally in the middle of loading your code.
I can confirm it by your dask’s import warning and a print after the user code is loaded.

I will continue tomorrow.

Did you tried with CPU instead of GPU?

The scoring will be computed monday evening, so you still have time.

many-kalin · December 1, 2024, 9:33pm

Hi @enzo , thank you for your time… Now, I switched to CPU and it started working… However, the model is very slow on CPU and it will take days for the inference… (e.g., ~10h per sample x 8).

I wasn’t expecting that the GPU would be the issue… On a side note - when I am submitting the job, the program says that it is not detecting GPU needs even though the code requires it, so I am enforcing the use of the GPU… could this be related?

many-kalin · December 1, 2024, 9:36pm

The CPU run id is 25780

enzo · December 1, 2024, 9:46pm

I am not sure. Something is causing the program to silently exit without error…
I tried to catch exit calls, but it only works if it is done in Python.

many-kalin · December 2, 2024, 6:53am

Hi, I found that when I submit through the website, the program is not crashing on GPU, but through the crunch cli it is. One of the differences I noticed is the requirements.txt file.
When I submit through crunch cli it enforces the package versions even though I do not specify them:

original requirements.txt

frozen using pypi

anndata
numpy
pandas
scanpy
scikit-image
spatialdata
timm
torch==2.5.1
torchvision
tqdm

will be uploaded as:

frozen using pypi

anndata==0.10.8
numpy==1.26.4
pandas==2.2.2
scanpy==1.10.2
scikit-image==0.24.0
spatialdata==0.2.1
timm==0.9.12
torch==2.5.1
torchvision==0.19.0
tqdm==4.66.5

many-kalin · December 2, 2024, 8:23am

after adding

–no-pip-freeze Do not do a pip freeze to know preferred packages
version.

the requirements.txt looks the same

many-kalin · December 2, 2024, 8:37am

this solved the issue…

enzo · December 2, 2024, 11:56am

The crunch-cli collect your locally installed version to make sure the cloud environment always match your local environment.

This should not be an issues as if it run locally on your machine, it should run everywhere else as long as the versions are matching.

many-kalin · December 2, 2024, 11:58am

I was submitting it from a different machine than the one I was executing the code.

many-kalin · December 15, 2024, 8:58pm

By the way @enzo , I think the --no-pip-freeze option should be the default 1) Rewriting submission files without the user’s explicit knowledge is bad practice and makes it harder for the user to debug if they are not aware of this. Also, it happens on the fly, so the original file remains as it is. 2) This behavior is different from the website submission process, where it is not enforced.

But I also understand why you might prefer this. However, people also use conda env and maybe during the submission process, they are not inside the one they used during development… But maybe you have better arguments to support it…

enzo · December 16, 2024, 1:19pm

We had problems in the past where libraries would update in the middle of competitions and some user codes would break. To improve the situation, we froze some of the requirements (like pandas and numpy), but that was never a stable solution.

Although it helps, having a way to know the original package versions that a user was using locally to run their code would ensure that their environment would be reproducible.

We’re not going to make the --no-pip-freeze option the default, but it should be better communicated, I agree.
Since all the processing is done on the server (and not locally), we can only show a message after a submission has been accepted on the server. “Wasting” a submission if you do not like the processed requirements.txt.

enzo · December 16, 2024, 1:23pm

This is also a big reason for the whitelist: since Jupyter users could not submit a requirements.txt, we needed a way to know which package name corresponds to which pypi project (sklearn → scikit-learn).

Internally we keep a boolean to indicate to the system if a library should always be frozen to the latest version (up to the submission creation date).
We cannot do this for all packages as not all latest versions are compatible with each other.

enzo · December 16, 2024, 1:26pm

Because we understand that debugging is hard, we always offer more submissions to help. We do not want users to feel stuck if the problem is beyond their control.

many-kalin · December 16, 2024, 1:32pm

These package versions…

Topic		Replies	Views
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet file size is 0 bytes Causality Discovery	4	317	October 26, 2024
Can you please help fix this bug? Broad Institute Crunch #3	1	39	January 29, 2025
Data download for local machine ADIA Lab	3	246	June 21, 2023
OverflowError: string longer than 2147483647 bytes Broad Institute Crunch #1	3	81	November 12, 2024
Submission format-- CSV or notebook? Broad Institute Crunch #1	9	134	November 28, 2024

Error: pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet file size is 0 bytes

frozen using pypi

frozen using pypi

Related topics