Page MenuHomePhabricator

Bump memory to enable large artifacts sync on HDFS
Open, Needs TriagePublic

Description

The SEAL project ships with two large artifacts that get synced to the Hadoop cluster at deployment time:

Note that we've already tried to decrease their size.

This currently prevents Airflow DAGs from being deployed: details in T325316#9250697.

Event Timeline

Hm, actually, as far as I can tell, reading from HTTP (and many other sources) uses https://filesystem-spec.readthedocs.io/en/stable/api.html#fsspec.spec.AbstractBufferedFile, which has a default read blocksize of 5MB.

Perhaps this is on the writing side?

No, I guess not:

Buffer only sent on flush() or if buffer is greater than or equal to blocksize.

Oh, or:

https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/http.py#L527-L528

Supports only reading, with read-ahead of a predermined block-size.

In the case that the server does not supply the filesize, only reading of
the complete file in one go is supported.
$ curl -I https://gitlab.wikimedia.org/repos/structured-data/seal/-/package_files/1465/download
...
content-length: 0

So it should work, its just that gitlab package registry doesn't indicate the size of the file?

Ah, good find!

$ curl -I https://gitlab.wikimedia.org/repos/structured-data/seal/-/package_files/1465/download
...
content-length: 0

So it should work, its just that gitlab package registry doesn't indicate the size of the file?

Ah good find. That is unfortunate, perhaps we can log it as a gitlab bug?