Page MenuHomePhabricator

Uncompressed TIFFs on commons
Open, Needs TriagePublic

Description

Typically uploaders can choose to upload in any of our supported image formats. Some users, though, choose to upload uncompressed images when functionally-identical lossless formats are also available. This wastes a tremendous amount[0] of storage and backup space.

Suggestions, in order (I think) from easiest to hardest:

  • document guidelines about preferred vs. hated commons image formats
  • reject uncompressed files at upload time
  • compress new files at upload time
  • actively detect and replace uncompressed files with lossless-compressed equivalents

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Andrew updated the task description. (Show Details)

TIFF compression is fairly easy via tiffcp (I'm not a compression specialist, both -c lzma:p9 and -c zip:p12 were pretty good, the former slower but better). The difficulty is that we can't just change objects in swift "under the feet of" Mediawiki.

And if you did the naive "download uncompressed TIFF, reupload compressed TIFF as new version", you'd end up storing the old version forever in a deleted container. There is a server-side tool (I think! used by T&S) that can do "Permanently discard version X of object Y", so you could have a bot that did the compress-and-upload and recorded the old revision(s) of the object to later permanently discard.

TIFF compression can be done losslessly, so I see no reason to accept uncompressed TIFFs.

As some data:

mysql:research@dbstore1007.eqiad.wmnet [commonswiki]> select actor_name, sum(fr_size) from filerevision join actor on fr_actor = actor_id where fr_timestamp like '2026%' group by fr_actor order by sum(fr_size) desc limit 10;
+--------------------+-----------------+
| actor_name         | sum(fr_size)    |
+--------------------+-----------------+
| PantheraLeo1359531 | 172075525814959 |
| DPLA bot           |  10693380586322 |
| OptimusPrimeBot    |   5816079552520 |
| Smaxims            |   3482921796474 |
| Vitaly Zdanevich   |   1064604825977 |
| Gzen92Bot          |   1038377632666 |
| Mr.Nostalgic       |    869023925250 |
| NoonIcarus         |    604110964274 |
| TaronjaSatsuma     |    448316991444 |
| MGeog2022          |    440454372198 |
+--------------------+-----------------+
10 rows in set (1 min 53.262 sec)

mysql:research@dbstore1007.eqiad.wmnet [commonswiki]> select sum(fr_size) from filerevision where fr_timestamp like '2026%' limit 5;
+-----------------+
| sum(fr_size)    |
+-----------------+
| 214287934281570 |
+-----------------+
1 row in set (43.571 sec)

The tiff uploads are 157TB meaning 80.3%‌ of all data being to commons now. 16 times more than the second highest mass uploader. And that's just the 2026 uploads

I wrote a script to proactively compress tiff files, and it works pretty nice so far:

Processing: File:LVGL-SL - DOP20IR - 345000 5489000 (2025).tif
Success! Reduced by 62.62% (75030424 -> 28043302 bytes).

I just can't download anything without getting 429s, and on my own laptop it ooms a lot given the size of these files. Running them on WMCS is going to be complicated too since in toolforge there is no tiffcp and I'm sure the memory limits are even lower. I could take a VM somewhere in Cloud VPS and run it there, memory still is a problem but more doable I guess (I still will get 429s but carving out an exception shouldn't be too hard). On stat machines would be nice but it's much much harder to setup.

And it can't even upload the new files:

ERROR: An error occurred for uri https://commons.wikimedia.org/w/api.php
ERROR: Traceback (most recent call last):
  File "/home/amir/p3/lib/python3.13/site-packages/pywikibot/data/api/_requests.py", line 734, in _http_request
    response = http.request(self.site, uri=uri,
                            method='GET' if use_get else 'POST',
                            data=data, headers=headers)
  File "/home/amir/p3/lib/python3.13/site-packages/pywikibot/comms/http.py", line 329, in request
    r = fetch(baseuri, headers=headers, **kwargs)
  File "/home/amir/p3/lib/python3.13/site-packages/pywikibot/comms/http.py", line 506, in fetch
    callback(response)
    ~~~~~~~~^^^^^^^^^^
  File "/home/amir/p3/lib/python3.13/site-packages/pywikibot/comms/http.py", line 389, in error_handling_callback
    raise response from None
  File "/home/amir/p3/lib/python3.13/site-packages/pywikibot/comms/http.py", line 496, in fetch
    response = session.request(method, uri,
                               headers=headers, auth=auth, timeout=timeout,
                               **kwargs)
  File "/home/amir/p3/lib/python3.13/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/amir/p3/lib/python3.13/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/amir/p3/lib/python3.13/site-packages/requests/adapters.py", line 659, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', TimeoutError('The write operation timed out'))

WARNING: Waiting 80.0 seconds before retrying.

Hi! I think LZW compressing would be very useful. Amir Sarabadani reached out to me about the issue. I would be glad to be able to implement an LZW workflow, but I see some technical limitations right now. I transfer open geodata from different sourcess, which may include files without compression. I use OpenRefine for this workflow, because the amount of data is too large to download, compress and reupload. My question is: Would it be possible to implement a function that (via OpenRefine) fetches the files, compresses them and uploads them as smaller file? This would be very helpful. Tools like GDAL support conversion without losing the embedded coordinates, so files can still be used in GIS like QGIS. In cases like Thuringia, where only ZIPs are available, I wrote a Python script that applies automated compression. Thank you!

@PantheraLeo1359531 It sounds like you have a very specific and complex workflow. You are probably more aware of what is possible in this regard than others. The generally point however is; whenever you upload files, they should have been saved in a way that applies compression. Especially when uploading in the volumes that you do. Please find the tools that allow you to do this. I'm sure scripts can be hooked up to open refine etc. to facilitate this.

because the amount of data is too large to download, compress and reupload

I'm sure that open refine does all those things as well. It simply does it for you, one by one or something. Unless you use uploadbyurl of course.

TIFF compression can be done losslessly, so I see no reason to accept uncompressed TIFFs.

We should probably have a ticket to detect this in the metadata handler and reject or at least warn when attempting to upload these types of files.

In general, I think this is yet another case of why we need to invest in media handling and clean up core apis to make problems like this easier to handle.

I think one of the questions that needs to be asked here is, are many of these files really needed? Some may have educational value, but there are others that provide no information at all. I often use WikiShootMe to review and categorize images with coordinates in my country, and sometimes I find images taken from the ISS that contain nothing but darkness.

I think one of the questions that needs to be asked here is, are many of these files really needed? Some may have educational value, but there are others that provide no information at all. I often use WikiShootMe to review and categorize images with coordinates in my country, and sometimes I find images taken from the ISS that contain nothing but darkness.

As a long-term volunteer, I have opinions on that but as staff/engineer, that's not my call to make. I suggest bringing this to the Commons VP and see how people feel about it. Of course, the burden of such files on the infrastructure could be one reason against having these files but it should be measured in context of other reasons holistically and I don't think phabricator is the right place for it. In this specific case, my task here is to mitigate not to block.

I just can't download anything without getting 429s, and on my own laptop it ooms a lot given the size of these files. Running them on WMCS is going to be complicated too since in toolforge there is no tiffcp and I'm sure the memory limits are even lower. I could take a VM somewhere in Cloud VPS and run it there, memory still is a problem but more doable I guess (I still will get 429s but carving out an exception shouldn't be too hard). On stat machines would be nice but it's much much harder to setup.

repeating from IRC: how much RAM and CPU do you need? We can potentially just make you an extra-large toolforge quota if you'd prefer to work in toolforge, but I'm also happy to make you a cloud-vps project.

My only reservation would be if we think this compressor is going to remain part of our infra forever, in which case it should live in prod k8s. But as I understand it we're also disallowing future uncompressed uploads so the compression tool can be shut down after it does its job; is that right?

Creating a ticket to request a temporary cloud VPS project for it is in my todo list for today. I hope I can get to it ASAP. Sorry! but we have been thinking exactly the same ideas interdependently.

I think one of the questions that needs to be asked here is, are many of these files really needed? Some may have educational value, but there are others that provide no information at all. I often use WikiShootMe to review and categorize images with coordinates in my country, and sometimes I find images taken from the ISS that contain nothing but darkness.

Well it is debatable how useful such images are, if we actually enforce compression such files should in theory compress really well.

Orthophotos are among the few media types that simultaneously serve as illustrations, historical records, geospatial datasets, verification tools, and long-term archival resources, and can be combined with many other sources of geodata or regular data.

  • Why Commons?

The question is not only whether orthophotos are useful, but whether there is another platform within the Wikimedia ecosystem that can fulfil the same role. Commons already provides long-term storage, categorisation, geodata support, multilingual descriptions, Wikidata integration, and direct reuse across Wikimedia projects. Orthophotos therefore benefit from an existing ecosystem that makes them discoverable, reusable, and connected to other forms of knowledge. We can add aerial images to maybe Million Wikidata objects or to Wikipedia with missing imagery. I regularly receive notifications where users used orthophotoimagery to illustrate articles.

In addition, Commons is one of the few large-scale repositories dedicated to freely licensed media. While commercial providers offer aerial imagery, their archives are not freely reusable and access conditions may change over time. Commons provides a stable and openly accessible alternative that aligns with Wikimedia's mission of free knowledge. Especially commercial providers don't care a lot about long-time preservation.

  • Why orthophotos are different from ordinary photographs

Unlike ordinary photographs, orthophotos are not limited to a single subject. A single orthophoto may simultaneously document settlements, transportation networks, waterways, vegetation, land use, industrial facilities, and cultural heritage sites. Their educational value therefore extends far beyond the reason why a particular image was originally consulted.

Orthophotos are also among the few media types that simultaneously function as illustrations, historical records, geospatial datasets, verification tools, and archival resources. This combination makes them uniquely valuable for educational and research purposes. Their value may increase in the near future, especially when structures get lost by wars or natural catastrophies. Orthophoto imagery expands Commons into the domain of large-scale geospatial documentation, connecting between visual media and structured spatial datasets. Emerging formats such as textured 3D meshes similarly extend this scope, highlighting that Commons is increasingly becoming a repository for complex, data-rich representations of the physical world. As the nature of cities or communities changes over time, they add a historic value over time.

  • Why volunteer replacement is unrealistic

Many types of media stored on Commons can theoretically be recreated by volunteers. State-wide or nation-wide orthophoto datasets are fundamentally different. Their creation requires specialised aircraft, sensors, processing pipelines, and substantial financial resources.

If a freely licensed orthophoto dataset disappears, it is often impossible for volunteers to reproduce the same coverage, quality, and historical time frame. Preserving such datasets therefore protects information that may otherwise be lost permanently. The problem is that archives like the Internet Archive has an unstable infrastructure and is facing several lawsuits over time.

  • Verifiability

Orthophotos provide an independent and verifiable record of the physical world at a specific point in time. They can be used to confirm the existence, condition, or development of buildings, infrastructure, landscapes, and other geographic features. This directly supports Wikimedia's commitment to verifiability and reliable sourcing, which is more important in a time of growing fake news.

  • Archive function

Public access to geospatial datasets cannot be taken for granted. Governments change policies, budgets are reduced, and institutions may replace or remove older imagery. Commons can serve as a safeguard against the loss of historically significant datasets by preserving freely licensed imagery in a stable and openly accessible environment. This is also a topic by the ArchiveTeam or within the growing awareness by archivists. The archival comes with providing geocoordinates (thanks to the respective user), so the files can be found on Wiki maps by users.

  • Addressing infrastructure concerns

The infrastructure concerns are very important. Elements like mandatory lossless compression of TIFFs would be a great idea. GeoTIFF is a widely used standard in the geospatial community because it preserves georeferencing information and other important metadata. An issue is that many files are uploaded uncompressed, so we could implement mature lossless compression methods such as LZW, Deflate, or ZSTD. If the upload of uncompressed TIFFs is implemented, a mandatory compression would be good.

Such an approach would address infrastructure concerns without reducing Commons' ability to preserve and distribute geospatial knowledge. It would represent a technical solution to a technical problem rather than restricting a class of educationally valuable files, which would be a very great loss IMHO. Many users are not experts in geospatial data formats. Therefore, enforcing efficient lossless compression automatically during upload may be preferable to relying on every contributor to optimise files manually before submission. Some new users may not know about this discussion, so we can avoid upcoming problems :).

PS: I already asked in respective village pump section for help or additional functions in OpenRefine, but didn't get an answer :(.

I don't think anyone is disputing that orthophotos can be educationally useful. The question is whether storing massive numbers of huge TIFFs on Commons is the best use of Commons infrastructure and budget.

Several claims in your argument appear to be assumptions rather than evidence-backed conclusions:

  1. Can you provide sources showing that these orthophotos are actually being used at a scale that justifies their storage costs?

    Commons is not an archive where educational value alone automatically justifies unlimited storage. If orthophotos consume a substantial share of infrastructure resources, we should be looking at measurable outcomes:
    • How many Wikipedia articles use them?
    • How many Wikidata items use them?
    • How many page views do those usages generate?
    • How often are the original TIFFs, rather than derived previews, actually accessed?

      Without such data, statements about educational value remain largely anecdotal.
  1. Can you provide sources demonstrating that the very large, original TIFFs are the component generating the educational value?

    Most Wikimedia users consume rendered thumbnails or derived JPEG/PNG outputs, not multi-gigabyte GeoTIFFs. If the educational use comes primarily from derived products, then it is not obvious that Commons must store every original file indefinitely, especially in uncompressed form.
  1. Can you provide evidence supporting the claim that Commons is the appropriate platform for long-term preservation of national-scale geospatial datasets?

    Commons was designed primarily as a media repository supporting Wikimedia projects. It was not designed as a dedicated geospatial archive. Many of your arguments describe requirements closer to those of a specialist geospatial preservation platform than a media repository.
  1. Can you provide sources showing that the cost is proportionate to the benefit?

    If orthophotos account for a significant fraction of Commons storage growth or operational expenses, then the burden of proof is not merely demonstrating usefulness but demonstrating sufficient usefulness relative to cost.

    For example, if a particular category of files represents roughly 38% of storage expenditure, it would be reasonable to ask for evidence that it delivers a comparable share of Wikimedia's educational impact.
  1. Why should Commons bear these costs instead of a dedicated project with infrastructure optimized for geospatial data?

    GeoTIFFs, orthophotos, DEMs, point clouds, and similar datasets have very different access patterns and technical requirements from photographs, diagrams, audio, and video. A specialized repository could:
    • optimize storage for large geospatial assets,
    • apply domain-specific compression and deduplication,
    • provide geospatial APIs and tooling,
    • and avoid imposing disproportionate costs on Commons infrastructure.

      Before expanding Commons into a general-purpose geospatial archive, there should be evidence that such a role is both technically and financially appropriate.

In short, the question is not whether orthophotos have value. The question is whether there is evidence that storing huge original TIFF datasets on Commons provides enough Wikimedia-wide benefit to justify their infrastructure cost, and whether Commons is the most suitable place to host them instead of a dedicated geospatial platform. Until data is presented, those remain assumptions rather than demonstrated conclusions.

Wearing my WMF staff hat, I'd like to note that "we should not store uncompressed TIFFs in commons" is definitely our call to make, and "upload tooling should not upload uncompressed TIFFs to commons" likely also.

"Should these large TIFFs be in commons at all?", is a separate question, and one I think for the commons community, not SRE Data Persistence to make. Which inclines me to think it should become a separate ticket if it's going to be a lengthy discussion :)

Wearing my WMF staff hat, I'd like to note that "we should not store uncompressed TIFFs in commons" is definitely our call to make, and "upload tooling should not upload uncompressed TIFFs to commons" likely also.

Agreed, Commons is a platform to store and serve "media". While storing and serving "files" is a very subtle distinction, it means completely different worlds in terms of infrastructure and software. There are other places in the internet that do this. For example Internet Archive is an example of the latter. It's not impossible to build a version of it in Wikimedia ecosystem but it's a different beast altogether.

"Should these large TIFFs be in commons at all?", is a separate question, and one I think for the commons community, not SRE Data Persistence to make. Which inclines me to think it should become a separate ticket if it's going to be a lengthy discussion :)

Also agreed, I go even further and say it shouldn't happen in phabricator at all. There should be a discussion on https://commons.wikimedia.org/wiki/Commons:Village_pump

I thought I gave an update here. The bot is now running and compressing tiffs: https://commons.wikimedia.org/w/index.php?title=Special:Contributions/Dexbot&target=Dexbot&dir=prev&offset=20260605152652 I'll bother Matthew about dropping the old files from swift.