Page MenuHomePhabricator

Server-side upload request for OptimusPrimeBot (INPE DPI)
Closed, ResolvedPublicRequest

Description

Please upload the following files to Wikimedia Commons:

The username of my bot is OptimusPrimeBot.

Description:

== {{int:filedesc}} ==
{{Information
| source = https://www.dpi.inpe.br/galeria/
| author = INPE/OBT/DPI: Divisão de Processamento de Imagens, Coordenação Geral de Observação da Terra, Instituto Nacional de Pesquisas Espaciais
}}
=={{int:license-header}}==
{{Cc-by-sa-4.0}}
[[Category:Satellite pictures]]
[[Category:Spacemedia INPE files uploaded by OptimusPrimeBot]]
[[Category:Spacemedia files (review needed)]]

Thank you.

Event Timeline

Urbanecm_WMF changed the task status from Open to Stalled.Jul 4 2024, 5:38 PM
Urbanecm_WMF subscribed.

Hello @Don-vip, I can take a look at this, but it is currently not possible to download the images from Wikimedia production servers:

[urbanecm@mwmaint1002 ~/uploads]$ wget 'https://www.dpi.inpe.br/galeria/Ontario_NY_PA_OH_MI_AMAZONIA_1_WFI_20220509_044_008_L2_BAND4321.tif'
--2024-07-04 17:35:16--  https://www.dpi.inpe.br/galeria/Ontario_NY_PA_OH_MI_AMAZONIA_1_WFI_20220509_044_008_L2_BAND4321.tif
Resolving webproxy.eqiad.wmnet (webproxy.eqiad.wmnet)... 2620:0:861:3:208:80:154:74, 208.80.154.74
Connecting to webproxy.eqiad.wmnet (webproxy.eqiad.wmnet)|2620:0:861:3:208:80:154:74|:8080... connected.
Proxy request sent, awaiting response... 403 Forbidden
2024-07-04 17:35:18 ERROR 403: Forbidden.

[urbanecm@mwmaint1002 ~/uploads]$ wget https://www.dpi.inpe.br/galeria/Remanso_PAN10M_20170425_153_111_B432_5mRecReg5M.tif
--2024-07-04 17:35:30--  https://www.dpi.inpe.br/galeria/Remanso_PAN10M_20170425_153_111_B432_5mRecReg5M.tif
Resolving webproxy.eqiad.wmnet (webproxy.eqiad.wmnet)... 2620:0:861:3:208:80:154:74, 208.80.154.74
Connecting to webproxy.eqiad.wmnet (webproxy.eqiad.wmnet)|2620:0:861:3:208:80:154:74|:8080... connected.
Proxy request sent, awaiting response... 403 Forbidden
2024-07-04 17:35:31 ERROR 403: Forbidden.

[urbanecm@mwmaint1002 ~/uploads]$

Do you have any idea why that might be? I also tried setting a custom user agent (Wikimedia Foundation (murbanec@wikimedia.org) is what I used), but I received the same response.

Setting as stalled until the unavailability can be resolved.

Hi @Urbanecm_WMF !
It's weird, I can download files without problem from Cloud VPS:

don-vip@worker-1:~/spacemedia$ hostname -A
worker-1.spacemedia.eqiad1.wikimedia.cloud

don-vip@worker-1:~/spacemedia$ wget https://www.dpi.inpe.br/galeria/Ontario_NY_PA_OH_MI_AMAZONIA_1_WFI_20220509_044_008_L2_BAND4321.tif
--2024-07-04 19:26:35--  https://www.dpi.inpe.br/galeria/Ontario_NY_PA_OH_MI_AMAZONIA_1_WFI_20220509_044_008_L2_BAND4321.tif
Resolving www.dpi.inpe.br (www.dpi.inpe.br)... 150.163.2.5
Connecting to www.dpi.inpe.br (www.dpi.inpe.br)|150.163.2.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2021381513 (1.9G) [image/tiff]
Saving to: ‘Ontario_NY_PA_OH_MI_AMAZONIA_1_WFI_20220509_044_008_L2_BAND4321.tif’

Ontario_NY_PA_OH_MI_AMAZONIA_1_WFI_20220509_044_008_L2_BAND4321.tif             7%[==============>

So you know someone from the WMF network team that could help?

Urbanecm_WMF changed the task status from Stalled to Open.EditedJul 8 2024, 10:21 AM
Urbanecm_WMF claimed this task.

Hi @Don-vip!

The issue doesn't appear to be within the WMF network; a 403 Forbidden error means that mwmaint1002 managed to connect to www.dpi.inpe.br, but that server refused to provide the file. Logs from the www.dpi.inpe.br might be helpful at determining the cause here.

From my end: Specifically, it seems only downloading via wget results in the problem. If I use curl -O ... to download the file, it downloads successfully. I'll process this request now, given I found a way to download the files, but I recommend to take a look at the 403 error, especially if you plan on making future server-side upload requests with this source.

FWIW, it is also possible to upload via URL using https://commons.wikimedia.org/wiki/Commons:Upload_tools/wgCopyUploadsDomains (see the URL field at https://commons.wikimedia.org/wiki/Special:Upload), which can help with the uploading as well if the images are already available somewhere. EDIT: This might be what you're already using, as dpi.inpe.br is allowlisted. In that case, it might be running into the same 403 problem as I experienced when downloading manually.

In any case, this is now done:

[urbanecm@mwmaint1002 ~/uploads]$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=OptimusPrimeBot .
Importing Files

Importing GolfoMexico_AWFI_20200505_210_069_L2_BAND161514.tif...done.
Importing Sobradinho_MUX_20150117_152_110e111_B876.tif...done.
Importing Rio_PAN10M_20170212e310_151_125e6_B342.tif...done.
Importing SP_PAN10M_20160710e0923_155_126_e_164_126e7_B432.tif...done.
Importing Ontario_NY_PA_OH_MI_AMAZONIA_1_WFI_20220509_044_008_L2_BAND4321.tif...done.
Importing Remanso_PAN10M_20170425_153_111_B432_5mRecReg5M.tif...done.
Importing Brasilia_20170312a0416_Mediana_B151413.tif...done.
Importing Rio_PAN10M_20170212e310_151_125e6_B432.tif...done.
Importing Remanso_PAN5M_20170425_153_111_2_5m_Rec.tif...done.
Importing Remanso_PAN10MPAN5M_20170425_153_111_B432_5mRec.tif...done.
Importing Titicaca_AWFI_20150518_181_117_B151413_8bits.tif...done.

Found: 11
Added: 11
[urbanecm@mwmaint1002 ~/uploads]$

Thank you @Urbanecm_WMF!
I see the files have been imported, but the thumbnails have not been generated:

image.png (352×1 px, 47 KB)

Is there something else to do, or is it the root cause that prevented me to upload these files in the first place?

Lack of thumbnails is probably because mediawiki thinks the dimension of the image is 0x0. MediaWiki reports an error of "no page data found in tiff directory!".

Try running the tiff files through the command line tiffinfo command to see if they recognize them. Possibly these tiff files don't conform to the tiff standard.

Tried Brasilia 20170312a0416 Mediana B151413.tif

tiffinfo is able to read it:

=== TIFF directory 0 ===
TIFF Directory at offset 0xcd8be34 (215531060)
  Subfile Type: (0 = 0x0)
  Image Width: 6253 Image Length: 6878
  Resolution: 72, 72 pixels/inch
  Bits/Sample: 8
  Compression Scheme: LZW
  Photometric Interpretation: RGB color
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 3
  Rows/Strip: 13
  Planar Configuration: single image plane
  Make: SPRING
  Software: Adobe Photoshop CC 2015 (Windows)
  DateTime: 2017:10:30 16:32:08
  XMLPacket (XMP Metadata):
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c111 79.158325, 2015/09/10-01:10:20        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#">
         <xmp:CreatorTool>Spring 5.5.0 - Windows 64 bits</xmp:CreatorTool>
         <xmp:ModifyDate>2017-10-30T16:32:08-02:00</xmp:ModifyDate>
         <xmp:CreateDate>2017-10-30T15:18:19-02:00</xmp:CreateDate>
         <xmp:MetadataDate>2017-10-30T16:32:08-02:00</xmp:MetadataDate>
         <dc:format>image/tiff</dc:format>
         <photoshop:ColorMode>3</photoshop:ColorMode>
         <xmpMM:InstanceID>xmp.iid:b61066bf-2362-5745-b447-be580faf09d7</xmpMM:InstanceID>
         <xmpMM:DocumentID>adobe:docid:photoshop:9d18cfe1-bda0-11e7-b6dd-fe224ca38e9f</xmpMM:DocumentID>
         <xmpMM:OriginalDocumentID>xmp.did:c6ba0e65-7a48-2040-a27f-c5cbab6d0836</xmpMM:OriginalDocumentID>
         <xmpMM:History>
            <rdf:Seq>
               <rdf:li rdf:parseType="Resource">
                  <stEvt:action>created</stEvt:action>
                  <stEvt:instanceID>xmp.iid:c6ba0e65-7a48-2040-a27f-c5cbab6d0836</stEvt:instanceID>
                  <stEvt:when>2017-10-30T15:18:19-02:00</stEvt:when>
                  <stEvt:softwareAgent>Adobe Photoshop CC 2022 (Windows)</stEvt:softwareAgent>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <stEvt:action>saved</stEvt:action>
                  <stEvt:instanceID>xmp.iid:b61066bf-2362-5745-b447-be580faf09d7</stEvt:instanceID>
                  <stEvt:when>2017-10-30T16:32:08-02:00</stEvt:when>
                  <stEvt:softwareAgent>Adobe Photoshop CC 2022 (Windows)</stEvt:softwareAgent>
                  <stEvt:changed>/</stEvt:changed>
               </rdf:li>
            </rdf:Seq>
         </xmpMM:History>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
...
<?xpacket end="w"?>
  Tag 33550: 64.000000,64.000000,0.000000
  Tag 33922: 0.000000,0.000000,0.000000,-96.000000,8500096.000000,0.000000
  Photoshop Data: <present>, 11528 bytes
  EXIFIFDOffset: 0x53b2a14
  Tag 34735: 1,1,0,7,1024,0,1,1,1025,0,1,1,1026,34737,28,0,2049,34737,7,28,2054,0,1,9102,3072,0,1,32723,3076,0,1,9001
  Tag 34737: UTM Zone 23S (48 W to 42 W)|WGS 84|
  Adobe Photoshop Document Data Block: 0x41,0x64,0x6f,0x62,0x65,0x20,0x50,0x68,0x6f,0x74,0x6f,0x73,0x68,0x6f,0x70,0x20,0x44,0x6f,0x63,0x75,0x6d,0x65,0x6e,0x74,0x20,0x44,0x61,0x74,0x61,0x20,0x42,0x6c,0x6f,0x63,0x6b,0x0,0x4d,0x49,0x42
,0x38,0x72,0x79,0x61,0x4c,0x7e,0x93,0x9d,0x7,0x3,0x
...
,0xfc,0xfe,0x0,0x0,0x4d,0x49,0x42,0x38,0x6b,0x73,0x4d,0x4c,0xe,0x0,0x0,0x0,0x0,0x0,0xff,0xff,0x0,0x0,0x0,0x0,0x0,0x0,0x32,0x0,0x80,0x0,0x0,0x0,0x4d,0x49,0x42,0x38,0x74,0x74,0x61,0x50,0x0,0x0,0x0,0x0,0x4d,0x49,0x42,0x38,0x6b,0x73,0x4d,0x46,0xc,0x0,0x0,0x0,0x0,0x0,0xff,0xff,0x0,0x0,0x0,0x0,0x0,0x0,0x32,0x0
  Predictor: horizontal differencing 2 (0x2)
--- EXIF directory within directory 0
TIFF Directory at offset 0x53b2a14 (87763476)
  ColorSpace: 65535
  PixelXDimension: 6253
  PixelYDimension: 6878

I even setup wikimedia thumbor-plugins on one of my cloud vps instances and the thumbnail generation works:

https://thumbor.wmcloud.org/thumbor/unsafe/800x/Brasilia_20170312a0416_Mediana_B151413.tif

So the WMF Thumbor instance should be able to deal with it too. But I can't tell it to do it, I get a 404 error?

https://commons.wikimedia.org/w/thumb.php?f=Brasilia_20170312a0416_Mediana_B151413.tif&w=800

If MediaWiki can't determine the dimensions of the file, it will fail before thumbor gets involved

I've setup a local mediawiki instance (1.39, using Ubuntu package) with default config:

// Use exiv2? if false, MediaWiki's internal EXIF parser will be used
$wgTiffUseExiv = false;
// Use tiffinfo? if false, ImageMagick's identify command will be used
$wgTiffUseTiffinfo = false;

And it is able to detect the file size ((6,253 × 6,878 pixels, file size: 327.42 MB, MIME type: image/tiff, 3 pages)):

image.png (44×603 px, 9 KB)

I tried to replicate the Commons config (tell me if I'm wrong)

$wgUseImageMagick = true;
$wgTiffUseTiffinfo = true;
$wgTiffMaxMetaSize = 1048576;

And it works too: dimensions, filesize and mime type are correctly determined. What's wrong with Commons?

I've setup a local mediawiki instance (1.39, using Ubuntu package) with default config:

Note Commons runs on a significantly newer version, not the LTS. Installing MediaWiki from Git master branches would be more representative (but the exact version information can be found at https://commons.wikimedia.org/wiki/Special:Version and the current WMF deployment branch in use can be found at https://versions.toolforge.org/). Also note the error @Bawolff mentioned above does not come from MediaWiki Core, it comes from the MediaWiki-extensions-PagedTiffHandler extension (considering that is the only codebase that includes the string "no page data found in tiff directory", see CodeSearch).

Tests that do not use the same software version than Commons, or miss Tiff-related extensions, might not be representative of what would be actually happening at Commons's side. For what it's worth, the img_width and img_height fields in the database are set correctly:

MariaDB [commonswiki_p]> select * from image where img_name='Brasilia_20170312a0416_Mediana_B151413.tif'\G
*************************** 1. row ***************************
          img_name: Brasilia_20170312a0416_Mediana_B151413.tif
          img_size: 343329260
         img_width: 6253
        img_height: 6878
      img_metadata: {"data":{"page_data":[],"errors":["no page data found in tiff directory!"],"exif":[],"TIFF_METADATA_VERSION":"1.4"}}
          img_bits: 0
    img_media_type: BITMAP
    img_major_mime: image
    img_minor_mime: tiff
img_description_id: 338103315
         img_actor: 25326292
     img_timestamp: 20240708102545
          img_sha1: lu5l0mb394yji32xgiw1vkjcf2ijbfh
1 row in set (0.001 sec)

MariaDB [commonswiki_p]>

However, MediaWiki Core does not use those two fields in case of formats that support multiple pages (which includes the TIFF format). For those formats, it calls an appropriate handler to provide the information independently, such as the MediaWiki-extensions-PagedTiffHandler extension.

That means that whatever is happening is likely contained within the extension, not within MediaWiki Core. Hence, any tests should likely focus on that extension, rather than MediaWiki in itself; otherwise, they are likely not representing what would happen on Commons's end well.

What's wrong with Commons?

Let's avoid prematurely jumping into conclusions :). As showed above, your comment can be fully explained by the test not taking the TIFF extension into account (which is what reported the error here). As of now, I do not see any evidence on this task that would suggest it's specifically Commons (its config and/or infrastructure) that would be at fault here. As of now, I'm happy to take a look at what might be happening here, but I don't believe engaging in a discussion in which the infrastructure is repeatedly blamed without evidence would be productive. Thank you for your understanding.


That being said...I executed tiffinfo for the file you mentioned, and a random tif from Commons:

[urbanecm@mwmaint1002 ~/uploads]$ wget https://upload.wikimedia.org/wikipedia/commons/5/5d/Brasilia_20170312a0416_Mediana_B151413.tif
[urbanecm@mwmaint1002 ~/uploads]$ wget https://upload.wikimedia.org/wikipedia/commons/5/56/Mokary_nature_de_chez_Mokariz.tif
[urbanecm@mwmaint1002 ~/uploads]$ tiffinfo Brasilia_20170312a0416_Mediana_B151413.tif > Brasilia_20170312a0416_Mediana_B151413.tif.tiffinfo
[urbanecm@mwmaint1002 ~/uploads]$ tiffinfo Mokary_nature_de_chez_Mokariz.tif > Mokary_nature_de_chez_Mokariz.tif.tiffinfo
[urbanecm@mwmaint1002 ~/uploads]$ ls -lh
total 942M
-rw-rw-r-- 1 urbanecm wikidev 328M Jul  8 10:25 Brasilia_20170312a0416_Mediana_B151413.tif
-rw-rw-r-- 1 urbanecm wikidev 603M Jul  9 11:09 Brasilia_20170312a0416_Mediana_B151413.tif.tiffinfo
-rw-rw-r-- 1 urbanecm wikidev  12M Jul  9 09:34 Mokary_nature_de_chez_Mokariz.tif
-rw-rw-r-- 1 urbanecm wikidev 1.6K Jul  9 11:09 Mokary_nature_de_chez_Mokariz.tif.tiffinfo
[urbanecm@mwmaint1002 ~/uploads]$

TIFF info for Brasilia_20170312a0416_Mediana_B151413.tif is almost twice as large as the actual image. That doesn't seem to be correct – metadata is supposed to be smaller than the actual data it describes. Are we sure the TIFF info for the tiffs provided is a correct one?

600 mb! I imagine mediawiki is cutting off the output way before that, probably losing some important data.

Let's avoid prematurely jumping into conclusions :). As showed above, your comment can be fully explained by the test not taking the TIFF extension into account (which is what reported the error here). As of now, I do not see any evidence on this task that would suggest it's specifically Commons (its config and/or infrastructure) that would be at fault here. As of now, I'm happy to take a look at what might be happening here, but I don't believe engaging in a discussion in which the infrastructure is repeatedly blamed without evidence would be productive. Thank you for your understanding.

I am sorry. I didn't want to blame the infra. I realize it's very complex and a lot of efforts have been (is) made to keep it running and that's not an easy task. You clearly have my full respect and gratitude for that. English is not my native language, I just wanted to say that I observe a problem on Commons and felt that Mediawiki core software seemed able to handle the problematic file, with my quick test. But I am novice in my understanding of the inner mechanisms of Mediawiki so I am surely wrong in my tests, I try to learn. I appreciate a lot that you explain me how the things work.

For my test I had already installed the PagedTiffHandler extension (of course, not the latest version but the one compatible with 1.39) and felt it was working as I got the number of pages in the UI. I'll check again with correct versions from git repository.

Are we sure the TIFF info for the tiffs provided is a correct one?

I don't know, I'll dig into it. I noticed I had to increase a lot the memory settings of PHP (memory_limit = 1G) to get my local instance accept the file (otherwise the import failed with an out of memory error).

I am sorry. I didn't want to blame the infra

In fairness, the infra is often broken when it comes to multimedia :)

Keep in mind that wikimedia uses the following config settings (also shellbox) which i strongly suspect affects things: (edit, reading the older comments, you already knew this, so nevermind :) however im not exactly sure what steps you took, so keep in mind that changing these settings affect newly updated files but would (mostly) not affect any files that are already on your test wiki.

$wgTiffUseTiffinfo = true; 
$wgTiffMaxMetaSize = 1048576;

Of course its also always possible that different versions of tiffinfo might give different results.

I found a way to solve the issue :)

I removed the problematic Photosphop-specific TIFF tag 37724 called ImageSourceData with exiftool:

$ exiftool -ImageSourceData= Brasilia_20170312a0416_Mediana_B151413_modified.tif
    1 image files updated

$ ls -l *.tif
-rw-rw-r-- 1 xxx xxx  84M jul. 10 18:43 Brasilia_20170312a0416_Mediana_B151413_modified.tif
-rw-rw-r-- 1 xxx xxx 328M jul.  8 14:09 Brasilia_20170312a0416_Mediana_B151413.tif

As you can see, removing this tag shrinks the size from 328M to 84M. I imported it as a new version of https://commons.wikimedia.org/wiki/File:Brasilia_20170312a0416_Mediana_B151413.tif and Commons is much more happy now, the thumbnail has been generated instantly.

I'm going to do the same with all other files. Thanks a lot for your help!

I managed to fix almost all files except these two:

Except that they are the two biggest files, I have no clue how to fix them so that they are recognized by Commons :(