Page MenuHomePhabricator

Include sha1 in mediainfo rdf
Open, Needs TriagePublic

Description

Currently in the mediainfo for a file some basic information is included. For example for https://commons.wikimedia.org/wiki/File:Marcos_Correa_-_Trompe_l%27Oeil_-_A1845_-_Hispanic_Society_of_America.jpg (via https://commons.wikimedia.org/wiki/Special:EntityData/M100177437.rdf):

<rdf:Description rdf:about="https://commons.wikimedia.org/entity/M100177437">
<rdf:type rdf:resource="http://wikiba.se/ontology#Mediainfo"/>
<rdf:type rdf:resource="http://schema.org/MediaObject"/>
<rdf:type rdf:resource="http://schema.org/ImageObject"/>
<schema:encodingFormat>image/jpeg</schema:encodingFormat>
<schema:contentUrl rdf:resource="https://upload.wikimedia.org/wikipedia/commons/9/93/Marcos_Correa_-_Trompe_l%27Oeil_-_A1845_-_Hispanic_Society_of_America.jpg"/>
<schema:contentSize rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">130188</schema:contentSize>
<schema:height rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1000</schema:height>
<schema:width rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">550</schema:width>

I believe this is all from the image table ( https://www.mediawiki.org/wiki/Manual:Image_table ):

MariaDB [commonswiki_p]> SELECT * FROM image WHERE img_name="Marcos_Correa_-_Trompe_l'Oeil_-_A1845_-_Hispanic_Society_of_America.jpg" LIMIT 1\G
*************************** 1. row ***************************
          img_name: Marcos_Correa_-_Trompe_l'Oeil_-_A1845_-_Hispanic_Society_of_America.jpg
          img_size: 130188
         img_width: 550
        img_height: 1000
      img_metadata: a:1:{s:22:"MEDIAWIKI_EXIF_VERSION";i:2;}
          img_bits: 8
    img_media_type: BITMAP
    img_major_mime: image
    img_minor_mime: jpeg
img_description_id: 190202060
         img_actor: 2518
     img_timestamp: 20210221165644
          img_sha1: 9p8izfst2xbsbhm2qb3vc1iqpxdrbl0
1 row in set (0.01 sec)

Currently the sha1 is missing in the RDF output. It would be very nice to have this in the RDF output because it's often use to prevent duplicates and do reconciliation.
Mind that in the database it's in base 36 format. The RDF should contain the normal (base 16) format as visible on https://commons.wikimedia.org/w/index.php?title=File:Marcos_Correa_-_Trompe_l%27Oeil_-_A1845_-_Hispanic_Society_of_America.jpg&action=info