Page MenuHomePhabricator

General SDC data modelling session
Closed, ResolvedPublic

Description

Group session/exercise on basic data modelling challenges for StructuredDataOnCommons at Wikimania-Hackathon-2019 - Thu 15 August 2019 in the afternoon.

Documented in Etherpad: https://etherpad.wikimedia.org/p/SDC_modelling_Wikimania2019

Related Objects

Mentioned Here
P18 my paste!
P31 Fork of P29 (An Untitled Masterwork)
P50 wmf config inherit settings
P86 phaste.py
P88 phaste
P98 Maximum function level exceeded when viewing large, nested page
P110 quips
P123 old e-mail about dual booting woes
P127 (An Untitled Masterwork)
P144 appserver lists compared
P155 178205 errors
P156 Workboard loading error, whilst trying to load https://phabricator.wikimedia.org/tag/collaboration-team/board/
P175 vagrant puppet run twice?
P180 Error trace encountered working on Bug T74747
P186 wgFileExtensions mwgrep
P195 webconsole log from viewing https://www.mediawiki.org/wiki/Talk:Sandbox
P212 sprint app error upgrading schema
P243 (An Untitled Masterwork)
P276 WTF HHVM?!
P279 404 and 500 error pages
P291 (An Untitled Masterwork)
P393 Profiler output from MediaWiki-Vagrant on Zend
P407 Masterwork From Distant Lands
P459 git review -d 200850
P478 building geoipupdate
P518 VE 1.26wmf2 release list
P528 (An Untitled Masterwork)
P570 (An Untitled Masterwork)
P571 (An Untitled Masterwork)
P577 (An Untitled Masterwork)
P580 (An Untitled Masterwork)
P582 (An Untitled Masterwork)
P585 Somebody subscripted core deployment branches?!??!?!?!!!!????
P625 Re-run of new version of FlowUpdateWorkflowPageId.php in production
P629 (An Untitled Masterwork)
P655 Most referenced domains on the Italian Wikipedia (2015-05-15) (T96927)
P676 (An Untitled Masterwork)
P856 Second run of convertNamespaceFromWikitext.php on Catalan Beta
P921 Remove php5-dev
P957 More examples of "not stored with SHA-1 metadata"
P972 Config to write files to 2 local backends
P973 (An Untitled Masterwork)
P1071 (An Untitled Masterwork)
P1259 Masterwork From Distant Lands
P1264 Masterwork From Distant Lands
P1319 Masterwork From Distant Lands
P1326 Masterwork From Distant Lands
P1332 Masterwork From Distant Lands
P1333 Masterwork From Distant Lands
P1334 Masterwork From Distant Lands
P1335 Masterwork From Distant Lands
P1352 Masterwork From Distant Lands
P1433 Masterwork From Distant Lands
P1476 Masterwork From Distant Lands
P1545 Masterwork From Distant Lands
P1574 Masterwork From Distant Lands
P1680 Masterwork From Distant Lands
P1684 Masterwork From Distant Lands
P1752 Masterwork From Distant Lands
P1779 Masterwork From Distant Lands
P1957 bug86436 http log
P2048 labsdb alias via pdns
P2049 modules/toollabs/templates/hosts.erb -> DNS zone
P2061 On simple.wp, I opened an non-existent user talk page, then pressed send warning message
P2093 mailman - renaming a list (test)
P2151 PGP tests
P2408 Hide results and browser family from mobile FR campaign 2015-12-09 21:00:00-21:59:59 UTC
P2550 (An Untitled Masterwork)
P2677 Masterwork From Distant Lands
P2754 Code from @Khannaanant262129 in T129562
P2868 arc-paste-file
P2913 University related (Q3918) property suggester correlation data (February to April 2016)
P3037 (An Untitled Masterwork)
P3267 dhcp/carbon
P4036 vagrant up
P4092 ArchCom-RFC-2016W38-irc-E273.txt
P4174 lead apt-get upgrade
P4241 (An Untitled Masterwork)
P4765 novaproxy-01 syslog
P6243 https://gerrit.wikimedia.org/r/#/c/387658/5 mjolnir / py.test timeout?
P6731 (An Untitled Masterwork)

Event Timeline

Session was attended by 10+ participants - many thanks!

Now how to process/summarize the input in the Etherpad and how to translate it to on wiki input?

https://etherpad.wikimedia.org/p/SDC_modelling_Wikimania2019

Multichill closed this task as Resolved.Sep 29 2019, 4:39 PM

Backing up the etherpad here, follow up at https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling

https://commons.wikimedia.org/wiki/Commons:Structured_data/Properties_table
https://commons.wikimedia.org/wiki/Special:MostTranscludedPages
https://commons.wikimedia.org/wiki/Commons:Infobox_templates

Wikidata projects

Test images:
https://commons.wikimedia.org/wiki/Wiki_Loves_Monuments_2018_winners#Winners First ten

Date

  • Wikidata (P577 = publication date; P2754 = production date ; P571 = inception)
  • Date of creation (P571)
    • Creation of what? The file? What type of creation? When a photo was taken?
  • Date of publication (P577)
  • Date of upload (publication?)
  • Date of modification
  • Date of destruction (of digital copy if file is destroyed)
  • Start and end date if applicable
  • Date properties should be direct to be able to add qualifiers
    • Qualifiers “earliest date (P1319), latest date (P1326), refine date (P4241),
  • How to tell user (like another site) which date they should show?

Authorship

Author (P50) can be used if the author has a Wikidata item
If the author does not have a Wikidata item, there are a couple other possibilities to identify the author:

  • Author name string (P2093) can be used if the author doesn't have a Wikidata item
  • Wikimedia username (P4174) can be used if the author has a Wikimedia username --> COMMENT: May be best indicated using "author" = "somevalue" with qualifier "Wikimedia username" = Username .
  • Conventions need to be established on how to designate anonymous and unknown authors
  • Should we use the Anonymous (Q4233718) and/or Unknown (Q24238356) items? --> COMMENT: There is quite developed practise on Wikidata now for paintings of unknown/anonymous/pseudonymous authorship
  • Should there be a way to set unknown value from the Add statement interface? --> COMMENT: "Unknown" in the Wikidata / SDC UI should be renamed "somevalue", as per the underlying software, because this is what the value actually means.

Author properties

  • If the author has an item, most relavent data about the author should be pulled automatically from the author item
  • If the author doesn't have an item, we should create qualifiers under the Author name string or Wikimedia username value, e.g. Date of death (P570), Official website (P856), Flickr user ID (P3267), etc. --> COMMENT: For consistency, use "somevalue" with qualifer "stated as" rather than "author name string" ?

Numerous Wikidata properties are available for author IDs in various databases, but we should probably pull this informtation automatically from the Author item in most cases.
The role of each author could be specified as a qualifier to the Author or Author name string value using the Subject has role (P2868) property. For example, photographer, painter, architect, scultpor, etc.
A new property is probably needed for Author attribution (which accepts a string). This will likely go under the licensing data, however, rather than the authorship data. There are only 13 attribution templates per https://commons.wikimedia.org/wiki/Category:Attribution_templates

  • Creator templates : Probably should all be managed through Wikidata Author items, but we need a way to check if all the data from the templates is on Wikidata. Eventually, these will be replaced entirely --> COMMENT: Conversely, display extended Creator style information in field on file description page, if we have an author statement with a Q-item Ultimately this approach could provide all the visible functionality of a creator template, without needing any creator templates

Uploader to be treated separately from authorship.

Source

Get the data!   If we look at 100,000 random images, what is in the source field ?

Immediate source of image

  • Own work --> only for photos of people & places ?
    • distinguish photos/scans of their own artwork
    • created digital original drawings/artwork
    • created diagrams -- software used ?

      Easy enough to create a Q-item for "original creation by uploader" (now created as Q66458942) as value for a master "Source of file": property ---- NB: Temp property P828 used for this role. Will need to be replaced. A bot might indentify cases that look dubious, and mark with a qualifier BUT -- if we have this model with a top-level statement we can't have any second level of qualifers to clarify the nature of statements being made in first level qualifiers -- eg one might want "applies to part", or "sourcing circumstances", or to distinguish immediate vs ultimate source URL
  • From the internet

    Q-item for "file available on the internet", with qualifiers specifying detailed provenance Q-item for "user modification of file available on the internet"
    • Which property will point to this Q item? A new property: Source, taking a value indicating the nature of the source, with qualifiers adding further info. A "source" statement of this kind should become mandatory, with a limited closed vocabulary of possible values. Make upload wizard enforce the making of a choice.

t

  • Commons best practice: URL for image + URL for description by source --> two qualifiers for this ? ADDED: The "description by source" URL might be well handled by P973 "described at url" as a separate main statement. ISSUE: We might *only* have the description page URL -- and it might no longer exist. So we might need to specify that an image used to exist at a particular institution (or website), but not be able to say what the URL used to be.
  • In practice might have:
    • Some url
    • Some url with a description
    • Some url with a source site (url + Flickr, or Europeana, Internet Archive Books)
      • Identifier properties are subclass of source url
        • --> Q. Do we want to start minting new properties for identifiers from such sites, or just use URLs as per others. What are pros and cons ? Is this a workaround for not being able to find URLs that start with .... in SPARQL (because the indexing isn't there?)
  • What sorts of free text do we find in the source fields ?
  • maybe this is the last 20% we should try to capture, after we've got the easiest 80%, But how to assess/record completeness of extraction from source field ?
  • Sources which are offline, but eg which have been scanned
    • eg images from art books --> full bibliographic
  • See also other version section for derived works Q-item as top-level value to indicate "derived from file or files on Commons" ?
  • Comment: "Other version" is only relevant if we host the work(s) that the file was derived from. But we may not. eg scan of a page from a book, diagrams based on a diagram in a book (simple enough so no copyright), photograph of a copyright-expired painting, a photo of a dress based on a Mondrian painting "based on" property

Also: some operations -- rotation, colour modification, cropping, etc may have been undertaken by user prior to upload.

  • so distinguish "scan of image" from "user-modified scan of image" in top-level source statement ?

Source of things shown within the image

(eg : a photo of a 2D collage of objects)

Esp. important because these things may have different copyright status
  -- qualifiers below "depicts" statement ?
      -- how to indicate things if there is no obvious Q-item for something in the image, but neverthess one wants to identify it & record information relating to it?   Should a "depicts" = "somevalue" statement be created to record information about particular parts of the image ?

Will often be handled by the Q-items for the value(s) of the depicts statements

Other

  • copyright checkers may be closely tied to source: should the statements be similarly related -- or is it enough to put verification info as a qualifier or reference on the copyright status. Will SDC even have/display references ?

Metadata has provenance too

eg {{tl|BL cat credit}}

  • on Wikidata we would indicate this in references, statement by statement. But will Commons have references?

License/copyright status

Need to both show complexity of copyright situation as well as straightforward information to end-user on usability of image

  • Public domain is a status, cc-by-... is a license, we need both + what about rightsstatements.org declarations?
  • Attribution is important (even if it's legally not required)
  • Strong connection with authorship

we need publication date besides creation dates (copyright relies on publication date, and in special situations creation date)
we need copyrightholder besides author
'attributed as' how to deal with that? 'author name string' P2093? or new property? P2093 will also be used for the normal names as mentoned, attribution names will differ. Better to have a specific attribution property

Restrictions
Portrait rights, that is a right for the depicted persons, how could we model that: 'depicts': qid/unknown and qualifier for rights?
usage restriction-property / reproduction restriction property
We already have 'copyright exemption' property: https://www.wikidata.org/wiki/Property:P7152
https://commons.wikimedia.org/wiki/Category:Non-copyright_restriction_templates
Traditional knowledge restrictions: https://www.loc.gov/collections/ancestral-voices/about-this-collection/rights-and-access/
Could we lnk a law or UN-treaty to a certain subject and in that way notify a user if that subject is depicted?

We could use 'depicts' that links to wikidata. In wikidata there could be a property that links to certain restrictions
Swaziki symbol, trademark etc property 'usage restriction' in WIkidata, so we move the information out of Commons

  • License review

Start with Creative Commons licenses, PD-licenses are difficult because template information is complex

  • Permission (OTRS)

Freedom of panorama

Other versions

  • A way to link to other files, used in quite a broad way
  • Let's cover all linked cases here
  • Extracted from (for crops etc.)
  • Superseded
  • Derived works
  • (..... add more .....)
  • Need some catch all property, qualifier for media legend
  • jpg/tiff – cropped image only (jpg) vs full object with a color target (tiff) – IIIF would solve this issues
  • cropped image vs full images with passpartout

Related images

  • reverse side
  • rephotograph
  • other exemplars of the same image (issue of object vs image) – photographic negative, paper photo, postcard of the same photo
  • similar images∂

Location data

Exif duplication

  • ... (link)

Partnership templates

  • Things like Wiki Loves Monuments
  • Uploads by GLAMs
  • Supported by ....

Quality things

  • Commons quality assesment (Valued Image, Quality Image, Featured Picture). The assessment - https://www.wikidata.org/wiki/Property:P6731
    • commons quality assement (P6731)
      • point in time
      • wikimedia project (for FP on enwiki)
      • nomination page
  • Also use for (winners of) competitions like WLM?
    • award received (P166) →Item for Winner of Wiki Loves Monuments
      • point in time (P585) → year (e.g. 2016)
      • ranking (P1352)
      • edition could probably be a qualifier for each national using P17

****actually some contest can have their own item (WLM Italy = Q19960422) / Yes the probably it's easier to have an item "Winner of Wiki Loves Monuments in Italy" and then just ranking and year (no, it's better not)

  • Problem to solve group of country editions ?

Templates by usage / importance

Information template

Artwork template

  • https://commons.wikimedia.org/wiki/Template:Artwork
  • Already heavy Wikidata integration
  • artist -
  • author -
  • title -
  • description -
  • depicted people -
  • date -
  • medium -
  • dimensions -
  • institution -
  • department -
  • place of discovery -
  • object history -
  • exhibition history -
  • credit line -
  • inscriptions -
  • notes -
  • accession number -
  • place of creation -
  • source -
  • permission -
  • other_versions -
  • references -
  • depicted place -
  • wikidata -

Photograph template

  • https://commons.wikimedia.org/wiki/Template:Photograph
  • photographer -
  • title -
  • description -
  • depicted people -
  • depicted place -
  • date -
  • medium -
  • dimensions -
  • institution -
  • department -
  • references -
  • object history -
  • exhibition history -
  • credit line -
  • inscriptions -
  • notes -
  • accession number -
  • source -
  • permission -
  • other_versions -
  • wikidata -
  • camera coord -
  • original description -
  • biased

Art photo template

  • wikidata -
  • artwork license -
  • photo description -
  • photo date -
  • photographer -
  • source -
  • photo license -
  • other_versions -
  • artist -
  • title -
  • description -
  • date -
  • medium -
  • dimensions -
  • institution -
  • location -
  • references -
  • object history -
  • exhibition history -
  • credit line -
  • inscriptions -
  • notes -
  • accession number -
  • artwork license -
  • place of creation -
  • photo description -
  • photo date -
  • photographer -
  • source -
  • photo license -
  • other_versions -
  • wikidata -

Music work template

  • composer --> P86
  • lyrics_writer --> P676
  • performer --> P175
  • title --> P1476
  • description --> maybe "recording or performance of (P2550)"
  • composition_date --> P577 (but it regards the *first* composition)
  • performance_date --> P577 (but it regards the *first* composition)
  • notes -->
  • record_ID --> it depends on the file (but it's an identificator)
  • image -->
  • references -->
  • source -->
  • permission -->
  • other_versions -->

}}

Book template

https://commons.wikimedia.org/wiki/Template:Book
used on 785 273 pages (source: https://commons.wikimedia.org/w/index.php?title=Special:MostTranscludedPages&limit=300&offset=0 )
Does anyone know how many files are PDFs?
How often is each field used in this template?

  • Template used on files that are not books (like single page from a book)
    • Can we distinguish the number files by types? jpg/pdf/djvu ; it seems to be mainly used on single image

Use case of https://commons.wikimedia.org/wiki/File:Gray356.png There is 2 templates, on specific for this image and one for the book where it's from. And instead of filling the template everytime, there is a pre-filled template : {{Gray's Anatomy}}.

FRBR

  • Author - P50 item, often use a creator template ! can be multiple author, authors of part of the books (to store with qualifier, P518)
  • Translator - P655 item
  • Editor - P98 item ???
  • Illustrator - P110 item
  • Title - P1476 monolingual text (how to know the language ? assume it's the same as P407 below)
  • Subtitle - P1680 monolingual text
  • Series title - monolingual text ??
  • Volume - ??
  • Edition - P393 string
  • Publisher - P123 ? P872 ?? item
  • Printer - P872 ?? item
  • Publication date - P577 time (in which calendar ?)
  • City - P291 (place of publication)
  • Language - P407
  • Description - ????
  • Source - ???
  • Permission - ???
  • Image - P18 (is it actually used like that on Wikidata)
  • Image page - ??? number for the page where is the cover
  • Pageoverview -
  • Wikisource - P1957 (Wikisource Index page) ?
  • Homecat -
  • Other_versions -
  • ISBN - P957 (ISBN10) - P212 (ISBN 13)
  • LCCN -
  • OCLC - P243
  • References -
  • Linkback -
  • Wikidata - NA ??
  • Better duplication rather than the bad duplication of data in the Templates.
  • VERY IMPORTANT: this template is used to autofill the Index pages on Wikisources, don't break that please
  • Is this template for edition, exemplar, specific digitized copy ; unclear for now, the template does precise and is used in various way.

Book - Data Modeling for SDC

  • Institution/Department
  • Accession number
    • Accessed on
  • Digitized by = property that may need to be created.

Followup actions:

  • check our assumptions
  • Template on Wikisource should not break

Map template

title
wikidata title
description
legend
author
imgen
date
source
permission
license
map date
location
wikidata location
type
projection
scale
zoom
heading
latitude
longitude
warp status
warp url
set
wikidata set
sheet
book author
wd book author
book title
wikidata book
volume
page
language
publication place
publisher
print date
ISBNLCCNOCLC
institution
accession number
id
uri
dimensions
size
scan resolution
medium
technique
credit line
inscriptions
notes
other versions
references
demo
other fields

Map

NOTE: Based on https://www.wikidata.org/wiki/Wikidata:WikiProject_Maps/Historical_map_properties NOTE: See also https://www.wikidata.org/wiki/Wikidata:WikiProject_Maps/stats for properties most often used at present on Wikidata items for maps
  • instance of P31 Any subclass of (P279) map (Q4006) or manuscript (Q87167 COMMENT: Is P31 the right property for this ?
  • title P1476
  • subtitle P1680
  • inceptionP571
  • language of work or name P407

<s>* image P18</s> COMMENT: Not relevant for SDC

  • Creator P170
  • possible creator P1779
    • object has role P3831, examples: engraver (Q329439), illustrator (Q644687), land surveyor (Q294126)
  • depicts P180 -- for what the image is a map of, eg Paris ; see also P921: main subject. Both are currently in use of Wikidata items for maps
  • Geographic coordinates P625
  • spatial reference system P3037
  • scale P1752 Proportional ratio of a linear dimension of a model, map, etc, to the same feature of the original - 1:n. Use 1 for lifesize, positive numbers for smaller than lifesize, negative numbers for larger. Map scale with value-unit pairs (such as 1 verst to an inch) has be converted into ratio to 1.

    Map bounding box requires new properties, since the properties coordinates of northernmost point (P1332), coordinates of southernmost point (P1333), coordinates of easternmost point (P1334) and coordinates of westernmost point (P1335) define coordinate pairs. Alternatives could be upperleft and lowerright corners using coordinate pairs or Northern, Eastern, Southern and Western using single coordinate values. See proposal at Wikidata:Property_proposal/bounding_box (which was not done).
    • In practice P1332 to P1335 are quite widely used
  • part of the series P179
  • publisher P123
  • place of publication P291
  • language of work or name P407
  • printed by P872
  • edition number P393
  • edition or translation of P629
  • publication date P577
  • license P275
  • published in P1433
  • volume P478
  • location P276
  • collection P195
  • inventory number P217
  • exemplar of P1574
  • inscription P1684
  • material used P186
  • colorP462
  • height P2048
  • width P2049
    • applies to part P518I
  • commissioned by P88
  • owned by P127
  • location of final assembly P1071
  • exhibition history P608
  • catalog code P528
  • described at URL P973
  • catalog code P528

TO DO: additions to make, based on list of template fields above. Also note template fields that might need free-form text, or otherwise be unsuitable for putting into SDC statements (cf description pages vs Wikidata items for some existing maps)

ISSUE: How to deal with images that contain multiple maps, eg main map + one or more inset maps

  • The inset map is likely to have its own depicts / subject, centroid + bounding box; perhaps also georeferencing, scale, projection, based on, date depicted. It will also have its own position or boundary within image.
    • if we keep all of these are first level properties (to allow them to be qualified if desired), how to identify that various of these statements all refer to the same part of the image?

Possibly we need to define some standard Q-items for sub-parts of an image, eg "Inset map", "Inset map #1", "Inset map #2', "sub map A", "sub map B" etc (the latter where there is no obvious main map - subsidiary map divide),
and then make statements "File:XYZ map" has parts "sub map 1", "sub map 2" etc, qualified with postition or/and bounding box within the image, followed by statements such as
"File:XYZ" depicts "Paris", applies to part: "sub map 1".
It's not a beautiful data model, but it might be workable.

QUESTION: Should every map have its Wikidata item? If so, which metadata will be stored on Wikidata (related to the object), and which metadata is specific to the file?

  • *Many* maps will have a Wikidata item, eg if they are objects catalogued by libraries in their own right. File-specific metadata might include eg georeferencing data.
  • But not all maps will have a Wikidata item, eg for maps extracted from scanned books, possibly only the book would have a wikidata item

Wikidocumentaries

depicts (180)
located at street address (P6375)
collection (P195)
creator (P170)
subject has role (P2868)
inception (P571)
start time (P580), end time (P582) / earliest date (P1319), latest date (P1326)
date depicted (P2913)
start time (P580), end time (P582) / earliest date (P1319), latest date (P1326)
set in period (P2408)
P31
Commons compatible image available at URL (P4765)
described at URL (P973)
inscription (P1684)
collection (P195)
height (P2048)
width (P2049)
depicts (180) + preferred rank.
Relative position within image (P2677)
aspect ratio (P2061)
relative position within image (P2677)
checksum (P4092)
determination method (P459)
field of view (P4036)
focal length (P2151) meters and millimeters
digital representation of (P6243)
coordinates of the point of view (P1259)

General issues and concerns

  • We need a general policy around which creative works should have a Wikidata item, which not
  • We need a migration strategy for those files that represent creative works that don't have Wikidata items yet, but should have one
  • Data duplication and asynchronicity between Wikidata and Wikimedia Commons
  • Exif duplication (see above)
  • Input from Albin Larsson: property constraints on Wikidata may not apply on Wikimedia Commons. A Wikidata item may have one specific value for an identifier, for instance, while a file on Commons may have several values for one identifier statement. In the second case that will be entirely correct and should not throw a property constraint. (use case RAÄ)
  • Do we want references on Commons statements?

Instance of

  • Question: Is P31 the right property here? Would a new property specific to this use make more sense, eg "nature of image" ?
  • Question: If one wants to indicate eg that it is a map, or an engraving, that the image is a representation of, how to do this ?

Files that represent creative works

Takeaways

Easy cases to start with:

  • Own work
  • with CC licenses
  • Wiki Loves Monuments files are good examples to work with

Albin and Susanna: we need to be able to indicate provenance of specific statements, eg that they come from tools, AI... - would be good to do this in a uniform way and to have community consensus about it; human translation / various points of transformation and interpretation of the metadata / create a Phabricator ticket about this