Page MenuHomePhabricator

Standardize returned short-name titles for Creative Commons licenses
Open, MediumPublic5 Estimated Story Points

Description

Description

Users of the Attribution API may want or need to alter their experience depending on the type of license used by the reused content. For example, the Future Audiences team is interested in displaying the specific license icon depending on the type of license in use. To make these types of use cases easier, we should make it easier for reusers to confidently handle different license values by normalizing them to the best of our ability.

Although the license type is a community and human set value, there are style standards defined by the Creative Commons organization that we can adopt to transform the data before it is returned.

Conditions of acceptance

  • Standardize how license titles are returned for the most common license types
    • Follow rules set out in https://creativecommons.org/licenses/list.en to support jurisdiction, igo, and other permutations.
    • Always return appropriate capitalization; abbreviations should be all caps (for example, CC BY-SA 3.0 IGO, CC BY-SA 2.0 DE)
      • If 'Generic' is used, do not include it
      • If 'Unported' is used, return the full word
  • For public domain licenses, always return "PDM" for public domain mark, per CC standards
    • Includes both "Public domain" and "PD-XX" classifications
  • If an unknown license type is present, return it in the raw format

Implementation details

Below is an image of some existing licenses that are being returned.

image.png (2×1 px, 194 KB)

Claude provided regex to use as a starting point for all CC & PDM licenses:

'\bCC0(?:\s+1\.0)?|PDM\s+1\.0|\bCC BY(?:-(?:NC(?:-(?:ND|SA))?|ND|SA))? (?:1\.0|2\.0|2\.1|2\.5|3\.0|4\.0)(?:\s+(?:IGO|Unported|Generic|[A-Z]{2}(?:\s+[A-Z]+)*))?\b'

Event Timeline

Licence information comes from two places:

When fetching Articles/Pages/Talk/etc on Wikis - it comes from configuration -
	'url' => $this->options->get( MainConfigNames::RightsUrl ),
	'title' => $this->options->get( MainConfigNames::RightsText )

source: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/2efa5409d70bb2437cbd2c3276f1f96264961f2a/includes/Rest/Handler/Helper/PageContentHelper.php#258

This one is relatively easy to do - we would need to go through configs. ( https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/3ff486473879abe7eafcf61e60f09784ee5071eb/wmf-config/InitialiseSettings.php#11368 ) and most likely provide a new config RightsTextShort .

When fetching media it comes from templates

This licence info comes from templates - texts are stored on all projects as freehand text.

For example - the template for CC-BY-SA-4.0 on Commons renders ( source: https://commons.wikimedia.org/wiki/Template:Cc-by-sa-4.0 )

<div class="rlicense-text">
  <span class="licensetpl_link" style="display:none;">https://creativecommons.org/licenses/by-sa/4.0</span>
  <span class="licensetpl_short" style="display:none;">CC BY-SA 4.0 </span>
  <span class="licensetpl_long" style="display:none;">Creative Commons Attribution-Share Alike 4.0 </span>
  <span class="licensetpl_link_req" style="display:none;">true</span>
  <span class="licensetpl_attr_req" style="display:none;">true</span>
</div>

Then the CommonsMetadata extracts from rendered template elements with .licensetpl_short, .licensetpl_link, .licensetpl_long etc.
Then, this template may be different, for example on Polish Wikipedia ( source: https://pl.wikipedia.org/wiki/Szablon:CC_BY-SA_4.0 )

<span class="licensetpl" style="display:none;">
  <span class="licensetpl_link">https://creativecommons.org/licenses/by-sa/4.0</span>
  <span class="licensetpl_short">CC BY-SA 4.0</span>
  <span class="licensetpl_long">Creative Commons Attribution-Share Alike 4.0</span>
  <span class="licensetpl_link_req">true</span><span class="licensetpl_attr_req">true</span>
</span>

It's almost the same, but you can see that they're two different templates; people can enter whatever they want. We can try to fix some stuff in flight, but we won't be able to make it 100% correct for every case/scenario.
I wouldn't use any regex magic here; this info comes from templates, and those aren't edited often. IMHO it's better to create a map and map some of those into other:

$standarizedLicenceShortTextMap = [
  'PD' => 'PDM',
  'PUBLIC DOMAIN' => 'PDM',
  ...
];

$licenceShort = mb_strtoupper( $licenceShort );
return $standarizedLicenceShortTextMap[$licenceShort] ?? $licenceShort;

The regex magic may return some weird output on our texts, the one attached for sure will fail when whitespaces are mixed ( added extra or removed )

I submitted T421051: Attribution - Align Media licence information with Article licence information to standardize licence information between Articles and Media as for now - one returns a long text licence text, and second returns short text.

I wouldn't use any regex magic here; this info comes from templates, and those aren't edited often. IMHO it's better to create a map and map some of those into other:

$standarizedLicenceShortTextMap = [
  'PD' => 'PDM',
  'PUBLIC DOMAIN' => 'PDM',
  ...
];

$licenceShort = mb_strtoupper( $licenceShort );
return $standarizedLicenceShortTextMap[$licenceShort] ?? $licenceShort;

The regex magic may return some weird output on our texts, the one attached for sure will fail when whitespaces are mixed ( added extra or removed )

Yeah I agree, licenses are fairly standardized in the templates, but what you could do is do a "soft" combination of both regex and a map, where you use regex only for spaces-or-dashes and capitalization normalization.

Something like

$standarizedLicenceShortTextMap = [
  'PD' => 'PDM',
  'PUBLIC[ -]?DOMAIN' => 'PDM',
  'CC[ -]+?by[ -]+?SA[ -]+?3.0' => 'CC-by-SA 3.0'
  ...
];

And then run this through a regex with /i to be case insensitive. This can then normalize in our own map uses of CC-by-SA 3.0 vs CC by SA-3.0 vs CC by SA - 3.0 that may differ a little in weird places but all mean the same.

Atieno triaged this task as Medium priority.Mar 27 2026, 9:43 PM
AGhirelli-WMF set the point value for this task to 0.Tue, Apr 28, 3:04 PM
BPirkle changed the point value for this task from 0 to 5.Thu, Apr 30, 3:12 PM