Page MenuHomePhabricator

Revise "language" field validation
Closed, ResolvedPublic

Description

Supposedly, the language field should only accept language codes matching either xx or xx-xx* formats.

However, the pattern seems to be matching other strings as well (e.g., "spanish").

Nonetheless, why is Citoid returing "spanish" for some sources? See for example https://www.elpais.com.uy/mundo/relato-esposa-fiscal-paraguayo-marcelo-pecci-asesinado-colombia.html

Event Timeline

The pattern used in our code comes from Citoid's fixLang: /^[a-z]{2}(?:-?[a-z]{2,})*$/i. I mistakenly thought it meant codes matching the xx or xx-xx* formats, but of course I was wrong.

On the other hand, the language parameter of the Cite web citation template (for example) accepts "either the ISO 639 language code (preferred) or the full language name".

In fact, there is a list here of the values it and other CS1/2 citation templates accept.

Note that there are some values that, although valid for CS1/2 templates' language parameter, would not pass Citoid validation. For example, abq-latn, or es-419. We may have to report this to Citoid.

What should we do in Web2Cit?

  • Should we continue using the same pattern as Citoid's fixLang but change the documentation accordingly?
  • Should we accept any non-empty string, but warn users that only values in the list above will be recognized by Wikipedia?
  • Should we validate against the full list of languages supported by CS1/2 templates?

Note that there are some values that, although valid for CS1/2 templates' language parameter, would not pass Citoid validation. For example, abq-latn, or es-419. We may have to report this to Citoid.

Added comment here: T93561

Should we continue using the same pattern as Citoid's fixLang but change the documentation accordingly?

Because using language codes is recommended, we could make our validation stricter than Citoid's and accept language codes only. However, we probably don't want to do this until we find a way to convert Citoid-valid languages into Web2Cit-valid ones (T312110), to prevent fallback template from failing.

Should we accept any non-empty string, but warn users that only values in the list above will be recognized by Wikipedia?

Being less strict than Citoid may work, and would support Wikipedia-valid languages not supported by Citoid (see above).

...the language parameter of the Cite web citation template (for example) accepts "either the ISO 639 language code (preferred) or the full language name" ... there is a list here of the values it and other CS1/2 citation templates accept.

BTW, tried this today and couldn't get templates Cite web or Cite news throw an error if supposedly unsupported values were used in the corresponding language fields.

diegodlh claimed this task.

Flexibilized language validation pattern as "any non-empty string" in w2c-core's 760b8159.

Updated Fields documentation on-wiki as well.