Page MenuHomePhabricator

Increase length limit for external identifier, string and URL datatype
Closed, ResolvedPublic3 Estimated Story Points

Description

As an editor I want to be able to enter longer strings in order to be able to enter more data.

Task:

Make sure that the string-limits config var covers all string types that we want to increase the length limit for.
Increase the configured length limit of strings on wikidata* to 1500 (T154660#4735134) using the string-limits repo option (recently introduced).

Acceptance criteria:

  • string lengths (for monoling text, URL, string, external id (NOT TERMS :))) should be configured to be 1500 on wikidata*
  • wikidatawiki, testwikidatawiki and beta wikidata all have the same length limits for the values changed

Original task description

We have had several requests that the length limit of the external identifier, string and URL datatype is too low to cover some usecases. https://www.wikidata.org/wiki/Property:P234 is one example. We should increase the limit.

Open question: What should the limit be? What's the technical limit we run into?

Strawman proposal by Lydia: set all to 4000

Some numbers to help with the decision for external identifier:
InChI string lengths on the 92 Mio PubChem compounds:

99% 99.9%  100%
311   676  4502

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks Lydia!

For the InChI example above, 4000 should be fine, question is what other properties outside my expertise would need more than 4000 chars.

Best,

Sebastian

The InChI is not the only use case for chemistry, btw. SMILES also runs into the char limit right now for a number of compounds.

thiemowmde added subscribers: thiemowmde, daniel.

We already discussed this somewhere else. (But where?) On the web you will find a lot of Stackoverflow-type questions and answers. The most relevant are:

  • Some extremely old browsers had an URL limit of 256 characters.
  • Some old versions of Internet Explorer had a limit of 1024 characters. These versions are not relevant any more.
  • Modern browsers basically don't have a limit any more (multiple megabytes, if you want).
  • We still want some limit to block users from overflowing our database with stuff that does not belong there. Typically a "round" binary number like 2048 or 16384 is used.
  • A relevant number is 768 (this is 3 * 256). This is the maximum length of a field in a MySQL index. This was very relevant when we tried to build a query service based on MySQL indexing, and might still be relevant for the planned "external identifier lookup" as well as third party users. If we still want to give the guarantee that the lookup will work without Cirrus, we should stick to that maximum of 768.

The current limit for the "string" and "external-id" property types is 400, but "url" is 500. A comment quotes http://stackoverflow.com/a/417184 and an absolute upper limit of 2000, which sounds reasonable to me.

Can somebody calculate the percentage for 768, please?

I calculated these numbers above, they are solely valid for the chemical structure property InChI (P234), based on ~68 million InChI values in the largest public chemistry database PubChem (also valid for other chemical structure properties like canonical and isomeric SMILES). For any other data of Wikidata datatype string/text, I cannot provide numbers, as I lack the distribution of string lengths relevant to other data which should be represented as strings in Wikidata. And as you can see from the distribution above, increasing the limit would only influence representation of the top ~1% of total chemical structure data.

That said, for me, increasing from 400 to 768 is not wort the effort. The reason is that larger biomolecules will not fit anyway (these are currently increasing in relevance), no matter if the limit is 400 or 768, and if we cannot cover these in a comprehensive fashion, trying to tackle it at all does not seem worthwile.

-sebastian

I don't get that. o_O Increasing the limit to 768 will cover almost 1% more in addition to the 99% we have now. How are these almost 100% "not worth the effort"?

@thiemowmde based on your numbers, we are talking about a 1% improvement. If we could get to the full 100%, we would no longer have to account for the failure case. But if we still have to account for that case anyway, going from 99% to 99.9% doesn't really gain us much.

Again, how are almost 100% "not much", compared to the 99% we have now? This is a huge difference. It will decrease the error rate by a factor of 10.

I don't believe that increasing the limit to 2000 would make any difference. The next day somebody will come and ask for 2500. And the week after somebody will ask for 4000. And a month later somebody will still hit this limit.

There is no 100%. Every limit is arbitrary. Every limit will block certain use cases – as it should.

All I want is us to be crystal clear about:

  • Why we have such limits in the first place. I believe we should document the positive effects such limits have. Otherwise it's not worth much and could as well be dropped.
  • That an increase will make certain use cases much harder. We should document which known use cases are affected by the decision we make here, because we are basically dropping the support for them.

That's what I think we as a team should do. Personally, I still believe that strings longer than 768 characters are not identifiers.

I am not sure how much we should worry about the exact percentages for PubChem; to me, more important is are the percentages of the chemistry we have in Wikidata. These are likely correlated, and since PubChem is a lot bigger puts things in perspective. InChIs are identifiers, but not as we are common too, and I understand the point about indexing and ID length.

Quick question... is the semantic meaning of an 'external identifier' that it must be indexed, all of them?

@thiemowmde agreed, it would bring down the error rate for this specific identifier/string. But currently, that's the only one we have a size distribution for. I think, Lydia intended to solve this issue here for every property of data type string. So, if for technical reasons (MySQL index field length) it should be limited to 768 for now, this would also be fine for chemistry for now, but how about other properties?

The general question is, if it would be useful/doable to alllow e.g. a medium blob for a text field in the first place, or create a new data type which stores text in blobs. Currently, medium blobs are used to store the whole json, if I am correct. Certainly, this creates indexing, access and potential misuse problems. It also prohibits data access via WDQS.

Another point to keep in mind is that SPARQL queries will get problematic if the field length gets too long. I think I hit a limit of ~4K chars regarding max query length on query.wikidata.org. Is there a maximum length which the currently Wikidata design and SPARQL endpoint agree on which, in order to lift it, would require a major redesign of Wikidata technical infrastructure?

The length limit does also affect the migration of constraints from property talk pages to statements on properties. ISBN-13 has a format expression which is 481 characters long and therefore can't be added to the property page.

The 400 character limit can also be a problem for the titles of old books, maps, etc -- the original titles of these, as given eg in library catalogues, can get pretty long.

We recently made these options more easily configurable in T198108.
So if we come up with some new values for wikidata.org we can make this change pretty easily.
I'll leave that to @Lydia_Pintscher though.

Ok then let's put it to 1500 as a next step and then reevaluate.

@Tarrow @Lucas_Werkmeister_WMDE this could be a good candidate to try out your deployment access? :)

Addshore renamed this task from increase length limit for external identifier, string and URL datatype to Increase length limit for external identifier, string and URL datatype.Nov 13 2018, 3:08 PM

Change 473694 had a related patch set uploaded (by Tarrow; owner: Tarrow):
[operations/mediawiki-config@master] Set Wikibase string-limits

https://gerrit.wikimedia.org/r/473694

Change 473716 had a related patch set uploaded (by Tarrow; owner: Tarrow):
[operations/mediawiki-config@master] Read WikibaseStringLimit in Wikibase.php

https://gerrit.wikimedia.org/r/473716

Change 473694 merged by jenkins-bot:
[operations/mediawiki-config@master] Set Wikibase string-limits for wikidata dblist

https://gerrit.wikimedia.org/r/473694

Mentioned in SAL (#wikimedia-operations) [2018-11-15T12:21:29Z] <tarrow@deploy1001> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:473694]] Set Wikibase string-limits for wikidata dblist T154660 (duration: 00m 54s)

Change 473716 merged by jenkins-bot:
[operations/mediawiki-config@master] Read WikibaseStringLimit in Wikibase.php

https://gerrit.wikimedia.org/r/473716

Mentioned in SAL (#wikimedia-operations) [2018-11-15T12:30:45Z] <tarrow@deploy1001> Synchronized wmf-config/Wikibase.php: [[gerrit:473716]] Read WikibaseStringLimit in Wikibase.php T154660 (duration: 00m 53s)

Change 473815 had a related patch set uploaded (by Tarrow; owner: Tarrow):
[operations/mediawiki-config@master] Don't set Wikibase StringLengths on Commons

https://gerrit.wikimedia.org/r/473815

Change 473815 abandoned by Tarrow:
Don't set Wikibase StringLengths on Commons

Reason:
addshore beat me

https://gerrit.wikimedia.org/r/473815

@Tarrow Could you give me an update on this? Is it already deployed, what's the final max length? Thanks :)

I believe it’s 1500 characters (example edit), for monolingual text, strings, and URLs.

Apparently, the string length validator now looks like this in production:

Wikibase\Repo\Validators\StringLengthValidator::__set_state(array(
  'minLength' => 1,
  'maxLength' => 
  array (
    'length' => 1500,
  ),
  'measure' => 'mb_strlen',
  'errorCodePrefix' => '',
)),

Because maxLength is an array with a length member instead of a plain integer, the comparison $length > $this->maxLength is now never true, which is why strings of any length are accepted. I’ll try to find out where exactly this is happening.

(Aside: Do we have any place where we collect bugs like this one, which would have been prevented if we had full PHP7 support, including primitive type hints? Just to further motivate the PHP7 transition?)

Change 478678 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Fix configuration of max string length

https://gerrit.wikimedia.org/r/478678

Change 478678 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Fix configuration of max string length

https://gerrit.wikimedia.org/r/478678

Change 478928 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Add parameter type assertions to StringLengthValidator

https://gerrit.wikimedia.org/r/478928

Change 478928 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add parameter type assertions to StringLengthValidator

https://gerrit.wikimedia.org/r/478928

Tested on the sandbox on wikidata.org. The expected behavior (error message) happens correctly for properties of datatype string. However, for external IDs and monolingual string, no error message shows up, the publish link is disabled.

Apparently, for external identifiers the “publish” link becomes disabled at a length of 4592 characters, and for monolingual text, at 4566 characters. (Tested with strings of x characters, generated via printf '%*s' 4566 | tr ' ' 'x' | xclip, and using the en language code for monolingual text.) With values between 1501 and ~4500 characters, you still get the server-side error on submission.

I notice that the difference between 4566 and 4592 characters is exactly the same (26 characters) as the length of {"language":"en","value":}, the overhead of monolingual text values compared to plain string values. But that doesn’t explain why this apparent limit on data value length doesn’t seem to affect string-type properties.

I just looked at this and as the limits work and the acceptance criteria have been met I am going to close the ticket.
I'll open another one about the consistency in the UI for these different cases.