The reason for this behavior can be seen here: https://www.wikidata.org/w/api.php?action=wbsgetsuggestions&language=en-US. There is no language "en-US" in MediaWiki. Most code falls back to the default "en". But the API module that is responsible for suggesting properties does not.
Estimated table sizes:
- The latest Item ID is currently Q49977198. Thats 9 bytes.
- 9 * 3 = 27 bytes per row.
- 27 * 1 million Lexemes = 26 megabytes.
- Lexeme IDs will be similar to Item IDs, so 9 bytes again.
- Lets say language codes are 5 bytes on average (e.g. stuff like "en-gb").
- Lets say lemmas are 15 characters on average (see http://www.ravi.io/language-word-lengths).
- Lemmas will use multi-byte UTF-8 characters in many cases. I suggest to assume a factor of 4 bytes per character, just to be sure.
- Lets say a Lexeme does have 2 lemmas on average.
- ( 9 + 5 + ( 15 * 4 ) ) * 2 * 1 million Lexemes = 141 megabytes.
- wb_terms is plural. Most MediaWiki core tables are plural. I also like plural names for tables more. But in the end it really does not matter.
- I used VARBINARY and VARCHAR BINARY as they currently are on other Wikibase tables. From https://dev.mysql.com/doc/refman/5.7/en/binary-varbinary.html: "The […] VARBINARY data types [is] distinct from the […] VARCHAR BINARY data type. […] the BINARY attribute does not cause the column to be treated as a binary string column. Instead, it causes the binary (_bin) collation for the column character set to be used, and the column itself contains nonbinary character strings rather than binary byte strings." https://dev.mysql.com/doc/refman/5.7/en/charset-binary-collations.html explains this in much more detail. Based on this I believe whats suggested above is correct: Use VARBINARY for Item IDs that are known to not contain multi-byte characters, but VARCHAR BINARY for values that will contain multi-byte characters.
Thu, Feb 22
Wed, Feb 21
The message is from the DecimalValue constructor, which is used in the QuantityValue constructor. This situation can happen when an edit is made via the API, and a quantity is submitted as a floating point number instead of a string. The code in the DecimalValue that converts floats to strings, but does this in a way so it can violate it's own limitations. Basically: The float is converted to a string with 100 decimal places. If the number before the decimal point is longer than 27 characters, the conversion fails with said error message.
Seems to be closely related to T187755: Double quotes in title are displayed as HTML entity in comments. Could potentially be the same cause, but should be tracked separately.
@WMDE-leszek, something like this would be my draft:
CREATE TABLE IF NOT EXISTS wbl_lexemes ( lex_lexeme_id VARBINARY(20) NOT NULL PRIMARY KEY, lex_lexical_category_id VARBINARY(20) NOT NULL, lex_language_item_id VARBINARY(20) NOT NULL );
Tue, Feb 20
Please consider the learnings from ArticleFeedbackv5 (https://www.mediawiki.org/wiki/Article_feedback/Version_5), which provided a somewhat similar "inline" comment feature. I was working closely with @Fabrice_Florin and his team back then. My personal key learnings are:
- The feature made readers super happy.
- The vast majority of the incoming comments was not actionable. This was actually intentional in the products design, and therefor never fully considered in any of the moderation process designs that have been added later.
- The moderation process that was added later was designed like all comments are potentially relevant. This created a lot of tedious, unproductive workload that made active editors unhappy very fast.
- The most significant frustration was experienced by editors that cared the most about the articles they maintained so carefully. These editors wanted feedback, but what they got was nothing they could work with. For example, on an article about a mammal the editor wanted feedback like "here is a paper with new information you can add to the article". Instead, they got requests from children asking how old the mammal gets, which is a non-scientific question in the first place, and something science actually does not know about most animals.
Personally, I'm totally fine with using any kind of cache, might it be an in-memory one or something else. My worst-case scenario is as follows: Let's say we have 10 million Lexemes, 2 lemmas per Lexeme, 20 bytes per lemma. The cache would need to hold about 0.4 gigabytes. What kind of cache would be ok with such a size?
We should fix https://commons.wikimedia.org/wiki/File:Lexeme_data_model.png then, because it very prominently says there is only "one" lemma. It could be this is meant to be interpreted as "one" value that can somehow contain multiple values. I wonder what the benefit of phrasing it like this is.
As for the code, it is currently not possible and not planned to support additional languages for labels and descriptions that are not supported by MediaWiki core. Adding code for this is certainly possibly, but I can not predict how much work this is going to be.
Mon, Feb 19
I had a brief look and found two details that might help solving this issue:
- The API request might need a uselang=de attached to make sure the error messages the API call possibly returns are all localized.
I would close this as invalid for a series of reasons:
- The screencast is flawed. There are no suggestions popping up. Only the last "containing…" entry pops up, which is supposed to go to Special:Search. So the screencast does not show an error but actually what this ticket asks for.
- Currently the suggester does find many items with the label "game theory". Which one should be opened by default? Always the first one?
- This tickets description sounds like it's based on wrong assumptions: There is no "page" with the title "game theory". The critical difference is that titles are unique, while labels are not. This is the main reason why pressing enter does not go to an item page.
- There is stemming and such involved. What is an "exact" match under these circumstances?
- It would be critical to make the return key behave in a way that is consistent, and can be relied on by the user. Personally I would find it very confusing if pressing enter sometimes opens the item I expected, and sometimes a "random" one with the same title as the one I was hoping for.
- This is from 2014. Did anybody else had the same issue since then?
A series of super-trivial improvements that might not even need any design is:
- Display both "Latitude 14.113116805556°" and "Longitude 122.95454861111°" with a degree sign.
- Convert known precisions to values similar to the ones shown in the dropdown. Namely:
- 1 arcminute
- 1 arcsecond
- 1/10 arcsecond
- 1/100 arcsecond
- 1/1000 arcsecond
- 1/10000 arcsecond
- Color the individual lines in coordinate and quantity diffs. As noted above this might need a new ticket.
Thu, Feb 15
Wed, Feb 14
Rough answer without knowing all details @hoo might have been referring to: We see mostly linear growth on https://grafana.wikimedia.org/dashboard/db/wikidata-datamodel. So I believe the numbers given for 1 year can be extrapolated linearly. @Lydia_Pintscher might be able to provide a more substantial answer.
Tue, Feb 13
I was able to reproduce the error quite easily by temporarily adding return Status::newFatal( 'dummy' ); to SpecialWikibaseRepoPage::saveEntity, and simulating an error that way. The same issue with the duplicate form then appears in all special pages that are based on the same abstract class: Special:SetLabel, description, aliases, and sitelinks.
Other questions not relevant for PM:
- Q: Store derived labels for all languages we support in advance?
- Q: We are going to have stuff like “English, Noun” repeated a lot. Is it worth optimizing the storage layer for duplications?
- Q: Can the same solution we investigate here work for MediaInfo?
- Q: Can the solution we investigate here replace Label/DescriptionLookups in Wikibase? See T163538.
Where is this used, and how?
This code example is quite unfortunate. It works around all performance optimizations the Wikidata team implemented in the past weeks and months. Is this from a Lua module actually used somewhere, and if so, can you please provide a link?
Guys, can you please share some links? I'm really not able to go and find "some" client wiki (Which one? We do have about a thousand of them!) that might or might not have an error somewhere hidden in local logs or recent changes.
Mon, Feb 12
I tried to reproduce this, but was not able to. The diff in your example does look bad, indeed. The edit summary states "Changed claim: software version (P348): 8.0.1", but at the same time an other statement disappears without being mentioned in the edit summary. This should not be possible. This was obviously an undetected edit conflict where the second edit was based on an old revision that did not contained the previous edit that added the "software version: 8.1" statement. I can imagine some replication lag, or the edit being based on a slightly lagged secondary database.
Can you please help us understand the issue better? At the moment, when I look at the given example page https://www.wikidata.org/wiki/Special:Diff/627491344, I can not find a "wrong" URL anywhere on that page.
The class index at https://doc.wikimedia.org/Wikibase/master/php/classes.html looks complete. This means the doc generator takes all subdirectories into account. It just does not create entires in the "modules" section, as these are bound to manually maintained @defgroup tags. Honestly, I would remove all these tags and let the directory structure dictate instead.
I would love to include this into what we consider for T182147: more convenience functions for Lua. But I'm afraid the current description is more confusing than helpful.
I was able to reproduce this locally quite easily. It's an actual bug in the code. The code assumed all Wikibase entity labels are valid MediaWiki page names, but that was not always the case. I uploaded a quick fix.
Fri, Feb 9
Sorry to ask again, but the Wikidata team would really love to know what the state here is, as this is blocking Wikidata-Query-Service development more and more (see T112715). I hear there are (possibly more than one) security issues that are blocking a deployment of this otherwise finished service from happening. Can someone briefly explain what these issues are, and what is probably needed to resolve them? That would be super helpful.
Moving this component alone is not really worth much. We are barely touching it any more. It's just some git repository, and it really should not matter that much where a git repository lives. It matters for other reasons than the build, but as said, as far as I'm aware of there was never that much pressure or even demand to change anything here.