percentage, number of WDQS queries per month that involve Lexemes
percentage, number of the above queries that only involve Lexemes (i.e. doesn't require anything from the larger Wikidata graph)
with very naive heuristics and for one day I extracted 529097 queries involving lexemes.
357917 seemed to require data from wikidata but I would not trust this too much. Since the language is a wikidata item a query requesting labels in a language using its language code rather than its QID falls into the category of queries requiring the wikidata graph.
I did not run the analysis on the full month because it's rather slow and given the precision of the heuristics I chose I would not trust these numbers anyways.
Mon, Apr 19
bd:sample is a blazegraph feature and should be documented on the blazegraph wiki which is referenced from https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Blazegraph_extensions.
we might want to exclude wdqs1009 from this for now since we do not have anywhere else to get its journal from.
Indeed, the RDF data is available in the hive table discovery.wikibase_rdf but it is generated reading the TTL dumps so it might not help for this particular task.
Using hadoop will indeed allow to process the json efficiently but has drawbacks as already pointed out:
- requires maintaining the Wikibase -> RDF projection in multiple codebases (PHP wikibase & in spark)
- once created from the hadoop cluster it will have to be pushed back to the labstore machine for public consumption and might add extra delay
OpenSearch was forked from elastic 7.10 but CirrusSearch only supports elastic 6.x so migrating to 7 (T280482) might be necessary before doing this.
Fri, Apr 16
Thu, Apr 15
Since we are going to use envoy to contact MW applications servers I wonder if this kind of limits could be enforced by it?
Wed, Apr 14
Tue, Apr 13
moving to needs triage to raise visibility
Most probably due to the recent reindex T274200. It looks like cloudelastic does not have the capacity to support a reindex of our large indices (commons and wikidata). Worth noting that we are investigating creating dedicated production clusters for these two indices (T265621), should we reconsider the size of the cloudelastic cluster (add even more machines) or perhaps have a dedicated cloudelastic cluster for wikidata&commons?
Thu, Apr 8
Another weird behavior is that you can expand the 7 results without asking for more:
@Moebeus thanks for the report, do you know if the duplicates appear after clicking more to display the remaining results or directly?
If they appear directly could you check by scrolling down if all the first 7 results are duplicated?
By default only 7 items are searched and shown, more can be displayed only if you hit the more button.
Wed, Apr 7
When File is part of the searched namespaces on wmf wikis commons is being searched too. The shape of the search query might inhibit this behavior but incategory is explicitly declared as a keyword that is allowed to work on commons. I don't have strong opinions on this but I think this feature has been here for a while so I guess it's OK.
To answer your last statement it's a feature and not a issue nor a side-effect. As a workaround the local: prefix can be used to forcibly ignore results from commons.
I'm declining as this is working as expected but feel free to re-open if you believe this feature should be changed.
Tue, Apr 6
Thu, Apr 1
Wed, Mar 31
Tue, Mar 30
@MisterSynergy thanks for the report!
Mon, Mar 29
Thanks for bringing this here, this link is generated from https://www.wikidata.org/wiki/Module:Constraints/SPARQL and seems to be added to all properties except the fews that define no constraint.
Digging more through the impact over the 370 queries using wikibase:hasViolationForConstraint for March (1st -> 28th):
- 200 come from one of these links
- 112 are from https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Pi_bot_13
- 28 are from https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Pi_bot_11
- 8 are from example queries taken from the announcement: https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_property_constraints/Archive_2#You_can_now_query_the_constraint_violations_with_the_Query_Service
- no obvious categories for the remaining 22 queries
Fri, Mar 26
I think one of the reason bucketing was done in the frontend was to better detect the search session boundaries, doing this on the backend without a state per identity you would have to set arbitrary boundaries I think.
Quickly looked when I saw that frwiki has the new search widget enabled but not dewiki/enwiki. Looking at the data it seems frwiki is heavily affected (~20% of the sessions have an event in mismatch or invalid as opposed to 1%/2% for other wikis):
Thu, Mar 25
Tue, Mar 23
Suggestion for a better long term solution here: T278246
Mon, Mar 22
Mar 19 2021
@despens could you provide a reproducible test case (a small RDF file that triggers the problem would be great). I don't see how site links could be involved in the problem you raise and a test case will definitely help. Thanks!
I see two ways to fix this:
Mar 17 2021
Mar 16 2021
Sounds good to me, given the unpredictable growth of commons (current coverage of captions&depict statement is below 10%) I think it's sane to have at least 12 machines.
Tentatively closing as a duplicate of T263427 as this sounds very similar, please re-open if you think it's completely different or if the workaround mentioned there does not work for you.
Mar 15 2021
For reference BC patch by Erik: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/671214
Mar 12 2021
I think I'll go with ?useskinversion=1 for now and wait for the new widget to become the default to switch to it in the test code.
Pinging SD folks as they've worked on this dump IIRC.
Mar 9 2021
Reprocessed all updates related to lexemes on wdqs1009 using a custom build with https://gerrit.wikimedia.org/r/670090
This makes perfect sense and I think this can be considered a bug. I think that the code simply ignores that the keyword node (in the AST) can be wrapped inside a NegatedNode and will incorrectly skip negated keywords.
Mar 8 2021
Mar 5 2021
For info the test started to fail around Feb 12 (https://integration.wikimedia.org/ci/view/Selenium/job/selenium-daily-beta-CirrusSearch/)
I think that the estimate at 20TB is not enough to support the current shape of the indices, mainly because we have a giant index at 11Tb. Assuming worst case scenario (reindexing commonswiki_file will require another 11Tb) 33Tb is what is needed to reindex commonswiki_file without reaching the 75% watermark.
So adding +4 machines will increase the overall usable disk space to 33.6Tb so I would suggest at least 14 machines to comfortably support the current sizes. Now should we assume that the expected decrease of commonswiki_file due to file_text truncation and my over-pessimistic reindex scenario will compensate future growth? Hard to say.
Mar 4 2021
Great write-up thanks!
Mar 3 2021
Triaging to high as this can cause serious problems.
The cause seems to be in elastic itself but I could not spot the exact problem looking at the elastic code. We might want to workaround the issue by always running systemd-tmpfiles --create from the elasticsearch systemd unit to make sure the folder exists when it's needed.
Search is functional again.
Mar 2 2021
It's related to T274204