I believe that because the file name has many words the score on the tokenized text fields is very high (since we sum all token scores), the score on the exact match having only one word and despite having a high weight it's not enough to compete with the loss of its text matches discarded because of the negation.
@Lea_Lacroix_WMDE no, we just need to deploy it, sorry for the delay.
I'd consider this a bug indeed, I suspect the tokenization algorithm of the default search backend to be quite limited by not being able to properly discard punctuation.
Thanks for all the feedback.
I'll discard the "constant" option.
Fri, Feb 7
Thu, Feb 6
Yes the issue with blank nodes is that they are not "reference-able" and thus point delete queries are impossible which is what we want to achieve with the next gen updater.
Wed, Feb 5
Thu, Jan 23
Wed, Jan 22
Tue, Jan 21
Jan 17 2020
The patch was just merged, I wonder if it's not because of the submodule and trying to detect conflicts with another patch that touches this deleted module (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/564063).
There might be some logs server-side?
Jan 16 2020
icinga check showed: CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds. for Query Service HTTP Port and NaN for WDQS high update lag.
I suppose that the last remark refers to the $wgCapitalLinks and
$wgCapitalLinkOverrides configuration variables.
When querying cirrus properly honors these parameters in a way that searching for hastemplate:foo will actually search for Template:Foo on english wikipedia but Template:foo on english wiktionary.
Indeed, the only keyword that will do some filtering but also affect ranking is morelike but not sure we can base any naming pattern on it. about-topic: sounds fine to me (@TJones might have some suggestions perhaps?).
Jan 13 2020
very similar to T242587
Jan 11 2020
Perhaps prefer-topic:something then?
My concern here is mostly to avoid existing words in the special syntax to avoid swallowing queries that are valid sentences. For instance when I copy/paste a text and search for it, e.g. searching for Special topic: Electric aircraft I probably don't mean the keyword.
Jan 10 2020
Jan 8 2020
Jan 7 2020
Search was broken, the config change fixed it.
logstash-beta seems to have stopped to receive events since Jan 1st 16:40 so I can't be really sure that the logspam stopped. Please reopen if you still see errors of this kind.
Jan 6 2020
Most likely already fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/554603 but not yet deployed (should be deployed during this week train).
For 2020 primaries vs 2016 primaries there seem to be an additional problem (the user query should not be corrected in this case), filing a new task for this (please see T241969).
Jan 3 2020
Jan 2 2020
One problem is that the french stemmer does not conflate venir with its conjugated form venu or venue.
The page Je suis venu te dire que je m'en vais does not have venir meaning that it cannot match the query Je suis venir te dire que je m'en vais.
I suggest a keyword slightly less ambiguous such as hastopic or hasdrafttopic.
I agree that there should be a mapping, if this keyword is going to be used directly by users it might be helpful to allow them to search a topic translated into the wiki language instead of using English.
Dec 31 2019
I believe this is caused by a bot sending a large amount of requests of type:
using the UA: wikipedia (https://github.com/goldsmith/Wikipedia/)
Dec 20 2019
The most annoying integration test (and probably slowest) is org.wikidata.query.rdf.tool.wikibase.WikibaseRepositoryIntegrationTest:
- it generates anonymous edits to test.wikidata.org in order to test the RecentChange api
- Concurrent runs of this test will cause failure. The test expects to see the timestamp of the edits it makes, if this test is run concurrently (two patches in CI) it's a race and can fail.
- it adds a lot of complexity to test the robustness (retries) by launching a custom Proxy prior running the integration tests (start-proxy and org.wikidata.query.rdf.tool.Proxy)
Dec 19 2019
Dec 18 2019
Dec 17 2019
Dec 16 2019
@Mholloway yes it is expected, previously this topic was only used to replay failed updates to elasticsearch.
As Erik mentionned in a previous comment:
There will now be, approximately, 3x as many ElasticaWrite jobs as there were CirrusSearchLinksUpdate jobs. Ballpark estimate on latency is 300ms, basically dividing the current 700ms by three and rounding up a bit. We almost certainly need to increase concurrency here, using the current level of links update (300) is almost certainly safe, and we can adjust from there.
Looking at the graph "Rate of committed offset increment" from https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1 it seems that only "low_traffic_jobs" are affected:
dropping from ~40 to 3.
With one topic (fetchGoogleCloudVisionAnnotations) constantly failing out of many that should run properly (all the ones consumed by low_traffic_jobs), if ChangeProp does not properly handle such scenario I suppose it could lead to such behavior.
Dec 12 2019