Wed, Oct 28
Yes it definitely can support such queries e.g (extract all api requests from mediawiki.apiaction grouped by their action param and database where the avg backend time is > 100ms over a 1 minute window).
Tue, Oct 27
Mon, Oct 26
There's nothing to fix in the updater related to this ticket, the reason was a bad response from one mw machine.
@Aschroet thanks for the reply, closing as it seems you found a workaround.
Please feel free to re-open if you think there's still a fix to be made to Cirrus.
It would be great to make this process more robust to connection issues but for this I think we should move away from the scroll API to fetch documents.
Fri, Oct 23
Thu, Oct 22
All the revisions I manually checked were created on this same day 2020-06-12 before mw1384 was depooled, I'm trying to extract a full list from one server but I'm having hard times making blazegraph not fail:
The revision reported in T266211 was created on 2020-06-12T06:36:58Z which also coincides with the date of problems identified in T264042.
Looking at logs we seemed to have had troubles with a MW machine at this times: T255282 which relates to the opcache issue and the RDF code in wikibase.
Wed, Oct 21
resuming investigation, additional logs seem to suggest that the jetty http client (or the way we use it) is to blame.
- timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00
- host: mw2252
[2b171d8b-48ec-480d-b7a4-187dd3af259c] /w/api.php?titles=Image%3ANorrlands_nation_Nya_entr%C3%A9n2.jpg&iiprop=url&iiurlwidth=120&iiurlheight=120&prop=imageinfo&format=json&action=query TypeError from line 395 of /srv/mediawiki/php-1.36.0-wmf.13/extensions/WikibaseCirrusSearch/src/Hooks.php: Return value of Wikibase\Search\Elastic\Hooks::getWBCSConfig() must be an instance of Vikibase\Search\Elastic\WikibaseSearchConfig, instance of Vikibase\Search\Elastic\WikibaseSearchConfig returned
< at line 1 looks suspiciously as an HTML blob being returned while calling the recent change API, could it be that the host hit by this request was not fully functional at this time?
Tue, Oct 20
The 246979 non-matching events are likely due to T265374
For the 7204 I could only find these two explanations:
- User clicks a search link that has a cirrusUserTesting=bucket attached to it
- User reopen its browser with several tabs opened one of which has link with a cirrusUserTesting=bucket param attached to it
Mon, Oct 19
Wed, Oct 14
Tue, Oct 13
happened again today:
Mon, Oct 12
capturing some logs before they vanish:
@Aschroet could you append &cirrusDumpQuery to the search URL you obtain when the error occurs and paste its output on the ticket, thanks!
Fri, Oct 9
Thu, Oct 8
For 691588 backend events matching a test bucket:
- 437764 match a SearchSatisfaction searchResultPage event
- 7204 are inconsistent with their corresponding SearchSatisfaction searchResultPage event (joining on the search token)
- 246979 have no matching SearchSatisfaction searchResultPage event, only 10 are matching go, rest is unclear
Tue, Oct 6
I don't think there exists a formal process to change these values on wiki. My experience around these values have been:
- disable them on enwiki through wgCirrusSearchIgnoreOnWikiBoostTemplates because they were incompatible with the switch to BM25, at the time only the original author of CirrusSearch had set them there so as a CirrusSearch maintainer I took the liberty to disable them
- on wikitech these values are actively maintained by wiki admins
Mon, Oct 5
What's inside in the elasticsearch index could allow some level of filtering/re-ranking based on some context provided.
Currently statements that resolve to time values are not indexed and would be required here.
On the other hand selecting (or ranking higher) entities with proper P31 could be done if the list of P31 items can be inferred easily from the property itself (using P1629 perhaps?).
If the problem to solve is related to pages being tagged with more than one of these templates I'd suggest the simple approach you suggested (dismax) but setting score_mode = max in includes/Search/Rescore/BoostTemplatesFunctionScoreBuilder.php. Template boosting is rarely used and I'm sure most of the time they have been adjusted with only one matching template in mind and I'm sure this change would benefit the rare other wikis using this feature.
If the problem is more regaining control over template boosting because the way they're applied is not compatible with the ranking formula being implemented I'd suggest setting a dedicated rescore profile, this will give more flexibility to tune these settings. Issue being that wgCirrusSearchBoostTemplates and wgCirrusSearchIgnoreOnWikiBoostTemplates are global to all query builders.
Fri, Oct 2
Thu, Oct 1
The root cause of the problem is yet unclear.
Added some more debug logs to continue investigating.
What I know so far is that only codfw was affected and restarting blazegraph on an affected node fixed the issue. A state is probably leaked but it's unclear where yet, could be in blazegraph itself or in the jetty http client (the additional logging should hopefully help to discard one option or another).
Wed, Sep 30
Sep 30 2020
For the record here are some graphs taken over the same period (jun-2020 to sept-2020):
Sep 29 2020
Looking at existing solutions based on flink in this area I don't think this is a good fit for the table API and/or SQL unless the usecase is relatively simple (does not require fine control on the state nor specific timers).
Most solutions I've seen describe a similar architecture:
- event ingestion (exactly what eventgate does)
- flink pipeline:
- read from existing event sources and possibly join multiple ones
- key (partitioning)
- feature extraction (time operation/aggregation/...)
- anomaly detection (applying rules/models)
- front-end (alerts/UI)
no objections to prefixing a letter or a couple chars here, the query service munging process can easily be adapted to remove such prefixes when skolemizing the blank nodes.
Something seems to have happened around jul 14th, it's particularly visible on https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=elasticsearch&var-instance=All&var-datasource=thanos&from=now-90d&to=now (esp. the temperature&network graphs).
The search thread pool sizes started to rise more regularly after this date as well.
Sep 28 2020
Sep 25 2020
Sep 23 2020
I did not find anything obvious but looking at the various classes involved in managing the writes I see excessive locking protection and object reuse esp:
- WriteCacheService which keeps and reuses WriteCache instances.
- WriteCache which (protects?) wrap access to a ByteBuffer
- DirectBufferPool which according to comments seems to have issues managing its references: When DEBUG is true we do not permit a buffer which was not correctly release to be reused which in other words means When DEBUG is false we do permit a buffer which was not correctly release to be reused
Sep 22 2020
Sep 21 2020
Sep 18 2020
I like the idea of using the wikidata graph (via SPARQL) to explore possibilities of pulling interesting data to feed a query expansion engine.
Using WDQS for serving real time search traffic on the other hand is not an option I think (for perf reasons) but I believe it could make sense to create a dedicated dataset using the findings you've made here. This dataset could be used for two purposes:
- the initial concept lookup (replacing the need to use wikidata fulltext search)
- the expansion of the concepts following certain paths of the graph like you experimented
Sep 17 2020
https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-3h&to=now shows a restart/deploy during this spike so I guess it's not related to the train.