User Details
- User Since
- Jun 9 2015, 9:03 AM (362 w, 1 d)
- Availability
- Available
- IRC Nick
- dcausse
- LDAP User
- DCausse
- MediaWiki User
- DCausse (WMF) [ Global Accounts ]
Today
Marking as Open as this should be resolved "soon" as we plan to ship ruflin/Elastica 7.1.5 & elasticsearch/elasticsearch 7.11.0 as part of https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/791634 in the coming weeks.
@Jdforrester-WMF indeed! I wish I had noticed your comment before! Reedy's patch on vendor was merged so I think we're good now :)
Cause is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/790734 removing this function but this patch should have had https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GeoData/+/790732 tagged as Depends-On.
We might have avoided this mistake by including GeoData in CirrusSearch phan analysis. I'll file a task to change this.
Mon, May 16
Thu, May 12
https://commons.wikimedia.beta.wmflabs.org/wiki/File:Jason_Shaw_-_Big_Car_Theft.ogg?action=cirrusDump shows it being indexed and seems to appear in search results now, might just that beta is bit slow to index pages.
Tue, May 10
Mon, May 9
This is extremely weird and I suspect a serious blazegraph bug that causes this. I could not reproduce the problem at the moment running the python script provided but it might certainly happen again in the future.
I'm not sure how to proceed here but perhaps capturing the full blazegraph response when it occurs might help?
Tue, May 3
Mon, May 2
I could not reproduce such duplicated settings even when doing the following updates: 5.5 -> 6.3 -> 6.3.1 -> 6.4.2 -> 6.5.4.
Testing using the node state taken from the master node (e.g. elastic1054.eqiad.wmnet:/srv/elasticsearch/production-search-eqiad/nodes/0/_state/) I was able to boot elasticsearch v7.10.2 locally without errors, the duplicated settings disappeared and I could update them properly.
I'm going to assume that this won't cause any problem for us and will resolve itself by upgrading to 7.10.
Fri, Apr 29
Here are some thoughts we compiled while figuring out the possible deployment options on the wikikube k8s cluster: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Flink_On_Kubernetes .
Thu, Apr 28
Mon, Apr 25
- Change: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/785199/
- Summary:
- This merges a branch with 8 pending patches that relate to T301959
- Test plan:
- Error checking (this patch is supposed to be a noop but might introduce new errors)
- Places to monitor:
- Revert plan: Revert patch
- Affected wikis: all
- IRC contact: ebernhardson, dcausse
- UBN Task Projects/tags: CirrusSearch
- Would you like to backport this change rather than ride the train?: No
Thu, Apr 21
@matthiasmullie I don't think this would be problematic but it will be limited to the ability of blazegraph to return the full category hierarchy and by the way deepcat is currently constructing its elasticsearch query.
Wed, Apr 20
Tue, Apr 19
@Nikerabbit ideally we would like to have migrated all MW extensions depending on elasticsearch by may 13th so addressing this task in the first half of May would be ideal for us.
Apr 14 2022
I can confirm, this host is not used.
Apr 12 2022
Apr 11 2022
Apr 8 2022
Apr 5 2022
Reason is that this data may be referenced by other items and thus cannot be deleted blindly without asking blazegraph: "is this data used by another item?" which would be too costly to ask for every edit.
Another approach is to reload blazegraph from the dumps at regular intervals (TBD: once, twice or four times a year).
Apr 4 2022
Apr 1 2022
Actually wdqs2007, wdqs2004 and wdqs2003 also triggered jvmquake, GC activity increased and wdqs2007 & wdqs2003 were unresponsive for a couple minutes. For wdqs2004 there are no visible blips in the various graph. I guess we should relax the settings a bit more.
With the settings we properly detected wdqs1006 going down for 30minutes at 2022-04-01T12:30:00 (this 2minutes after the first blip in the graph).
Unfortunately there was a false positive wdqs1012 at 2022-04-01T10:00:00 as this machine was unavailable from 2 minutes only.
Unsure if it's still too sensitive or if we can accept having a couple false positives.
I agree that federation is adding a lot of boiler plate and inspecting the shape of the IRIs is very fragile. But merging multiple graphs into the same store for ease of use is going against the recent discussions we had around the future of the WDQS architecture, it is also a bit more complex than it seems within the current data flows. Nevertheless I think the concern you raise is very valid and should be taken into account while we figure out if splitting the graph and building on top SPARQL federation is something we have to pursue and/or if some sub-graph are very central that they'd better be replicated to all sub-graphs.
Mar 31 2022
Thanks for the quick answer! (response inline)
Tentatively moving this ticket to needs review as I'm not sure sure we can do much more from the search team perspective.
I think the last point to discuss was to investigate the reasons why a single k8s node that misbehaves could make a deployment unstable.
@JMeybohm do you see any additional action items that would improve the resilience of k8s in such scenario?
Mar 30 2022
Mar 29 2022
The reconciliation process is running and should auto-correct missed updates couple hours after they're performed.
I also fixed the inconsistencies listed here and other related tickets. Please let me know if you still find errors.
Moved remaining work in T304914.
Mar 28 2022
re-opening, seems to happen more frequently
Mar 25 2022
Pinging releng for help on how to proceed with the gitlab MR and the deployment of the images to the docker repo.