First attempt not good:
In https://horizon.wikimedia.org/project/instances/beed7e0d-4e7f-446f-a73c-60dce7ecff4f/ I see the config for stream-beta.wmflabs.org. The current docker image used is: version: 2022-01-20-101239-production
Applied the permanent fix, all use cases work afaics! Thanks for reporting :)
Applied a permanent fix, thanks for reporting!
Wed, Sep 27
[2023-09-27 13:01:02,991] ERROR [GroupMetadataManager brokerId=1003] Appending metadata message for group kafka-mirror-main-eqiad_to_jumbo-eqiad generation 19961 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.group.GroupMetadataManager)
It seems that the 15th mirror maker instance triggers the issue (it is independent which one, after the 14th we see the trouble).
Two theories for the moment:
Tue, Sep 26
I applied the following config to both ml-serve-eqiad and codfw:
Applied a manual fix on ml-serve-codfw, it should now work. I need to add the proper config to deployment-charts to make it permanent :)
Mon, Sep 25
@Seddon my understanding is that this version of the recommendation API is the one that we want to progress from now on, deprecating the one that the apps are using). We need to consolidate the work into one single API, and the recommendation-api that the apps are currently using is already exposed via Restbase.
Fri, Sep 22
The following works nicely and it seems more precise:
We currently don't support feature injection, what is the use case for it? From our traffic analysis this is not a feature that is really used. Ores is being deprecated so we'd like to keep the features to maintain as few as possible.
For the first use case, I tried to check logs in Lift Wing for ruwiki:133170407, and I see this:
The K8s SIG group added a new policy for upstream charts imported in our repo: https://wikitech.wikimedia.org/wiki/Kubernetes/Upstream_Helm_charts_policy
Change merged! Thanks to all for the feedback :)
Final status for all the dashboards:
Thu, Sep 21
Wed, Sep 20
Fix deployed to goodfaith/damaing pod environments. Let's double check tomorrow that the memory metrics are stable and then we can roll out the changes even further.
We set it in our requirements.txt :(
Something really interesting: the following rev-id (https://es.wikipedia.org/w/index.php?diff=153880256) causes a big jump in the size of memory stored by mwparserfromhell:
I left the code running for a bit, I wanted to test disabling caching in revscoring, I ended up with the following (more precise) trace:
One trace sample from tracemalloc:
Last changes applied, we should be good to close!
I created the following test environment (locally):
Tue, Sep 19
Using this graph to find the busiest pods to kill before leaving for EOD (to avoid issues during the EU night).
From the SAL:
WME is going to perform some extra tests on Lift Wing this week, and they will enable the full request flow after that.
Tried to hit eswiki-damaging in staging:
Mon, Sep 18
Ilias rolled out https://gerrit.wikimedia.org/r/958393 to damaging/goodfaith pods in ml-serve-eqiad, so far we haven't seen any occurrence of the memory leak. Let's keep it monitored.
@RLazarus thanks a lot! We can wait and be the first beta-testers of the new alerts if you are ok!
@RLazarus What do you think? :)
Email sent to Wikitech-l, the task is completed. Let's leave it open for a couple of days to see if everything works as expected.
Killed eswiki damaging/goodfaith, same pattern.
Error while uploading the new revscoring to Pypi:
Sun, Sep 17
Killed eswiki-damaging again.
The last pod that I deleted on ml-serve-eqiad was eswiki-damaging-predictor-default-00012-deployment-754bf46tdg6p. Something interesting is that hours before an OOM event occurred:
There seems to be a metric to look for, namely response-flags:
Sat, Sep 16
Killed eswiki goodfaith and damaging today.
Fri, Sep 15
As desperate attempt, we restarted all the pods in goodfaith/damaging (eqiad and codfw). It is likely not gonna help but worth trying.
Tried to log on a ml-serve node running a pod hanging, and tried to run gdb (without nsenter):
We found a serious bug though, namely sometimes the kserve container inside an isvc pod stops working for some reason blackholing traffic. We noticed this since the retry queue in changeprop for ORESFetchScoreJob has been increasing for the past days, and the related kafka topic was constantly getting new events inserted. The related Kafka consumer lag is decreasing, but we need to file a new task to investigate this problem (since it can happen anytime again).
The metrics now look better! One thing that I noticed is that we have a lot of events in the ORESFetchScoreJob retry topic.
The Kafka consumer lag dashboard shows what Ilias pointed out, namely changeprop is lagging in consuming (and processing) ORESFetchSCore jobs.
Latency for enwiki damaging
Really nice finding! It seems to match exactly https://sal.toolforge.org/log/HIAEhIoBGiVuUzOdDi6t, that is when we moved wikidata and enwiki to Lift Wing. Maybe it is only a matter of adding more pods?
@klausman Assigned the task to you since there are a couple of steps that are more related to SRE (lemme know if you don't have time, I'll take care of it).
We applied the rps strategy to all our isvcs, and re-calibrated autoscaling settings. The autoscaling graphs looks much better now, I am inclined to close. Thanks Aiko for the work!
I think that we should coordinate with SRE (@RLazarus for example) before proceeding further with SLO alarming, we don't want to derail from the SRE recommendations :)
Thu, Sep 14
We have disabled the revision-score stream from Eventstreams, so it is not published anymore in https://stream.wikimedia.org/.
The service is currently deployed to production!
Wed, Sep 13
@bking I had a chat with @dcausse and from https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/ha/zookeeper_ha/#example-configuration (the 1.16 docs) it seems that an example config is:
Thanks for the info @MGerlach!
@bking another thing to verify:
The flink cluster in eqiad looks healthy:
Tue, Sep 12
Great results @achou!
Opened T346144 as related change for the dashboards :)
From this query it seems that only one IP address has been active in the few weeks. It is difficult to follow up with them, it seems a stream running in a cloud provider but it seems not possible to get the UA or similar from our metrics and logs.
No no for the moment it is fine, what I wanted to do is to avoid single persons on-call for events like Cassandra being in trouble (namely, you :). We can slowly build knowledge over time, I am in if you want to evangelize more how to deal with Cassandra events!
@prabhat It shouldn't be a problem, so let's keep both dev and prod accounts for the moment. The total traffic per second should get up 100 rps with both clients active, not a ton but still a sizeable one :D We can review the status down the line, maybe let's start with just one hitting at full power for the moment, would it be ok?
Mon, Sep 11
elukey@mwmaint1002:~$ mwscript extensions/OAuthRateLimiter/maintenance/setClientTierName.php --wiki metawiki --client ee4742635eb8098fbbc5a0d3ee037251 --tier wme Successfully added tier wme for ee4742635eb8098fbbc5a0d3ee037251. elukey@mwmaint1002:~$ mwscript extensions/OAuthRateLimiter/maintenance/setClientTierName.php --wiki metawiki --client 686b05ed258704c7d71bb584cfe60865 --tier wme Successfully added tier wme for 686b05ed258704c7d71bb584cfe60865.
@Eevans I totally understand your point of view, but at the same time I am not clear what procedure we should follow when an issue like this one happens (while on-call etc..). Is your recommendation to just to let the instance depooled, stop puppet etc.. and then ping Data Persistence for a permanent fix?