User Details
- User Since
- Oct 9 2014, 4:50 PM (601 w, 6 h)
- Availability
- Available
- IRC Nick
- ottomata
- LDAP User
- Ottomata
- MediaWiki User
- Ottomata [ Global Accounts ]
Today
I fixed the process function metrics. I accidentally renamed them.
Errr, I thought I fixed them, but I don't see them. They were fixed locally. Hm. Will follow up tomorrow.
Status report!
Since OOMs don't seem to be related to async_enabled=false or the v1.50.0 release I'm going to deploy staging back at v1.50.0. I want to find out if enrich function metrics are broken just for synch mode, or if they are also broken in async. I also don't think we need so much parallelism (if we are not backfilling, and maybe even if we are). Deploying v1.50.0 with 'fake sync' mode, with slight reduction in parallelism and TM memory.
Ah! It also slowly container OOMs and restarts in the same way on v1.50.0.dev40 (without the new async_enabled param). So seems like more due to our latest Flink settings? Or was it OOMing before (in non backfill case) too?
Yesterday
Avoid envoy retries (there's a header)
Hm, it looks like my Monday deployment now container OOMs regularly after about 11 hours. it restarts and recovers fine.
Possibly related to T416669: Upgrade Kafka to version 3.x? But I'm not sure if they upgraded beta kafka?
Mon, Apr 13
Hm, async_enabled=False EventProcessFunction metrics look busted though: https://grafana.wikimedia.org/goto/bfizxkx7mumm8d?orgId=1
Alright, I merged the patch for T421965: eventutilties-python - support synchronous Flink process function mode, and bumped MWEE to use it. I merged your patch to use a single process function, and released MWEE.
Ah yes, that would be best! TY
Thu, Apr 9
BTW, I tried pipeline.object-reuse and got an exception. See slack thread.
I am trying process_async_enabled_default=False in staging (T421965).
In case helpful:
Based on a meeting today (notes here) it sounded like adding more partitions may actually hinder our current efforts in T421216: HTML Enrichment - Tuning & Backfilling configuration. @JMonton-WMF should we hold on actually doing this?
if the canary events only flow to a single partition this might create an imbalance on flink applications
Deploying content length fix in prod, and also avoiding retries on 504 server timeouts.
Wow nice!
Wed, Apr 8
I published docker v1.49.0.dev30 today with synch mode. Didn't have time to test it live.
Hm! prod failed with a message to large error but in the error sink!
As a rule of thumb, we need to do 40 HTTP calls per second. If many of them take 10 seconds, we need to have 400 HTTP calls in parallel or we won't be able to reach 20msg/seconds. If we need to backfill, we need to increase that x2, x4, or ideally x10 or more.
making it "unaligned" which seems to improve in scenarios with big latencies, and we can also reduce the number of checkpoints being created.
The checkpoint was already exactly_once, the Sink delivery guarantee was at_least_once,
Oh! I did not know this! Very interesting.
Linking some really good thoughts from Javier from the parent task: T421216#11792886
...and responding here.
Mon, Apr 6
the job restarted. It is now slowly catching back up.
Well it isn't catching back up now.
Sun, Apr 5
Looks like something similar to T421216#11787170 happened again last night.
https://grafana.wikimedia.org/goto/efi675t7jq96ob?orgId=1
Sat, Apr 4
Can we do better than 400ms in the normal case?
Ah, I figured out why the Flink UI wasn't showing metrics. metrics.internal.query-service.port was not configured. It is a random port by default, but in k8s we need a networkpolicy for it. This was already supported by the flink chart, but it wasn't set by default.
FYI, we see log messages like
Name collision: Group already contains a Metric with the name 'pendingCommittables'. Metric will not be reported.
Increase mediawiki.page_change.v1 kafka topic partitions and increase kafka source parallelism
@JMonton-WMF this is still worth a try, but I am less and less thinking that source throughput is the problem. Flink should shuffle the source messages to downstream tasks, and the source messages are not the big ones. We have successfully backfilled page_content_change with a lot of TMs before, with the same number of kafka topic partitions.
Huh! And production caught back up! It did not crash or restart.
The production job has started getting stuck.
https://grafana.wikimedia.org/goto/bfi0ytdvi6y2od?orgId=1
Fri, Apr 3
Well, whatever I changed didn't work. staging still dying due to the same size too large in kafka sink error.
For now, in staging, I'm going to reduce enrich.max_content_size to 15MB, giving us a 5MB margin.
Well! Staging is failing with message too large in kafka sink again:
Looks like prod died, and is now backfilling! Ah, but if there is no checkpointed offsets, flink source will fallback on earliest offset by default!
Cool! Seems easy enough, except last commit is 7 years ago? :D
@JMonton-WMF something we should keep an eye on: kafka topic size. I think the html topics will end up being the largest on kafka jumbo. It looks the largest topic is about 420GB right now.
Thu, Apr 2
@JMonton-WMF good luck to you! Things we need to try:
Okay, in staging (-next) I just applied
@JMonton-WMF and @AKhatun_WMF while we have backfill tuning issues, I want the production job to run, not restart, and not have to backfill if it does restart. To achieve this, I'm doing the following:
Another main Error that appears as INFO, just right after a real ERROR:
changeprop errors
Wed, Apr 1
Should I host the code anywhere, get it reviewed? Hosting probably not required since it is a one-time spark job.
Nah, but it would be good to post it somewhere. Here in phab is fine, or in a link Gitlab snippet, or whatever you prefer.
Uh, that did not work.
BTW I think we can see the container OOMs here: https://grafana.wikimedia.org/goto/bfht68nna58g0f?orgId=1
There are a couple of other (minor?) pieces of the puzzle.
...And also, maybe we should just try upgrading to Flink 2.2.0 and doing T347282: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions, using Flink's built in AsyncProcessFunction instead of our custom microbatcher.
access to internal data sets such as wmf.pageview_actor, sessionlength
These are unlikely to be made available publicly, but...
Re checkpointing, this could be what we need when backfilling: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/checkpointing/#execution-checkpointing-interval-during-backlog
More off-heap memory
Right! makes sense. IIRC, @AKhatun_WMF and I had to increase this for edit types stuff too, to deal with large messages in pyflink.
Very nice writeup, thank you.
No objections! I haven't followed this closely but IIUC, the EventBus related work has been done by Aaron. Thanks Aaron!
Tue, Mar 31
It works!
OTOH, perhaps all we need is something for a one off now? If so this, would be a quick one off and easy enough to do.
It would be nice to backfill the final production table, rather than our current development tables.
Slack thread with backfill approach.
Config uses deprecated configuration key 'state.backend' instead of proper key 'state.backend.type'
Mon, Mar 30
FailureRequest to uri https://meta.wikimedia.org/w/api.php?format=json&action=streamconfigs&all_settings=true failed. BasicHttpResult(failure) encountered local exception: Connect to mw-api-int-ro.discovery.wmnet:4446 [mw-api-int-ro.discovery.wmnet/10.2.2.81] failed: Connection timed out (Connection timed out)
I think it works!
Thu, Mar 26
the operator doesn't know about ZK; it only checks the K8s-native HA
Hm, perhaps! I'm not sure though. IIUC (I might not!) HA is handled by Flink JM, not by the operator? I think you can use k8s ConfigMap HA if you were not using the k8s operator, and I'd expect you to be able to use ZK HA even if you are using the operator.
@JAllemandou I think this is semi-relevant to the discussion we were just having about MWH incremental and namespace_is_content_historical. We probably want this info in the event
Wed, Mar 25
cc @GGoncalves-WMF (we are grooming and putting this in backlog for now).
Declining, please use the mediawiki_content_history 'AKA dumps 2) files. Please reopen if declining was incorrect.