User Details
- User Since
- Nov 2 2020, 1:15 PM (135 w, 5 d)
- Availability
- Available
- LDAP User
- Gmodena
- MediaWiki User
- GModena (WMF) [ Global Accounts ]
Thu, Jun 8
Wed, Jun 7
Tue, Jun 6
Project documentation is available at
https://doc.wikimedia.org/data-engineering/eventutilities-python/eventutilities_python/
@tchin an alternative path for coverage reporting could be integrating with https://gitlab.wikimedia.org/repos/releng/docpub/-/blob/main/README.md and linking the published coverage report from Gitlab (we'd lose metric reporting in badge - but so be it).
Couple of questions re integrating this workflow in our pipeline. See the attached patch and this pipeline.
- As a test, I wanted to trigger .docpub:publish-docs (derived) manually from a working (unprotected) branch, but I'm not authorized to do so. Will this job automatically execute on a protected branch?
- We do releases by manually triggering a publish_gitlab_release on main. If I wanted to automatically trigger .docpub:publish-docs on release, can I just declare a need: publish_gitlab_release condition?
Mon, Jun 5
Wed, May 31
Think we can publish pydocs as part of this task!?
Tue, May 30
I marked the Google Doc as read-only and moved the draft to https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment/SLO/Mediawiki_Page_Content_Change_Enrichment.
Thu, May 25
Wed, May 24
mw_page_content_change_enrich__dse-k8s-eqiad is not a valid s3 bucket, because the protocol does no allow _ in names: https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html.
As a test, I replaced _ with - and checkpoints are now stored.
This log entry is also relevant to the error above:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket is not valid. (Service: Amazon S3; Status Code: 400; Error Code: InvalidBucketName; Request ID: tx07b821592df14024bd0fb-00646dbe61; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null
There's a draft at https://docs.google.com/document/d/1U2bYVqmEsn7ryP0dtFUr-S5xPqF9_plLIFdzk883HBc/edit#. After a first round of feedback, I'll move it to wikitech.
Hopping on this thread to confirm that we are now able to store snapshots / savepoints in swift using the provided containers.
Tue, May 23
Mon, May 22
As for other docs to move: probably all links at
https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream (expect maybe for the RFCs on Use Cases?).
There is some deployment documentation (dse-k8s) at https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream/Pyflink_Enrichment_Service_Deployment.
Fri, May 19
@Ottomata should we adopt this naming conventions also for DSE?
Wed, May 17
Tue, May 16
Checkpointing to Swift (S3 protocol) has been enabled.
Flink docs recommend setting zookeeper https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#high-availability-zookeeper-path-root. I need some help to brainstorm / bikeshed a naming convention.
Mon, May 15
Fri, May 12
May 11 2023
May 9 2023
Discussed with @Ottomata, and we opted for option 2: We log and raise a more functional error message.
Apr 19 2023
We have a seemingly stable job running on YARN. Using an ad hoc process function that combines count and time triggers on a keyed data stream, seem to have reduced memory pressure on the beam workers.
Apr 4 2023
[...]
@MatthewVernon brought up persistent volume claim. I have no experience with it, but it sounds like a good fit; I'd like to explore that a bit.
I will investigate.
Apr 3 2023
Next step will be measuring latency/throughput on YARN and possibly tune settings (batch size, thread pool size). If this works, we can look at integrating it into eventutilities-python.
Caveat: on DSE we run the application with the same memory settings, and noticed a slightly higher memory footprint (in the order of 20%). Might be due to different Java versions (java8 on YARN, java11 on k8s). We have tasks for load testing and tuning planned for upcoming sprints.
Mar 27 2023
I have been able to validate that the idea can work. I was able to init a thread pool local to an operator and execute parallel requests on a batch of elements (PoC implementation here). No pickling issues, because what is put on the wire is the function output (instead of the closure). I think we could fallback to using a ProcessWindowFunction for compute + managing sideoutput, without too much impact on our API. I have a rough implementation outside of eventutilities-python, that I got to work locally (docker/minikube). Next step will be measuring latency/throughput on YARN and possibly tune settings (batch size, thread pool size). If this works, we can look at integrating it into eventutilities-python.
Mar 24 2023
After a few long lasting runs on YARN and k8s all I can say is I see correlation (not necessarilly causation!) with OOMs and:
- calls to dict_to_row conversion
- python fn-bundle size (beam mico-batching)
- messages throughput and kafka topic lag
To some degree I think that blocking I/O is contributing to memory pressure (saying this because of the effect of tuning microbatching via fn-bundle size)
Mar 23 2023
I created https://phabricator.wikimedia.org/T332948 as follow up work for this spike.
Mar 22 2023
I experimented with tunables recommended on Flink's Slack, and was able to run the application for over 6 hours. Memory kept growing till container OOM.
Some additional info that I was able to observe:
- memory consumption grows over time, the number of objects tracked by Python's GC seem to stabilise after a few minutes of runtime, and not grow linearly with memory.
- How quickly memory grows seems to be proportional to python.fn-execution.bundle.size
I enabled the python profiler (python.profile.enabled)
and was able to identify some hot paths that do correlate to memory
allocation (SerDes and dict <-> Row conversion).
Mar 16 2023
Hi Eric,
Mar 15 2023
Mar 9 2023
So: since we decided to use plain on datastream.map when error_destination = False (instead of our datastream.process w error handler), the job was dying when it encountered these errors. This makes sense I think. If stream manager error handling is disabled, then it is up to the user to catch exceptions in their own map function, right?