User Details
- User Since
- Nov 2 2020, 1:15 PM (186 w, 6 d)
- Availability
- Available
- IRC Nick
- gmodena
- LDAP User
- Gmodena
- MediaWiki User
- GModena (WMF) [ Global Accounts ]
Fri, May 24
Will adopting SDKMan be a prescriptive change or just a default option? Would this change affect only a user's development environment, or also impact CI?
Wed, May 15
Tue, May 14
FWIW, with this outcome, then dynamic ESC as implemented is fine with me :)
Mon, May 13
This comment is the result of some time spent collecting info from various stakeholders, and reviewing documentation and decision records.
It was initially shared as a Google doc (now moved to read only to preserve comment history).
Apr 30 2024
@Fabfur f/up from our chat earlier; these would be the pending config bits that we'll the to finalize when moving to prod topics:
Apr 29 2024
We can easily express a stream config as jsonschema, and expose via datasets-config-service.
I am opposed to a monorepo for all configurations and suggest focusing current efforts on Airflow and Airflow-produced datasets. For integration with Metrics, Platform, and Mediawiki, I lean towards a service mesh approach. The service developed by @tchin could serve as a template.
I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...
Makes sense. This would require some work on our end to generate webrequest data from two "raw" sources at once, but I think as long as we can filter on dc / hostnames, we should manage. Let me take a better look at how this ETL is setup.
- there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).
No problem here, for us is just a matter of removing a line from the Benthos configuration. Let me know if I can proceed!
Apr 25 2024
The haproxy_id field has been added to messages.
Apr 19 2024
Apr 18 2024
About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.
Apr 17 2024
Apr 5 2024
Apr 4 2024
Mar 27 2024
Mar 26 2024
Mar 25 2024
I need to add a wrapper to the Alert generation SerDe
Mar 22 2024
These are the fields that are sent from Benthos that aren't present in the current webrequest stream:
Mar 21 2024
Mar 19 2024
For context: this is the approach we follow with other producers, e.g. Java.
Mar 8 2024
Mar 7 2024
We can integrate our DQ framework with Python by piggy backing on pyspark 's py4j gateway. Following is a rudimentary example that produces
metrics with data_quality_metrics table format:
Mar 6 2024
Mar 5 2024
Mar 4 2024
@Ottomata @Jdforrester-WMF there's a caveat wrt using collectDefaultMetrics. The method call does not allow setting custom labels. If i understand the doc correctly, we can still define them at registry level. This would clash with the current implementation. I'm not super keen in refactoring current behaviour given the codebase status, so I'd lean towards avoiding custom labels if possible. Are they used at all?
I spent some time learning this code base, touching base to validate direction. If this makes sense, I'll open a PR. My proposal here would be to add a new collect_default option to the prometheus metrics option block
Feb 28 2024
FYI: this is a list of metrics reported with a local run:
I was on PTO last week and trying to piece together what happened and how the UBN was mitigated.
Update: We've got access back, and v4.0.0 is finally released. Still happy to break whatever we need to, and help people migrate.
Feb 27 2024
@Jdforrester-WMF FWIW I saw you started deprecation work in https://github.com/wikimedia/service-runner/pull/249/files.
I am taking a stab at this tasks, because we need gc and memory info to help track T357005: eventstreams regularly uses more than 95% of its memory limit.
Feb 26 2024
Feb 15 2024
Spoke a bit about this with @xcollazo.
Open question: do we want webrequest.frontent (or whatever we settle on) to be a versioned stream? https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning
the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.
Data has been deleted from HDFS. It will be quarantined in hdfs://analytics-hadoop/user/hdfs/.Trash/Current/wmf/data/event_sanitized for a period longer than the on week grace time required by this task.
Feb 14 2024
May I proceed with deleting the tables from the Hive metastore for the impacted datasets?
Feb 13 2024
@Fabfur and I would like to start some integration tests in the short term. I moved the webrequest schema from GA to development in the primary repo. This follows the same process we adopted with page_change, and should allow for faster iteration speed without messing around with schema versions.
Feb 12 2024
TBD on final stream name in T314956: [Event Platform] Declare webrequest as an Event Platform stream, but the currently suggested one is webrequest.frontend
Feb 9 2024
Both approaches are feasible (also at the same time if we do accept to increase the payload a little)...
@Fabfur here is example payload with added meta, as we'd expect to receive according to the WIP webrequest event schema.
{ "meta": { dt: "2023-11-23T16:04:17Z", # value set by Benthos stream: "webrequest_text", # value set by Benthos domain: "en.wikipedia.org", # can we get this from HAProxy? request_id: request-uuid # can we get this from HAProxy? id: "event-uuid" # value set by Benthos? }, "accept": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.2.0\"", "accept_language": "en", "backend": "ATS/9.1.4", "cache_status": "hit-front", "content_type": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.5.0\"", "dt": "2023-11-23T16:04:17Z", # value recorded by HAProxy "hostname": "cp3067.esams.wmnet", "http_method": "GET", "http_status": "200", "ip": "<REDACTED>", "range": "-", "referer": "https://en.wikipedia.org/w/index.php?title=Category:Films_based_on_non-fiction_books&pagefrom=Power+Play+%281978+film%29%0APower+Play+%281978+film%29", "response_size": 987, "sequence": 10558502962, "time_firstbyte": 0.000201, "tls": "vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new", "uri_host": "en.wikipedia.org", "uri_path": "/api/rest_v1/page/summary/Secretariat_(film)", "uri_query": "", "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0", "x_analytics": "WMF-Last-Access=23-Nov-2023;WMF-Last-Access-Global=23-Nov-2023;include_pv=0;https=1;client_port=33126", "x_cache": "cp3067 miss, cp3067 hit/5" }
Feb 8 2024
@Fabfur nice!
tl;dr: our approach to address this spike is currently documented at https://wikitech.wikimedia.org/wiki/Data_Engineering/Data_Quality.
db and tables have been created:
spark-sql (default)> use wmf_data_ops; Response code Time taken: 2.698 seconds spark-sql (default)> show tables; database tableName isTemporary wmf_data_ops data_quality_alerts false wmf_data_ops data_quality_metrics false Time taken: 0.569 seconds, Fetched 2 row(s)