User Details
- User Since
- Jun 21 2021, 2:34 PM (199 w, 6 d)
- Availability
- Available
- LDAP User
- TChin
- MediaWiki User
- TChin (WMF) [ Global Accounts ]
Fri, Apr 18
Tue, Apr 15
Throwing out a guess, it seems like because where the lineage runs depends on the driver, it needs to somehow be aware of whether or not it's running on k8s and choose the correct kafka bootstrap url. I wonder if there's an easy way to figure this out? This will probably require some refactoring.
Mon, Apr 14
Tue, Apr 8
Ok so that wasn't it, now we get this error:
25/04/08 19:42:35 ERROR AsyncEventQueue: Listener DatahubSparkListener threw an exception datahub.shaded.org.apache.kafka.common.KafkaException: Failed to construct kafka producer at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:430) at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:298) at datahub.client.kafka.KafkaEmitter.<init>(KafkaEmitter.java:55) at datahub.spark.DatahubEventEmitter.getEmitter(DatahubEventEmitter.java:85) at datahub.spark.DatahubEventEmitter.emitMcps(DatahubEventEmitter.java:401) at datahub.spark.DatahubEventEmitter.emitCoalesced(DatahubEventEmitter.java:190) at datahub.spark.DatahubSparkListener.onApplicationEnd(DatahubSparkListener.java:279) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:57) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Caused by: datahub.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers at datahub.shaded.org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:84) at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:408) ... 20 more
Mon, Apr 7
Fri, Apr 4
Seems to be working, webrequest_analyzer dag runs normally and I can see the columns being filled in the table:
Altered table:
ALTER TABLE wmf_data_ops.data_quality_alerts ADD COLUMNS ( dataset_date BIGINT COMMENT 'AWS Deequ resultKey: key insertion time.', tags MAP<STRING,STRING> COMMENT 'AWS Deequ resultKey: key tags.' );
Thu, Apr 3
I think it actually is broken on main as well and it's just been silently failing. I opened a patch to add the port back in
Wed, Apr 2
@brouberol would you happen to have some insight into this issue?
Tue, Apr 1
Actually nevermind, the error we had last time was
Caused by: datahub.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers at datahub.shaded.org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:84) at datahub.shaded.org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:408) ... 20 more
Currently encountering this error, with the search instance, think it's the same issue we encountered when migrating the analytics instance to k8s:
Thu, Mar 27
Or a real quick way is to not pass in the new logger from eventstreams to KafkaSSE, that way it creates a bunyan logger and the logs won't appear as an error anymore in logstash
Ah, the issue is in KafkaSEE, which still assumes bunyan. I'll have to standardize the logging inside that lib
Tue, Mar 25
Mar 18 2025
Mar 14 2025
Here's an issue I currently see: the data_quality_ops.data_quality_alerts doesn't have a column to put in metadata like tags like the metrics table does. This doesn't affect the actual alerting part, but would affect any future analyses and dashboarding someone might want to do on the verification checks. For instance if we want to alert on T388439 there isn't a way currently to differentiate records in the table that are checking monthly vs daily reconciles. Even now, there's an open question whether the source_table column in the alerts table should refer to data_quality_ops.data_quality_metrics or the underlying table that the metrics were computed against.
Mar 12 2025
Mar 5 2025
Reconstructing the path using only the req object does work, but only if the params belong to the local router. So basically it doesn't work. Which means I need to go the other direction and make the middleware router-aware. This may require a major version bump.
Mar 4 2025
Yeah I'm going option 3 as well. The main issue is that the middleware is not aware of the router or the path it's on, so I need to figure out a way to make it aware of the router, or reconstruct the path using only the req object. Some more investigation is needed
Could it have something to do with T382173?
@DSantamaria Sorry, the ticket took longer to make since it took a while to just do the investigation to write it. Here it is: T387824
Feb 28 2025
Feb 26 2025
Updated the dashboard.
NodeJS
Here are the current metrics I'm seeing for function-orchestrator when curling a pod
# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds. # TYPE process_cpu_user_seconds_total counter process_cpu_user_seconds_total 68.77650600000004
Feb 24 2025
There doesn't seem to be a service label on these GC metrics so I used kubernetes_namespace instead.
Feb 20 2025
This was a wild combination of issues which resulted in me submitting an issue upstream.
So, a quick explanation working backwards:
- The library I'm using to deep merge when reformatting log messages doesn't merge Symbols (seems like a bug)
- Winston uses symbols for internal state
- If Symbol('level') is not in the logged message object, apparently Winston doesn't output anything at all
- Because the symbol was only erased when the logged message was reformatted, logging still looked like it worked
- The more we fixed T383448, the more logging looked like it didn't work
- Because the test was a unit test on the formatter and not an integration test on the entirety of Winston, it didn't catch it
Feb 18 2025
Feb 15 2025
Looks like it got locked again. I wonder if we should just split the metrics table to have one per pipeline to avoid this whole thing entirely
Feb 14 2025
Garbage collection is enabled by default, but it seems that the default metrics are also in histogram now and not gauge which is what the dashboard assumes
Feb 13 2025
Stream Connection Duration
- min(express_router_request_duration_seconds{service="$service", path=~"v2/stream/.*"}) + min(rate(express_router_request_duration_seconds_sum{service="$service", path=~"stream/.*"}[5m])/rate(express_router_request_duration_seconds_count{service="$service", path=~"stream/.*"}[5m])>0)
Do we know what libraries are included in the node20 image vs the node18 one? I recently had to install libssl-dev to get a service working on node20.
I don't know why but it seems like the kafka sink doesn't recognize schema_registry_config and it's currently failing:
Exception: 1 validation error for KafkaSinkConfig schema_registry_config extra fields not permitted (type=value_error.extra)
Feb 12 2025
There's another dag that fails but I turned it off since it runs hourly so it's super noisy: webrequest_analyzer
Feb 1 2025
Summary
- Created a python script and airflow dag for computing metrics
- Dogfood refinery-python and therefore PyDeequ
- refinery-python doesn't work with the latest version of PyDeequ. We're currently pinning it but it should be upgraded.
- Discovered Deequ has some major quirks, or it's more like we're not using it for its intended purpose
- Can't directly insert metrics.. Metrics are always computed and therefore associated with an Analyzer.
- Can't implement custom Analyzers in PyDeequ (GitHub Issue)
- Can't compute metrics across tables. A workaround had to be used.
- Doesn't output metrics on empty data (except for size). i.e. Asking it to give the Completeness of a column on a DataFrame of 0 records results in no metrics.
- Created a Superset dashboard with metrics that were computed, turns out Superset also has some quirks
- Superset expects a table to compute metrics over, not a table of already computed metrics. Some workarounds had to be used.
- A dashboard with 500+ wikis is not that helpful, perhaps split it into multiple smaller dashboards
- Superset does not handle time series data that well. No metrics on a specific day results in no data points. Resampling is only available for some graphs.
Jan 30 2025
Jan 28 2025
Jan 27 2025
Jan 24 2025
Nice catch! Putting up a patch
Jan 23 2025
If you have any questions I could try to ask them on the sudachi slack?
Jan 21 2025
Jan 14 2025
@brouberol I guess I need to set egress? What's the cidr of Ceph?
Deployed v0.11.0 on beta and confirmed it's fixed
Jan 9 2025
Tried deploying to staging with @gmodena, got this error but it doesn't show up in logstash
java.util.concurrent.CompletionException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to rgw.eqiad.dpe.anycast.wmnet:443 [rgw.eqiad.dpe.anycast.wmnet/10.3.0.8, rgw.eqiad.dpe.anycast.wmnet/2a02:ec80:ff00:101:0:0:0:8] failed: connect timed out at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)
Deployed Eventstreams v0.10.0 on beta and it throws this error when listening to a stream:
{"message":"No topics available for consumption. This likely means that the configured allowedTopics do not currently exist.","origin":"KafkaSSE","name":"ConfigurationError","allowedTopics":[null],"statusCode":500}
Jan 6 2025
Looking at it, it seems like the easiest thing to do would be to use Deequ's AnalysisResultSerde and add a new column to our metrics table to store the result. Then we can implement our own metrics repository or maybe if we upgrade to Deequ 2.0.7 there's a SparkTableMetricsRepository that we can possibly extend from.