User Details
- User Since
- Feb 11 2015, 6:02 PM (582 w, 2 d)
- Availability
- Available
- IRC Nick
- joal
- LDAP User
- Unknown
- MediaWiki User
- JAllemandou (WMF) [ Global Accounts ]
Yesterday
I've written a plan for Incremental-Mediawiki-History here: https://docs.google.com/document/d/1QZNCZhsBCxEKwogI8S1GFtELTPa0t9DYUBFoc3jI-oo/edit?tab=t.0
Calling this done.
Removing myself as the task assignee so that someone else take it while I'm on holidays.
Having discussed this with Traffic, this was related to SSL handshake problem (regular traffic), that was incorrectly logged by HAProxy in v3.0. The logging is now fixed with v3.2. Calling this done.
Here's my proposed plan for an Incremental-Mediawiki-History: https://docs.google.com/document/d/1QZNCZhsBCxEKwogI8S1GFtELTPa0t9DYUBFoc3jI-oo/edit?tab=t.0
I know the team will discuss this next week in Dublin :)
Thu, Apr 9
Wed, Apr 8
Summarizing here a talk we had on slack with @Vgutierrez and @Fabfur :
- In v3.0 we were experiencing unexpected sequence-id increment. This is fixed with v3.2 as of today.
- We were not-logging a lot of lines seen as invalid-messages in HAProxyKafka, and this is fixed with v3.2.
Thanks for confirming the invalid-events change @Vgutierrez.
There still is something I don't understand:
- The pattern we see in v3.0 seems to show that haproxykafka generates a sequence-id even when it discards a log as invalid, as we see 2x the number of invalid seqId per BADREQ.
- With v3,2 we see a big increase in BADREQ that don't correlate with the number of invalid-seqenceId we were reporting previsouly: there are way more.
Tue, Apr 7
An interesting change in behavior from 3.0 to 3.2 and that could be related is that after the upgrade, haproxykafka number of invalid messages dropped to 0: https://grafana.wikimedia.org/goto/afi95ztecv20wd?orgId=1
Ping @Ahoelzl on this. There are patches to review that the team doesn't know about.
Plan looks good to me :)
Thu, Apr 2
Wed, Apr 1
That's interesting!
@Ottomata could you have a look at the event side of thing? This could mean a bug, right?
I have experienced again the same issue today:
Exception in thread "main" java.io.FileNotFoundException: /tmp/table_maintenance_iceberg_monthly/ivy_spark3/cache/resolved-org.apache.spark-spark-submit-parent-73ae20fa-2b58-4c79-9568-c95b98695cd1-1.0.xml (Permission denied)
I'll ask Ben to do the cleanup.
Tue, Mar 31
I wish to revive this task, maybe not sending data to push-gateway at first, but at least storing metrics in ways that allow the DE team to access them. Inspecting how spark behaves internally will key for our migration from Hadoop to k8s.
2 failed instance this morning:
- 2x connection problem to metawiki API.
A PR has been created and merged for this: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2140
It should have belonged to this task.
I think I found the culprit for this.
In the stack trace, the error happens at this line:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/hooks/skein.py?ref_type=heads#L195
and if track the stack-trace, this lines shows up:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/hooks/skein.py?ref_type=heads#L272
Mon, Mar 30
This is a client side timeout, yes? I wonder what our client timeout is...
I think I found the culprit for this.
In the stack trace, the error happens at this line:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/hooks/skein.py?ref_type=heads#L195
and if track the stack-trace, this lines shows up:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/hooks/skein.py?ref_type=heads#L272
@JAllemandou We will postpone the upgrade (i.e. the table hot swap). However, the new tables are already created, though they are not in use. Is that a problem for you?
I don't see how just having the new tables created (and even loaded for the sake of it) could be an issue. Thanks for warning us :)
More failures from March 27th to March 30th:
@brouberol : Should we consider this done?
Fri, Mar 27
We have experienced failures in the past few days days (March 24, 25, 26, 27).
Here's a summary of the detailed failure:
False errors:
- Airflow log retrieval failure. The underlying task was successful.
Real errors:
- 8 times connection problem to metawiki API:
FailureRequest to uri https://meta.wikimedia.org/w/api.php?format=json&action=streamconfigs&all_settings=true failed. BasicHttpResult(failure) encountered local exception: Connect to mw-api-int-ro.discovery.wmnet:4446 [mw-api-int-ro.discovery.wmnet/10.2.2.81] failed: Connection timed out (Connection timed out)
- Error with file availablity in task pod. Possibly related to git sync synchronicity.
skein.exceptions.DriverError: Failed to submit application, exception: File file:/opt/airflow/dags/.worktrees/9a4d46e7e7fd1cf2774e8a92eec5724f00e250ab/main/dags/gobblin/config/analytics-common.properties does not exist
- Error with Yarn launching skein app.
org.apache.hadoop.yarn.exceptions.InvalidApplicationMasterRequestException: Application doesn't exist in cache appattempt_1773779850057_1912_000001
This is awesome work, it will really help building trust in the dataset. Kudos @xcollazo and @APizzata-WMF :)
Thu, Mar 26
This is indeed very relevant @Ottomata . If we could have this info in the event it'd be very useful.
Wed, Mar 25
Current status: The 5 hosts are full at ~75%, with almost 2Tb used from 2.75Tb each. This represents ~10Tb used. From those 10Tb, webrequest_sampled_live account for ~4Tb (2Tb useful replicated 2 times), and wmf_netflow for 3.4Tb (1.7Tb useful replicated 2 times).
For the moment the cluster holds, but we need to be careful if we wish to continue to grow the datasets.
Tue, Mar 24
Thank you folks for considering my idea :)
Mon, Mar 23
What I like with the valid_until field is the possibility to keep old inactive records, in case we'd have to re-use them for instance, or to remember the state of the filtering in the past. If you still don't want it, please do as you see fit :)
Monday I will delete the file 17 from each snapshot and run msck repair table
The big database import (sqoop) into the Data Lake starts on the first of each month at 05:00.
The sqoop jobs fleet starts at midnight UTC on the first of the month, and usually lasts 2 days and a half if everything goes well.
Thank you for considering scheduling your operation at a different time :)
Fri, Mar 20
Hi @JAllemandou , the metrics have been developed in accordance to the new Contributor measurement strategy. you can see definitions in the following links
Thu, Mar 19
If we went with a Hive table for the bot JA3N-JA4H list, would you prefer it being located in another (non-Iceberg) database, like wmf?
I'd prefer, but it's not very important.
I'm more in wonder about making it Iceberg versus not. I see this table as possibly be updated with some regularity, if we automate finding bad actors for instance, or something similar. Also, @GGoncalves-WMF was mentioning the will to combine the various actor-related table into a single one at some point.
I don't know how much we wish to make the list for the backfill "future-proof" versus one-off.
The file I am talking about is only made up of duplicates, therefore deleting it would just remove duplication
After some time reading and processing the queries used to generate the metrics asked weekly, here are some findings:
Regarding the old data? Since we know the duplicates are always in the same file we could think of dropping these problematic files. What do you think?
Hm, this would mean partially incomplete data. I'd rather have duplicate in my data than incomplete one.
We should nonetheless communicate about this!
I reviewed mediawiki_history code, and my analysis says that we introduced some duplication :(
Given metrics didn't crash, it means the numbers are not huge, but it's not great anyhow.
I think it's easier to make it happen this way (reducing the mapper weight in sqoop script) than changing puppet. If ok for everyone, let's make it happen (with a comment in the code :) )
Wed, Mar 18
Yet another idea is to use Spark to get this data. But this is quite a change.
What if we spilt by another column like lu_local_id?
Now that I've made myself a fool by not being precise enough, let's get back to solutions :)
I can see tow ways:
- the one I suggested above
- reducing the mapper weight for that table. If we go from 0.5 to 0.25, the effect will be to reduce the number o mappers by two, exactly the same as changing it in puppet.
I've read our code and the bug report again, there is something I don't understand: the bug is supposed to happen when splitting a table on String type field, but we split on a Long type field:
https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L1320
I'd really like for us to investigate more. let's sync @APizzata-WMF .
My way of dealing with that would be to change the puppet code to using 32 mappers.
This will involve creating anew variable and update the template, not great, but at least we'll have a solution.
And obviously, in addition to the code change, add a comment referencing your previous comment to explain why we do that.
I think 2 (expiration) is a good place to start and seems less likely to introduce issues in the long term (i.e. I'd rather have a few more incidents than realize we've had bad data for 2 years because of an old rule).
@Mayakp.wiki , can you please confirm that the SQL code for the metrics defined above is this one please? Thank you :)
I have opinions on this indeed :)
Tue, Mar 17
Fri, Mar 13
I confirm I have data in the datalake for auth_type. However, the numbers for api.wikimedia.org are very low for March 12:
select
x_analytics_map['auth_type'],
count(1) as c
from wmf.webrequest
where webrequest_source = 'text'
AND year = 2026 and month = 3 and day = 12
AND uri_host = 'api.wikimedia.org'
GROUP BY
x_analytics_map['auth_type']
ORDER BY c DESC
limit 50Thu, Mar 12
Mar 11 2026
My assumption was that we would rerun failed airflow task when failure happens, generating more-than-needed canary events.
Mar 10 2026
Mar 9 2026
@KCVelaga_WMF let's talk about how to categorize IPs. I think creating a table with ranges is really sub-optimal and we could do better, even for a temporary solution.
I validated the data this morning.
This is done.
Mar 5 2026
I see how the change defined above has an impact on SLAs: for an SLA defined of 5h, if we're waiting one hour more than before to get source data, we alert sooner than before in relation to the source data.
I don't see how this would affect sensor timeouts, as our default values are very long (some might be infinite?). It might not be the same in different airflow instances.