User Details
- User Since
- May 5 2020, 11:24 AM (254 w, 5 d)
- Availability
- Available
- IRC Nick
- nemo-yiannis
- LDAP User
- Jgiannelos
- MediaWiki User
- JGiannelos (WMF) [ Global Accounts ]
Fri, Mar 21
Upstream patch merged: https://github.com/omniscale/imposm3/pull/300
FWIW we should also investigate how far are the other sequences in our tables from max value.
For future reference (on bookworm upgrade). We might want to consider:
Looks like ALTERing the sequence after ALTERing the table did the trick.
Thu, Mar 20
We don't have an explicit id column in the imposm mapping. This means that it defaults to:
https://github.com/omniscale/imposm3/blob/5d32daabd0e75800a20261a486f33b20b948ad5b/database/postgis/spec.go#L57
And in postgres SERIAL is integer not bigint
Going through the logs, the last error before that aborted transactions is:
nextval: reached maximum value of sequence "wikidata_relation_members_id_seq" (2147483647)
Turns out this is the side-effect of the main issue and the node is just flooded with logs and / run out of space.
This error correlates with the start of the replication lag:
https://logstash.wikimedia.org/goto/070b5c8c4c7fdcaf42824de690ae09c3
Yeah, i just verified it on staging without caching:
I think the problem here might be the handling of language variants after we switched over srwiki PCS from RESTBase to REST-gateway.
I tried GETing all the scap targets for the healthcheck URL and some failed consistently:
From maps 1006 i saw logs flooded with:
2025-03-20 13:45:21 GMT LOG: incomplete startup packet
Tue, Mar 18
Mon, Mar 17
We should also remove or deprecate the scap config and update the deployment docs.
Thu, Mar 13
So far I've tested:
Wed, Mar 12
I updated the patch to enable all changeprop traffic except of top 9 by PCS traffic:
en|de|ru|ja|fr|es|zh|it|pt
Here is an example query of what the android app is asking from the action api:
https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch=hasrecommendation%3Aimage&gsrnamespace=0&gsrsort=random&prop=growthimagesuggestiondata|revisions|pageimages&pilicense=any&rvprop=ids|timestamp|flags|comment|user|content&rvslots=main&rvsection=0
I manually purged page summary and it looks the output of most-read is fixed.
The vandalized title comes from:
Even after purging the local cache in the app this still shows up.
The output of page/mobile-html looks like its fixed.
Tue, Mar 11
For reference, from a quick look:
We can do that, but that means we should keep track of whats supported and whats not in the PCS level.
I think this round of rollout needs a bit more thinking.
Currently restbase maintains a list of per project (wikipedias, wiktionaries, wikiquotes, wikivoyages) thats explicit:
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/restbase/deploy/+/refs/heads/master/scap/vars.yaml#49
Mon, Mar 10
Fri, Mar 7
As a workaround there is some orchestration in place that reads the resource change topics and invalidates caches outside of change-prop until we have the proper change prop solution deployed.
Thu, Mar 6
Tue, Mar 4
Deployed to prod.
I just deployed the restbase change to enable the wiki in prod.
After talking with @Seddon its probably better if we swap ptwiki with srwiki+kowiki+idwikii
@MSantos we need to double check that we set the same cache headers everywhere. For the rest of the stored endpoints we needed to add them because they were missing.
Mon, Mar 3
Thu, Feb 27
Wed, Feb 26
@Seddon Any objections on the list? I wasn't sure if there is a language specific experiment happening that could be an issue.
After discussing with @Ottomata and @Joe it sounds like a good idea to add more fields if we need them. Using meta though is a bad idea: https://wikitech.wikimedia.org/wiki/Event_Platform/Flaws#meta_field
Metrics overall look OK.
Using the client side session data I don't see any bump in session.page_load_latency but I would like to see how a weeks worth of traffic works.
More specifically:
Tue, Feb 25
@Joe I spent some time figuring out how EventBus works in order to create and emit events but I don't think the current schema gives us flexibility to add more information about the event namespace.
Do you think the issue described in this ticket justifies a schema change? I was thinking that the least invasive way is to just add a tag like:
"meta": { "uri": "http://de.wikipedia.org/api/rest_v1/page/html/Heinrich_von_Othegraven", "stream": "resource_change", "request_id": "da40ebc5-556f-4609-be09-b6c82b0661e2", "id": "9c996201-f391-11ef-9615-87603b9b8d19", "dt": "2025-02-25T16:00:19.488Z", "domain": "de.wikipedia.org" }, "$schema": "/resource_change/1.0.0", "tags": [ "restbase", "main_namespace" ], "triggered_by": "req:da40ebc5-556f-4609-be09-b6c82b0661e2,mediawiki.revision-create:https://de.wikipedia.org/wiki/Heinrich_von_Othegraven" }
I think so, yeah both master nodes and postgres read replicas.
Just a comment around the usage in the bare metal nodes, keep in mind that each node other than node service also runs Postgres/PostGIS and master nodes run the OSM import pipeline.
Feb 20 2025
Using the following queries:
SELECT COUNT(*) FROM (SELECT regexp_like(uri_path, '/api/rest_v1/page/mobile-html/(User_talk|User|Wikipedia|File_talk|File|Category_talk|Category|Draft|Template|Template_talk|Wikipedia_talk|Draft_talk|Portal|Module|Module_talk)\%3A.*') AS is_non_main_ns_mobile_html, uri_path FROM webrequest_sampled_128 WHERE __time >= TIMESTAMP '2025-02-05 00:00:00' AND __time < TIMESTAMP '2025-02-06 00:00:00' AND uri_host = 'en.wikipedia.org' ) WHERE is_non_main_ns_mobile_html=true
Feb 19 2025
Verified on staging
From my local env when I try this request this shows up in the trace logs so indeed it should eventually request en.wikipedia.org:
[2025-02-19T05:00:20.053Z] TRACE: kartotherian/20 on 48bfdbb6c162: Outgoing request (request_id=6b03dd00-ee7e-11ef-b32e-cf29ca820246, levelPath=trace/req) out_request: { "method": "post", "uri": "https://en.wikipedia.org/w/api.php", "headers": { "user-agent": "kartotherian", "x-request-id": "6b03dd00-ee7e-11ef-b32e-cf29ca820246" }, "body": { "format": "json", "formatversion": "2", "action": "query", "revids": "1268325471", "prop": "mapdata", "mpdlimit": "max", "mpdgroups": [ "_1b46af921bb4e1f090a1b07748a50bd2e1f322fc" ] } } -- request: { "url": "/img/osm-intl,a,a,a,300x200.png?lang=en&domain=en.wikipedia.org&title=Alabama&revid=1268325471&groups=_1b46af921bb4e1f090a1b07748a50bd2e1f322fc", "headers": { "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:135.0) Gecko/20100101 Firefox/135.0", "x-request-id": "6b03dd00-ee7e-11ef-b32e-cf29ca820246" }, "method": "GET", "params": { "0": "/img/osm-intl,a,a,a,300x200.png" }, "query": { "lang": "en", "domain": "en.wikipedia.org", "title": "Alabama", "revid": "1268325471", "groups": [ "_1b46af921bb4e1f090a1b07748a50bd2e1f322fc" ] }, "remoteAddress": "192.168.65.1", "remotePort": 48165 }
For debugging purposes, this is a URL that requests a snapshot with an overlay map from en.wikipedia.org:
https://maps.wikimedia.org/img/osm-intl,a,a,a,300x200.png?lang=en&domain=en.wikipedia.org&title=Alabama&revid=1268325471&groups=_1b46af921bb4e1f090a1b07748a50bd2e1f322fc
Feb 18 2025
I agree, but the only env that could hang on en.wikipedia.org is k8s, maps nodes can talk to that endpoint directly.
The errors not showing up on the k8s side could be because there was actually no error, the ETIMEDOUT is raised in the client side, the server side just hangs.