Page MenuHomePhabricator

Decide on the future of DPL
Open, HighPublic

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'd argue we have to disable DPL from everywhere, this has potential to cause a full outage in our system but from any wiki that has it turned on. Intentionally or unintentionally.

I'd argue that all systems have scalability limits, and can potentially cause a full system outage if those scalability limits are reached. However, given current events, I can definitely understand where you're coming from.

Very roughly speaking, DPL scales with the size of the largest category (Lots of caveats on that, but to the first approximation).

I ran some stats on what the biggest category size is on wikis with dpl enabled (Caveat, using category table which isn't always accurate, also sometimes these are maintenance categories which are less likely for people to query on)

Full list at https://bawolff.toolforge.org/max-cat-size-dpl-wikis.txt but top 10 + top 6 wikinews projects

wikiSELECT max(cat_pages) from category;
ruwikinews_p6627744
frwikisource_p1479318
enwikisource_p1012131
dewiktionary_p994271
eswiktionary_p818846
enwiktionary_p776163
bnwikisource_p637196
zhwikisource_p569024
metawiki_p484999
ptwiki_p401754
[...]
srwikinews_p52884
[...]
ptwikinews_p22634
[...]
enwikinews_p21587
[..]
zhwikinews_p15164

[Note: srwikinews is high on this list, because it too is known for its bot imports of free news sources]. Wikinews projects depend on DPL much more than other projects, and most of them compared to ruwikinews operate at several orders of magnitude smaller scale.

I personally think it would be reasonable to do a hard cut-off at wikis with > 100,000 articles in the largest category. This is chosen kind of arbitrarily, I could understand selecting a much more lower number, but I think that would mitigate (but not totally remove) the pressing concerns, and make the feature similar in performance to a large watchlists. This would remove DPL from the following wikis: frwikisource_p, enwikisource_p, dewiktionary_p, eswiktionary_p, enwiktionary_p, bnwikisource_p, zhwikisource_p, metawiki_p, ptwiki_p,srwiki_p, ruwikisource_p, itwikisource_p, dewikisource_p, tawikisource_p, plwikisource_p, viwiktionary_p, hiwikisource_p. I also think we should generally only enable it on wikis that actually want it/use it. e.g. french wikisource is the highest on this list (excluding ru.wn). Although its interesting to note, that many of the other non-wikinews projects use DPL in a very different way - i.e. no intersection. If DPL is used with only a single category and the sort method set to either categorysortkey or categoryadd (The default), its actually pretty efficient, which seems to be what some of the wikisource projects are doing.

Potentially this sort of measure could be built in to DPL itself very easily - e.g. before issuing a query, check the size of the smallest category involved, and ensure that it is not greater than X.

Although I also totally understand why SREs might want to take a more conservative approach then this given recent events.

Thanks for the info and the numbers, I will definitely take a deep look at it and see what we can do. One thing I want to add to see where I'm coming from is the DDoS vector. The most recent outage was done in AGF but it can be easily abused in another wiki.

I like the categorylinks approach, it's just the table doesn't have counter so it would fallback to grouping and scanning lots of rows, if there are proper indexes in place it still should be fast but I need to double check. categorylinks itself is in a terrible shape. I have some plans to improve it (T222224) but it'll take years to finish.

oh cat_pages exist in category and we don't need to rely on categorylinks. That's pretty good.

Beside the aforementioned issue above (which can be mitigated to some degree), DPL has other scalability issues that need examining before taking any action.

Which of the four options described at https://www.mediawiki.org/wiki/Extension:DynamicPageList is this task about, please?

Thanks. The page I linked to seems to imply that a couple options there are perhaps more recent and/or actively maintained? Since I don't know or understand the details, I am left wondering whether those are also affected by the issues that's being debated on this task.

If you mean that we replace it with another DPL extension, this has been looked at and those extensions seems to be even worse than the one currently deployed: T262391#6449305

Change 708374 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/mediawiki-config@master] Stop enabling DPL on new wikis

https://gerrit.wikimedia.org/r/708374

Change 708376 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/intersection@master] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708376

Notwithstanding potential improvements to DPL, I think we can take the step of disabling it on projects where it's not being used. I submitted two patches, the first, "Stop enabling DPL on new wikis" will stop enabling DPL by being in a specific wiki family and instead requires explicit enabling for each wiki. Second, "Add a tracking category to pages using the <DynamicPageList> tag", will allow us to track and identify which wikis are using DPL. Once deployed I can run refreshLinks to populate it pretty quickly.

I personally think it would be reasonable to do a hard cut-off at wikis with > 100,000 articles in the largest category. This is chosen kind of arbitrarily, I could understand selecting a much more lower number, but I think that would mitigate (but not totally remove) the pressing concerns, and make the feature similar in performance to a large watchlists. This would remove DPL from the following wikis: frwikisource_p, enwikisource_p, dewiktionary_p, eswiktionary_p, enwiktionary_p, bnwikisource_p, zhwikisource_p, metawiki_p, ptwiki_p,srwiki_p, ruwikisource_p, itwikisource_p, dewikisource_p, tawikisource_p, plwikisource_p, viwiktionary_p, hiwikisource_p. I also think we should generally only enable it on wikis that actually want it/use it. e.g. french wikisource is the highest on this list (excluding ru.wn). Although its interesting to note, that many of the other non-wikinews projects use DPL in a very different way - i.e. no intersection. If DPL is used with only a single category and the sort method set to either categorysortkey or categoryadd (The default), its actually pretty efficient, which seems to be what some of the wikisource projects are doing.

Potentially this sort of measure could be built in to DPL itself very easily - e.g. before issuing a query, check the size of the smallest category involved, and ensure that it is not greater than X.

What you've said sounds plenty reasonable to me, +1 from me, but...

Although I also totally understand why SREs might want to take a more conservative approach then this given recent events.

While DPL's flexibility is it's greatest feature, it's also what makes it super scary from a SRE perspective. If you're not already familiar with the extension, it's hard to discern from a quick look at wikitext whether it's doing some slow query or using the fast mode. You mentioned that the non-Wikinews projects use DPL very differently, can we have a dedicated parser tag that enables *just* that safer usage?

...and make the feature similar in performance to a large watchlists.

This gave me the idea of just sending DPL queries to vslow (originally I thought of sending them to watchlist group, but @Ladsgroup pointed me to T263127: Remove groups from db configs, so vslow is a better option), which I think would prevent queries from taking down general traffic, I'll put up a patch for that too. It does feel wrong to "punish" fast queries by sending them to vslow, so if we had some heuristics on whether DPL queries were fast or slow that would be nice.

Change 708390 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/intersection@master] Send queries to \"vslow\" database group

https://gerrit.wikimedia.org/r/708390

Change 708376 merged by jenkins-bot:

[mediawiki/extensions/intersection@master] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708376

Change 708224 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/intersection@wmf/1.37.0-wmf.15] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708224

Change 708225 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/intersection@wmf/1.37.0-wmf.16] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708225

Change 708390 merged by jenkins-bot:

[mediawiki/extensions/intersection@master] Send queries to \"vslow\" database group

https://gerrit.wikimedia.org/r/708390

For reference, DPL queries can be grouped into the following four performance categories (from best to worst):

  1. Efficient (Similar to rendering a category page. Read a small-ish number of rows in sequential order of some index)
    • Single category specified. Order method one of categoryadd (Default) or categorysortkey
  2. Look at large number of rows of catgeorylinks. no filesort. If the intersection is dense, this could potentially return quickly without looking at all that many rows
    • Multiple category clauses. The smallest category is the first metioned. Order method is one of categoryadd (default) or categorysortkey
  3. Look at all categorylinks rows for the smallest category of the intersection, filesort the results
    • Multiple category clauses if the smallest category is not first
    • All cases where the order method is created, length or lastedit (lastedit really means last touched)
  4. filesort the page table
    • Basically any case where no categories are specified, but a namespace is. Depending on mysql size metrics, sometimes this might scan large number of rows and filter instead.

...and make the feature similar in performance to a large watchlists.

This gave me the idea of just sending DPL queries to vslow (originally I thought of sending them to watchlist group, but @Ladsgroup pointed me to T263127: Remove groups from db configs, so vslow is a better option), which I think would prevent queries from taking down general traffic, I'll put up a patch for that too. It does feel wrong to "punish" fast queries by sending them to vslow, so if we had some heuristics on whether DPL queries were fast or slow that would be nice.

We need to fix at least two things before we do this:

  1. The queries are definitely slow: 3 minutes on an idle server (eqiad) for the query to run: https://phabricator.wikimedia.org/P16896 (around 30 seconds if the same query is ran again, which is still a lot)
  2. The burst of queries will overload the host (keep in mind that most of the vlsow groups only have 1 host), and my understanding is that if that host is unavailable the traffic will shift to any other host within that section, is that the case?

We definitely need to fix the amount of queries that are allowed to be sent

Note: extension:googlenewssitemap which powers the rss feeds on wikinews ( https://en.wikinews.org/w/index.php?title=Special:NewsFeed&format=atom ) does similar queries.

This gave me the idea of just sending DPL queries to vslow (originally I thought of sending them to watchlist group, but @Ladsgroup pointed me to T263127: Remove groups from db configs, so vslow is a better option), which I think would prevent queries from taking down general traffic, I'll put up a patch for that too. It does feel wrong to "punish" fast queries by sending them to vslow, so if we had some heuristics on whether DPL queries were fast or slow that would be nice.

We need to fix at least two things before we do this:

  1. The queries are definitely slow: 3 minutes on an idle server (eqiad) for the query to run: https://phabricator.wikimedia.org/P16896 (around 30 seconds if the same query is ran again, which is still a lot)
  2. The burst of queries will overload the host (keep in mind that most of the vlsow groups only have 1 host), and my understanding is that if that host is unavailable the traffic will shift to any other host within that section, is that the case?

We definitely need to fix the amount of queries that are allowed to be sent

To clarify after brief IRC discussion, moving queries to vslow is by no means a solution, it just makes the failure mode a bit better as it would first take down a vslow replica before taking down a normal replica, giving us a few seconds or minutes before an overload affects general traffic. The 2 things you mentioned are still problems that need addressing.

I think moving their workflows over to use a bot is the best idea for now, I don't think there are any magical fixes that'll make DPL work.

Putting my money where my mouth is... https://gitlab.com/legoktm/dplbot/ is the start of a port of DPL to mwparserfromhell + Pywikibot. It'll probably take another day or two to finish, happy to add more collaborators if people are interested in working on it. Note that I have no intention of running or maintaining it long-term, I'm just writing it to prove that a bot replacement is feasible and probably superior to the extension.

@Legoktm How do you plan to call this bot? Are you planning to continuously loop through millions of categories and create a list of the latest news in each of them?

@Bawolff Maybe you can try replacing database queries with CirrusSearch queries. It won't take long but we can at least test this hypothesis.

@Legoktm How do you plan to call this bot? Are you planning to continuously loop through millions of categories and create a list of the latest news in each of them?

I don't see why that would be necessary. But just looping through every page that uses DPL is a naive option that'll probably work for smaller wikis. Another might be to watch for newly created or published articles (e.g. enwikinews looks like it has ~2-3 new articles per day), and then working backwards and updating just those.

@Legoktm

2-3 articles can be added by hand. We don't need a robot for this.

DPL does an excellent job with 2-3 articles.

And it's not just about creating articles, but about any change where a category can be added or removed. There can be tens of thousands changes per day.

There is no problem to write such a bot. The problem is that it won't do the real thing.

Moreover, I have a bot that analyzes all edits in Russian Wikinews and makes technical changes required.

Even with the existing loads, I have to automatically kill processes several times a day, otherwise they grow like an avalanche. That is it ignores some of the changes during peak loads.

It definitely won't do it if one has to rewrite hundreds of categories. This will require some serious dedicated power.

We need a list generation solution on the fly but on demand. This will reduce the useless load tens of thousands of times, as I write above. Moreover, such a list can also be cached.

@Bawolff Maybe you can try replacing database queries with CirrusSearch queries. It won't take long but we can at least test this hypothesis.

That's beyond the level of effort im willing to commit to at this time.

In any case, that approach is in the rough idea stage, its not just a simple matter of programming - it doesn't exactly have buy-in from the relavant stakeholders as of yet.

Change 708225 merged by jenkins-bot:

[mediawiki/extensions/intersection@wmf/1.37.0-wmf.16] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708225

Change 708224 merged by jenkins-bot:

[mediawiki/extensions/intersection@wmf/1.37.0-wmf.15] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708224

As German Wikinews editor (+admin) I consider removing DPL as a ba solution.. In most perhaps all of the language versions I know we use them to generate different newsfeeds used in what we call news portals which are using the portal namespace. Some languages use or used DPL on the homepage but those are all wikis with a couple of articles a day.

The Russian Wikinews does two things different: They use DPL in the category namespace, and they use it on the mainpage. Considering the performance issue on s3: Could the main page usage of DPL result in this? This rememebers me strongly on the incident caused by the "Dackelvandale" in the German Wikipedia about a dozen years ago when the so called troll included the the neest article special page as a template 50 tiimes or so on the main page what cause the server to stand still. (IIRC since then template including of certain special pages isn't possible anymore and man wikis have protecte or semi-protected main pages.

Also a flaw in the category architecture could cause a time out if, some part of the tree is included elsehere, I don't know the word for it. For example, in the English WP the Category:World is put in the Category:Universum and some levels above this again is in Category:World. Like if a snake would eat its own end. Since I don't speak Russian I did not ana analyze of the cat tree. But mitigating the DPLs from category pages to portal pages might help. Or not.

Please enable DPL at least at Main Page of RWN. This 1 page seem to be safe for servers and it's most crucial for RWN.

I think that it would be unwise to disable DPL, perhaps it would be better to slow down the query of imports to prevent another failure like this so the imports still work (but slower), while other Wikimedia websites won't be affected.

Please pay attention to RWN's call for Board of Trustees candidates to share their opinion on current issue as candidates: https://ru.wikinews.org/wiki/Викиновости:Форум/Общий#Appeal_to_candidates_of_Wikimedia_Foundation_Board_of_Trustees_elections_2021

(it's in English so anyone can read)

Change 708374 merged by jenkins-bot:

[operations/mediawiki-config@master] Stop enabling DPL on new wikis

https://gerrit.wikimedia.org/r/708374

Mentioned in SAL (#wikimedia-operations) [2021-08-02T23:21:12Z] <legoktm> Previous sync also deployed c38998f03f "Stop enabling DPL on new wikis" (T287380)

Please enable DPL at least at Main Page of RWN. This 1 page seem to be safe for servers and it's most crucial for RWN.

I don't believe it is.

I think that it would be unwise to disable DPL, perhaps it would be better to slow down the query of imports to prevent another failure like this so the imports still work (but slower), while other Wikimedia websites won't be affected.

At this point, its less about the imports as the total size of the wiki. Even if nobody ever created a new article in ruwn ever again, its still too big at this point

Just to summarize some additional investigation that was done:

  • The triggering event seems to be this edit https://ru.wikinews.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%96%D1%83%D1%80%D0%BD%D0%B0%D0%BB%D1%8B&logid=19137921
    • Creating that category page triggered a job which parsed a large number of pages that had a DPL query on them, instigating the DB to overload
  • Its possible (although i don't have conclusive evidence) that mariadb was using an unideal query plan for this particular query, which may have exacerbated the situation. This particular DPL query may have been scaling proportional to the size of the categorylinks table (44M) instead of proportional to the size of the category (180k) i didn't originally notice that ordermethod in the query was set to created instead of categoryadd. categoryadd has more efficient execution
  • ruwikinews had log entries that this particular DPL query (The one for the infobox on "В мире‏‎") had been failing due to timeouts for several days in the lead up to the incident. This meant that the mitigation introduced in the previous incident (wgDLPQueryCacheTime - Which was really a band-aid) was at least sometimes not applying, as it doesn't work for queries that take so long that they timeout.
  • Possibly the mitigation from last time (wgDLPQueryCacheTime) helped us scale further, but when it did fail, it resulted in a much harder failure

So i missed originally that the ordermethod was set to "created" (aka page_id) on the dpl query. This made me confused about the query plan chosen and i said some incorrect things about it being unideal.

Ordering by "created" is often less efficient than ordering by categoryadd in this extension. We should maybe remove the less efficient order methods from the extension.

Edit: The query plan was still a bad choice, its just a little more reasonable that mariadb chose it. I clarified on the incident doc page.

@Bawolff, As you may know, our colleague @Krassotkin was recently banned by the WMF through an "office action", and we believe that this latest incident was also blamed on him - according to their wording "misuse of facilities" and "severe damage to the technical infrastructure of the projects".

See:

As far as I can see from this discussion, he was not involved in the last incident, and after the previous incident, he cannot be blamed for any action or inaction. He had no way of preventing or predicting this incident.

I ask you to explicitly confirm or refute this conclusion.

I would also like to ask you to explicitly list which of the Foundation employees should be considered responsible for the current situation. Not for accusations, but only for clarification to those who do not fully understand the distribution of responsibility in technological matters.

I ask you to explicitly confirm or refute this conclusion.

Your conclusion is 100% incorrect.

Bawolff doesn't even work for the WMF, and hasn't for two years now, please do not harass him…

This is not a place to discuss office actions. Please only leave comments related to the task, i.e. related to the future of DPL.

This is not a place to discuss office actions. Please only leave comments related to the task, i.e. related to the future of DPL.

I'm sorry, but "future of DPL" is not my interest at this time. I am interested in the current state of the DPL and who exactly is responsible for it.
I am not inviting anyone to discuss office actions here. I asked a not very difficult and quite specific question in response to a comment T287380#7258587

If you know of another place and another addressee to whom I could ask these questions, just suggest them.

I ask you to explicitly confirm or refute this conclusion.

Your conclusion is 100% incorrect.

Dear Giuseppe, If you have knowledge related to these questions, please explain. Maybe in another place. My knowledge of this whole story is not very small, but in the past, I was not directly involved in the analysis and heard mainly one party.

Bawolff doesn't even work for the WMF, and hasn't for two years now, please do not harass him…

I'm sorry, but I'm not very interested in who works for the WMF and who does not. I saw a comment with analysis and asked for clarification. If you don't have an answer, you can safely ignore my comment.

@Kaganer: The etiquette asks to criticize ideas, not people. In case you are looking for lists of people for whatever reason, then that sounds both very unhelpful and like the wrong place. If you are interested in current code stewardship, then https://www.mediawiki.org/wiki/Developers/Maintainers might provide some insights.

Bawolff doesn't even work for the WMF, and hasn't for two years now, please do not harass him…

I'm sorry, but I'm not very interested in who works for the WMF and who does not. I saw a comment with analysis and asked for clarification. If you don't have an answer, you can safely ignore my comment.

You may rest assured that I will ignore your comments in the future. Good luck finding someone who does not.

@Aklapper , This distracts from the topic. By way, I'm not a fresh user in Movement. And in the IT also. I know perfectly well where is the code, and where is the management of the code, and where is the management of the software infrastructure (and where executive management). And these are four different countries. And I'm not looking for guilty people. But I'm not trying to justify anyone either.

Ok, I'm read T287362 at this time. As I know, after disabling DPL for ru-Wikinews at 2021-08-26, three weeks later, the project participants figured out how to do without this extension at all. It is no longer used or called.

Do I understand correctly that this is the latest incident? And when was the previous one? was one year before, 2020-09-09.

And were there any instructions and restrictions for the content project between the previous and the last one? For the user? The maximum number of page creations per day? Something other? Personal restrictions for a user or his bot?

If these questions have already been answered somewhere, I would be grateful for a link to that place.

Hello everyone,

We would like to remind all discussion participants here that a Code of Conduct applies in all Wikimedia Technical spaces, including Wikimedia Phabricator.

To avoid unnecessarily heated discussions and/or CoC violations, I would like to ask everyone to re-review Code of Conduct, and to think twice before saving your comment, touching the area of DPL.

If you encounter a conduct issue that you'd appreciate help with, please do not hesitate to contact the committee. Details about issues can be reported are available in the Code of Conduct itself, under "Report a problem".

Sincerely,
Martin Urbanec (@Urbanecm), Code of Conduct committee.

And when was the previous one?

See for example the links in the "Mentions" box in T287362, which themselves and in further outgoing links should answer your questions.

And when was the previous one?

See for example the links in the "Mentions" box in T287362, which themselves and in further outgoing links should answer your questions.

Thank you, I'll read it all. Who could I ask questions about the responsibility for all this? About managerial responsibility, not developmental responsibility?

And when was the previous one?

See for example the links in the "Mentions" box in T287362, which themselves and in further outgoing links should answer your questions.

Thank you, I'll read it all. Who could I ask questions about the responsibility for all this? About managerial responsibility, not developmental responsibility?

As you can see at https://www.mediawiki.org/wiki/Developers/Maintainers, no team is currently assigned to maintain the DPL extension. Since there's no WMF team that are supposed to maintain DPL, that also means there are no managers :).

A lot of the code that runs in Wikimedia production is maintained by individual volunteers from the technical community, rather than by WMF/WMDE teams. Just as Russian Wikipedia's community doesn't have managers, neither does the technical community :). If you're interested in participating and improving Wikimedia's code, https://www.mediawiki.org/wiki/New_Developers is likely a good place to start.

If these questions have already been answered somewhere, I would be grateful for a link to that place.

T287362#7241776, T287362#7242156

As you can see at https://www.mediawiki.org/wiki/Developers/Maintainers, no team is currently assigned to maintain the DPL extension. Since there's no WMF team that are supposed to maintain DPL, that also means there are no managers :).

I think the real question would be, who would be the manager that could be petitioned to sponsor a project of sufficiently rewriting DPL to be usable on ruwikinews again (which seems like a significant effort and not some small fix anyone would be willing to do on the side). TBH I find it exceedingly unlikely that this would be considered a good place to allocate resources to (sorry but there are just too many things that are more important and also in need of resources), but my personal opinion aside, these are the venues I can think of:

Update: From now on, any db queries made by DPL has timeout of ten seconds. This is done as part of T297708: Set max execution time for several expensive mediawiki actions. This won't remove ability of DPL to bring down all of Wikipedia but it will drastically makes it harder. It also means if a wiki starts to grow really big and use DPL heavily in complex work, their DPL extension might effectively gets disabled due to this timeout.

I urge to investigate wether the Russian issue can be minimized if DPL is not used on category pages, but on index pages only (most likely PORTAL pages, like in [[:n:de:Portal:Berlin]] what (as I am understanding it) would render the DPL only when a specific portal page would be accessed (purged) and not every time an article is added to a category.