Page MenuHomePhabricator

Decide on the future of DPL
Open, HighPublic

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

IIRC, Wikimedia's DPL fork was created as part of the Wikivoyage migration rush because a few of the incoming communities insisted they needed it; some of those extensions have been subsequently worked around and dropped from Wikimedia production, and that's probably the best outcome for DPL too. I appreciate that a couple of wikis have built processes that rely upon it, but I can't see it being remotely justified given the complexity and risk to the wider Wikimedia movement.

Can DPL be somehow optimized so as not to load the DB?

IIRC, Wikimedia's DPL fork was created as part of the Wikivoyage migration rush because a few of the incoming communities insisted they needed it; some of those extensions have been subsequently worked around and dropped from Wikimedia production, and that's probably the best outcome for DPL too. I appreciate that a couple of wikis have built processes that rely upon it, but I can't see it being remotely justified given the complexity and risk to the wider Wikimedia movement.

Brion says wikinews - https://www.mediawiki.org/wiki/Special:Code/MediaWiki/7981

IIRC, Wikimedia's DPL fork was created as part of the Wikivoyage migration rush because a few of the incoming communities insisted they needed it; some of those extensions have been subsequently worked around and dropped from Wikimedia production, and that's probably the best outcome for DPL too. I appreciate that a couple of wikis have built processes that rely upon it, but I can't see it being remotely justified given the complexity and risk to the wider Wikimedia movement.

Brion says wikinews - https://www.mediawiki.org/wiki/Special:Code/MediaWiki/7981

It turns out I didn't remember correctly. ;-)

I haven't fully reviewed the incident yet, but my understanding is that our DPL fork isn't as bad as the others, but it still has some issues. Most wikinewses are pretty small, so it doesn't become a bit performance issue, but ruwikinews has rapidly grown in size (congrats!) and is no longer small. I think moving their workflows over to use a bot is the best idea for now, I don't think there are any magical fixes that'll make DPL work.

IIRC, Wikimedia's DPL fork was created as part of the Wikivoyage migration rush because a few of the incoming communities insisted they needed it; some of those extensions have been subsequently worked around and dropped from Wikimedia production, and that's probably the best outcome for DPL too. I appreciate that a couple of wikis have built processes that rely upon it, but I can't see it being remotely justified given the complexity and risk to the wider Wikimedia movement.

Fyi, this is incorrect. Wikimedia's dpl is the original version of the extension. It was made specificly for wikinews by contributors to enwikinews long before wikivoyage was a thing.

I agree though, its untennable for projects the size of ruwikinews.

I haven't fully reviewed the incident yet, but my understanding is that our DPL fork isn't as bad as the others, but it still has some issues. Most wikinewses are pretty small, so it doesn't become a bit performance issue, but ruwikinews has rapidly grown in size (congrats!) and is no longer small. I think moving their workflows over to use a bot is the best idea for now, I don't think there are any magical fixes that'll make DPL work.

I'd argue we have to disable DPL from everywhere, this has potential to cause a full outage in our system but from any wiki that has it turned on. Intentionally or unintentionally.

We cannot implement the functionality of news feeds through bots.

Wikinews contains news feeds both on the home page and in all categories. Any news projects look the same.

Categories on Wikinews are like tags in other news projects. In particular, they are equivalent to articles on Wikipedia but much broader.

At the limit Wikinews can contain more categories than Wikidata Q-entities. These are millions of categories. As we can see in the example of social media tags. And note that this problem has been successfully solved there.

If we implement a news feed of categories through a requests, it will be generated only for these requests. Implementing this function through bots we will have to prepare these pages in advance anyway whether someone needs it or not.

In the latter case, this is thousands of times more constant requests to API (page requests and rewriting).

On the other hand, in the last discussion, it was decided to rewrite the DPL to CirrusSearch.

I implemented this idea in a bot that updates news on Wikipedia portals. It works instantly and flawlessly. Moreover, the implementation of this functionality took me less than an hour.

For example see "Викиновости" block on https://ru.wikipedia.org/wiki/Портал:Музыка.

That is why I suggest to try to implement that proposal. I see it works. I just don't write on PHP.

On the other hand, in the last discussion, it was decided to rewrite the DPL to CirrusSearch.

Decided is a strong word. It was suggested as a possible way forward, but nobody was interested in volunteering to do the work neccesary. As of this writing, there is still nobody interested afaik.

And to be clear, while it isn't rocket science, its also a non-trivial effort (in addition to rewriting the dpl query engine part, the ES indicies would also have to be augmented with additional data i think. Performance would have to be evaluated carefully. While ES scales to this type of workload much better than naive self-joins in mariadb, its still something that would need to be measured and ensured that the existing cluster could handle it)

In the latter case, this is thousands of times more constant requests to API (page requests and rewriting).

Requesting/editing a page is a fast operation for the server. Thousands of times more is probably easier on the servers at the scale of ruwikinews than the current solution.

I'd argue we have to disable DPL from everywhere, this has potential to cause a full outage in our system but from any wiki that has it turned on. Intentionally or unintentionally.

I'd argue that all systems have scalability limits, and can potentially cause a full system outage if those scalability limits are reached. However, given current events, I can definitely understand where you're coming from.

Very roughly speaking, DPL scales with the size of the largest category (Lots of caveats on that, but to the first approximation).

I ran some stats on what the biggest category size is on wikis with dpl enabled (Caveat, using category table which isn't always accurate, also sometimes these are maintenance categories which are less likely for people to query on)

Full list at https://bawolff.toolforge.org/max-cat-size-dpl-wikis.txt but top 10 + top 6 wikinews projects

wikiSELECT max(cat_pages) from category;
ruwikinews_p6627744
frwikisource_p1479318
enwikisource_p1012131
dewiktionary_p994271
eswiktionary_p818846
enwiktionary_p776163
bnwikisource_p637196
zhwikisource_p569024
metawiki_p484999
ptwiki_p401754
[...]
srwikinews_p52884
[...]
ptwikinews_p22634
[...]
enwikinews_p21587
[..]
zhwikinews_p15164

[Note: srwikinews is high on this list, because it too is known for its bot imports of free news sources]. Wikinews projects depend on DPL much more than other projects, and most of them compared to ruwikinews operate at several orders of magnitude smaller scale.

I personally think it would be reasonable to do a hard cut-off at wikis with > 100,000 articles in the largest category. This is chosen kind of arbitrarily, I could understand selecting a much more lower number, but I think that would mitigate (but not totally remove) the pressing concerns, and make the feature similar in performance to a large watchlists. This would remove DPL from the following wikis: frwikisource_p, enwikisource_p, dewiktionary_p, eswiktionary_p, enwiktionary_p, bnwikisource_p, zhwikisource_p, metawiki_p, ptwiki_p,srwiki_p, ruwikisource_p, itwikisource_p, dewikisource_p, tawikisource_p, plwikisource_p, viwiktionary_p, hiwikisource_p. I also think we should generally only enable it on wikis that actually want it/use it. e.g. french wikisource is the highest on this list (excluding ru.wn). Although its interesting to note, that many of the other non-wikinews projects use DPL in a very different way - i.e. no intersection. If DPL is used with only a single category and the sort method set to either categorysortkey or categoryadd (The default), its actually pretty efficient, which seems to be what some of the wikisource projects are doing.

Potentially this sort of measure could be built in to DPL itself very easily - e.g. before issuing a query, check the size of the smallest category involved, and ensure that it is not greater than X.

Although I also totally understand why SREs might want to take a more conservative approach then this given recent events.

Thanks for the info and the numbers, I will definitely take a deep look at it and see what we can do. One thing I want to add to see where I'm coming from is the DDoS vector. The most recent outage was done in AGF but it can be easily abused in another wiki.

I like the categorylinks approach, it's just the table doesn't have counter so it would fallback to grouping and scanning lots of rows, if there are proper indexes in place it still should be fast but I need to double check. categorylinks itself is in a terrible shape. I have some plans to improve it (T222224) but it'll take years to finish.

oh cat_pages exist in category and we don't need to rely on categorylinks. That's pretty good.

Beside the aforementioned issue above (which can be mitigated to some degree), DPL has other scalability issues that need examining before taking any action.

Which of the four options described at https://www.mediawiki.org/wiki/Extension:DynamicPageList is this task about, please?

Thanks. The page I linked to seems to imply that a couple options there are perhaps more recent and/or actively maintained? Since I don't know or understand the details, I am left wondering whether those are also affected by the issues that's being debated on this task.

If you mean that we replace it with another DPL extension, this has been looked at and those extensions seems to be even worse than the one currently deployed: T262391#6449305

Change 708374 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/mediawiki-config@master] Stop enabling DPL on new wikis

https://gerrit.wikimedia.org/r/708374

Change 708376 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/intersection@master] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708376

Notwithstanding potential improvements to DPL, I think we can take the step of disabling it on projects where it's not being used. I submitted two patches, the first, "Stop enabling DPL on new wikis" will stop enabling DPL by being in a specific wiki family and instead requires explicit enabling for each wiki. Second, "Add a tracking category to pages using the <DynamicPageList> tag", will allow us to track and identify which wikis are using DPL. Once deployed I can run refreshLinks to populate it pretty quickly.

I personally think it would be reasonable to do a hard cut-off at wikis with > 100,000 articles in the largest category. This is chosen kind of arbitrarily, I could understand selecting a much more lower number, but I think that would mitigate (but not totally remove) the pressing concerns, and make the feature similar in performance to a large watchlists. This would remove DPL from the following wikis: frwikisource_p, enwikisource_p, dewiktionary_p, eswiktionary_p, enwiktionary_p, bnwikisource_p, zhwikisource_p, metawiki_p, ptwiki_p,srwiki_p, ruwikisource_p, itwikisource_p, dewikisource_p, tawikisource_p, plwikisource_p, viwiktionary_p, hiwikisource_p. I also think we should generally only enable it on wikis that actually want it/use it. e.g. french wikisource is the highest on this list (excluding ru.wn). Although its interesting to note, that many of the other non-wikinews projects use DPL in a very different way - i.e. no intersection. If DPL is used with only a single category and the sort method set to either categorysortkey or categoryadd (The default), its actually pretty efficient, which seems to be what some of the wikisource projects are doing.

Potentially this sort of measure could be built in to DPL itself very easily - e.g. before issuing a query, check the size of the smallest category involved, and ensure that it is not greater than X.

What you've said sounds plenty reasonable to me, +1 from me, but...

Although I also totally understand why SREs might want to take a more conservative approach then this given recent events.

While DPL's flexibility is it's greatest feature, it's also what makes it super scary from a SRE perspective. If you're not already familiar with the extension, it's hard to discern from a quick look at wikitext whether it's doing some slow query or using the fast mode. You mentioned that the non-Wikinews projects use DPL very differently, can we have a dedicated parser tag that enables *just* that safer usage?

...and make the feature similar in performance to a large watchlists.

This gave me the idea of just sending DPL queries to vslow (originally I thought of sending them to watchlist group, but @Ladsgroup pointed me to T263127: Remove groups from db configs, so vslow is a better option), which I think would prevent queries from taking down general traffic, I'll put up a patch for that too. It does feel wrong to "punish" fast queries by sending them to vslow, so if we had some heuristics on whether DPL queries were fast or slow that would be nice.

Change 708390 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/intersection@master] Send queries to \"vslow\" database group

https://gerrit.wikimedia.org/r/708390

Change 708376 merged by jenkins-bot:

[mediawiki/extensions/intersection@master] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708376

Change 708224 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/intersection@wmf/1.37.0-wmf.15] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708224

Change 708225 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/intersection@wmf/1.37.0-wmf.16] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708225

Change 708390 merged by jenkins-bot:

[mediawiki/extensions/intersection@master] Send queries to \"vslow\" database group

https://gerrit.wikimedia.org/r/708390

For reference, DPL queries can be grouped into the following four performance categories (from best to worst):

  1. Efficient (Similar to rendering a category page. Read a small-ish number of rows in sequential order of some index)
    • Single category specified. Order method one of categoryadd (Default) or categorysortkey
  2. Look at large number of rows of catgeorylinks. no filesort. If the intersection is dense, this could potentially return quickly without looking at all that many rows
    • Multiple category clauses. The smallest category is the first metioned. Order method is one of categoryadd (default) or categorysortkey
  3. Look at all categorylinks rows for the smallest category of the intersection, filesort the results
    • Multiple category clauses if the smallest category is not first
    • All cases where the order method is created, length or lastedit (lastedit really means last touched)
  4. filesort the page table
    • Basically any case where no categories are specified, but a namespace is. Depending on mysql size metrics, sometimes this might scan large number of rows and filter instead.

...and make the feature similar in performance to a large watchlists.

This gave me the idea of just sending DPL queries to vslow (originally I thought of sending them to watchlist group, but @Ladsgroup pointed me to T263127: Remove groups from db configs, so vslow is a better option), which I think would prevent queries from taking down general traffic, I'll put up a patch for that too. It does feel wrong to "punish" fast queries by sending them to vslow, so if we had some heuristics on whether DPL queries were fast or slow that would be nice.

We need to fix at least two things before we do this:

  1. The queries are definitely slow: 3 minutes on an idle server (eqiad) for the query to run: https://phabricator.wikimedia.org/P16896 (around 30 seconds if the same query is ran again, which is still a lot)
  2. The burst of queries will overload the host (keep in mind that most of the vlsow groups only have 1 host), and my understanding is that if that host is unavailable the traffic will shift to any other host within that section, is that the case?

We definitely need to fix the amount of queries that are allowed to be sent

Note: extension:googlenewssitemap which powers the rss feeds on wikinews ( https://en.wikinews.org/w/index.php?title=Special:NewsFeed&format=atom ) does similar queries.

This gave me the idea of just sending DPL queries to vslow (originally I thought of sending them to watchlist group, but @Ladsgroup pointed me to T263127: Remove groups from db configs, so vslow is a better option), which I think would prevent queries from taking down general traffic, I'll put up a patch for that too. It does feel wrong to "punish" fast queries by sending them to vslow, so if we had some heuristics on whether DPL queries were fast or slow that would be nice.

We need to fix at least two things before we do this:

  1. The queries are definitely slow: 3 minutes on an idle server (eqiad) for the query to run: https://phabricator.wikimedia.org/P16896 (around 30 seconds if the same query is ran again, which is still a lot)
  2. The burst of queries will overload the host (keep in mind that most of the vlsow groups only have 1 host), and my understanding is that if that host is unavailable the traffic will shift to any other host within that section, is that the case?

We definitely need to fix the amount of queries that are allowed to be sent

To clarify after brief IRC discussion, moving queries to vslow is by no means a solution, it just makes the failure mode a bit better as it would first take down a vslow replica before taking down a normal replica, giving us a few seconds or minutes before an overload affects general traffic. The 2 things you mentioned are still problems that need addressing.

I think moving their workflows over to use a bot is the best idea for now, I don't think there are any magical fixes that'll make DPL work.

Putting my money where my mouth is... https://gitlab.com/legoktm/dplbot/ is the start of a port of DPL to mwparserfromhell + Pywikibot. It'll probably take another day or two to finish, happy to add more collaborators if people are interested in working on it. Note that I have no intention of running or maintaining it long-term, I'm just writing it to prove that a bot replacement is feasible and probably superior to the extension.

@Legoktm How do you plan to call this bot? Are you planning to continuously loop through millions of categories and create a list of the latest news in each of them?

@Bawolff Maybe you can try replacing database queries with CirrusSearch queries. It won't take long but we can at least test this hypothesis.

@Legoktm How do you plan to call this bot? Are you planning to continuously loop through millions of categories and create a list of the latest news in each of them?

I don't see why that would be necessary. But just looping through every page that uses DPL is a naive option that'll probably work for smaller wikis. Another might be to watch for newly created or published articles (e.g. enwikinews looks like it has ~2-3 new articles per day), and then working backwards and updating just those.

@Legoktm

2-3 articles can be added by hand. We don't need a robot for this.

DPL does an excellent job with 2-3 articles.

And it's not just about creating articles, but about any change where a category can be added or removed. There can be tens of thousands changes per day.

There is no problem to write such a bot. The problem is that it won't do the real thing.

Moreover, I have a bot that analyzes all edits in Russian Wikinews and makes technical changes required.

Even with the existing loads, I have to automatically kill processes several times a day, otherwise they grow like an avalanche. That is it ignores some of the changes during peak loads.

It definitely won't do it if one has to rewrite hundreds of categories. This will require some serious dedicated power.

We need a list generation solution on the fly but on demand. This will reduce the useless load tens of thousands of times, as I write above. Moreover, such a list can also be cached.

@Bawolff Maybe you can try replacing database queries with CirrusSearch queries. It won't take long but we can at least test this hypothesis.

That's beyond the level of effort im willing to commit to at this time.

In any case, that approach is in the rough idea stage, its not just a simple matter of programming - it doesn't exactly have buy-in from the relavant stakeholders as of yet.

Change 708225 merged by jenkins-bot:

[mediawiki/extensions/intersection@wmf/1.37.0-wmf.16] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708225

Change 708224 merged by jenkins-bot:

[mediawiki/extensions/intersection@wmf/1.37.0-wmf.15] Add a tracking category to pages using the <DynamicPageList> tag

https://gerrit.wikimedia.org/r/708224

As German Wikinews editor (+admin) I consider removing DPL as a ba solution.. In most perhaps all of the language versions I know we use them to generate different newsfeeds used in what we call news portals which are using the portal namespace. Some languages use or used DPL on the homepage but those are all wikis with a couple of articles a day.

The Russian Wikinews does two things different: They use DPL in the category namespace, and they use it on the mainpage. Considering the performance issue on s3: Could the main page usage of DPL result in this? This rememebers me strongly on the incident caused by the "Dackelvandale" in the German Wikipedia about a dozen years ago when the so called troll included the the neest article special page as a template 50 tiimes or so on the main page what cause the server to stand still. (IIRC since then template including of certain special pages isn't possible anymore and man wikis have protecte or semi-protected main pages.

Also a flaw in the category architecture could cause a time out if, some part of the tree is included elsehere, I don't know the word for it. For example, in the English WP the Category:World is put in the Category:Universum and some levels above this again is in Category:World. Like if a snake would eat its own end. Since I don't speak Russian I did not ana analyze of the cat tree. But mitigating the DPLs from category pages to portal pages might help. Or not.

Please enable DPL at least at Main Page of RWN. This 1 page seem to be safe for servers and it's most crucial for RWN.

I think that it would be unwise to disable DPL, perhaps it would be better to slow down the query of imports to prevent another failure like this so the imports still work (but slower), while other Wikimedia websites won't be affected.

Please pay attention to RWN's call for Board of Trustees candidates to share their opinion on current issue as candidates: https://ru.wikinews.org/wiki/Викиновости:Форум/Общий#Appeal_to_candidates_of_Wikimedia_Foundation_Board_of_Trustees_elections_2021

(it's in English so anyone can read)

Change 708374 merged by jenkins-bot:

[operations/mediawiki-config@master] Stop enabling DPL on new wikis

https://gerrit.wikimedia.org/r/708374

Mentioned in SAL (#wikimedia-operations) [2021-08-02T23:21:12Z] <legoktm> Previous sync also deployed c38998f03f "Stop enabling DPL on new wikis" (T287380)

Please enable DPL at least at Main Page of RWN. This 1 page seem to be safe for servers and it's most crucial for RWN.

I don't believe it is.

I think that it would be unwise to disable DPL, perhaps it would be better to slow down the query of imports to prevent another failure like this so the imports still work (but slower), while other Wikimedia websites won't be affected.

At this point, its less about the imports as the total size of the wiki. Even if nobody ever created a new article in ruwn ever again, its still too big at this point

Just to summarize some additional investigation that was done:

  • The triggering event seems to be this edit https://ru.wikinews.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%96%D1%83%D1%80%D0%BD%D0%B0%D0%BB%D1%8B&logid=19137921
    • Creating that category page triggered a job which parsed a large number of pages that had a DPL query on them, instigating the DB to overload
  • Its possible (although i don't have conclusive evidence) that mariadb was using an unideal query plan for this particular query, which may have exacerbated the situation. This particular DPL query may have been scaling proportional to the size of the categorylinks table (44M) instead of proportional to the size of the category (180k) i didn't originally notice that ordermethod in the query was set to created instead of categoryadd. categoryadd has more efficient execution
  • ruwikinews had log entries that this particular DPL query (The one for the infobox on "В мире‏‎") had been failing due to timeouts for several days in the lead up to the incident. This meant that the mitigation introduced in the previous incident (wgDLPQueryCacheTime - Which was really a band-aid) was at least sometimes not applying, as it doesn't work for queries that take so long that they timeout.
  • Possibly the mitigation from last time (wgDLPQueryCacheTime) helped us scale further, but when it did fail, it resulted in a much harder failure

So i missed originally that the ordermethod was set to "created" (aka page_id) on the dpl query. This made me confused about the query plan chosen and i said some incorrect things about it being unideal.

Ordering by "created" is often less efficient than ordering by categoryadd in this extension. We should maybe remove the less efficient order methods from the extension.

Edit: The query plan was still a bad choice, its just a little more reasonable that mariadb chose it. I clarified on the incident doc page.