Page MenuHomePhabricator

Increase Retry-After header for Wikidata
Open, Needs TriagePublic

Description

Hello,

in T243701, we discuss an issue of maxlag higher than 5s for Wikidata. The issue is caused by bots editing too frequently, WDQS lags, bots stop due to maxlag parameter forcing them to, WDQS recovers, lag decreases and so on.

Pywikibot respects the value set in Retry-After header, see Pywikibot's code (1, 2).

Increasing this value, at least for Wikidata (we'd need a new hook for that probably), could make bots delay with editing for longer time, giving more time to WDQS to recover.

Opinions?

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptFeb 13 2020, 12:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Currently multiple tools are broken because the time that make maxlag back to normal is much longer than the total time the tool retries edits (see this and this).

Good that this change works as intended? Tools should be updated to handle maxlag in a graceful manner. Retry time should not be something fixed, but something which increases if you get it or not. So 5, 10, 20, 40, 80, etc. This also makes sure that not all tools restart at the same time.

Though I don't think this will be really effective. For example we have a period higher than maxlag for 5 minutes, i.e. 300 seconds (remember query service lag is updated each minute), then all bot will sleep 5+10+20+40+80+160=315s, and then restart at the same time (if they do one edit every 10 second, they will restart in a 10 second timespan).

I have been thinking about this. Originally I thought it would help but the more I think I feel it would be very similar to the increasing factor. The oscillation will be longer but the situation stays the same because they only respect after WDQS lagging behind too much due to speed. We need something to slow down the bots before they reach to the threshold of maxlag and would have to back off.

This is one of those bugs where you should just lookup the relevant chapter in a book like http://barbie.uta.edu/~jli/Resources/MapReduce&Hadoop/Distributed%20Systems%20Principles%20and%20Paradigms.pdf and look at the possible solutions.

Urbanecm renamed this task from Increase Retry-Time header for Wikidata to Increase Retry-After header for Wikidata.Feb 14 2020, 9:33 PM

I have been thinking about this. Originally I thought it would help but the more I think I feel it would be very similar to the increasing factor. The oscillation will be longer but the situation stays the same because they only respect after WDQS lagging behind too much due to speed. We need something to slow down the bots before they reach to the threshold of maxlag and would have to back off.

Maybe I'm doing some mistake, but I believe that increasing the factor behaves differently than Retry-After change. Feel free to correct me if I'm wrong. When you increase the factor, lag decreases, althrough the situation is the same (real lag is the same, nothing changed) - as a result, bot edits for a longer period of time, the lag is too high for them after a longer period of time. It also should increase the backoff period, but maybe because bots can make more edits, the WDQS has more work, and as a result, the issue is bigger than before (=longer time of too big lag).

On the other hand, when I imagine a Retry-After change, the bots (complying with https://www.mediawiki.org/wiki/Manual:Maxlag_parameter, includes PWB) sleep for longer time, giving WDQS more time to recover. The number of new edits saved should be the same, given lag re-increases with the same speed. Not sure about other bots, but PWB seems to sleep for (at least) the recommended number of seconds, and then tries again. If the recommended number of seconds were higher, the bots should just edit slower, IMO.

This is one of those bugs where you should just lookup the relevant chapter in a book like http://barbie.uta.edu/~jli/Resources/MapReduce&Hadoop/Distributed%20Systems%20Principles%20and%20Paradigms.pdf and look at the possible solutions.

If you have something specific in your mind, please do feel free to share it!

We need something to slow down the bots before they reach to the threshold of maxlag and would have to back off.

Totally agree!

This is one of those bugs where you should just lookup the relevant chapter in a book like http://barbie.uta.edu/~jli/Resources/MapReduce&Hadoop/Distributed%20Systems%20Principles%20and%20Paradigms.pdf and look at the possible solutions.

I looked at TOC of that book and couldn't find a topic related to this. I also have another book physically and couldn't find anything in it either. There are some good information in this book that I shared it in T240442: Design a continuous throttling policy for Wikidata bots

Xqt added a subscriber: Xqt.EditedMar 10 2020, 10:22 AM

On the other hand, when I imagine a Retry-After change, the bots (complying with https://www.mediawiki.org/wiki/Manual:Maxlag_parameter, includes PWB) sleep for longer time, giving WDQS more time to recover. The number of new edits saved should be the same, given lag re-increases with the same speed. Not sure about other bots, but PWB seems to sleep for (at least) the recommended number of seconds, and then tries again. If the recommended number of seconds were higher, the bots should just edit slower, IMO.

How pwb works with throttleling:

  • the http response header value retry_after is responsible for the delay after a maxlag has been triggered
  • The retry_after value was alway 5 s the last years and that value does not seems sufficient. Therefore the current maxlag value was also taken into account for the wait cycle by 1/5 for the first try, 2/5 for the second try, 4/5 for the third, 8/5 for the forth 16/5 for the fifth etc. but never less than retry_after value.
  • there is a put_throttle for every api write access which 10 s by default but should never be below 5 by the most local bot policies
  • there is a minthrottle for every api read access which is 0 by default i.e. there is no read throttleling at all.
  • if more than one bot is working simultaneously the times are lengthened

I guess the minthrottle should be activated for read access on wikidata too to avoid server overload

Note as I said in T243701#5884926, some bots runs with put_throttle=1 or 0.

Dvorapa added a comment.EditedMar 10 2020, 12:00 PM

Note as I said in T243701#5884926, some bots runs with put_throttle=1 or 0.

They should not do it continuously, this makes Pywikibot ignore any maxlag or throttle values and just rush-run edits. But of course, sometimes bots have to use putthrottle=0 (or close to 0) to fix some breakage in Wikipedia articles/Wikidata items quickly. Therefore restricting all bots that ever used putthrottle=0 (or close to 0) is unreasonable, but monitoring bots activity and restricting those who do it continuously or regularly is necessary and should be somehow carried out.

Even if bots are using pt:0, they will still follow maxlag (unless also set maxlag=0). This may cause some problem (start and stop brutally and make the lag oscillating), though will not break the server. Even if you use the default put throttle (10, or even longer), the issue may still occur if there're so many bots running that result in an edit rate Query Service Updater can not handle.

start and stop brutally and make the lag oscillating

Yes, but in my opinion this behavior is not much polite and ethical to servers too. And to me it seems it is also dishonest to other bots in an indirect way: Bots rush-saving edits in a rate like 1 per second or worse make Query Service Updater lag much sooner/worse than bots respecting 5+ s putthrottle.