Page MenuHomePhabricator

increase factor for query service that is taken into account for maxlag
Open, Stalled, Needs TriagePublic

Description

Problem:
The lag of the query service is taken into account for throttling bots making edits to Wikidata using maxlag API parameter (i.e. the bot request "I want to make this edit and make request it to actually be made only if the current DB replication lag is not higher than X").

The current factor is 60. We want to increase it to 180 to be more realistic and not be too strict. This corresponds to an acceptable average lag between an edit being made and it showing up in the query service of 15 minutes (currently being 5 minutes).

The expected impact/result here is the increased number of edits that are successfully made (i.e. replication lag is low enough for those to be saved), and the level of the lag between edits to DB and Query Service updates is acceptable for Query Service users.

There is a configuration setting relevant for the said factory that would require to be changed.

Acceptance criteria:

  • acceptable maxlag is increased for the lag between an edit being made and it being reflected in the query service

Notes:

  • We will try this and see if further adjustments are needed.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterTriple the factor of WDQS lag to maxlag for Wikidata

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptFeb 10 2020, 11:34 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
WMDE-leszek updated the task description. (Show Details)Feb 11 2020, 1:24 PM

Change 571705 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Triple the factor of WDQS lag to maxlag for Wikidata

https://gerrit.wikimedia.org/r/571705

Change 571705 merged by jenkins-bot:
[operations/mediawiki-config@master] Triple the factor of WDQS lag to maxlag for Wikidata

https://gerrit.wikimedia.org/r/571705

Mentioned in SAL (#wikimedia-operations) [2020-02-12T12:19:22Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571705|Triple the factor of WDQS lag to maxlag for Wikidata (T244722)]] (duration: 01m 04s)

Mentioned in SAL (#wikimedia-operations) [2020-02-12T12:21:36Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571705|Triple the factor of WDQS lag to maxlag for Wikidata (T244722)]], take II, the cache issue (duration: 01m 03s)

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptFeb 12 2020, 12:22 PM
Maintenance_bot moved this task from Incoming to In progress on the User-Ladsgroup board.
Tarrow added a subscriber: Tarrow.Feb 13 2020, 9:08 AM

Is this now doing what we want?

I think it's doing what we discussed in the storytime and making the frequency of the oscillations higher. However in addition to that it is increasing the amplitude and thus now actually triggering low edit volume alerts. See: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&from=1581498145748&to=1581584545749

Is this a problem? I'm still a little unsure what metric we might want to try and optimise for but once we know what it is me might be able to make smarter tweaks about this maxlag parameter.

Possible metrics for us to aim for could be something like:

  • Keeping WDQS lag consistently below some value
  • Keeping the maximum "maxlag > 5" interval shorter than n mins (e.g. where people are now annoyed they can't edit)
  • Keeping a minimum edit rate averaged over "some time period greater" than "some number"

I guess to meet the conflicting needs of our users we should probably try and "pick 2" or something?

Since the change we perform didn't have the expected results, we're going to revert it today, and keep looking for sustainable solutions.

Mentioned in SAL (#wikimedia-operations) [2020-02-13T12:20:51Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571956|Revert: Triple the factor of WDQS lag to maxlag for Wikidata (T244722)]] (duration: 01m 04s)

The issue seems to be somewhere in the update process and query service itself rather than the wikidata edit rate (although of course the edit rate has some effect).
Over the past months even when the wikidata edit rate drops the query service lag can continue to increase.

Also before changing the factor to be more strict we probably want to complete T238751: Only generate maxlag from pooled query service servers.
The discussions on T240442: Design a continuous throttling policy for Wikidata bots are also very relevant to this as having a hard cut off at 5 is no good for anyone, a gradual increase and decrease in edit rate would be much better for both the people using wikidata and our systems.

This comment was removed by Addshore.

The issue seems to be somewhere in the update process and query service itself rather than the wikidata edit rate (although of course the edit rate has some effect).
Over the past months even when the wikidata edit rate drops the query service lag can continue to increase.

I agree with this. From my non-skilled observation I would say that the lag is mostly a function of the read queries from clients not really the related to edits

Also before changing the factor to be more strict we probably want to complete T238751: Only generate maxlag from pooled query service servers.
The discussions on T240442: Design a continuous throttling policy for Wikidata bots are also very relevant to this as having a hard cut off at 5 is no good for anyone, a gradual increase and decrease in edit rate would be much better for both the people using wikidata and our systems.

Sadly, as we already have the "magic number 5" used by load of people (like pywikibot) we're a little trapped with this. The alternative could be returning maxlag with some randomly calculated weighting to avoid "turning the taps totally off". e.g. 50% of the time subtract 1s from max lag due to the query service and 50% of the time add 1s. The time and the numbers would need to be chosen to to and seek some level of write rate v.s. lag.

Addshore changed the task status from Open to Stalled.Feb 18 2020, 10:28 AM
Addshore removed Ladsgroup as the assignee of this task.
Xqt added a subscriber: Xqt.Mar 5 2020, 10:29 AM