Page MenuHomePhabricator

Clear watchlist on enwiki only removes 50 items at a time
Closed, ResolvedPublic

Description

While investigating T171898: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit, I've noticed that the ClearWatchlist functionality at /wiki/Special:EditWatchlist/clear only removes 50 items, then stops.

The first pass at removing watchlist items succeeds in the first job, but then something fails for the second job, with one of these possibilities

  1. pushing a new job fails (JobQueueGroup::singleton()->push( new self( $this->getTitle(), $this->getParams() ) );
  2. The job that's pushed is deduplicated and therefore not run
  3. $loadbalancer->safeWaitForMasterPos() returns false in one of two locations
  4. $dbw->getScopedLockAndFlush( $lockKey, __METHOD__, 10 ); returns false
  5. something else entirely

I can't reproduce this in vagrant, but I can reliably reproduce the issue on enwiki.

Event Timeline

Ping @Pchelolo@Catrope suggested you might be able to help troubleshoot what is happening on the jobqueue side, as I haven't been able to find anything in Logstash.

Pchelolo added a subscriber: Ottomata.

I know what's happening. If there's more items in the list then the bach size, the code re-enqueues exactly the same job again here. That means the deduplication info is exactly the same as for the previous job, but the Kafka queue is so quick that the jobs are executed with < 1 second delay, so the dt for the second job ends up being exactly the same as the dt for the previous job and they get deduplicated.

We deduplicate based on sha1 (deduplication info) if the previous job seen with the same sha1 was executed after the new job was enqueued. But since the job dt is 1-second resolution and the previous exec time we store is millisecond resolution it gets deduplicated. You see nothing in the logs because deduplication is a normal operation so it's logged with trace level.

As a quick fix, you could set removeDuplicates for this job to false, but I think we need to use millisecond resolution for job creation timestamp. This might require some changes in how we represent the dt field. We use ISO8601 format for it, I guess it supports millisecond resolution, but we need to double-check whether json validator we use supports it. cc @Ottomata

Just checked, python jsonschema validates milliseconds ISO-8601s with date-time format just fine. :)

Change 468482 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Set event datetime with millisecond resolution.

https://gerrit.wikimedia.org/r/468482

The above patch should mitigate the problem, however, we need to also account for possible clock drift between our servers. The more drift we tolerate, the less efficient our deduplication becomes, so I'm wondering whether we have any data on the possible clock drift to help make a decision on the exact number to tolerate?

Just checked, python jsonschema validates milliseconds ISO-8601s with date-time format just fine. :)

Apparently HHVM doesn't support v in DateTime format string which is for the millisecinds, it only supports microseconds. Could you check whether 2018-10-09T23:41:08.397700+00:00 supported by validator. JS is fine with it

mobrovac added subscribers: Joe, mobrovac.

The above patch should mitigate the problem, however, we need to also account for possible clock drift between our servers. The more drift we tolerate, the less efficient our deduplication becomes, so I'm wondering whether we have any data on the possible clock drift to help make a decision on the exact number to tolerate?

This is a question for @Joe :)

Just checked, python jsonschema validates milliseconds ISO-8601s with date-time format just fine. :)

Apparently HHVM doesn't support v in DateTime format string which is for the millisecinds, it only supports microseconds. Could you check whether 2018-10-09T23:41:08.397700+00:00 supported by validator. JS is fine with it

When @Ottomata confirms Python is ok with this, we can merge the above patch.

Verified microseconds is fine with python jsonschema. I also checked Camus, which uses [[ http://joda-time.sourceforge.net/apidocs/org/joda/time/format/ISODateTimeFormat.html | ISODateTimeFormatter ]].parseDateTime to get the timestamps for eventual partition bucketing in hive. It works fine, but it does drop the microseconds portion, e.g. 2018-10-09T23:41:08.397712+00:00 -> 2018-10-09T23:41:08.397Z. Should be fine tho!

Change 468482 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Set event datetime with microsecond resolution.

https://gerrit.wikimedia.org/r/468482

@Pchelolo @Ottomata and @mobrovac thank you for tracking this down and working on it. Should our team plan to verify the clear watchlist job is functioning properly on Thursday?

@kostajh If you have time for that it would be perfect. I admit, I don't have any idea how to test this.

Thank you for bringing it up to our attention, this is a generic problem we didn't really think about. It might have done a lot of harm that went unnoticed.

It's easy enough for me to see if running "clear watchlist" on my enwiki account works :) @Etonkovidova may want to test this too. I'll check it on Thursday when it's in production.

kostajh edited projects, added Growth-Team (Current Sprint); removed Growth-Team.
kostajh moved this task from Incoming to QA on the Growth-Team (Current Sprint) board.
Krinkle added a subscriber: Krinkle.

Does not appear to be about a warning, error or fatal emitted from a WMF production service.

I checked testwiki, cawiki (wmf.1), enwiki (wmf.26) and betalabs.

testwiki, cawiki (wmf.1) Deletes 50 items form Watchlist at a time.

enwiki (wmf.26) the same as for testwiki

betalabs - clicking on 'Clear the watchlist' brings the warning message button. Clicking on that button - the Watchlist items are not deleted.

Note: In my opinion, if we inform users properly that only 50 items could be deleted at a time, the issue could be considered resolved.
@Pchelolo - any additional work needs to be done?

@Etonkovidova sorry I didn't update this ticket.. We've had an outage caused by the fix to it so it was reverted this morning. We will continue working on this - we know what the problem is, just need more time

@Etonkovidova sorry I didn't update this ticket.. We've had an outage caused by the fix to it so it was reverted this morning. We will continue working on this - we know what the problem is, just need more time

Thx! Ping me when there will be something to check.

Hi @Pchelolo just checking in on the status of this

Since T207817 got closed I guess we can make one more attempt to fix this.

@Pchelolo any update on this? The "Clear watchlist" link on enwiki is still not working for me. Only one job executes.

Came across this task while looking over my team's board. Special:EditWatchlist/clear is still broken.

Mentioned in SAL (#wikimedia-operations) [2019-02-11T20:07:08Z] <ppchelko@deploy1001> Started deploy [changeprop/deploy@bdb4740]: Update dependencies, minor refactor, safer deduplication, T207329

Sorry, got out of my radar somehow. After the latest deploy ^^ this has to work properly.

Mentioned in SAL (#wikimedia-operations) [2019-02-11T20:08:44Z] <ppchelko@deploy1001> Finished deploy [changeprop/deploy@bdb4740]: Update dependencies, minor refactor, safer deduplication, T207329 (duration: 01m 37s)

@Pchelolo unfortunately this is still not working. I can see from the profile trace that the first job is enqueued https://performance.wikimedia.org/xhgui/run/symbol?id=5c6344dc3f3dfab50a29b7cf&symbol=ClearUserWatchlistJob%3A%3AnewForUser but after that, I'm not sure where this is falling apart. I could try adding some logging statements to ClearUserWatchlistJob#run if you'd like.

Hm...

So, in kafka we have:

{"database": "enwiki", "mediawiki_signature": "97c895ef8e934b335f74b4d8d57d5b7bebd1de16dd68d88f76479c826f4a9bf9", "meta": {"domain": "en.wikipedia.org", "dt": "2019-02-12T22:12:43+00:00", "id": "51840dae-2f13-11e9-881a-aa00006e80dc", "request_id": "XGNE2gpAIHsAABGqlmYAAAAV", "schema_uri": "mediawiki/job/2", "topic": "mediawiki.job.clearUserWatchlist", "uri": "https://en.wikipedia.org/wiki/Special:EditWatchlist/clear"}, "page_namespace": -1, "page_title": "Special:EditWatchlist/clear", "params": {"maxWatchlistId": 214224212, "requestId": "XGNE2gpAIHsAABGqlmYAAAAV", "userId": 10774019}, "sha1": "32624f411d5070eacbfe3d0aab95b55a60748a6f", "type": "clearUserWatchlist"}

{"database": "enwiki", "mediawiki_signature": "6cd3bec69ec51ee6837f0159446dc1da366510ec45c6292c075bb19f2be3d3e6", "meta": {"domain": "en.wikipedia.org", "dt": "2019-02-12T22:12:44+00:00", "id": "521c9c71-2f13-11e9-9632-1418776149df", "request_id": "XGNE2gpAIHsAABGqlmYAAAAV", "schema_uri": "mediawiki/job/2", "topic": "mediawiki.job.clearUserWatchlist", "uri": "https://en.wikipedia.org/wiki/Special:EditWatchlist/clear"}, "page_namespace": -1, "page_title": "Special:EditWatchlist/clear", "params": {"maxWatchlistId": 214224212, "requestId": "XGNE2gpAIHsAABGqlmYAAAAV", "userId": 10774019}, "sha1": "32624f411d5070eacbfe3d0aab95b55a60748a6f", "type": "clearUserWatchlist"}

And the second job doesn't get executed indeed:

2019-02-12 22:12:44 [6744d71d9bf3af72fe625887] mw1302 enwiki 1.33.0-wmf.16 JobExecutor DEBUG: Beginning job execution {"job":"clearUserWatchlist Special:EditWatchlist/clear maxWatchlistId=214224212 requestId=XGNE2gpAIHsAABGqlmYAAAAV userId=10774019","job_type":"clearUserWatchlist"} 

2019-02-12 22:12:44 [XGNE2gpAIHsAABGqlmYAAAAV] mw1302 enwiki 1.33.0-wmf.16 JobExecutor INFO: Finished job execution {"job":"clearUserWatchlist Special:EditWatchlist/clear maxWatchlistId=214224212 requestId=XGNE2gpAIHsAABGqlmYAAAAV userId=10774019","job_type":"clearUserWatchlist","job_status":true,"job_duration":0.078950166702271}

With the new approach that gives the sha1-based deduplication 1 second of wiggle room, the second job should not be deduped, but for some reason it's still not being executed. Looking more..

Yay, this works! Not sure what changed with the infrastructure, but today I was able to clear my 1,828 item watchlist on enwiki. @Pchelolo assigning to you to review and close.