Page MenuHomePhabricator

[Commons] A new image added to a category is not shown in Watchlist
Closed, ResolvedPublic

Description

Hi,
The information about adding an image to a category is not always shown in the Watchlist.
An example: the uploader added this file to this category. But although I watch the category, this addition to the category is not displayed in Watchlist changes.
However, adding this file to this category is displayed in my Watchlist:
(diff | hist) . . Category:Novoslobodskaya‎; 15:45 . . ‎A.Savin (talk | contribs)‎ (File:Метро "Новослободская" - panoramio.jpg added to category)

I suspect that when a category is specified at the moment of a file upload, addition to this category is not shown in the Watchlist. If a category is added later, this information is displayed in the Watchlist.


All category membership and Wikidata changes in recentchanges / watchlist were missing on all wikis in 2018 December (probably completely unrelated to the original report, from April; we'd have noticed if this had been going on for half a year).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 20 2018, 10:37 PM
Restricted Application added a project: Growth-Team. · View Herald TranscriptDec 19 2018, 8:38 PM

Hi!

I think I just experienced the same issue as this.

Steps to reproduce:

  • Create a new account on Commons (not technically necessary, but shows that it isn't a problem with my or the reporter's account)
  • Go to Special:NewFiles and pick one of the files listed.
  • Put a category that that file belongs to on your watchlist.
  • Go to your watchlist.
  • Make sure that "Category changes" is one of the filters activated.

Expected results:

  • There is an item on the watchlist that mentions that the file was added to the category.

Actual results:

  • No such item is displayed.

I noticed this because on my main account (User:Tokfo), this upload was not shown on the watchlist even though the category "Photos by User:Linie29" is on my watchlist.

I reproduced it with a fresh account "User:Justtestingsomethings" with this upload into this category. Adding that category to that account's watchlist only made the creation of the category appear, but not the addition of that file.

I'm relatively sure that this used to work fine some months ago, at least for some files. (Knowing my editing patterns, I don't think I'd have made this edit if not for this category on my watchlist, for example.

This bug affects Commons more than other wikis, since there are likely many users relying on this to find newly uploaded files that might need curation.

I can report that, for me, this seems to now at least sometimes work again. (I have no way of knowing whether my watchlist is missing anything.) This upload was shown on my watchlist as an addition to "Category:Photos by User:Linie29", as it should. I don't know if this is a bug that happens sometimes but not always, or whether the code has changed, but I thought this would be useful information.

Restricted Application added a project: TCB-Team. · View Herald TranscriptDec 27 2018, 11:08 AM

Could be the same problem as T211849?

Tgr added a comment.EditedJan 2 2019, 7:23 PM

This affected all category membership changes and wikidata changes (ie. all the "artificial" recentchanges/watchlist items) at some point (T212432: Category membership and wikidata not showing up in recent changes).

T211849: A particular edit not showing on watchlist is probably a different issue but reports of this issue are mixed in in the comments.

Tgr updated the task description. (Show Details)Jan 2 2019, 7:50 PM
Tgr updated the task description. (Show Details)Jan 2 2019, 7:54 PM

This problem was originally reported on December 17 (allegedly started a few days before that) and apparently it fixed itself sometime between Dec 22 and 26. Which is very weird as that was during a deploy freeze... maybe some kind of job queue performance issue as these types of recentchanges records are populated by a job. There is a spike of jobqueue logstash events somewhat matching that period, but that dashboard does not make it easy to tell what has been going on or whether that was normal or not. (FWIW there is another spike that started yesterday.) Tagging Services, maybe they have an idea what logs to check. Otherwise, if recentchanges are working correctly now, let's just close this and blame sunspots.

mobrovac added subscribers: Pchelolo, mobrovac.

I took a quick look and it seems like there were quite a few categoryMembershipChange jobs that failed with the message Could not find page #<page_id>, which would suggest a race condition. However, the real surge in the number of log entries are for Old event processed for the same type of job. This could suggest a subtle bug in CP-JobQueue. Sample log entry:

{
  "_index": "logstash-2018.12.20",
  "_type": "cpjobqueue",
  "_id": "AWfOD6ibxYA-x_Uc4G4Y",
  "_version": 1,
  "_score": null,
  "_source": {
    "source_host": "10.64.32.153",
    "level": "WARNING",
    "rule_name": "categoryMembershipChange",
    "event_str": "{\"database\":\"dewiki\",\"mediawiki_signature\":\"358afaf6726126715170d77fad21b1eec41ae76d15be83564136453c8991d1f3\",\"meta\":{\"domain\":\"de.wikipedia.org\",\"dt\":\"2018-12-16T15:04:13+00:00\",\"id\":\"d933e40b-0143-11e9-820d-1866da99474d\",\"request_id\":\"XBZpbApAICMAAEQe0fYAAADK\",\"schema_uri\":\"mediawiki/job/2\",\"topic\":\"mediawiki.job.categoryMembershipChange\",\"uri\":\"https://de.wikipedia.org/wiki/Michael_Tronick\"},\"page_namespace\":0,\"page_title\":\"Michael_Tronick\",\"params\":{\"pageId\":6114462,\"revTimestamp\":\"20181216150412\"},\"type\":\"categoryMembershipChange\"}",
    "pid": 6019,
    "type": "cpjobqueue",
    "message": "Old event processed",
    "version": "1.0",
    "normalized_message": "Old event processed",
    "levelPath": "warn/oldevent",
    "tags": [
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "@timestamp": "2018-12-20T23:59:51.938Z",
    "executor": "RuleExecutor",
    "host": "scb1003",
    "@version": "1",
    "topic": "mediawiki.job.categoryMembershipChange",
    "gelf_level": "4"
  },
  "fields": {
    "@timestamp": [
      1545350391938
    ]
  },
  "sort": [
    1545350391938
  ]
}
Tgr added a comment.EditedJan 3 2019, 10:02 PM

It's not a race condition in the obvious sense, this affected all category membership changes, not just new or deleted pages. For example the entry you pasted is for this edit - both the article and the removed categories have been around for a long time. never mind, those are different log entries. So the missing page IDs could be a different bug, but they not related to (most of) the user complaints.

This problem was originally reported on December 17 (allegedly started a few days before that) and apparently it fixed itself sometime between Dec 22 and 26.

This is exactly matching the drastic increase in the job processing timings, which given the limited processing concurrency increased the backlog and thus lead to Old event processed log entries.

The Couldn't find pageId #XXXXX errors are easily explainable - we've had an almost 7-day-long backlog in these jobs, so the pages were deleted between job submission and execution. This is the consequence, not the reason.

The sudden fixing itself after one week is pretty easily explainable by the fact that we discard jobs that were posted more than one week ago, so all that backlog was cleared up fairly quickly. (Questionable thing to do, I think we should reconsider this)

As for the reason for job latency increase - I have a theory. This job uses parser cache, so once for some reason, we get a small backlog or a surge in parser cache misses, the probability of the job to miss parser cache and need to actually render the revision increases -> latency increases. That causes the cascade effect - the larger the backlog the higher is the mean latency since the rate of parser cache misses gets higher, which in turn increases the backlog and we loop down (actually up) to hell.

As quick mitigation, I will significantly increase the job concurrency so that we clear up current backlog and get ourselves some headroom in future, however it is interesting to look at how to handle jobs with drastically varying execution time in the future.

Change 482741 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Increase concurrency for categoryMembershipJob

https://gerrit.wikimedia.org/r/482741

Tgr added a comment.Jan 8 2019, 1:58 AM

Should we manually recreate the jobs for these changes? Two weeks past, no one looking at recent changes will care, but maybe for watchlists people might?

@Tgr That is still doable, the retention of Kafka topics is one month. I can store the jobs in a file, for now, however, I think this will create more confusion than benefit. If we reinsert the jobs, a lot of recent changes will suddenly appear and I am not entirely sure what the consequences be.

If we insert ~5 million recent changes from December - would they appear in recent changes logs? Would we trigger some more updates/bots/etc that we do not anticipate? I think the risk overweights the benefit in this case. This action will probably exercise a lot of code paths that normally are not executed, so it seems like a fairly risky operation to perform. We, of course, could try inserting a couple thousand after the backlog is cleared and see what happens and make a final decision after the experiment.

I'll save the jobs into a file while we decide.

Tgr added a comment.Jan 8 2019, 2:15 AM

Recent changes are sorted by change date (they don't have an insert date) so they should just appear somewhere way down in recent changes, I think. That's why I said nobody will care; but for watchlists two weeks might not be a lot of rows (it really depends on the user).

It would probably trigger feed based bots (e.g. those listening to IRC changes), although most of those are probably antivandalism bots and ignored these kinds of "virtual" edits (they would listen for the real edit, which added/removed the category in the article, or edited the Wikidata item, and those did not happen through these job queues and were recorded in recentchanges normally back when they happened).

Not sure if the recentchanges table having a very different order physically than logically has a performance impact. Probably not, it's indexed and not big.

Ok. I've stored the events into a file on kafka1001. The issue would be to filter out the events that were not executed. SHA1-based deduplication has only 1 day TTL in Redis, the abandonig logs are trace level, so they're not stored anywhere. I will look into options on what will happen if we have duplicate events and on options how to figure out which were processed and which were abandoned.

I think the abandoning super old event is a feature that just backfired. I will disable it.

Tgr added a comment.Jan 8 2019, 2:30 AM

Probably the events have a reference to the revision ID for the change, and a source field (mw.categorize for categories, not sure about Wikidata), they can be deduplicated with the current RC entries with the same type and revision ID.

Ok, as soon as my patch is merged and the backlog gets more reasonable, I will try replaying several events, see what happens and if all is good replay all of them.

Change 482741 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Increase concurrency for categoryMembershipJob

https://gerrit.wikimedia.org/r/482741

Mentioned in SAL (#wikimedia-operations) [2019-01-08T09:11:59Z] <mobrovac@deploy1001> Started deploy [cpjobqueue/deploy@f91cf04]: Increase the concurrency of categoryMembershipJob - T192691

Mentioned in SAL (#wikimedia-operations) [2019-01-08T09:12:58Z] <mobrovac@deploy1001> Finished deploy [cpjobqueue/deploy@f91cf04]: Increase the concurrency of categoryMembershipJob - T192691 (duration: 00m 59s)

Change 482903 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [TEMP] Increase concurrency for categoryMembershipJob

https://gerrit.wikimedia.org/r/482903

Change 482903 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [TEMP] Increase concurrency for categoryMembershipJob

https://gerrit.wikimedia.org/r/482903

Mentioned in SAL (#wikimedia-operations) [2019-01-09T21:03:53Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@bfa9241]: Increase concurrency for categoryMembershipJob T192691

Mentioned in SAL (#wikimedia-operations) [2019-01-09T21:04:38Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@bfa9241]: Increase concurrency for categoryMembershipJob T192691 (duration: 00m 45s)

I can report two further occurrences of this bug. The following two uploads should have been shown on my (User:Tokfo) watchlist on the basis that Category:Photos by User:Linie29 is on my watchlist, but they weren't:

I hope this is useful to help developers track down the cause.

The issue has created some backlog in the jobqueue. It's being cleared out now, I expect it to go away within next 12 hours.

Pchelolo closed this task as Resolved.Jan 10 2019, 4:26 PM
Pchelolo claimed this task.

Ok, the backlog has been cleared and it all should be operating normally.

Regarding replaying - it, unfortunately, will be impossible, the Kafka topic retention is one week, not one month, so the events from an incident in December has been lost.

I will eventually revert the patch that has made concurrency for categoryMembershipJob super high.

daniel removed a subscriber: daniel.Jan 10 2019, 4:31 PM

I can confirm that the problem reported in my previous message has gone away and those two uploads have now appeared on my watchlist. Thanks to the developers for fixing it.

Tgr added a comment.Jan 10 2019, 5:40 PM

In theory we could go through the edits in that period and generate changes for them. But it would be a fair amount of effort and no one seems too concerned about the missing entries so probably not worth it.