[Regression] Top read on enwiki for 6-14-17 has duplicate entries
Open, Needs TriagePublic

Description

Testing on 5.5.0 (1153). Tried it on iphone and ipad.

Steps:

  1. Open the explore feed and view the Top Read for Wednesday, June 14th.

Expected:
There are no duplicate entries

Actual Result:
Grenfell Tower Fire appears as item #2 and item #4.

Frequency - 5/5


Note - I have not seen duplicate entries in this feed on previous days.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Currently when I go back to June 14th, 2017, I see "Grenfell Tower fire" in 2nd and "Grenfell Tower" in 4th so it appears to be fixed. @bearND any idea if something could have caused the feed endpoint to temporarily return "Grenfell Tower fire" in both top read slots?

ABorbaWMF updated the task description. (Show Details)Jun 21 2017, 5:15 PM
bearND added a project: Services.EditedJun 27 2017, 8:27 PM
bearND added subscribers: Pchelolo, mobrovac.

Yes, https://en.wikipedia.org/api/rest_v1/feed/featured/2017/06/15 shows two different results for the same topic: "Grenfell_Tower_fire". (There's also a related "Grenfell_Tower" result there but that's not the issue here.)

curl -s "https://en.wikipedia.org/api/rest_v1/feed/featured/2017/06/15" | jq '.mostread.articles[] | select(.pageid == 54297895) | {title, rank, views, view_history}'
{
  "title": "Grenfell_Tower_fire",
  "rank": 4,
  "views": 347223,
  "view_history": [
    {
      "date": "2017-06-14Z",
      "views": 347223
    }
  ]
}
{
  "title": "Grenfell_Tower_fire",
  "rank": 41,
  "views": 61184,
  "view_history": [
    {
      "date": "2017-06-14Z",
      "views": 61184
    }
  ]
}

If I run the most-read endpoint locally now I get only one result:

curl -s "http://localhost:6927/en.wikipedia.org/v1/page/most-read/2017/06/15" | jq '.articles[] | select(.pageid == 54297895)'
{
  "views": 194582,
  "rank": 7,
  "pageid": 54297895,
  "$merge": [
    "https://en.wikipedia.org/api/rest_v1/page/summary/Grenfell_Tower_fire"
  ],
  "view_history": [
    {
      "date": "2017-06-14Z",
      "views": 347223
    },
    {
      "date": "2017-06-15Z",
      "views": 194582
    }
  ]
}

Now there is only one entry. It has a different rank and an additional day for the view_history. I think the additional day in view_history might be a bug (or a feature???) in MCS but the main issue is the multiple entries and different ranks.
I think the results should be updated. @mobrovac or @Pchelolo, could the storage logic in RB be changed so that it allows to update the most-read results a bit later? Let's say update the most-read again after a day has passed.

Now there is only one entry. It has a different rank and an additional day for the view_history. I think the additional day in view_history might be a bug (or a feature???) in MCS but the main issue is the multiple entries and different ranks.
I think the results should be updated. @mobrovac or @Pchelolo, could the storage logic in RB be changed so that it allows to update the most-read results a bit later? Let's say update the most-read again after a day has passed.

What would be the exact purpose of updating it a day later? Temporal async updates to only one component are rather tricky and might be imprecise (and we don't have that implemented for any other end point, so would require some work). If there is a genuine and valid reason to do so, we can discuss it, but if it's just about circumventing a situation we know it's happening but don't really want (or know how) to deal with, then I wouldn't be up for it, honestly.

@mobrovac The reason is that it seems that PageView API still sees some updates after the first time the aggregated feed entry for a day is stored. I assume that's shortly after 0:00 UTC. If I run the same thing in MCS a day later I shouldn't see significant differences (changed ranks) and duplicated entries from the last stored version.