Page MenuHomePhabricator

EPIC: Add Top read articles to the app
Closed, ResolvedPublic

Description

Lets productize the spike we have for Top Read and create a daily feed card (inserted with daily content such as Featured and POTD).

Suggest we show 5 top articles of the day in a condensed card format, with:

  • header should say "Top Articles for [DATE]"
  • five articles with image, title and description
  • footer for "Top 10 Most Read" clicks to modal display of top 10 articles

When querying top articles for $site filter results so that:

  • They contain only articles from main namespace
  • Exclude the wiki's main page
  • Exclude articles from global blacklist (i.e. "-")

For other filters, see comments in T124082

Mocks

trending.png (1×750 px, 190 KB)

  • List uses search results table cell
  • the header states the day on which these articles were top read.
  • humanize the dates to "yesterday"

Tapping on the more top read will show the full list in a new VC

trending copy.png (1×750 px, 275 KB)

(ignore the list items, i used search results screenshot :)

Event Timeline

JMinor raised the priority of this task from to Medium.
JMinor updated the task description. (Show Details)

@Nirzar is it safe to assume that you're the default assignee whenever a task is in Design Doing? Just asking because some tasks are assigned to you but some aren't.

@JMinor due to limitations in pageview data accumulation (and the fact that not everyone uses UTC/GMT), we need to come up with explicit rules for handling cases where "today" or "yesterday" might not contain valid data for users in time zones ahead of GMT/UTC (not sure which time zone the pageview data jobs are run on). @Mhurd developed a heuristic which (IIRC):

  • Tries to fetch top articles for "today" in the user's time zone
    • If successful: display today's top articles
    • If not enough data was available (determined by API response), fetch the previous day's data
      • If successful: display yesterday's top articles

@Nuria what time zone are the pageview processing jobs run on? For example, if I always query "yesterday" (by GMT standards), am I very likely to get data back? Are there other approaches you would suggest?

@Slaporte do you have any ideas for how the apps should handle this? How did top.hatnote.com determine which date to fetch top 100 for? For example, going to http://top.hatnote.com/ today (Jan 29) shows yesterday's (Jan 28) top 100.

Let's assume pageview is designed to have data ready by end-of-day GMT. If the clients always fetch "GMT yesterday" we should be able to reliably show the top articles from "GMT yesterday." All that remains to be done is convert the date to the user's timezone. The caveat of course is that "GMT yesterday" might mean "today" for some, or "2 days ago" for others. Ideally, each site/region would have its own pageview data crunched in its local timezone to improve UX for local users.

Also, we should add some filters to the acceptance criteria:

  • For the current targeted site:
    • filter out articles whose titles start with a known namespace prefix
    • filter out its main page
  • Exclude articles in the universal blacklist
    • For now, just use "-", but we'll likely come up with others once we integrate the feature

@Slaporte do you have any ideas for how the apps should handle this? How did top.hatnote.com determine which date to fetch top 100 for? For example, going to http://top.hatnote.com/ today (Jan 29) shows yesterday's (Jan 28) top 100.

top.hatnote.com displays the data under the UTC date, with no timezone conversion, and shows the most recently available chart. I'd recommend making it simple and consistent, with a universal "Top article of Jan 29" regardless of the user's timezone.

The pageview API does not guarantee when the data will be available. We try fetching shortly after UTC midnight and then try periodically until we are successful. If I'm remembering right, I think the data is usually ready 02:00 - 06:00 UTC the following day (but a few times it's been published much later).

I'd recommend making it simple and consistent, with a universal "Top article of Jan 29" regardless of the user's timezone.

I agree, KISS is probably best for now.

@Nuria @Milimetric is it possible for pageviews to have a latest endpoint which exposes the most recent data for a given day? i.e. the API response would contain the date the data is associated with, and cache control information would allow for clients to fetch latest "If-Modified-Since: etag"".

In the meantime, @JMinor @Fjalapeno @Mhurd here's what I propose we do in the iOS app when the feed is refreshed:

  • Insert feed section for current UTC day if not already present
  • When the section fetches its data:
    • And it succeeds:
      • display it to the user as "top read articles for $date" (where date is the fetched UTC date, converted to the user's time zone)
    • And we fail:
      • and the error code is "data unavailable": display an error view asking user to try again later (via pull-to-refresh)
      • otherwise, just show a generic error view in the section*

There's a case to be made for trying to fall back to get the previous day's data (and the day before that, etc.) but IMO this is simplest.

\* IMO we should do this for all sections instead of relying on banners. would be much easier now that @Fjalapeno has implemented a base class w/ shared functionality for all sections

FYI that top articles display a lot of automated traffic (we have an item to work on that) so you might need to do some client side filtering for "nonsensical" results.

@BGerstle-WMF

@Nirzar is it safe to assume that you're the default assignee whenever a task is in Design Doing? Just asking because some tasks are assigned to you but some aren't.

Yes

Also, the initial mock I was working on I was placing the specific date for trending card on the feed.
WIP -->

trending.png (1×750 px, 190 KB)

with the new headers from T124484 it becomes easier to show the date.

@Nuria thanks, we're aware of this, but I'm not sure what we can do on the client to alleviate the "spam." I know it's something you're looking into, let us know if we can help in any way. @Jdlrobson has also (frequently 😜) cited this as a potential achilles' heel for this feature.

Surfacing the top referrers for an article may help with this @Nuria
No referrers - suggests bot
internal link or search (via wikipedia or other engine e.g. google) suggests trending in my opinion

@BGerstle-WMF The already implemented logic works better than what you're proposing... i.e. it already has perfectly functional fallback logic for the "today's update not yet available" problem.

In case you are interested you can review the output of trending as defined by editors - https://pushipedia.wmflabs.org/trending and compare with the top 100 viewed (note a few in there are edit wars and I have since refined the algorithm to exclude them in future)

Tabled this for discussion at "code review" meeting later today. Will report our decision back here

FYI that top articles display a lot of automated traffic (we have an item to work on that) so you might need to do some client side filtering for "nonsensical" results.

My approach has been to simply display the results, even if they look unusual. I think that differentiating between human/non-human views is probably best and most consistently done in the service side. For example, it's safe to guess that some (most?) of Web scraping's 406.3K views yesterday are automated traffic, but how many of Zika virus's 350.3K? Client side filtering makes the list look more realistic, but doesn't really fix the (potentially) distorted ranking from automated traffic.

In case you are interested you can review the output of trending as defined by editors - https://pushipedia.wmflabs.org/trending and compare with the top 100 viewed (note a few in there are edit wars and I have since refined the algorithm to exclude them in future)

This is so cool!

@Nirzar - here's that mock:

trending.png (1×750 px, 190 KB)

@JMinor can we have the "More top articles" footer load, like, the top 30 at least?
It's pretty cool to scroll through: (tap to see scrolling)

top.gif (686×374 px, 7 MB)

WIP:
https://github.com/wikimedia/wikipedia-ios/pull/388

MBinder_WMF renamed this task from As a user I'd like to see the top read articles in my Explore feed to EPIC: As a user I'd like to see the top read articles in my Explore feed.Feb 3 2016, 9:41 PM
JMinor renamed this task from EPIC: As a user I'd like to see the top read articles in my Explore feed to EPIC: Add Most Read articles to the app.Feb 10 2016, 6:55 PM

Note that this has been broken down into two subtasks to add the card to the feed, and to add the "more most read" modal from the footer touch.

Additional subtasks may be needed as we finalize implementation.

@JMinor @Mhurd recapping my thoughts on whether the app should (now or in the future) filter specific tiles from the "top read articles":

I think well agree to remove the following across all languages:

  • "-"
  • The targeted site's main page
  • Any article not in the main namespace (e.g. Special:Search)

What's up for debate now is whether to also filter things like:

  • "Web scraping"
  • "Java"
  • "xHamster"

i.e. things we manually identify as anomalies in analytic's pageviews ranking or as not useful to the user because they're persistently in the top articles. Note, this is not censorship.

My $0.02: do not manually filter out specific "blacklisted" titles for now. This will allow us and our beta users to experience "top articles" the same way those viewing non-EN sites would. Based on these observations, we can then determine whether:

  • These sorts of titles should be filtered
  • Whether the feed is even useful w/o the filters, and thus whether it's worth showing when the user's site isn't en.wikipedia.org

I'd recommend the following exclusions:

title exclusions

  • "Main_Page" (exclude name of main page for site language)
  • "-"
  • "Test_card"
  • "Web_scraping"

title prefix exclusions (don't exclude just because a ":" is present as these can be in article namespace titles)

  • "User:"
  • "Special:"
  • "File:"

I was out during the IRC discussion.

I think @Mhurd is right about the long term product direction (we should move towards "trending") but what we have now is "Most Read", however given the timelines and means within reach, I would suggest we go with the more limited exclusions he outlines above, and see what our testers (and ourselves) have to say.

I'd particlarly note that this goal:
"Any article not in the main namespace (e.g. Special:Search)"

Seems similar to our issues with recommendations, in that we have no "namespace" field to cleanly draw from. Therefore it seems like at least excluding the most popular namespaces outside ø (ie. the title prefix exclusions listed) is the only possibility for this version.

My vote:

  • "-"
  • The targeted site's main page
  • Any non mainspace pages

I don't think we should filter suspicious traffic based on page titles unless we can also exclude that type of traffic across the board.

Seems similar to our issues with recommendations, in that we have no "namespace" field to cleanly draw from.

We could, but it's tricky. Ultimately a feed architecture that's designed to accommodate this is what's needed.

I don't think we should filter suspicious traffic based on page titles unless we can also exclude that type of traffic across the board.

My point exactly. We're putting ourselves in a position to ship a feature w/ a false confidence that it's valuable in other languages. If the feed needs filters, then we should only deploy it in languages which we know how to filter.

@JMinor also, I plan to filter out article namespaces in my current implementation. We can get that data when fetching the data for that title, and exclude it before we display them.

MBinder_WMF raised the priority of this task from Medium to High.Feb 16 2016, 10:20 PM

@JMinor @Slaporte @Mhurd This one is slipping past our "filters." Only seems to be happening on FR wiki 😒

pasted_file (1×750 px, 168 KB)

@BGerstle-WMF maybe do HAX filter for strings ending in ":Search"?

@JMinor @Slaporte @Mhurd This one is slipping past our "filters." Only seems to be happening on FR wiki 😒

Good catch. My prefix filter rule for the hatnote top page is ['Special', 'Template', 'Sp?cial', 'Project'] + all the local namespace names. I haven't seen any issues outside of this list in ~18 languages.

Just a note to subscribers here that I posted a question on the talk page of the special page I found here:

https://en.wikipedia.org/wiki/Wikipedia:Top_25_Report

about the Exclusions he refers to at the bottom.

I like his inclusion of article quality and view count. Maybe instead of using the standard compact cell view or the standard "big card", we should add this kind of info to the cells of the Top 30 modal.

Not an issue for 5.0.0, though.

JMinor renamed this task from EPIC: Add Most Read articles to the app to EPIC: Add Top read articles to the app.Feb 26 2016, 6:36 PM

Re exclusions, I responded at the Top 25 report talk page: https://en.wikipedia.org/wiki/Wikipedia_talk:Top_25_Report#iOS_app_and_filters

For your purposes, I think excluding articles with almost no or almost all mobile views will catch most false entries. All articles popular due to human views fall well within a range of 10-90% mobile views. of course a spammer could game the system to overcome this, but it hasn't happened so far.

Thanks @Milowent

At the talk page he provides the following list of "usual suspects" based on this mobile/web split heuristic.

My suggestion is we implement these specific sites which are subject to this rule and regularly appear:

  • Web scraping (0.01% mobile)
  • XHamster (94.49% mobile)
  • Superintelligence: Paths, Dangers, Strategies (99.9% mobile)
  • Java (programming language) (1.46% mobile, a long-term removal)
  • Images/upload/bel.jpg (0% mobile, not even a page). (think we are covered by removing non-Article pages)

We should then work with the Analytics team on incorporating this as an optional filter on the API based on actual % cutoffs (rather than a static list).

Superintelligence: Paths, Dangers, Strategies

Aww I actually clicked on that one :-(

Also, strangely fitting how that title stays at the top of the list... Damn robots

Yes. I'm curious is that is used for some very specific testing purpose for AIs or some machine learning task (the way "web scraping" is used to test web scrapers).

  • Superintelligence: Paths, Dangers, Strategies (99.9% mobile)

I don't recall this being a regular removal, but it was removed on the last report.

The funniest gaming of views we ever saw was when the Scottish independence reference failed in 2014. "Scotland", "England", and "Marriage", all had big artificial spikes in views over a short period. But we caught it and excluded them.

https://en.wikipedia.org/wiki/Wikipedia:Top_25_Report/September_21_to_27,_2014#Exclusions