[Spike 6 hr] Investigate extraction of data from today pages
Closed, ResolvedPublic

Description

We need to style the today page as a card.

Ideally we can give it the same/similar aesthetic love we are giving to other articles (big images / card UI)

However, some today pages have odd layouts (like hebrew with its crazy table layout).

Here is some info:
https://www.mediawiki.org/wiki/Mobile_Gateway/Mobile_homepage_formatting
https://www.mediawiki.org/wiki/Extension:MobileFrontend#Configuring_the_main_page

Some questions to answer:

  • Can we detect “Big Table” pages like Hebrew?
  • Featured Article
  • Can we extract the “in the news” links?
  • Can we extract photo of the day from the today view?
  • Can we extract photo of the day from commons as well? (it has different photos)
  • Do many languages use the english today template (with in the news links/photos)?
  • Can we get a “lead image” on most today pages? Does it look good (i.e. is the lead image big enough)
Fjalapeno updated the task description. (Show Details)
Fjalapeno raised the priority of this task from to Needs Triage.
Fjalapeno moved this task to Needs Estimation on the Wikipedia-iOS-App-Backlog board.
Fjalapeno added a subscriber: Fjalapeno.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2015, 10:49 PM

My idea: run snapshot tests to capture output of extraction into an attributed text view. Try to figure out parse-ability of pages, otherwise fallback to webview.

Or, use a playground

Fjalapeno updated the task description. (Show Details)Aug 14 2015, 7:17 PM
Fjalapeno set Security to None.
Fjalapeno updated the task description. (Show Details)Aug 17 2015, 2:06 PM
Fjalapeno renamed this task from [Spike 3 hr] Investigate extraction of data from today pages to [Spike 6 hr] Investigate extraction of data from today pages.Aug 24 2015, 1:02 AM
BGerstle-WMF added a comment.EditedSep 2 2015, 4:55 PM

Summary

So, I tried looking into a couple different approaches. First, getting content through the designated categories for featured content. Then, I tried looking into how Main Pages are structured (using doc links in the task description).

Can we detect “Big Table” pages like Hebrew?

To be clear, all main pages I've encountered appear to be "big tables."

If by "detect" you mean detect elements meant for the mobile site... sort of. See the section on Main Page.

Featured Article

While we can get to "translations" of EN Wiki's TFA, there doesn't appear to be a convention (or standard) for getting "today's" featured article reliably across languages. The closest I think we can get is viewing translations of EN Wiki's TFA page—not to be confused w/ the actual featured article for today.

Can we extract the “in the news” links?

Again, not standardized, nor does there appear to be a convention for this, or any other, main page section.

Can we extract photo of the day from the today view?

See above. On EN Wiki the POTD is set as the page's image, but even that's not a widely-adopted convention AFAICT.

Can we extract photo of the day from commons as well? (it has different photos)

See @Mhurd's query. I guess commons is language agnostic, so that should work.

Do many languages use the english today template (with in the news links/photos)?

Seems like French (fr), German (de), Hebrew (he), Russian (ru), Japanese (ja), Arabic (ar), and UK (uk) all /look/ similar, but there's no underlying shared template AFAICT.

Can we get a “lead image” on most today pages? Does it look good (i.e. is the lead image big enough)

The image element within the TFA section is only really suitable (and appears to be designed for) a thumbnail, floating left of the text.

Categories

While promising, categories don't appear ready for consumption by a dynamic, language-agnostic client, instead relying heavily on manual connections across languages which require manual link traversal.

The good news:

Featured content (on some wikis) is tidily organized into various categories (e.g. TFA, POTD, DYK, etc.) which are already accessible via API. You can even get RSS/atom feed for some of them (featuredfeed, potd, onthisday) via the API! There are also pseudo-RESTful interfaces, like getting today's featured article (or even a specific date's) by going to /wiki/Wikipedia:Today%27s_featured_article/Today

The bad news:

None of this is standardized, or even manually linked together, to a large extent across languages. For example, although you can get to various languages' versions of TFA by traversing langlinks for EN Wiki's TFA, you end up on something similar to TFA, not an identical-but-localized version of TFA. Further, there are no apparent ubiquitous, let alone standard, interfaces for getting specific entries (today's yesterday's, etc.) or a feed of entries (featuredfeed is not available on all sites, nor are some of the feeds, nor does it apparently respect the lang parameter).

What can we do with it right now

Not much. We would have to galvanize the communities in each language to setup links between a few common endpoints (today's TFA, POTD, etc.) so that we can get to them via langlinks. This assumes that the content for these destinations is roughly the same, and suitable for our purposes. IMO, this would also be a wasted effort when dedicated tools tied to automation & APIs would serve this purpose much better.

Everything related to main pages & categories appears to be by-convention, as opposed to standard formats or technologies across wikis.

Main Page

While categories aren't standardized across wikis, we can at least figure out the main page for any MW install by querying its siteinfo. Once there, however, there's not a whole lot we can do.

The good news:

None, really.

The bad news:

MFE docs state that main pages can use the mf- selector prefix for specific elements to appear on the mobile version of the main page. However EN Wiki does not even follow this convention, using a mp- prefix instead.

Continuing with the theme, none of this appears to be standardized, or even strongly following a shared convention. Not only is there no standard selector for the "TFA" section of the main page, but we also lack selectors for specific elements of a section on the main page (e.g. the image of Today's Featured Article). This even goes so far as the HE Wiki's main page doesn't even use the (assumedly-ubiquitous) convention of links w/ class = image for its image in the TFA section of its main page.

We might be able to pull out specific page elements from the mobile representation in order to render them natively, but I would caution against this approach, instead showing the main page as we did before while we focus our efforts on standardization & APIs.

What can we do with it right now

I didn't attempt this, but we might be able to get the mobile representation of the main page, and render it natively by:

  • stripping images and
    • showing the largest one as the "lead image"
    • showing all remaining content similar to how we show an article's lead section

Using the mobile version of the main page means we get less unnecessary data (since MFE will select mp-/mf--prefixed elements and remove nomobile ones), but it also means we lose out on some other cool data. For example, "Did You Know..." and "On This Day..." don't seem to appear on https://en.m.wikipedia.org.

In an ideal world...

Ideally, we'd standardize how content is featured across wikis—or how content is organized in general—and build an API around it. The only language-agnostic option seems to be to use the mobile front-end HTML (which already selects specific main page elements for mobile presentation) and extract stuff from that based on some inferred schema.

Develop en wiki today sections
Design fallback
Target specific large markets

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptJan 27 2016, 7:09 PM
MBinder_WMF closed this task as Resolved.Feb 8 2016, 7:32 PM