Mobile-sections replacement
Open, HighPublic

Description

Intro/History/Motivation

The web team would like to use something similar to the mobile-sections* endpoints in the future. The current mobile-sections* endpoint implementation is lacking some features and flexibility the web team would like to see in order to adopt the endpoints for their use. We would also like to incorporate learning from the performance related investigations (the loot project). @Jdlrobson has created a set of patches which create a new pair of endpoints to be more suitable for the web teams use.

Proposed changes (see also subtasks)

  • Because the loading model changes from a two-step model to a single step with on-demand references we're creating a new set of endpoints:
  • /page/formatted/{title} (everything except references)
  • /page/references/{title}/{revision?}
  • Instead of a two-step load of all content all the time the new endpoints separate the reference sections so they can be loaded on demand
  • Intra-wiki links use relative paths (./) instead of /wiki/ we used from the action=mobileview days.
  • Expose page issues, hatnotes, and an infobox as separate JSON properties
  • Instead of the sections property on the lead response there will be a toc, intro, and a text property. The latter is the remainder of the lead section text (minus the intro). The intro is somewhat akin to the first (reading) paragraph but may contain a few more tags after the first <p>.

Goals of this discussion

  • In this task we would like to decide what a mobile-sections endpoint replacements would look like, and how it differs from the existing mobile-sections endpoints.
  • Let's see if we can even come up with better names for the first new endpoint, currently called formatted.
  • In addition to the web team, I would like to also open this up to Android, iOS app devs, and any other potential users of the API, so we can come up with an API that can be used by many.

Examples

Experimental new API (to be changed)

http://localhost:6927/en.wikipedia.org/v1/page/formatted/Barack_Obama

{
  "lead": {
    "ns": 0,
    "id": 534366,
    "revision": "765720215",
    "lastmodified": "2017-02-16T01:34:41Z",
    "lastmodifier": {
      "name": "Bbb23",
      "gender": "male"
    },
    "displaytitle": "Barack Obama",
    "normalizedtitle": "Barack Obama",
    "wikibase_item": "Q76",
    "description": "44th President of the United States of America",
    "protection": {
      "edit": [
        "autoconfirmed"
      ],
      "move": [
        "sysop"
      ]
    },
    "editable": false,
    "languagecount": 219,
    "image": {
      "file": "President Barack Obama.jpg",
      "urls": {
        "320": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/320px-President_Barack_Obama.jpg",
        "640": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/640px-President_Barack_Obama.jpg",
        "800": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/800px-President_Barack_Obama.jpg",
        "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/1024px-President_Barack_Obama.jpg"
      }
    },
    "pronunciation": {
      "url": "//upload.wikimedia.org/wikipedia/commons/8/82/En-us-Barack-Hussein-Obama.ogg"
    },
    "spoken": {
      "files": [
        "File:Barack Obama.ogg"
      ]
    },
    "hatnotes": [
      "\"Barack\" and \"Obama\" redirect here. For other uses, see <a href=\"/wiki/Barack_(disambiguation)\" title=\"Barack (disambiguation)\">Barack (disambiguation)</a> and <a href=\"/wiki/Obama_(disambiguation)\" title=\"Obama (disambiguation)\">Obama (disambiguation)</a>."
    ],
    "infobox": "...(blob of HTML)...",
    "intro": "...(blob of HTML)...",
    "text": "...(very large blob of HTML)..."
  },
  "remaining": {
    "sections": [
      {
        "id": 1,
        "text": "...(very large blob of HTML)...",
        "toclevel": 1,
        "line": "Early life and career",
        "anchor": "Early_life_and_career"
      },
      ...(~50 more)...,
      {
        "id": 55,
        "text": "...(very large blob of HTML)...",
        "toclevel": 2,
        "line": "Other",
        "anchor": "Other"
      }
    ]
  }
}

Old API

https://en.wikipedia.org/api/rest_v1/page/mobile-sections-lead/Barack_Obama

{
  "ns": 0,
  "id": 534366,
  "revision": "765720215",
  "lastmodified": "2017-02-16T01:34:41Z",
  "lastmodifier": {
    "name": "Bbb23",
    "gender": "male"
  },
  "displaytitle": "Barack Obama",
  "normalizedtitle": "Barack Obama",
  "wikibase_item": "Q76",
  "description": "44th President of the United States of America",
  "protection": {
    "edit": [
      "autoconfirmed"
    ],
    "move": [
      "sysop"
    ]
  },
  "editable": false,
  "languagecount": 219,
  "image": {
    "file": "President Barack Obama.jpg",
    "urls": {
      "320": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/320px-President_Barack_Obama.jpg",
      "640": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/640px-President_Barack_Obama.jpg",
      "800": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/800px-President_Barack_Obama.jpg",
      "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/1024px-President_Barack_Obama.jpg"
    }
  },
  "pronunciation": {
    "url": "//upload.wikimedia.org/wikipedia/commons/8/82/En-us-Barack-Hussein-Obama.ogg"
  },
  "spoken": {
    "files": [
      "File:Barack Obama.ogg"
    ]
  },
  "hatnotes": [
    "\"Barack\" and \"Obama\" redirect here. For other uses, see <a href=\"/wiki/Barack_(disambiguation)\" title=\"Barack (disambiguation)\">Barack (disambiguation)</a> and <a href=\"/wiki/Obama_(disambiguation)\" title=\"Obama (disambiguation)\">Obama (disambiguation)</a>."
  ],
  "sections": [
    {
      "id": 0,
      "text": "...(very large blob of HTML)..."
    },
    {
      "id": 1,
      "toclevel": 1,
      "anchor": "Early_life_and_career",
      "line": "Early life and career"
    },
    ...(~50 more)...,
    {
      "id": 55,
      "toclevel": 2,
      "anchor": "Other",
      "line": "Other"
    }
  ]
}
https://en.wikipedia.org/api/rest_v1/page/mobile-sections-remaining/Barack_Obama

{
  "sections": [
    {
      "id": 1,
      "text": "...(very large blob of HTML)...",
      "toclevel": 1,
      "line": "Early life and career",
      "anchor": "Early_life_and_career"
    },
    ...(~50 more)...,
    {
      "id": 55,
      "text": "...(very large blob of HTML)...",
      "toclevel": 2,
      "line": "Other",
      "anchor": "Other"
    }
  ]
}
https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Barack_Obama

{
  "lead": {
    "ns": 0,
    "id": 534366,
    "revision": "765720215",
    "lastmodified": "2017-02-16T01:34:41Z",
    "lastmodifier": {
      "name": "Bbb23",
      "gender": "male"
    },
    "displaytitle": "Barack Obama",
    "normalizedtitle": "Barack Obama",
    "wikibase_item": "Q76",
    "description": "44th President of the United States of America",
    "protection": {
      "edit": [
        "autoconfirmed"
      ],
      "move": [
        "sysop"
      ]
    },
    "editable": false,
    "languagecount": 219,
    "image": {
      "file": "President Barack Obama.jpg",
      "urls": {
        "320": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/320px-President_Barack_Obama.jpg",
        "640": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/640px-President_Barack_Obama.jpg",
        "800": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/800px-President_Barack_Obama.jpg",
        "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/1024px-President_Barack_Obama.jpg"
      }
    },
    "pronunciation": {
      "url": "//upload.wikimedia.org/wikipedia/commons/8/82/En-us-Barack-Hussein-Obama.ogg"
    },
    "spoken": {
      "files": [
        "File:Barack Obama.ogg"
      ]
    },
    "hatnotes": [
      "\"Barack\" and \"Obama\" redirect here. For other uses, see <a href=\"/wiki/Barack_(disambiguation)\" title=\"Barack (disambiguation)\">Barack (disambiguation)</a> and <a href=\"/wiki/Obama_(disambiguation)\" title=\"Obama (disambiguation)\">Obama (disambiguation)</a>."
    ],
    "sections": [
      {
        "id": 0,
        "text": "...(very large blob of HTML)..."
      },
      {
        "id": 1,
        "toclevel": 1,
        "anchor": "Early_life_and_career",
        "line": "Early life and career"
      },
      ...(~50 more)...,
      {
        "id": 55,
        "toclevel": 2,
        "anchor": "Other",
        "line": "Other"
      }
    ]
  },
  "remaining": {
    "sections": [
      {
        "id": 1,
        "text": "...(very large blob of HTML)..."
      },
      {
        "id": 2,
        "text": "...(very large blob of HTML)...",
        "toclevel": 2,
        "line": "Education",
        "anchor": "Education"
      },
      ...(~50 more)...,
      {
        "id": 55,
        "text": "...(very large blob of HTML)...",
        "toclevel": 2,
        "line": "Other",
        "anchor": "Other"
      }
    ]
  }
}
Notes
There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 28 2016, 9:26 PM

Is this the list of proposed changes? (from subtasks)

  • Expose page issues in MCS endpoints
  • Infobox, hatnote and Lead paragraph should be first class citizens on "lead"…
  • Do not duplicate sections property on lead and remaining

Let's see if we can even come up with better names for the mobile-sections-* endpoints.

I the names seem fine to me, mobile-sections for all, mobile-sections-lead and -remaining for two step loading. Is there any proposal around changing the name?

Decide what version number we want (current version is 0.8.0). Should we jump to 2.0?

Or start properly semver-ing and do 1.0. Or keep the 0.X and just bump minor (0.9.0). Under 1.0 every change is considered breaking.

Joaquin - correct all subtasks are currently proposed changes (but feel free to add more)

The new endpoints can be enabled by the environment variable MOBILE_CONTENT_SERVICE_EDGE_VERSION and exposed under /formatted/:title/:revision (but we can also change this again)

https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/routes/mobile-sections.js#L265

The versioning still needs to be worked out for this new endpoint. The apps are using this service in production. It's my opinion the versioning and the status of the API do not reflect this and should.

Jhernandez edited the task description. (Show Details)Oct 20 2016, 3:35 PM
bearND edited the task description. (Show Details)Nov 7 2016, 10:09 PM

Is this the list of proposed changes? (from subtasks)

  • Expose page issues in MCS endpoints
  • Infobox, hatnote and Lead paragraph should be first class citizens on "lead"…
  • Do not duplicate sections property on lead and remaining

Roughly. I've reorganized and fleshed out this list in the task description and added checkboxes to reflect the current state.

I the names seem fine to me, mobile-sections for all, mobile-sections-lead and -remaining for two step loading. Is there any proposal around changing the name?

We're going to have new endpoints since the whole premise of how things are loaded is going to change. From a two-step load it goes to a single step + references on demand. I still hope we can come up with a better name for the first one, instead of formatted.

Decide what version number we want (current version is 0.8.0). Should we jump to 2.0?

Or start properly semver-ing and do 1.0. Or keep the 0.X and just bump minor (0.9.0). Under 1.0 every change is considered breaking.

If bumping mobile-sections to 1.0 means that more people will use it then we're doing them a disservice. It may take a while but I think eventually, after the web team has adopted the new endpoints and it's clear that they are suitable for the web team, the apps should migrate to the new endpoints as well. I'm going to keep an eye on the development to ensure that the new endpoint would be suitable for at least the Android app as well. Then mobile-sections would become deprecated.

bearND changed the title from "Mobile-sections 2.0" to "Mobile-sections replacement".Nov 7 2016, 11:23 PM
bearND edited the task description. (Show Details)
GWicke added a subscriber: GWicke.Nov 11 2016, 5:01 PM
Fjalapeno added a subscriber: Fjalapeno.

Adding Android and Web projects so they can track this task

Not to bikeshed, but it's been mentioned a few times that this end point seems like it requires a rename.

page/formatted

As "formatted" is a bit ambiguous. (formatted for what? why?)

One idea is:

page/json

Anyone have other thoughts here?

How about page/section? I'm not a big fan of calling things "mobile" this or that because I think anything mobile should be applicable for desktop too. I dropped the plural since it's not pages or feeds

Well https://en.wikipedia.org/api/rest_v1/page/html/San_Francisco does provide HTML so
https://en.wikipedia.org/api/rest_v1/page/json/San_Francisco would be a better name than /formatted

"/structured" or "/model (MVC) could also be good

Pchelolo added a subscriber: Pchelolo.EditedWed, Feb 1, 7:49 PM

And what exactly is the difference between mobile-sections and formatted outputs? How hard could it be to not separate them from each other? Adding one more endpoint means more re-renders, more purges, more load on ChangeProp, RESTBase and MCS, more storage needs (currently mobile-sections for wikipedias only is about 150 gigs of data) - these costs add up.

If the difference in format is drastic and we absolutely must have a separate endpoint - could you explain what exactly is the difference and what exactly that formatted thing is - that might help choosing a good name for it. Right now I don't really understand what it's doing, so can't comment on the naming question.

mobrovac added a subscriber: mobrovac.

And what exactly is the difference between mobile-sections and formatted outputs? How hard could it be to not separate them from each other? Adding one more endpoint means more re-renders, more purges, more load on ChangeProp, RESTBase and MCS, more storage needs - these costs add up.

+100. We have to think about maintenance consts here, since we would not be able to turn mobile-sections* off.

If the difference in format is drastic and we absolutely must have a separate endpoint - could you explain what exactly is the difference and what exactly that formatted thing is - that might help choosing a good name for it. Right now I don't really understand what it's doing, so can't comment on the naming question.

Exactly. Also, the relevant question regarding naming is: will these endpoints cater only to the needs of apps and mobile web? I.e. will there be data mangling as in the current mobile-sections* endpoints? If so, any name not indicating such an intent is not appropriate, IMO.

Because the loading model changes from a two-step model to a single step with on-demand references we're creating a new set of endpoints

There's a mobile-sections endpoint that gives both pieces of content together. In RESTBase it makes two reads from Cassandra and merges the content, but that shouldn't be a problem.

We should only have one endpoint. This is the plan as I understand it. Only one thing blocks that:
Working out what to do with backwards compatibility with the old endpoint

Right now we do not surface the formatted endpoint due to a lack of a plan for maintaining backwards compatibility with older apps. I'm not sure what would happen to older apps if they make use of the new endpoint - @Dbrant @bearND could you estimate the damage there to Android if the old code was to use the new endpoint result? It's my understanding that infoboxes, disambiguation text are no longer in the body of the lead section so would not render any more inside old clients. There may also be some impact on listening to pronounciations.
(Thus I think we could reduce storage down to just storing the old lead section)

In terms of renaming the endpoint I think there was a disappointment with our original choice of name and a hope to revisit that. Of course if we can work out a way to replace the existing endpoint, this becomes tangential.

Could you elaborate on why we cannot turn an unstable endpoint off? Would a redirect to a new endpoint not be possible?

Pchelolo added a comment.EditedWed, Feb 1, 8:33 PM

Could you elaborate on why we cannot turn an unstable endpoint off?

The apps team have no control over the version of the app people install and we really don't want to break the client.

@Dbrant @bearND do you have any statistics on the android app versions that are there in the wild? Some estimate on the update rate? Is it even possible that at some point all (99.999%) of the clients will update?

Would a redirect to a new endpoint not be possible?

As I understand the formats are incompatible, so a redirect wouldn't do us any good.

Perhaps we could do some kind of content negotiation using the Accept header and generate older content version on the fly, or transforming newer content to older content, but I still lack understanding of what the new content is and what's the difference.

Naming

@Pchelolo This is the "next version" of mobile-sections. The primary difference is the split of references from the rest of the page. But there are other minor differences as well.

"formatted" is ambiguous, which is why I suggested "json".

"mobile-sections" is a strange name and is a bit misleading. Naming the API for the component rather than the whole is odd: It's like naming an API that returns a building, "floors" or an API that returns a cabin, "logs".

So it would be nice to correct this in the second version.

Android adoption

@Jdlrobson Turning the endpoint off with reasonable assurance of not affecting older versions on the Android app will be difficult. Apps are just updated at a much slower pace. And OS fragmentation on Android makes this even more difficult to predict.

Plan

For adoption, the current plans are:

  1. Mobile Web to adopt this endpoint for any PWA work
  2. Migrate the android app from mobile-sections to the new endpoint
  3. Migrate the iOS app from mobileview to the new endpoint

There's the most important and controversial part missing from the plan: "Figure out how to shut down the old endpoint". I'm really not comfortable with introducing the new one without having a clear plan for shutting off the old one... Without this we will end up with 2 versions forever and that would have a very significant cost in terms of resources (both computer power and human power)

"mobile-sections" is a strange name and is a bit misleading. Naming the API for the component rather than the whole is odd: It's like naming an API that returns a building, "floors" or an API that returns a cabin, "logs".

As I stated in my previous post, if the output of the next-gen routes do not manipulate data to such an extent that is effectively usable only in the mobile context, then I would tend to agree. But if that's not the case then we cannot have a generic name such as formatted or json, we must indicate that it is intended for mobile usage somehow.

For adoption, the current plans are:

  1. Mobile Web to adopt this endpoint for any PWA work

Is the point of Mobile Web using PWA still under consideration or has it been settled?

There's the most important and controversial part missing from the plan: "Figure out how to shut down the old endpoint". I'm really not comfortable with introducing the new one without having a clear plan for shutting off the old one... Without this we will end up with 2 versions forever and that would have a very significant cost in terms of resources (both computer power and human power)

I agree. However, a possible solution could also be for old clients to pay the penalty of not updating. For example, once we establish that a certain percentage of clients has been updated (>9[0-9]%), we could turn off updates and caching for mobile-sections*. The percentage of clients would need to be such that (i) the extra load incurred by MCS does not have an impact on normal background updates; and (ii) the absolute number of affected clients is small enough so that we can justify it. Can we get to that point? Do we have data on updates and exact versions of the Android app users use?

As I stated in my previous post, if the output of the next-gen routes do not manipulate data to such an extent that is effectively usable only in the mobile context, then I would tend to agree. But if that's not the case then we cannot have a generic name such as formatted or json, we must indicate that it is intended for mobile usage somehow.

@mobrovac I guess we need to define "effectively usable only in the mobile context".

So far, the output here is for Android, iOS, and Mobile Web… but is there anything here that precludes the output from being used in other contexts?

It is definitely simplified. It is definitely JSON (well its HTML wrapped in JSON).

If we feel like this should not be used by any clients outside of mobile, I would be all for naming it "page/mobile". I'm just not sure how to answer this question. Anyone?

Is the point of Mobile Web using PWA still under consideration or has it been settled?

It is not settled… so I can't say it is 100% for sure that the web team will build a PWA, but I can say that a PWA product and technical strategy is being actively developed over this quarter.

I can also say that the improvements that @Jdlrobson has made (especially in separating out the references) have benefits for the other platforms as well. So it makes sense for us to get requirements from Web, Android, and iOS when developing this endpoint with the eventual goal of us all using the same API. (And make sure that we can mark this endpoint as stable and use it for a long time)

There's the most important and controversial part missing from the plan: "Figure out how to shut down the old endpoint". I'm really not comfortable with introducing the new one without having a clear plan for shutting off the old one... Without this we will end up with 2 versions forever and that would have a very significant cost in terms of resources (both computer power and human power)

I agree. However, a possible solution could also be for old clients to pay the penalty of not updating. For example, once we establish that a certain percentage of clients has been updated (>9[0-9]%), we could turn off updates and caching for mobile-sections*. The percentage of clients would need to be such that (i) the extra load incurred by MCS does not have an impact on normal background updates; and (ii) the absolute number of affected clients is small enough so that we can justify it. Can we get to that point? Do we have data on updates and exact versions of the Android app users use?

@Pchelolo @mobrovac Thanks for pointing this out - now I know that this is the biggest issue here. I'll talk with the Android team and get more answers and figure out a plan.

Just a few questions to guide me:

  1. Am I correct in saying :"The main issue is not that we are creating a new endpoint, but that the new output is fundamentally different from the old output which will mean we need to maintain 2 versions of this forever"
  2. Another way of stating the above, is this also true: "Even if we used the same route and versioned the API, this is still intolerable because we would need still have cache fragmentation and different code to maintain"
  3. The API is experimental, and only used officially by the Android app at WMF. Does this mean if we can solve for Android we have no other issues in deprecating it and turning it off?
Mholloway added a comment.EditedThu, Feb 2, 4:43 PM

Could you elaborate on why we cannot turn an unstable endpoint off?

The apps team have no control over the version of the app people install and we really don't want to break the client.

@Dbrant @bearND do you have any statistics on the android app versions that are there in the wild? Some estimate on the update rate?

No neat solutions here, but some statistics from Google Play: as of Jan. 31, just over 2/3 (68.17%) of installs of the stable version of the app are on the current version. About another 10% are on an approximately year-old release no longer available on Google Play, and from there a variety of releases of various ages mostly at ~2% or less.

Is it even possible that at some point all (99.999%) of the clients will update?

That would actually be impossible, since in recent versions of the app we've dropped support for devices running older versions of Android that can still be found on lower-end devices being sold today. People running those Android versions can find a legacy version of the app to run on Google Play, but it won't be updated, except to address security issues.

That would actually be impossible, since in recent versions of the app we've dropped support for devices running older versions of Android that can still be found on lower-end devices being sold today. People running those Android versions can find a legacy version of the app to run on Google Play, but it won't be updated, except to address security issues.

@Mholloway Thought here… don't all previous versions of the app have a fallback to api.php for getting the page content?

ovasileva added a subscriber: ovasileva.

@Jdlrobson - moving to needs analysis for now - any suggestions on how to proceed with this from our side?

@ovasileva yeh you can leave it in needs analysis. Good conversations happening above and I will continue to contribute to them with @Fjalapeno guidance until we have a clear answer.

Mholloway added a comment.EditedThu, Feb 2, 8:35 PM

@Mholloway Thought here… don't all previous versions of the app have a fallback to api.php for getting the page content?

Good point. The Android app does still have a MediaWiki API fallback mechanism in place, so absent any breaking changes to the relevant output (primarily from action=mobileview), older clients should continue working via the fallback even if we shut down /mobile-sections*.

Jdlrobson added a comment.EditedThu, Feb 2, 9:51 PM

This seems like the best migration strategy. Thoughts?

  1. It sounds like we should build in support for both endpoints into the next(?) android release (let's say the current version is version Andrew and the new version is called Bob). We could make it possible that the Bob can handle the current endpoint (mobile-sections) and the new one (tbc). If mobile-section fails it tries the newly named endpoint. It should be intelligent enough to stop querying the old endpoint once it realises it no longer exists and should also be able to fallback to mobileview api when mobile-section is down.
  2. When usage of "version Bob" gets significantly low: We then release "version Chris", which has no knowledge of the old endpoint and do the RESTBase change.
  3. Version Andrew will now fallback to the mobileview API, whereas Bob and Chris will be using the new one (as Bob is trained in both).
bearND added a subscriber: TheDJ.Fri, Feb 3, 11:31 PM

Apologies for not chiming in earlier.

And what exactly is the difference between mobile-sections and formatted outputs?

As mentioned earlier, the biggest difference is that the formatted endpoint doesn't contain the reference sections.[1]
Other changes: The lead section text is not inside the sections property anymore. It's split up between intro and text properties, obviating the need to relocate the first paragraph through DOM manipulation. Hatnotes, page issues, and infobox are extracted from the lead (section) text.

[1] My hope is that loading everything except references in a single step would be performant enough. That's what the web team plans to do anyways AFAIU. If that's the case I think we could get rid of the lead + remaining properties level. When references are to be loaded the request should include the revision number to make sure that the reference links match.

@bearND could you estimate the damage there to Android if the old code was to use the new endpoint result? It's my understanding that infoboxes, disambiguation text are no longer in the body of the lead section so would not render any more inside old clients.

That's correct. Plus the missing infobox, page issues and references. I think the missing infobox and the references are the biggest concern IMO at this time. The disambig entries are not very prominently surfaced in the Android app; in latest releases anyways.

since we would not be able to turn mobile-sections* off. [...]
The apps team have no control over the version of the app people install and we really don't want to break the client.

Actually, there are a couple of remote config values we could use to turn off RESTBase usage in the Android app: restbaseProdPercent + restbaseBetaPercent (see T126934). If we introduced a new config key for the new endpoint and newer app versions this would then make older app version use plain MediaWiki action=mobileview API for page content and TextExtracts for link previews. I consider this would be the last step though. By going back to the action API some newer Android app features will be disabled (mainly the Wiktionary definition lookup IIRC).
Before turning RB usage off I envision an extended period of not pre-generating mobile-sections and just having MCS process it on demand. For the transition period I guess we want to adjust the cache-policy.

In terms of renaming the endpoint I think there was a disappointment with our original choice of name and a hope to revisit that. Of course if we can work out a way to replace the existing endpoint, this becomes tangential.

I'm not thrilled about the name mobile-sections. Making the backwards-incompatible changes seemed like a good opportunity to change the old name to something better. Unfortunately, it's unclear what a good name for this is. One idea I had was naming it read or reading (explanation at the end of the next paragraph).

Re: mobile: The only thing that is mobile-specific is the handling of main pages. For these MCS falls back to action=mobileview. That's mainly because the Parsoid HTML of main pages is not responsive. It looks horrible on smaller form factors. If we had the right CSS rules to make Parsoid output look good on small devices then this fallback to mobileview would not be needed. After watching this week's CREDIT presentation by @TheDJ, I'm hopeful that a solution for this is achievable.
The bulk of the DOM transformations done by MCS is more for reducing the payload by trying to get rid of DOM elements and attributes that don't seem to affect the reading experience. In contrast to that Parsoid HTML is geared towards editing pages.

Ok using inspiration from @Jdlrobson @bearND and @Mholloway above:

  1. We should release an update to the Android app that consults a new server side config option
  2. Eventually a new version that works with the new API will be released (this will NOT be affected by the new config option)
  3. For the transition period we need to support both APIs for a time

Open questions (at least for me, maybe someone here knows the answers):

  1. For the approximately 12% of Android users who are "stuck" on an older version (10% about a year old and 2% long tail), is the fallback on MW API sufficient? Or can they be updated some other way?
  2. How long is the "transition period" expected to be? Or what % adoption should be our threshold of shutting down the old API?
  3. Will we support both APIs during the transition as totally separate services OR will we refactor the old API to call through to the new one?
  4. Naming discussion is still TBD but we are getting closer to an answer
bearND added a comment.Tue, Feb 7, 1:25 AM
  1. For the approximately 12% of Android users who are "stuck" on an older version (10% about a year old and 2% long tail), is the fallback on MW API sufficient? Or can they be updated some other way?

I think the fallback should be OK. The patch could theoretically be backported to older app versions. I believe that's a painful process, though, and leave that to the Android app devs to decide if that would be really worth it. Maybe not by itself but if there was something else that needed backporting?

  1. How long is the "transition period" expected to be? Or what % adoption should be our threshold of shutting down the old API?

I was thinking about giving it at least a year or so. I hope the adoption rate of v2 would adjust quicker and higher than it did for v1. I think it would be good to also have a month or two where we have turned the restbaseProdPercent + restbaseBetaPercent down to 0 before turning off the actual endpoint. Once we disable storage for v1 we should keep an eye on the external request rate for it.[1] There could be users besides the Android app, too. I.e. we still haven't turned off the mobile-text endpoint.

[1] https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_apps#Route_usage

  1. Will we support both APIs during the transition as totally separate services OR will we refactor the old API to call through to the new one?

That one is probably more a question for the RESTBase guys.
I believe in MCS it should be separate endpoints but in RB it could be the same just with a different content-type profile version. We haven't done a split for MCS endpoints yet. So, I'm not sure what the best practice for this is.
I think there might be more changes for v2 down the road, which would make the structure less compatible with v1 that it already is right now. It would be cleaner to have a separate endpoint IMO.

Working with @Tbayer to estimate the number of users that will be affected by deprecating this API and removing it

To be clear - I mean users that are NOT coming from the Android app.

Also, @bearND came up with some reasonable strategies for migrating the Android users to the new API using the existing config flags. So it looks like we will be good on that front.

The Android app sent 99.6% of requests to this API during one recent one-week timespan. In absolute terms, there were 218805 requests (or about 30k/day) not from the Android app.

I also ran a query for the most frequent non-Android-app user agents, see below.

SELECT
SUM( IF(access_method = 'mobile app' AND user_agent_map['os_family'] = 'Android', 1, 0) )
/SUM(1) AS Android_app_percentage,
SUM(1) AS total_mobilesectionsAPI_requests
FROM wmf.webrequest
WHERE year = 2017 AND month = 2 AND day >= 3 AND day <=9
AND uri_path LIKE '/api/rest_v1/page/mobile-sections%';

android_app_percentage  total_mobilesectionsapi_requests
0.996305444171708       59223628
1 row selected (5271.287 seconds)
SELECT user_agent, SUM(1) AS mobilesectionsAPI_requests
FROM wmf.webrequest
WHERE year = 2017 AND month = 2 AND day >= 3 AND day <=9
AND uri_path LIKE '/api/rest_v1/page/mobile-sections%'
AND NOT (access_method = 'mobile app' AND user_agent_map['os_family'] = 'Android')
GROUP BY user_agent
ORDER BY mobilesectionsAPI_requests DESC
LIMIT 20;

user_agent      mobilesectionsapi_requests
okhttp/2.6.0    150837
MWOffliner/HEAD (dbrant@wikimedia.org)  52755
node-fetch/1.0 (+https://github.com/bitinn/node-fetch)  13727
[...]
20 rows selected (10745.389 seconds)

[rest of the list redacted in case there are potential privacy concerns, #4 is a Safari user agent with only 170 views, and most of the rest are Safari versions too.]

Note: The "MWOffliner" user agent was from me, creating an offline ZIM collection from 50k articles. This tool will be able to adapt to any new endpoint format, and shouldn't be a blocker.

Given that you are considering making this a major backwards-incompatible change, would it make sense to also enable streaming HTML rendering of the response? This would avoid major performance regressions on slow connections, and might even improve performance over the current lead section loading. Metadata could either be inlined in the head (if absolutely needed for rendering), or served in a separate JSON end point (if the first content can be rendered without this metadata).

@GWicke I was wondering this myself. @Niedzielski is doing some research auditing the existing code for the apps and evaluating moving to streaming HTML will part of the audit T152991: Audit iOS, Android and Mobile Web HTML, CSS, and JS

Looking at mobile sections output, there is some overlap with the summary endpoint (display title, description, thumbnail):

"ns": 0,
  "id": 4269567,
  "revision": "764668785",
  "lastmodified": "2017-02-10T06:40:26Z",
  "lastmodifier": {
    "name": "Futurulus",
    "gender": "unknown"
  },
  "displaytitle": "Dog",
  "wikibase_item": "Q144",
  "description": "domestic animal",
  "protection": {
    "edit": [
      "autoconfirmed"
    ],
    "move": [
      "sysop"
    ]
  },
  "editable": false,
  "languagecount": 220,
  "image": {
    "file": "Collage of Nine Dogs.jpg",
    "urls": {
      "320": "//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/320px-Collage_of_Nine_Dogs.jpg",
      "640": "//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/640px-Collage_of_Nine_Dogs.jpg",
      "800": "//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/800px-Collage_of_Nine_Dogs.jpg",
      "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/1024px-Collage_of_Nine_Dogs.jpg"
    }
  },
  "hatnotes": [
    "This article is about the domestic dog. For related species known as \"dogs\", see <a href=\"/wiki/Canidae\" title=\"Canidae\">Canidae</a>. For other uses, see <a href=\"/wiki/Dog_(disambiguation)\" title=\"Dog (disambiguation)\">Dog (disambiguation)</a>.",
    "\"Doggie\" redirects here. For the Danish artist, see <a href=\"/wiki/Doggie_(artist)\" title=\"Doggie (artist)\">Doggie (artist)</a>."
  ],

The other information included here (omitted above) is the section titles. So I am working off the cuff, but it seems like a lot of this could be added to the summary. And I think it would make sense. Basically everything that is used to build the native components (and editing business logic) could be downloaded from the summary end point, then the content is purely HTML can downloaded in the new end point.

Thoughts?

Adding a bit more information to the summary end point sounds like an interesting idea to me. I definitely like the consolidation aspect. However, I don't know enough about what the most common use cases are to really evaluate pros / cons, especially wrt performance.

One potential advantage might be that the summary information could already be in cache from using the page preview (web) or equivalent in the app. This would make the summary request on full page load basically free.

bearND added a comment.EditedTue, Feb 14, 6:28 AM

The only common properties between summary and mobile-sections-lead right now are description and, with a bit of variation, some form of titles. In the future, once we get a thumbnail API, then the thumbnail for the lead image should be common as well.

In the early beginnings of MCS I was experimenting with an HTML content type to be used by the Android app. I abandoned that approach mainly for two reasons: 1) this seemed like a bigger step which would have required a lot more refactoring in the Android app, plus at that time the RB environment was very new for the app, and 2) our desire then to replace the WebView with native components. Now I'd say #1 is probably not so much an issues anymore. I haven't heard much about #2 lately.
The main reason why I'm a bit hesitant to make the output to HTML content type nowadays is that I believe is that the web team might prefer JSON. My main motivation for this new endpoint is adoption by the web team. @Jdlrobson what do you think about HTML output?

This seems to go a bit counter to the tendency in the new endpoint of extracting properties from HTML into their own JSON properties (e.g. intro, (lead) text, infobox, issues, hatnotes). It wouldn't make sense to me to have encoded HTML portions inside JSON inside HTML <head>. If we go the HTML route then what MCS output should simply mark the respective elements using HTML attributes (class or something else) instead; then rely on CSS rules to adjust the layout.

Looking over the data we currently extract from the HTML, it seems that it is convenient to deliver it as structured JSON for 3 different purposes:

  1. To build native components (i.e. the TOC)
  2. To power business logic (i.e. protection status, editable)
  3. To re-assemble into better formatted HTML for the device (i.e. lead section, info box)

@bearND does this seem like a fair assessment? If so…

In regards to @GWicke's suggestion, I think streaming HTML could solve for purpose 3 if we assembled everything on the server.

For purpose 1 & 2, that seems like the part where we really want JSON. My suggestion here is that we should put everything that falls in these categories in a separate endpoint that delivers that JSON. The summary seems like a good candidate here, because that's what much of this is: protection status, editable, wikidata item, thumbnail, last modified, etc… I would say that even the ToC is a good fit here: a list of section titles seems like a good summary of the article.

Also technically there is some nice separation:

  • Summary endpoint: delivers JSON used for business rules and to build native (or web app) components that are separate from the main content
  • Content endpoint: delivers HTML used for the main content of the article

Any thoughts here?

Your points are a fair assessment of the current state.
For the future:
#1 I definitely agree.
#2 probably agree. I think a point could be made to share business logic between apps and web in JS eventually.
#3 As I said before, I think if we go the HTML route for the content then we should consider not extracting the lead intro and text anymore and just have CSS rules to affect the position depending on screen dimensions.

Adding a lot more fields to the summary endpoint would bloat it up. I'm not sure how the web team would feel about having a lot of unused payload in there when they want to show a hovercard.

@bearND So maybe a better future state would be:

  • Summary endpoint: delivers JSON used to create previews
  • Metadata endpoint: : delivers JSON used to power business rules and to build native (or web app) components that are separate from the main content
  • Content endpoint: delivers HTML used for the main content of the article

We also have the option of doing #3 (re-assemble into better formatted HTML, or extract data) in a streaming fashion on the client. For JS, we have web-html-stream to do this. Combined with decent markup (ex: <section> tags (T114072), namespaced attribute markers, <figure> and <video> for media, optional elements like references omitted), this can provide a lot of flexibility.

Depending on the size of the "business logic" metadata, it could also be possible to serve it in a HTTP header. This would avoid the need for an extra request, and control prioritization on slow connections. You probably wouldn't want to serve larger content like section headings this way, though. If you don't need those upfront, then it might be possible to extract those from the HTML stream.

The only common properties between summary and mobile-sections-lead right now are description and, with a bit of variation, some form of titles. In the future, once we get a thumbnail API, then the thumbnail for the lead image should be common as well.

In the early beginnings of MCS I was experimenting with an HTML content type to be used by the Android app. I abandoned that approach mainly for two reasons: 1) this seemed like a bigger step which would have required a lot more refactoring in the Android app, plus at that time the RB environment was very new for the app, and 2) our desire then to replace the WebView with native components. Now I'd say #1 is probably not so much an issues anymore. I haven't heard much about #2 lately.
The main reason why I'm a bit hesitant to make the output to HTML content type nowadays is that I believe is that the web team might prefer JSON. My main motivation for this new endpoint is adoption by the web team. @Jdlrobson what do you think about HTML output?

This seems to go a bit counter to the tendency in the new endpoint of extracting properties from HTML into their own JSON properties (e.g. intro, (lead) text, infobox, issues, hatnotes). It wouldn't make sense to me to have encoded HTML portions inside JSON inside HTML <head>. If we go the HTML route then what MCS output should simply mark the respective elements using HTML attributes (class or something else) instead; then rely on CSS rules to adjust the layout.

The reason I prefer JSON is that it's more malleable. The app and web experiences are different and should be allowed to be different. As soon as you send HTML, the client (and server if you are doing server-side rendering) now has to worry about parsing a DOM tree and working out what goes where. I want clients to be as dumb as possible. To take what they need and render it.

If streaming is the goal, a hybrid approach (e.g. sending a JSON blob of lead, followed by a maker followed by the HTML of remaining sections) may be better, but this then restricts us on what we can do with the remaining sections. We currently are experimenting with not rendering the references section. In addition to this the JSON format of remaining does allow us to easily construct the table of contents on mobile for instance as well as easily provide section collapsing in the React.js web app (without it that would be near impossible). Could we use multiple URLs to stitch JSONs together in the service worker for example?

So in summary, I'm pretty attached to the lead being JSON, but I think there's a sweet spot for a hybrid approach somewhere...

It's actually possible to stream JSON, there's a bunch of libraries for doing that, for example this one: http://oboejs.com

We can add support for that server-side if needed.

Sounds like we have some things to work out between web and apps.

@Jdlrobson @Niedzielski I am going to grab you two as representatives of web and apps and setup a meeting to try and figure out what the requirements are and how we want to use the data from the end points.

On using a hybrid approach vs using headers for structured data vs using separate APIs:
I think the goals of each of these is is the same: Extract what is needed from the HTML server side and deliver it to clients as structured JSON. In that way HTML does not need to be parsed on the client.

If the concern about HTML parsing is about performance, then I don't think it is necessarily warranted. Shallow, streaming HTML parsing / processing can actually be very cheap, on the order of 1ms per mb HTML: https://github.com/wikimedia/web-html-stream#performance. This might be faster than parsing the same HTML wrapped in JSON, even when using a fully blocking parser.

JSON and HTML each have their sweet spots: JSON for representing structured data, HTML for representing documents. JSON is easy to work with in a synchronous context. HTML gives you a lot of flexibility for matching & replacing bits of content, for example to adjust how media is rendered on a specific device, or how / whether an infobox is shown.

@GWicke (at least for me) I think the concern about HTML parsing is more about reimplementing brittle HTML parsing multiple times across several clients as we do now.

Your comments about HTML vs json also ring true to me... this is more about using the right tool for each context. And about separation of concerns. It's up to us to nail down what needs to be structured and what needs to be a document.

I'm going to start editing the description on this ticket to collect the requirements on this. I think we need to back up a few steps before we move forward so we have a good understanding of what the needs are before we dig into the solutions a bit more

@GWicke (at least for me) I think the concern about HTML parsing is more about reimplementing brittle HTML parsing multiple times across several clients as we do now.

Is it the HTML parsing itself that is brittle, or is it the unpredictable markup you need to work with? The latter is fairly straightforward to address with server side pre-processing, so that you have a stable markup structure to work against. That part is really no different between JSON & HTML.

Niedzielski edited the task description. (Show Details)Fri, Feb 17, 3:09 AM

Some Android concerns below :]

I think the fallback should be OK.

Unfortunately, the Android MediaWiki fallback support only kicks in for protocol errors (except 404) AFAIK. I tried the scenario of changing a field type in the mobile-sections-lead response and see an error page in the app but it's not considered significant so the fallback MediaWiki support will never kick in. The scenario of missing fields in a new response type would likely perform similarly and in some scenarios errors are intentionally stifled (see T145075 and the thread "[draft] incident report -- crash on startup for non-English users" for additional discussion). For fallback to kick in, the endpoint must return a status other than 404 (e.g., https://en.wikipedia.org/api/rest_v1/foobar returns 404 which would be considered insignificant)

The patch could theoretically be backported to older app versions. I believe that's a painful process, though, and leave that to the Android app devs to decide if that would be really worth it. Maybe not by itself but if there was something else that needed backporting?

This would be a little lame so I hope we don't have to go this route. If we do, short-circuiting an "isRestBaseEnabled()" method might be the most practical way to ensure old clients still work. Maybe we should set a "timebomb" MediaWiki fallback in future releases of the app to avoid this and similar problems going forward but this might also discourage good API versioning

I was thinking about giving it at least a year or so. I hope the adoption rate of v2 would adjust quicker and higher than it did for v1. I think it would be good to also have a month or two where we have turned the restbaseProdPercent + restbaseBetaPercent down to 0 before turning off the actual endpoint. Once we disable storage for v1 we should keep an eye on the external request rate for it.[1] There could be users besides the Android app, too. I.e. we still haven't turned off the mobile-text endpoint.

This remote kill switch regrettably is not version sensitive at the moment so it would affect all versions of the app. There's some good discussion around versioning the configuration we might consider for future releases (or go with the timebomb approach)

Is it the HTML parsing itself that is brittle, or is it the unpredictable markup you need to work with? The latter is fairly straightforward to address with server side pre-processing, so that you have a stable markup structure to work against. That part is really no different between JSON & HTML.

I really appreciated the considerations mentioned between JSON and HTML. I understand that the content is super HTML heavy but I want to give a quick mention that, with the exception of the WebView and an extremely small subset handled by TextView, most native Android components are not built for handling sophisticated HTML parsing or rendering. We'd have to rethink how the app consume responses if we go back to parsing HTML. The old app-side MediaWiki parsing implementations weren't particularly pretty so I'd prefer we somehow leverage the WebView or Duktape JavaScript to do HTML to JSON to Java unmarshalling app side. In general, I'll add that I find JSON responses much more human readable, of clearer intent, friendlier to newcomers, almost self-documenting, simpler to version, and easier to handle but I admit it has its shortcominngs

For fallback to kick in, the endpoint must return a status other than 404 (e.g., https://en.wikipedia.org/api/rest_v1/foobar returns 404 which would be considered insignificant)

Right. Thank you. 404 wouldn't work since we need that for titles that don't exist. We could try other codes, like 410, 400, or 501.

The patch could theoretically be backported to older app versions. I believe that's a painful process, though, and leave that to the Android app devs to decide if that would be really worth it. Maybe not by itself but if there was something else that needed backporting?

This would be a little lame so I hope we don't have to go this route. If we do, short-circuiting an "isRestBaseEnabled()" method might be the most practical way to ensure old clients still work. Maybe we should set a "timebomb" MediaWiki fallback in future releases of the app to avoid this and similar problems going forward but this might also discourage good API versioning

Yes, I agree that this is a bit too much, and using something like you suggested to short-circuit an "isRestBaseEnabled()" method is preferable over porting the page loading logic (too much goes on in there and too much code there has changed). I like the idea of a timebomb, esp. for the announcements endpoint.

This remote kill switch regrettably is not version sensitive at the moment so it would affect all versions of the app.

Yes, this would require a new remote config key to be used by newer app versions which can handle the new endpoint or new version of the endpoint.

In T146944#3035658, @Niedzielski wrote:
For fallback to kick in, the endpoint must return a status other than 404 (e.g., https://en.wikipedia.org/api/rest_v1/foobar returns 404 which would be considered insignificant)
Right. Thank you. 404 wouldn't work since we need that for titles that don't exist. We could try other codes, like 410, 400, or 501.

That's doable. We're already discussing hiding some endpoints from the docs, so we might as well hide these and configure them to always return whatever response code you want.