Page MenuHomePhabricator

Mobile-sections replacement
Closed, DuplicatePublic

Description

Intro/History/Motivation

The web team would like to use something similar to the mobile-sections* endpoints in the future. The current mobile-sections* endpoint implementation is lacking some features and flexibility the web team would like to see in order to adopt the endpoints for their use. We would also like to incorporate learning from the performance related investigations (the loot project). @Jdlrobson has created a set of patches which create a new pair of endpoints to be more suitable for the web teams use.

Proposed changes (see also subtasks)

  • Because the loading model changes from a two-step model to a single step with on-demand references we're creating a new set of endpoints:
  • /page/formatted/{title} (everything except references)
  • /page/references/{title}/{revision?}
  • Instead of a two-step load of all content all the time the new endpoints separate the reference sections so they can be loaded on demand
  • Intra-wiki links use relative paths (./) instead of /wiki/ we used from the action=mobileview days.
  • Expose page issues, hatnotes, and an infobox as separate JSON properties
  • Instead of the sections property on the lead response there will be a toc, intro, and a text property. The latter is the remainder of the lead section text (minus the intro). The intro is somewhat akin to the first (reading) paragraph but may contain a few more tags after the first <p>.

Goals of this discussion

  • In this task we would like to decide what a mobile-sections endpoint replacements would look like, and how it differs from the existing mobile-sections endpoints.
  • Let's see if we can even come up with better names for the first new endpoint, currently called formatted.
  • In addition to the web team, I would like to also open this up to Android, iOS app devs, and any other potential users of the API, so we can come up with an API that can be used by many.

Examples

Experimental new API (to be changed)

http://localhost:6927/en.wikipedia.org/v1/page/formatted/Barack_Obama

{
  "lead": {
    "ns": 0,
    "id": 534366,
    "revision": "765720215",
    "lastmodified": "2017-02-16T01:34:41Z",
    "lastmodifier": {
      "name": "Bbb23",
      "gender": "male"
    },
    "displaytitle": "Barack Obama",
    "normalizedtitle": "Barack Obama",
    "wikibase_item": "Q76",
    "description": "44th President of the United States of America",
    "protection": {
      "edit": [
        "autoconfirmed"
      ],
      "move": [
        "sysop"
      ]
    },
    "editable": false,
    "languagecount": 219,
    "image": {
      "file": "President Barack Obama.jpg",
      "urls": {
        "320": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/320px-President_Barack_Obama.jpg",
        "640": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/640px-President_Barack_Obama.jpg",
        "800": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/800px-President_Barack_Obama.jpg",
        "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/1024px-President_Barack_Obama.jpg"
      }
    },
    "pronunciation": {
      "url": "//upload.wikimedia.org/wikipedia/commons/8/82/En-us-Barack-Hussein-Obama.ogg"
    },
    "spoken": {
      "files": [
        "File:Barack Obama.ogg"
      ]
    },
    "hatnotes": [
      "\"Barack\" and \"Obama\" redirect here. For other uses, see <a href=\"/wiki/Barack_(disambiguation)\" title=\"Barack (disambiguation)\">Barack (disambiguation)</a> and <a href=\"/wiki/Obama_(disambiguation)\" title=\"Obama (disambiguation)\">Obama (disambiguation)</a>."
    ],
    "infobox": "...(blob of HTML)...",
    "intro": "...(blob of HTML)...",
    "text": "...(very large blob of HTML)..."
  },
  "remaining": {
    "sections": [
      {
        "id": 1,
        "text": "...(very large blob of HTML)...",
        "toclevel": 1,
        "line": "Early life and career",
        "anchor": "Early_life_and_career"
      },
      ...(~50 more)...,
      {
        "id": 55,
        "text": "...(very large blob of HTML)...",
        "toclevel": 2,
        "line": "Other",
        "anchor": "Other"
      }
    ]
  }
}

Old API

https://en.wikipedia.org/api/rest_v1/page/mobile-sections-lead/Barack_Obama

{
  "ns": 0,
  "id": 534366,
  "revision": "765720215",
  "lastmodified": "2017-02-16T01:34:41Z",
  "lastmodifier": {
    "name": "Bbb23",
    "gender": "male"
  },
  "displaytitle": "Barack Obama",
  "normalizedtitle": "Barack Obama",
  "wikibase_item": "Q76",
  "description": "44th President of the United States of America",
  "protection": {
    "edit": [
      "autoconfirmed"
    ],
    "move": [
      "sysop"
    ]
  },
  "editable": false,
  "languagecount": 219,
  "image": {
    "file": "President Barack Obama.jpg",
    "urls": {
      "320": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/320px-President_Barack_Obama.jpg",
      "640": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/640px-President_Barack_Obama.jpg",
      "800": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/800px-President_Barack_Obama.jpg",
      "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/1024px-President_Barack_Obama.jpg"
    }
  },
  "pronunciation": {
    "url": "//upload.wikimedia.org/wikipedia/commons/8/82/En-us-Barack-Hussein-Obama.ogg"
  },
  "spoken": {
    "files": [
      "File:Barack Obama.ogg"
    ]
  },
  "hatnotes": [
    "\"Barack\" and \"Obama\" redirect here. For other uses, see <a href=\"/wiki/Barack_(disambiguation)\" title=\"Barack (disambiguation)\">Barack (disambiguation)</a> and <a href=\"/wiki/Obama_(disambiguation)\" title=\"Obama (disambiguation)\">Obama (disambiguation)</a>."
  ],
  "sections": [
    {
      "id": 0,
      "text": "...(very large blob of HTML)..."
    },
    {
      "id": 1,
      "toclevel": 1,
      "anchor": "Early_life_and_career",
      "line": "Early life and career"
    },
    ...(~50 more)...,
    {
      "id": 55,
      "toclevel": 2,
      "anchor": "Other",
      "line": "Other"
    }
  ]
}
https://en.wikipedia.org/api/rest_v1/page/mobile-sections-remaining/Barack_Obama

{
  "sections": [
    {
      "id": 1,
      "text": "...(very large blob of HTML)...",
      "toclevel": 1,
      "line": "Early life and career",
      "anchor": "Early_life_and_career"
    },
    ...(~50 more)...,
    {
      "id": 55,
      "text": "...(very large blob of HTML)...",
      "toclevel": 2,
      "line": "Other",
      "anchor": "Other"
    }
  ]
}
https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Barack_Obama

{
  "lead": {
    "ns": 0,
    "id": 534366,
    "revision": "765720215",
    "lastmodified": "2017-02-16T01:34:41Z",
    "lastmodifier": {
      "name": "Bbb23",
      "gender": "male"
    },
    "displaytitle": "Barack Obama",
    "normalizedtitle": "Barack Obama",
    "wikibase_item": "Q76",
    "description": "44th President of the United States of America",
    "protection": {
      "edit": [
        "autoconfirmed"
      ],
      "move": [
        "sysop"
      ]
    },
    "editable": false,
    "languagecount": 219,
    "image": {
      "file": "President Barack Obama.jpg",
      "urls": {
        "320": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/320px-President_Barack_Obama.jpg",
        "640": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/640px-President_Barack_Obama.jpg",
        "800": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/800px-President_Barack_Obama.jpg",
        "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/1024px-President_Barack_Obama.jpg"
      }
    },
    "pronunciation": {
      "url": "//upload.wikimedia.org/wikipedia/commons/8/82/En-us-Barack-Hussein-Obama.ogg"
    },
    "spoken": {
      "files": [
        "File:Barack Obama.ogg"
      ]
    },
    "hatnotes": [
      "\"Barack\" and \"Obama\" redirect here. For other uses, see <a href=\"/wiki/Barack_(disambiguation)\" title=\"Barack (disambiguation)\">Barack (disambiguation)</a> and <a href=\"/wiki/Obama_(disambiguation)\" title=\"Obama (disambiguation)\">Obama (disambiguation)</a>."
    ],
    "sections": [
      {
        "id": 0,
        "text": "...(very large blob of HTML)..."
      },
      {
        "id": 1,
        "toclevel": 1,
        "anchor": "Early_life_and_career",
        "line": "Early life and career"
      },
      ...(~50 more)...,
      {
        "id": 55,
        "toclevel": 2,
        "anchor": "Other",
        "line": "Other"
      }
    ]
  },
  "remaining": {
    "sections": [
      {
        "id": 1,
        "text": "...(very large blob of HTML)..."
      },
      {
        "id": 2,
        "text": "...(very large blob of HTML)...",
        "toclevel": 2,
        "line": "Education",
        "anchor": "Education"
      },
      ...(~50 more)...,
      {
        "id": 55,
        "text": "...(very large blob of HTML)...",
        "toclevel": 2,
        "line": "Other",
        "anchor": "Other"
      }
    ]
  }
}
Notes
  • mobile-sections-lead / remaining are used for all Android pageviews
  • Only the Android background page downloader service uses mobile-sections; newer versions of the app will no longer use this endpoint as part of T156917 or a related task
  • Android remote RESTBase / MediaWiki config options

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Not to bikeshed, but it's been mentioned a few times that this end point seems like it requires a rename.

page/formatted

As "formatted" is a bit ambiguous. (formatted for what? why?)

One idea is:

page/json

Anyone have other thoughts here?

How about page/section? I'm not a big fan of calling things "mobile" this or that because I think anything mobile should be applicable for desktop too. I dropped the plural since it's not pages or feeds

Well https://en.wikipedia.org/api/rest_v1/page/html/San_Francisco does provide HTML so
https://en.wikipedia.org/api/rest_v1/page/json/San_Francisco would be a better name than /formatted

"/structured" or "/model (MVC) could also be good

And what exactly is the difference between mobile-sections and formatted outputs? How hard could it be to not separate them from each other? Adding one more endpoint means more re-renders, more purges, more load on ChangeProp, RESTBase and MCS, more storage needs (currently mobile-sections for wikipedias only is about 150 gigs of data) - these costs add up.

If the difference in format is drastic and we absolutely must have a separate endpoint - could you explain what exactly is the difference and what exactly that formatted thing is - that might help choosing a good name for it. Right now I don't really understand what it's doing, so can't comment on the naming question.

mobrovac subscribed.

And what exactly is the difference between mobile-sections and formatted outputs? How hard could it be to not separate them from each other? Adding one more endpoint means more re-renders, more purges, more load on ChangeProp, RESTBase and MCS, more storage needs - these costs add up.

+100. We have to think about maintenance consts here, since we would not be able to turn mobile-sections* off.

If the difference in format is drastic and we absolutely must have a separate endpoint - could you explain what exactly is the difference and what exactly that formatted thing is - that might help choosing a good name for it. Right now I don't really understand what it's doing, so can't comment on the naming question.

Exactly. Also, the relevant question regarding naming is: will these endpoints cater only to the needs of apps and mobile web? I.e. will there be data mangling as in the current mobile-sections* endpoints? If so, any name not indicating such an intent is not appropriate, IMO.

Because the loading model changes from a two-step model to a single step with on-demand references we're creating a new set of endpoints

There's a mobile-sections endpoint that gives both pieces of content together. In RESTBase it makes two reads from Cassandra and merges the content, but that shouldn't be a problem.

We should only have one endpoint. This is the plan as I understand it. Only one thing blocks that:
Working out what to do with backwards compatibility with the old endpoint

Right now we do not surface the formatted endpoint due to a lack of a plan for maintaining backwards compatibility with older apps. I'm not sure what would happen to older apps if they make use of the new endpoint - @Dbrant @bearND could you estimate the damage there to Android if the old code was to use the new endpoint result? It's my understanding that infoboxes, disambiguation text are no longer in the body of the lead section so would not render any more inside old clients. There may also be some impact on listening to pronounciations.
(Thus I think we could reduce storage down to just storing the old lead section)

In terms of renaming the endpoint I think there was a disappointment with our original choice of name and a hope to revisit that. Of course if we can work out a way to replace the existing endpoint, this becomes tangential.

Could you elaborate on why we cannot turn an unstable endpoint off? Would a redirect to a new endpoint not be possible?

Could you elaborate on why we cannot turn an unstable endpoint off?

The apps team have no control over the version of the app people install and we really don't want to break the client.

@Dbrant @bearND do you have any statistics on the android app versions that are there in the wild? Some estimate on the update rate? Is it even possible that at some point all (99.999%) of the clients will update?

Would a redirect to a new endpoint not be possible?

As I understand the formats are incompatible, so a redirect wouldn't do us any good.

Perhaps we could do some kind of content negotiation using the Accept header and generate older content version on the fly, or transforming newer content to older content, but I still lack understanding of what the new content is and what's the difference.

Naming

@Pchelolo This is the "next version" of mobile-sections. The primary difference is the split of references from the rest of the page. But there are other minor differences as well.

"formatted" is ambiguous, which is why I suggested "json".

"mobile-sections" is a strange name and is a bit misleading. Naming the API for the component rather than the whole is odd: It's like naming an API that returns a building, "floors" or an API that returns a cabin, "logs".

So it would be nice to correct this in the second version.

Android adoption

@Jdlrobson Turning the endpoint off with reasonable assurance of not affecting older versions on the Android app will be difficult. Apps are just updated at a much slower pace. And OS fragmentation on Android makes this even more difficult to predict.

Plan

For adoption, the current plans are:

  1. Mobile Web to adopt this endpoint for any PWA work
  2. Migrate the android app from mobile-sections to the new endpoint
  3. Migrate the iOS app from mobileview to the new endpoint

There's the most important and controversial part missing from the plan: "Figure out how to shut down the old endpoint". I'm really not comfortable with introducing the new one without having a clear plan for shutting off the old one... Without this we will end up with 2 versions forever and that would have a very significant cost in terms of resources (both computer power and human power)

"mobile-sections" is a strange name and is a bit misleading. Naming the API for the component rather than the whole is odd: It's like naming an API that returns a building, "floors" or an API that returns a cabin, "logs".

As I stated in my previous post, if the output of the next-gen routes do not manipulate data to such an extent that is effectively usable only in the mobile context, then I would tend to agree. But if that's not the case then we cannot have a generic name such as formatted or json, we must indicate that it is intended for mobile usage somehow.

For adoption, the current plans are:

  1. Mobile Web to adopt this endpoint for any PWA work

Is the point of Mobile Web using PWA still under consideration or has it been settled?

There's the most important and controversial part missing from the plan: "Figure out how to shut down the old endpoint". I'm really not comfortable with introducing the new one without having a clear plan for shutting off the old one... Without this we will end up with 2 versions forever and that would have a very significant cost in terms of resources (both computer power and human power)

I agree. However, a possible solution could also be for old clients to pay the penalty of not updating. For example, once we establish that a certain percentage of clients has been updated (>9[0-9]%), we could turn off updates and caching for mobile-sections*. The percentage of clients would need to be such that (i) the extra load incurred by MCS does not have an impact on normal background updates; and (ii) the absolute number of affected clients is small enough so that we can justify it. Can we get to that point? Do we have data on updates and exact versions of the Android app users use?

As I stated in my previous post, if the output of the next-gen routes do not manipulate data to such an extent that is effectively usable only in the mobile context, then I would tend to agree. But if that's not the case then we cannot have a generic name such as formatted or json, we must indicate that it is intended for mobile usage somehow.

@mobrovac I guess we need to define "effectively usable only in the mobile context".

So far, the output here is for Android, iOS, and Mobile Web… but is there anything here that precludes the output from being used in other contexts?

It is definitely simplified. It is definitely JSON (well its HTML wrapped in JSON).

If we feel like this should not be used by any clients outside of mobile, I would be all for naming it "page/mobile". I'm just not sure how to answer this question. Anyone?

Is the point of Mobile Web using PWA still under consideration or has it been settled?

It is not settled… so I can't say it is 100% for sure that the web team will build a PWA, but I can say that a PWA product and technical strategy is being actively developed over this quarter.

I can also say that the improvements that @Jdlrobson has made (especially in separating out the references) have benefits for the other platforms as well. So it makes sense for us to get requirements from Web, Android, and iOS when developing this endpoint with the eventual goal of us all using the same API. (And make sure that we can mark this endpoint as stable and use it for a long time)

There's the most important and controversial part missing from the plan: "Figure out how to shut down the old endpoint". I'm really not comfortable with introducing the new one without having a clear plan for shutting off the old one... Without this we will end up with 2 versions forever and that would have a very significant cost in terms of resources (both computer power and human power)

I agree. However, a possible solution could also be for old clients to pay the penalty of not updating. For example, once we establish that a certain percentage of clients has been updated (>9[0-9]%), we could turn off updates and caching for mobile-sections*. The percentage of clients would need to be such that (i) the extra load incurred by MCS does not have an impact on normal background updates; and (ii) the absolute number of affected clients is small enough so that we can justify it. Can we get to that point? Do we have data on updates and exact versions of the Android app users use?

@Pchelolo @mobrovac Thanks for pointing this out - now I know that this is the biggest issue here. I'll talk with the Android team and get more answers and figure out a plan.

Just a few questions to guide me:

  1. Am I correct in saying :"The main issue is not that we are creating a new endpoint, but that the new output is fundamentally different from the old output which will mean we need to maintain 2 versions of this forever"
  2. Another way of stating the above, is this also true: "Even if we used the same route and versioned the API, this is still intolerable because we would need still have cache fragmentation and different code to maintain"
  3. The API is experimental, and only used officially by the Android app at WMF. Does this mean if we can solve for Android we have no other issues in deprecating it and turning it off?

Could you elaborate on why we cannot turn an unstable endpoint off?

The apps team have no control over the version of the app people install and we really don't want to break the client.

@Dbrant @bearND do you have any statistics on the android app versions that are there in the wild? Some estimate on the update rate?

No neat solutions here, but some statistics from Google Play: as of Jan. 31, just over 2/3 (68.17%) of installs of the stable version of the app are on the current version. About another 10% are on an approximately year-old release no longer available on Google Play, and from there a variety of releases of various ages mostly at ~2% or less.

Is it even possible that at some point all (99.999%) of the clients will update?

That would actually be impossible, since in recent versions of the app we've dropped support for devices running older versions of Android that can still be found on lower-end devices being sold today. People running those Android versions can find a legacy version of the app to run on Google Play, but it won't be updated, except to address security issues.

That would actually be impossible, since in recent versions of the app we've dropped support for devices running older versions of Android that can still be found on lower-end devices being sold today. People running those Android versions can find a legacy version of the app to run on Google Play, but it won't be updated, except to address security issues.

@Mholloway Thought here… don't all previous versions of the app have a fallback to api.php for getting the page content?

ovasileva subscribed.

@Jdlrobson - moving to needs analysis for now - any suggestions on how to proceed with this from our side?

@ovasileva yeh you can leave it in needs analysis. Good conversations happening above and I will continue to contribute to them with @Fjalapeno guidance until we have a clear answer.

@Mholloway Thought here… don't all previous versions of the app have a fallback to api.php for getting the page content?

Good point. The Android app does still have a MediaWiki API fallback mechanism in place, so absent any breaking changes to the relevant output (primarily from action=mobileview), older clients should continue working via the fallback even if we shut down /mobile-sections*.

This seems like the best migration strategy. Thoughts?

  1. It sounds like we should build in support for both endpoints into the next(?) android release (let's say the current version is version Andrew and the new version is called Bob). We could make it possible that the Bob can handle the current endpoint (mobile-sections) and the new one (tbc). If mobile-section fails it tries the newly named endpoint. It should be intelligent enough to stop querying the old endpoint once it realises it no longer exists and should also be able to fallback to mobileview api when mobile-section is down.
  2. When usage of "version Bob" gets significantly low: We then release "version Chris", which has no knowledge of the old endpoint and do the RESTBase change.
  3. Version Andrew will now fallback to the mobileview API, whereas Bob and Chris will be using the new one (as Bob is trained in both).

Apologies for not chiming in earlier.

And what exactly is the difference between mobile-sections and formatted outputs?

As mentioned earlier, the biggest difference is that the formatted endpoint doesn't contain the reference sections.[1]
Other changes: The lead section text is not inside the sections property anymore. It's split up between intro and text properties, obviating the need to relocate the first paragraph through DOM manipulation. Hatnotes, page issues, and infobox are extracted from the lead (section) text.

[1] My hope is that loading everything except references in a single step would be performant enough. That's what the web team plans to do anyways AFAIU. If that's the case I think we could get rid of the lead + remaining properties level. When references are to be loaded the request should include the revision number to make sure that the reference links match.

@bearND could you estimate the damage there to Android if the old code was to use the new endpoint result? It's my understanding that infoboxes, disambiguation text are no longer in the body of the lead section so would not render any more inside old clients.

That's correct. Plus the missing infobox, page issues and references. I think the missing infobox and the references are the biggest concern IMO at this time. The disambig entries are not very prominently surfaced in the Android app; in latest releases anyways.

since we would not be able to turn mobile-sections* off. [...]
The apps team have no control over the version of the app people install and we really don't want to break the client.

Actually, there are a couple of remote config values we could use to turn off RESTBase usage in the Android app: restbaseProdPercent + restbaseBetaPercent (see T126934). If we introduced a new config key for the new endpoint and newer app versions this would then make older app version use plain MediaWiki action=mobileview API for page content and TextExtracts for link previews. I consider this would be the last step though. By going back to the action API some newer Android app features will be disabled (mainly the Wiktionary definition lookup IIRC).
Before turning RB usage off I envision an extended period of not pre-generating mobile-sections and just having MCS process it on demand. For the transition period I guess we want to adjust the cache-policy.

In terms of renaming the endpoint I think there was a disappointment with our original choice of name and a hope to revisit that. Of course if we can work out a way to replace the existing endpoint, this becomes tangential.

I'm not thrilled about the name mobile-sections. Making the backwards-incompatible changes seemed like a good opportunity to change the old name to something better. Unfortunately, it's unclear what a good name for this is. One idea I had was naming it read or reading (explanation at the end of the next paragraph).

Re: mobile: The only thing that is mobile-specific is the handling of main pages. For these MCS falls back to action=mobileview. That's mainly because the Parsoid HTML of main pages is not responsive. It looks horrible on smaller form factors. If we had the right CSS rules to make Parsoid output look good on small devices then this fallback to mobileview would not be needed. After watching this week's CREDIT presentation by @TheDJ, I'm hopeful that a solution for this is achievable.
The bulk of the DOM transformations done by MCS is more for reducing the payload by trying to get rid of DOM elements and attributes that don't seem to affect the reading experience. In contrast to that Parsoid HTML is geared towards editing pages.

Ok using inspiration from @Jdlrobson @bearND and @Mholloway above:

  1. We should release an update to the Android app that consults a new server side config option
  2. Eventually a new version that works with the new API will be released (this will NOT be affected by the new config option)
  3. For the transition period we need to support both APIs for a time

Open questions (at least for me, maybe someone here knows the answers):

  1. For the approximately 12% of Android users who are "stuck" on an older version (10% about a year old and 2% long tail), is the fallback on MW API sufficient? Or can they be updated some other way?
  2. How long is the "transition period" expected to be? Or what % adoption should be our threshold of shutting down the old API?
  3. Will we support both APIs during the transition as totally separate services OR will we refactor the old API to call through to the new one?
  4. Naming discussion is still TBD but we are getting closer to an answer
  1. For the approximately 12% of Android users who are "stuck" on an older version (10% about a year old and 2% long tail), is the fallback on MW API sufficient? Or can they be updated some other way?

I think the fallback should be OK. The patch could theoretically be backported to older app versions. I believe that's a painful process, though, and leave that to the Android app devs to decide if that would be really worth it. Maybe not by itself but if there was something else that needed backporting?

  1. How long is the "transition period" expected to be? Or what % adoption should be our threshold of shutting down the old API?

I was thinking about giving it at least a year or so. I hope the adoption rate of v2 would adjust quicker and higher than it did for v1. I think it would be good to also have a month or two where we have turned the restbaseProdPercent + restbaseBetaPercent down to 0 before turning off the actual endpoint. Once we disable storage for v1 we should keep an eye on the external request rate for it.[1] There could be users besides the Android app, too. I.e. we still haven't turned off the mobile-text endpoint.

[1] https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_apps#Route_usage

  1. Will we support both APIs during the transition as totally separate services OR will we refactor the old API to call through to the new one?

That one is probably more a question for the RESTBase guys.
I believe in MCS it should be separate endpoints but in RB it could be the same just with a different content-type profile version. We haven't done a split for MCS endpoints yet. So, I'm not sure what the best practice for this is.
I think there might be more changes for v2 down the road, which would make the structure less compatible with v1 that it already is right now. It would be cleaner to have a separate endpoint IMO.

Working with @Tbayer to estimate the number of users that will be affected by deprecating this API and removing it

To be clear - I mean users that are NOT coming from the Android app.

Also, @bearND came up with some reasonable strategies for migrating the Android users to the new API using the existing config flags. So it looks like we will be good on that front.

The Android app sent 99.6% of requests to this API during one recent one-week timespan. In absolute terms, there were 218805 requests (or about 30k/day) not from the Android app.

I also ran a query for the most frequent non-Android-app user agents, see below.

SELECT
SUM( IF(access_method = 'mobile app' AND user_agent_map['os_family'] = 'Android', 1, 0) )
/SUM(1) AS Android_app_percentage,
SUM(1) AS total_mobilesectionsAPI_requests
FROM wmf.webrequest
WHERE year = 2017 AND month = 2 AND day >= 3 AND day <=9
AND uri_path LIKE '/api/rest_v1/page/mobile-sections%';

android_app_percentage  total_mobilesectionsapi_requests
0.996305444171708       59223628
1 row selected (5271.287 seconds)
SELECT user_agent, SUM(1) AS mobilesectionsAPI_requests
FROM wmf.webrequest
WHERE year = 2017 AND month = 2 AND day >= 3 AND day <=9
AND uri_path LIKE '/api/rest_v1/page/mobile-sections%'
AND NOT (access_method = 'mobile app' AND user_agent_map['os_family'] = 'Android')
GROUP BY user_agent
ORDER BY mobilesectionsAPI_requests DESC
LIMIT 20;

user_agent      mobilesectionsapi_requests
okhttp/2.6.0    150837
MWOffliner/HEAD (dbrant@wikimedia.org)  52755
node-fetch/1.0 (+https://github.com/bitinn/node-fetch)  13727
[...]
20 rows selected (10745.389 seconds)

[rest of the list redacted in case there are potential privacy concerns, #4 is a Safari user agent with only 170 views, and most of the rest are Safari versions too.]

Note: The "MWOffliner" user agent was from me, creating an offline ZIM collection from 50k articles. This tool will be able to adapt to any new endpoint format, and shouldn't be a blocker.

Given that you are considering making this a major backwards-incompatible change, would it make sense to also enable streaming HTML rendering of the response? This would avoid major performance regressions on slow connections, and might even improve performance over the current lead section loading. Metadata could either be inlined in the head (if absolutely needed for rendering), or served in a separate JSON end point (if the first content can be rendered without this metadata).

@GWicke I was wondering this myself. @Niedzielski is doing some research auditing the existing code for the apps and evaluating moving to streaming HTML will part of the audit T152991: Audit iOS, Android and Mobile Web HTML, CSS, and JS

Looking at mobile sections output, there is some overlap with the summary endpoint (display title, description, thumbnail):

"ns": 0,
  "id": 4269567,
  "revision": "764668785",
  "lastmodified": "2017-02-10T06:40:26Z",
  "lastmodifier": {
    "name": "Futurulus",
    "gender": "unknown"
  },
  "displaytitle": "Dog",
  "wikibase_item": "Q144",
  "description": "domestic animal",
  "protection": {
    "edit": [
      "autoconfirmed"
    ],
    "move": [
      "sysop"
    ]
  },
  "editable": false,
  "languagecount": 220,
  "image": {
    "file": "Collage of Nine Dogs.jpg",
    "urls": {
      "320": "//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/320px-Collage_of_Nine_Dogs.jpg",
      "640": "//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/640px-Collage_of_Nine_Dogs.jpg",
      "800": "//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/800px-Collage_of_Nine_Dogs.jpg",
      "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/1024px-Collage_of_Nine_Dogs.jpg"
    }
  },
  "hatnotes": [
    "This article is about the domestic dog. For related species known as \"dogs\", see <a href=\"/wiki/Canidae\" title=\"Canidae\">Canidae</a>. For other uses, see <a href=\"/wiki/Dog_(disambiguation)\" title=\"Dog (disambiguation)\">Dog (disambiguation)</a>.",
    "\"Doggie\" redirects here. For the Danish artist, see <a href=\"/wiki/Doggie_(artist)\" title=\"Doggie (artist)\">Doggie (artist)</a>."
  ],

The other information included here (omitted above) is the section titles. So I am working off the cuff, but it seems like a lot of this could be added to the summary. And I think it would make sense. Basically everything that is used to build the native components (and editing business logic) could be downloaded from the summary end point, then the content is purely HTML can downloaded in the new end point.

Thoughts?

Adding a bit more information to the summary end point sounds like an interesting idea to me. I definitely like the consolidation aspect. However, I don't know enough about what the most common use cases are to really evaluate pros / cons, especially wrt performance.

One potential advantage might be that the summary information could already be in cache from using the page preview (web) or equivalent in the app. This would make the summary request on full page load basically free.

The only common properties between summary and mobile-sections-lead right now are description and, with a bit of variation, some form of titles. In the future, once we get a thumbnail API, then the thumbnail for the lead image should be common as well.

In the early beginnings of MCS I was experimenting with an HTML content type to be used by the Android app. I abandoned that approach mainly for two reasons: 1) this seemed like a bigger step which would have required a lot more refactoring in the Android app, plus at that time the RB environment was very new for the app, and 2) our desire then to replace the WebView with native components. Now I'd say #1 is probably not so much an issues anymore. I haven't heard much about #2 lately.
The main reason why I'm a bit hesitant to make the output to HTML content type nowadays is that I believe is that the web team might prefer JSON. My main motivation for this new endpoint is adoption by the web team. @Jdlrobson what do you think about HTML output?

This seems to go a bit counter to the tendency in the new endpoint of extracting properties from HTML into their own JSON properties (e.g. intro, (lead) text, infobox, issues, hatnotes). It wouldn't make sense to me to have encoded HTML portions inside JSON inside HTML <head>. If we go the HTML route then what MCS output should simply mark the respective elements using HTML attributes (class or something else) instead; then rely on CSS rules to adjust the layout.

Looking over the data we currently extract from the HTML, it seems that it is convenient to deliver it as structured JSON for 3 different purposes:

  1. To build native components (i.e. the TOC)
  2. To power business logic (i.e. protection status, editable)
  3. To re-assemble into better formatted HTML for the device (i.e. lead section, info box)

@bearND does this seem like a fair assessment? If so…

In regards to @GWicke's suggestion, I think streaming HTML could solve for purpose 3 if we assembled everything on the server.

For purpose 1 & 2, that seems like the part where we really want JSON. My suggestion here is that we should put everything that falls in these categories in a separate endpoint that delivers that JSON. The summary seems like a good candidate here, because that's what much of this is: protection status, editable, wikidata item, thumbnail, last modified, etc… I would say that even the ToC is a good fit here: a list of section titles seems like a good summary of the article.

Also technically there is some nice separation:

  • Summary endpoint: delivers JSON used for business rules and to build native (or web app) components that are separate from the main content
  • Content endpoint: delivers HTML used for the main content of the article

Any thoughts here?

Your points are a fair assessment of the current state.
For the future:
#1 I definitely agree.
#2 probably agree. I think a point could be made to share business logic between apps and web in JS eventually.
#3 As I said before, I think if we go the HTML route for the content then we should consider not extracting the lead intro and text anymore and just have CSS rules to affect the position depending on screen dimensions.

Adding a lot more fields to the summary endpoint would bloat it up. I'm not sure how the web team would feel about having a lot of unused payload in there when they want to show a hovercard.

@bearND So maybe a better future state would be:

  • Summary endpoint: delivers JSON used to create previews
  • Metadata endpoint: : delivers JSON used to power business rules and to build native (or web app) components that are separate from the main content
  • Content endpoint: delivers HTML used for the main content of the article

We also have the option of doing #3 (re-assemble into better formatted HTML, or extract data) in a streaming fashion on the client. For JS, we have web-html-stream to do this. Combined with decent markup (ex: <section> tags (T114072), namespaced attribute markers, <figure> and <video> for media, optional elements like references omitted), this can provide a lot of flexibility.

Depending on the size of the "business logic" metadata, it could also be possible to serve it in a HTTP header. This would avoid the need for an extra request, and control prioritization on slow connections. You probably wouldn't want to serve larger content like section headings this way, though. If you don't need those upfront, then it might be possible to extract those from the HTML stream.

The only common properties between summary and mobile-sections-lead right now are description and, with a bit of variation, some form of titles. In the future, once we get a thumbnail API, then the thumbnail for the lead image should be common as well.

In the early beginnings of MCS I was experimenting with an HTML content type to be used by the Android app. I abandoned that approach mainly for two reasons: 1) this seemed like a bigger step which would have required a lot more refactoring in the Android app, plus at that time the RB environment was very new for the app, and 2) our desire then to replace the WebView with native components. Now I'd say #1 is probably not so much an issues anymore. I haven't heard much about #2 lately.
The main reason why I'm a bit hesitant to make the output to HTML content type nowadays is that I believe is that the web team might prefer JSON. My main motivation for this new endpoint is adoption by the web team. @Jdlrobson what do you think about HTML output?

This seems to go a bit counter to the tendency in the new endpoint of extracting properties from HTML into their own JSON properties (e.g. intro, (lead) text, infobox, issues, hatnotes). It wouldn't make sense to me to have encoded HTML portions inside JSON inside HTML <head>. If we go the HTML route then what MCS output should simply mark the respective elements using HTML attributes (class or something else) instead; then rely on CSS rules to adjust the layout.

The reason I prefer JSON is that it's more malleable. The app and web experiences are different and should be allowed to be different. As soon as you send HTML, the client (and server if you are doing server-side rendering) now has to worry about parsing a DOM tree and working out what goes where. I want clients to be as dumb as possible. To take what they need and render it.

If streaming is the goal, a hybrid approach (e.g. sending a JSON blob of lead, followed by a maker followed by the HTML of remaining sections) may be better, but this then restricts us on what we can do with the remaining sections. We currently are experimenting with not rendering the references section. In addition to this the JSON format of remaining does allow us to easily construct the table of contents on mobile for instance as well as easily provide section collapsing in the React.js web app (without it that would be near impossible). Could we use multiple URLs to stitch JSONs together in the service worker for example?

So in summary, I'm pretty attached to the lead being JSON, but I think there's a sweet spot for a hybrid approach somewhere...

It's actually possible to stream JSON, there's a bunch of libraries for doing that, for example this one: http://oboejs.com

We can add support for that server-side if needed.

Sounds like we have some things to work out between web and apps.

@Jdlrobson @Niedzielski I am going to grab you two as representatives of web and apps and setup a meeting to try and figure out what the requirements are and how we want to use the data from the end points.

On using a hybrid approach vs using headers for structured data vs using separate APIs:
I think the goals of each of these is is the same: Extract what is needed from the HTML server side and deliver it to clients as structured JSON. In that way HTML does not need to be parsed on the client.

If the concern about HTML parsing is about performance, then I don't think it is necessarily warranted. Shallow, streaming HTML parsing / processing can actually be very cheap, on the order of 1ms per mb HTML: https://github.com/wikimedia/web-html-stream#performance. This might be faster than parsing the same HTML wrapped in JSON, even when using a fully blocking parser.

JSON and HTML each have their sweet spots: JSON for representing structured data, HTML for representing documents. JSON is easy to work with in a synchronous context. HTML gives you a lot of flexibility for matching & replacing bits of content, for example to adjust how media is rendered on a specific device, or how / whether an infobox is shown.

@GWicke (at least for me) I think the concern about HTML parsing is more about reimplementing brittle HTML parsing multiple times across several clients as we do now.

Your comments about HTML vs json also ring true to me... this is more about using the right tool for each context. And about separation of concerns. It's up to us to nail down what needs to be structured and what needs to be a document.

I'm going to start editing the description on this ticket to collect the requirements on this. I think we need to back up a few steps before we move forward so we have a good understanding of what the needs are before we dig into the solutions a bit more

@GWicke (at least for me) I think the concern about HTML parsing is more about reimplementing brittle HTML parsing multiple times across several clients as we do now.

Is it the HTML parsing itself that is brittle, or is it the unpredictable markup you need to work with? The latter is fairly straightforward to address with server side pre-processing, so that you have a stable markup structure to work against. That part is really no different between JSON & HTML.

Some Android concerns below :]

I think the fallback should be OK.

Unfortunately, the Android MediaWiki fallback support only kicks in for protocol errors (except 404) AFAIK. I tried the scenario of changing a field type in the mobile-sections-lead response and see an error page in the app but it's not considered significant so the fallback MediaWiki support will never kick in. The scenario of missing fields in a new response type would likely perform similarly and in some scenarios errors are intentionally stifled (see T145075 and the thread "[draft] incident report -- crash on startup for non-English users" for additional discussion). For fallback to kick in, the endpoint must return a status other than 404 (e.g., https://en.wikipedia.org/api/rest_v1/foobar returns 404 which would be considered insignificant)

The patch could theoretically be backported to older app versions. I believe that's a painful process, though, and leave that to the Android app devs to decide if that would be really worth it. Maybe not by itself but if there was something else that needed backporting?

This would be a little lame so I hope we don't have to go this route. If we do, short-circuiting an "isRestBaseEnabled()" method might be the most practical way to ensure old clients still work. Maybe we should set a "timebomb" MediaWiki fallback in future releases of the app to avoid this and similar problems going forward but this might also discourage good API versioning

I was thinking about giving it at least a year or so. I hope the adoption rate of v2 would adjust quicker and higher than it did for v1. I think it would be good to also have a month or two where we have turned the restbaseProdPercent + restbaseBetaPercent down to 0 before turning off the actual endpoint. Once we disable storage for v1 we should keep an eye on the external request rate for it.[1] There could be users besides the Android app, too. I.e. we still haven't turned off the mobile-text endpoint.

This remote kill switch regrettably is not version sensitive at the moment so it would affect all versions of the app. There's some good discussion around versioning the configuration we might consider for future releases (or go with the timebomb approach)

Is it the HTML parsing itself that is brittle, or is it the unpredictable markup you need to work with? The latter is fairly straightforward to address with server side pre-processing, so that you have a stable markup structure to work against. That part is really no different between JSON & HTML.

I really appreciated the considerations mentioned between JSON and HTML. I understand that the content is super HTML heavy but I want to give a quick mention that, with the exception of the WebView and an extremely small subset handled by TextView, most native Android components are not built for handling sophisticated HTML parsing or rendering. We'd have to rethink how the app consume responses if we go back to parsing HTML. The old app-side MediaWiki parsing implementations weren't particularly pretty so I'd prefer we somehow leverage the WebView or Duktape JavaScript to do HTML to JSON to Java unmarshalling app side. In general, I'll add that I find JSON responses much more human readable, of clearer intent, friendlier to newcomers, almost self-documenting, simpler to version, and easier to handle but I admit it has its shortcominngs

For fallback to kick in, the endpoint must return a status other than 404 (e.g., https://en.wikipedia.org/api/rest_v1/foobar returns 404 which would be considered insignificant)

Right. Thank you. 404 wouldn't work since we need that for titles that don't exist. We could try other codes, like 410, 400, or 501.

The patch could theoretically be backported to older app versions. I believe that's a painful process, though, and leave that to the Android app devs to decide if that would be really worth it. Maybe not by itself but if there was something else that needed backporting?

This would be a little lame so I hope we don't have to go this route. If we do, short-circuiting an "isRestBaseEnabled()" method might be the most practical way to ensure old clients still work. Maybe we should set a "timebomb" MediaWiki fallback in future releases of the app to avoid this and similar problems going forward but this might also discourage good API versioning

Yes, I agree that this is a bit too much, and using something like you suggested to short-circuit an "isRestBaseEnabled()" method is preferable over porting the page loading logic (too much goes on in there and too much code there has changed). I like the idea of a timebomb, esp. for the announcements endpoint.

This remote kill switch regrettably is not version sensitive at the moment so it would affect all versions of the app.

Yes, this would require a new remote config key to be used by newer app versions which can handle the new endpoint or new version of the endpoint.

In T146944#3035658, @Niedzielski wrote:
For fallback to kick in, the endpoint must return a status other than 404 (e.g., https://en.wikipedia.org/api/rest_v1/foobar returns 404 which would be considered insignificant)
Right. Thank you. 404 wouldn't work since we need that for titles that don't exist. We could try other codes, like 410, 400, or 501.

That's doable. We're already discussing hiding some endpoints from the docs, so we might as well hide these and configure them to always return whatever response code you want.

If only for my own recollection, I want to document an example of a more concrete case where HTML parsing is a hassle. The page lead image is a common app use case and it's parsed out in the mobile-sections-lead endpoint like:

"image": {
  "file": "President Barack Obama.jpg",
  "urls": {
    "320": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/320px-President_Barack_Obama.jpg",
    "640": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/640px-President_Barack_Obama.jpg",
    "800": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/800px-President_Barack_Obama.jpg",
    "1024": "//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/1024px-President_Barack_Obama.jpg"
  }
}

Here's the relevant HTML form from the same endpoint:

<a href="/wiki/File:President_Barack_Obama.jpg">
  <img
    src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/220px-President_Barack_Obama.jpg"
    data-file-type="bitmap"
    srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/440px-President_Barack_Obama.jpg 2x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/330px-President_Barack_Obama.jpg 1.5x"
    height="275"
    width="220">
</a>

A generous pseudo code snippet to access the image would be Image image = doc.querySelector(Image.TAG). Getting the URL is then like image.getSrc()... Or is it? If the client notices that the image is grainy, do they reimplement a subset of browser functionality to calculate the ideal URL from src, srcset, and sizes, hardcode it to one of the large sizes, or experiment with requesting arbitrary sizes? Does their parser easily unpack srcset or do they have to write that functionality?

@Niedzielski, HTML can be optimized for convenient consumption the same way JSON can be. In this example, it is clear that srcset is a poorly defined HTML attribute, and that we'd probably use something better in HTML that we'd send to the client. The native use of srcset also suffers from other problems. For example, browsers don't typically take network conditions into account when selecting a resolution. Loading huge images on a 2G connection leads to a bad user experience, and potentially large bandwidth bills. Something smarter (possibly also with progressive loading) is needed really, independent of JSON or HTML. See T66214 for the discussion on making it easy for the client to select image size & quality.

The example of images is a good one as it applies to content in general, and not just lead images. We probably won't cut up a HTML page into a JSON array at each image, so we'll need to handle images in content in any case. So my point is really that HTML is there to stay for the bulk of the content, and provides the flexibility we need there. But, there is also a lot of metadata that most use cases need, and JSON is a lot better for that.

@Niedzielski totally agree… one of the main purposes of the meeting at the offsite is to call out things like this that we do not want to parse from HTML. Basically IMO if we do deliver some of the page content as HTML, it should only be content that we don't have to later parse on the client. Any parsing into structured data should be done on the server and delivered as JSON so the clients don't have to do that work later.

I think this echoes @GWicke's comment as well.

TL;DR;
As a slight technical kitten… on this particular matter of source sets / images: this is one of the primary reasons that we want to wait until the thumbnail API is done before shipping this new API. If could simplify this specific case significantly.

For the sake of a complete comparison example, I've made a sketch for an improved HTML representation using data-* attributes:

<a href="/wiki/File:President_Barack_Obama.jpg">
  <img
    src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/220px-President_Barack_Obama.jpg"
    data-file-type="bitmap"
    data-src-1.5x="//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/330px-President_Barack_Obama.jpg"
    data-src-2x="//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/440px-President_Barack_Obama.jpg"
    data-src-x="//upload.wikimedia.org/wikipedia/commons/thumb/8/8d/President_Barack_Obama.jpg/%d-President_Barack_Obama.jpg"
    height="275"
    width="220">
</a>
  • There are no expectations that the parser is DOM web API capable. For example, the parser probably has an Element type but no concept of an HTMLImageElement and its API
  • The src attribute means that browsers display content without JavaScript. I guess JavaScript clients can think of src as the 1x thumbnail
  • The missing srcset attribute means that a JavaScript client is required to adapt the resolution for higher resolution devices. On the plus side, this requires the client to take an active role in deciding when to use high resolution images so maybe they'll think about bandwidth. On the downside, this requires the client to re-implement some srcset-like behavior to get non-grainy images which previously were available out of the box. We could also keep the srcset attribute but the data will be redundandt and make the model more confusing
  • The data-src-x attribute is an example of how we might expose a configurable format URL so the client wouldn't have to know the thumbnail API. I don't think we would expose this specifically but I thought it would be interesting to consider

@GWicke, the format probably won't look like this but is it a fair comparison? If so, I will update the ticket description with the JSON example and the HTML example. I will also update the goals section to include mention of the JSON / HTML / hybrid format.

@Niedzielski, the idea with the thumb API is to avoid the need to explicitly embed these URLs. The client should be able to make its own decisions on size, limited only by the original dimensions.

In any case, I don't think anybody is proposing to use JSON to parametrize images in the page content itself. The question of image scaling within page content is basically orthogonal to the question of whether to wrap HTML content into JSON, or delivering JSON metadata separately from HTML content.

Some discussion notes from the Reading offsite / onsite earlier this week: https://etherpad.wikimedia.org/p/web-code

It's actually possible to stream JSON, there's a bunch of libraries for doing that, for example this one: http://oboejs.com

+++++

Any updates on this task?
This endpoint came up as a possible solution to T168625. We are basically recreating the logic of the intro field in the MCS endpoint...

We can add support for that server-side if needed.

Any updates on this?
This endpoint came up as a possible solution to T168625. We are basically recreating the logic of the intro field in the MCS endpoint...

My comment was more about supporting JSON streaming and now I'm confused how that would help you on T168625

(my bad.. i meant the task in general and snippet was accidentally added.. updated original comment)