Page MenuHomePhabricator

[RFC] Evolving our content platform: Content adaptability, structured data and caching
Closed, DeclinedPublic

Description

In T96903 and at the Lyon hackathon we identified a set of interconnected issues around structured data, storage and caching. This task is aiming to provide a high-level summary. It is intended to be a starting point for a more focused discussion with stakeholders in this area.

Supporting a widening range of devices and use cases

The way our users interact with our projects has changed: They use devices ranging from feature phones on marginal connections to many-core high-resolution desktops on super-fast low-latency connections. Some of them want to quickly look up short summaries and factoids, while others immerse themselves in long form articles, and enjoy rich visualizations and media.

Our platform was originally designed around long-form articles displayed exclusively on desktops. As a result, it is not as easy to adapt to different devices and use cases as it could be. To become more adaptable, we need to evolve how we store and represent content and data.

Separating data from presentation

By separating data from its presentation, we gain flexibility in how we select and present data for a device or use case. For example, we can show infobox data differently depending on device, or use it to present a short summary in search results. Carefully designed presentation components can offer a better editing experience. For example, we could let users update a city's population right inside the rendered infobox component, with a widget prompting for a source of the new information.

With Wikidata we already have a great community-driven repository of semantic structured data. In Wikipedia, it is already used for language links, some infobox data, translations and article summaries. However, a more systematic integration is needed to reap the full benefits for both reading and editing.

We also have less general data that doesn't fit Wikidata's mission. This includes licensing information, image metadata, template parameters, categories, and newer types like revision scores, lead images or parsoid round-trip information. We need extensible storage and query APIs, as well as a systematic integration with MediaWiki functionality like page histories and recent changes.

Finally, our least structured data is regular article content, made up of regular paragraphs, lists and tables. This content is currently stored as wikitext, and converted to cached HTML for display. For visual editing and other transformations, we are also storing this content as machine-readable HTML5 with RDFa. Additional derived formats are being created, and will also need storage support and exposure via APIs.

Change propagation

A challenge with the decomposition of content into multiple bits of data is the systematic propagation of changes through the system. Our current methods of tracking dependencies and scheduling asynchronous updates are relatively difficult to extend to new types of content, and show some signs of strain. With more dependencies to track and more types of content to update, we will need to improve the scalability, ergonomics and efficiency of change propagation.
See also: T102476

Content composition and caching

After separating data from presentation, we need to re-assemble content for a given device and use case. For performance and efficiency of change propagation it would be desirable to perform at least some of this assembly as late as possible—either at the edge, or directly in the client. However, we need to balance late assembly with the overheads of doing this at high volume; choosing the right granularity and division of labor between client and server will be important. We also need to provide a reasonable user experience for clients without JavaScript and other modern browser features. Our analytics as discussed in T58575 show that these still make up about 2.5% of our page views, partly driven by feature phones.

A general composition mechanism should support typical content use cases like media embeddings, tag extensions, transclusions or data widgets. We could also consider using the same mechanism for skins.

See also:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke renamed this task from [RFC] Content portability and structured data to [RFC] Content portability, structured data and caching.Jun 3 2015, 11:23 AM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke renamed this task from [RFC] Content portability, structured data and caching to [Meta-RFC] Content portability, structured data and caching.Jun 3 2015, 12:42 PM

it would be desirable to perform at least some of this assembly as late as possible—either at the edge,

Let's not forget non-WMF use cases though. While edge assembly in varnish or some other frontend would help us, we have a lot of very smart people to set that up and configure it. Third-party users with a low-traffic wiki may not need it and may not have the time to spend setting it up.

In other words, out-of-the-box-MediaWiki should probably continue to work without having to do a lot of fancy configuration of the web server or installation of varnish or other frontends.

or directly in the client.

Let's not forget less-featured or locked-down clients though. Missing fancy skin bits isn't a problem, but missing basic navigation or infoboxes and other content probably would be.

out-of-the-box-MediaWiki should probably continue to work without having to do a lot of fancy configuration of the web server or installation of varnish or other frontends

Agreed: Installing a basic MediaWiki instance should not require a lot of manual configuration. There are different ways to achieve that, and we'll have to work out which one works best.

Let's not forget less-featured or locked-down clients though

Also agreed. I added two sentences on that with a reference to T58575.

These are the raw notes from T96903 pertaining to this task:

  • Content model & storage / structured data
    • Challenges: Multi-device / multi-context, rich editing, search & discovery
    • Move to HTML5, with wikitext as edit UI?
    • Structured data extraction (page properties, categories, infoboxes, navboxes, data tables)
      • generic widgets for presentation / editing
      • could also work for multimedia
    • future of templating and tag extensions
    • support for associating multiple types of content with a logical 'page' or 'media' name; history and editing support for those
    • Needs a bunch of work still: Flow uses HTML as content model now, but <s>is considering switching back to wikitext until everything else moves to HTML</s> (we want to use RESTBase, but no one is advocating switching to wikitext)
    • HTML transclusion
      • templates -- changing model?
      • future of lua modules -- html production? dom? template + data fill? something else?
      • compare with the way citations work in parsoid/VE today (DOM manipulation outside of the ext tag, scary!)
    • Re-usable citations
    • page-specific RL modules
      • (things that break when navigating)
  • Frontends as API consumers / caching / content distribution
    • CDN and user customization strategy:
      • Fully cached logged-in page views?
      • Push chrome customization and content storage to the edge?
      • Limit Varnish config complexity
      • Multilingual wikis (Daniel: is this about the URL layout?)
    • Which API end points will be critical for performance?
GWicke renamed this task from [Meta-RFC] Content portability, structured data and caching to [Meta-RFC] Content adaptability, structured data and caching.Jun 9 2015, 8:39 PM
GWicke renamed this task from [Meta-RFC] Content adaptability, structured data and caching to [RFC] Content adaptability, structured data and caching.Jun 15 2015, 2:59 PM
GWicke updated the task description. (Show Details)
GWicke renamed this task from [RFC] Content adaptability, structured data and caching to [RFC] Evolving our content platform: Content adaptability, structured data and caching.Jun 23 2015, 7:26 PM

I started to collect some use cases with their respective requirements / challenges for storage and change propagation at T103445.

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

Qgil raised the priority of this task from High to Needs Triage.Oct 28 2015, 11:08 AM

November 6, and this proposal doesn't seem to have much traction, it is not on track. Unless there is a sudden change, I will leave the ultimate decision of pre-scheduling it for the Wikimedia-Developer-Summit-2016 to @RobLa-WMF and the Architecture Committee.

Not so much merged as discussed at the same time / during the same session.

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

kchapman subscribed.

No activity since the 2016 dev summit and currently proposal is not actionable.