[RFC] Evolving our content platform: Content adaptability, structured data and caching
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• GWicke
	May 14 2015, 4:28 PM

Description

In T96903 and at the Lyon hackathon we identified a set of interconnected issues around structured data, storage and caching. This task is aiming to provide a high-level summary. It is intended to be a starting point for a more focused discussion with stakeholders in this area.

Supporting a widening range of devices and use cases

The way our users interact with our projects has changed: They use devices ranging from feature phones on marginal connections to many-core high-resolution desktops on super-fast low-latency connections. Some of them want to quickly look up short summaries and factoids, while others immerse themselves in long form articles, and enjoy rich visualizations and media.

Our platform was originally designed around long-form articles displayed exclusively on desktops. As a result, it is not as easy to adapt to different devices and use cases as it could be. To become more adaptable, we need to evolve how we store and represent content and data.

Separating data from presentation

By separating data from its presentation, we gain flexibility in how we select and present data for a device or use case. For example, we can show infobox data differently depending on device, or use it to present a short summary in search results. Carefully designed presentation components can offer a better editing experience. For example, we could let users update a city's population right inside the rendered infobox component, with a widget prompting for a source of the new information.

With Wikidata we already have a great community-driven repository of semantic structured data. In Wikipedia, it is already used for language links, some infobox data, translations and article summaries. However, a more systematic integration is needed to reap the full benefits for both reading and editing.

We also have less general data that doesn't fit Wikidata's mission. This includes licensing information, image metadata, template parameters, categories, and newer types like revision scores, lead images or parsoid round-trip information. We need extensible storage and query APIs, as well as a systematic integration with MediaWiki functionality like page histories and recent changes.

Finally, our least structured data is regular article content, made up of regular paragraphs, lists and tables. This content is currently stored as wikitext, and converted to cached HTML for display. For visual editing and other transformations, we are also storing this content as machine-readable HTML5 with RDFa. Additional derived formats are being created, and will also need storage support and exposure via APIs.

Change propagation

A challenge with the decomposition of content into multiple bits of data is the systematic propagation of changes through the system. Our current methods of tracking dependencies and scheduling asynchronous updates are relatively difficult to extend to new types of content, and show some signs of strain. With more dependencies to track and more types of content to update, we will need to improve the scalability, ergonomics and efficiency of change propagation.
See also: T102476

Content composition and caching

After separating data from presentation, we need to re-assemble content for a given device and use case. For performance and efficiency of change propagation it would be desirable to perform at least some of this assembly as late as possible—either at the edge, or directly in the client. However, we need to balance late assembly with the overheads of doing this at high volume; choosing the right granularity and division of labor between client and server will be important. We also need to provide a reasonable user experience for clients without JavaScript and other modern browser features. Our analytics as discussed in T58575 show that these still make up about 2.5% of our page views, partly driven by feature phones.

A general composition mechanism should support typical content use cases like media embeddings, tag extensions, transclusions or data widgets. We could also consider using the same mechanism for skins.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T98348 Implement the Wikimedia Foundation Call to Action 2015
Invalid	None	T98352 Update legacy architectures and deliver mobile-ready infrastructure and services to support structured data, user security, and a simplified user experience
Declined	None	T99088 [RFC] Evolving our content platform: Content adaptability, structured data and caching

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• GWicke updated the task description. (Show Details)Jun 2 2015, 10:24 PM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Jun 3 2015, 10:24 AM

• GWicke updated the task description. (Show Details)Jun 3 2015, 10:29 AM

• GWicke updated the task description. (Show Details)Jun 3 2015, 11:13 AM

• GWicke renamed this task from [RFC] Content portability and structured data to [RFC] Content portability, structured data and caching.Jun 3 2015, 11:23 AM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Jun 3 2015, 11:25 AM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Jun 3 2015, 12:25 PM

• GWicke updated the task description. (Show Details)Jun 3 2015, 12:33 PM

• GWicke renamed this task from [RFC] Content portability, structured data and caching to [Meta-RFC] Content portability, structured data and caching.Jun 3 2015, 12:42 PM

• GWicke updated the task description. (Show Details)Jun 3 2015, 12:46 PM

fbstj awarded a token.Jun 3 2015, 1:37 PM

• GWicke added subscribers: dr0ptp4kt, bd808.Jun 3 2015, 2:26 PM

• mobrovac awarded a token.Jun 4 2015, 9:42 AM

it would be desirable to perform at least some of this assembly as late as possible—either at the edge,

Let's not forget non-WMF use cases though. While edge assembly in varnish or some other frontend would help us, we have a lot of very smart people to set that up and configure it. Third-party users with a low-traffic wiki may not need it and may not have the time to spend setting it up.

In other words, out-of-the-box-MediaWiki should probably continue to work without having to do a lot of fancy configuration of the web server or installation of varnish or other frontends.

or directly in the client.

Let's not forget less-featured or locked-down clients though. Missing fancy skin bits isn't a problem, but missing basic navigation or infoboxes and other content probably would be.

out-of-the-box-MediaWiki should probably continue to work without having to do a lot of fancy configuration of the web server or installation of varnish or other frontends

Agreed: Installing a basic MediaWiki instance should not require a lot of manual configuration. There are different ways to achieve that, and we'll have to work out which one works best.

Let's not forget less-featured or locked-down clients though

Also agreed. I added two sentences on that with a reference to T58575.

• GWicke updated the task description. (Show Details)Jun 9 2015, 2:17 PM

These are the raw notes from T96903 pertaining to this task:

Content model & storage / structured data
- Challenges: Multi-device / multi-context, rich editing, search & discovery
- Move to HTML5, with wikitext as edit UI?
- Structured data extraction (page properties, categories, infoboxes, navboxes, data tables)
  - generic widgets for presentation / editing
  - could also work for multimedia
- future of templating and tag extensions
- support for associating multiple types of content with a logical 'page' or 'media' name; history and editing support for those
- Needs a bunch of work still: Flow uses HTML as content model now, but <s>is considering switching back to wikitext until everything else moves to HTML</s> (we want to use RESTBase, but no one is advocating switching to wikitext)
- HTML transclusion
  - templates -- changing model?
  - future of lua modules -- html production? dom? template + data fill? something else?
  - compare with the way citations work in parsoid/VE today (DOM manipulation outside of the ext tag, scary!)
- Re-usable citations
- page-specific RL modules
  - (things that break when navigating)
Frontends as API consumers / caching / content distribution
- CDN and user customization strategy:
  - Fully cached logged-in page views?
  - Push chrome customization and content storage to the edge?
  - Limit Varnish config complexity
  - Multilingual wikis (Daniel: is this about the URL layout?)
- Which API end points will be critical for performance?

• GWicke added subscribers: • damons, • TrevorParscal, • Tfinc and 2 others.Jun 9 2015, 6:08 PM

• GWicke added a subscriber: • Tnegrin.

• GWicke renamed this task from [Meta-RFC] Content portability, structured data and caching to [Meta-RFC] Content adaptability, structured data and caching.Jun 9 2015, 8:39 PM

• GWicke updated the task description. (Show Details)Jun 9 2015, 11:18 PM

• Spage moved this task from P1: Define to Under discussion on the TechCom-RFC board.Jun 10 2015, 8:37 PM

daniel removed a project: TechCom-RFC.Jun 10 2015, 8:41 PM

• GWicke added a parent task: T98352: Update legacy architectures and deliver mobile-ready infrastructure and services to support structured data, user security, and a simplified user experience.Jun 11 2015, 6:46 PM

• GWicke added subscribers: BBlack, Joe.Jun 11 2015, 7:32 PM

• GWicke mentioned this in T102306: Services team roadmap July - September 2015 (Q1 2015/16).Jun 12 2015, 11:27 PM

• GWicke renamed this task from [Meta-RFC] Content adaptability, structured data and caching to [RFC] Content adaptability, structured data and caching.Jun 15 2015, 2:59 PM

• GWicke updated the task description. (Show Details)

• GWicke added a subscriber: • Jhernandez.Jun 15 2015, 3:59 PM

• Fhocutt subscribed.Jun 20 2015, 12:25 AM

Ltrlg subscribed.Jun 20 2015, 8:17 PM

• Gilles subscribed.Jun 22 2015, 6:09 AM

• Tnegrin added a subscriber: • JKatzWMF.Jun 22 2015, 3:36 PM

• bearND subscribed.Jun 22 2015, 4:29 PM

greg subscribed.Jun 22 2015, 6:55 PM

I started to collect some use cases with their respective requirements / challenges for storage and change propagation at T103445.

Smalyshev subscribed.Jun 23 2015, 8:14 PM

Dbrant subscribed.Jun 24 2015, 2:57 AM

• GWicke updated the task description. (Show Details)Jul 15 2015, 12:48 AM

• GWicke mentioned this in T111588: RFC: API-driven web front-end.Sep 5 2015, 12:57 AM

• GWicke updated the task description. (Show Details)Sep 9 2015, 9:08 PM

• Spage mentioned this in T109612: Define main themes of the Wikimedia Developer Summit 2016.Sep 9 2015, 11:20 PM

• GWicke added a project: Wikimedia-Developer-Summit-2016.Sep 18 2015, 6:06 PM

cscott subscribed.Sep 18 2015, 6:07 PM

RobLa subscribed.Sep 18 2015, 6:24 PM

JanZerebecki subscribed.Sep 19 2015, 5:01 PM

Tgr subscribed.Sep 21 2015, 3:02 AM

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

Qgil moved this task from Backlog to Missing expected fields on the Wikimedia-Developer-Summit-2016 board.Oct 12 2015, 9:22 PM

Qgil raised the priority of this task from High to Needs Triage.Oct 28 2015, 11:08 AM

Qgil moved this task from Missing expected fields to Missing active discussion on the Wikimedia-Developer-Summit-2016 board.

November 6, and this proposal doesn't seem to have much traction, it is not on track. Unless there is a sudden change, I will leave the ultimate decision of pre-scheduling it for the Wikimedia-Developer-Summit-2016 to @RobLa-WMF and the Architecture Committee.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 6 2015, 11:24 AM

• MZMcBride subscribed.Nov 18 2015, 6:32 PM

• RobLa-WMF mentioned this in T119022: WikiDev 16 working area: Content format.Nov 19 2015, 12:30 AM

• Mholloway subscribed.Nov 20 2015, 3:22 PM

daniel mentioned this in T119593: Define the list of "must have" sessions for WikiDev '16.Nov 25 2015, 9:00 PM

daniel mentioned this in T119032: WikiDev 16 working area: Software engineering.Nov 26 2015, 6:14 PM

@daniel suggested in T119593 that this proposal be merged with T105845: RFC: Page components / content widgets and T114065: The future of MobileFrontend. Thoughts?

In T99088#1859949, @RobLa-WMF wrote:

@daniel suggested in T119593 that this proposal be merged with T105845: RFC: Page components / content widgets and T114065: The future of MobileFrontend. Thoughts?

Not so much merged as discussed at the same time / during the same session.

• RobLa-WMF mentioned this in T119029: WikiDev 16 working area: Content access and APIs.Dec 8 2015, 4:27 AM

JanZerebecki added a project: Wikidata.Dec 8 2015, 1:16 PM

LikeLifer subscribed.Dec 10 2015, 9:35 PM

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 11 2015, 10:11 AM

• gpaumier awarded a token.Jan 4 2016, 10:17 PM

• gpaumier subscribed.

Bianjiang subscribed.Jan 5 2016, 2:05 AM

jmadler subscribed.Jan 6 2016, 5:11 AM

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

Danny_B added a project: Proposal.May 2 2016, 10:14 PM

Krinkle moved this task from Inbox to Backlog on the Architecture board.Mar 29 2017, 8:48 PM

• GWicke removed • GWicke as the assignee of this task.Oct 11 2017, 10:19 PM

Krinkle removed a project: Architecture.Jan 4 2018, 12:20 AM

Krinkle edited projects, added TechCom-RFC; removed Proposal.

• gpaumier unsubscribed.Jan 4 2018, 2:09 AM

• Niedzielski subscribed.Feb 20 2018, 3:48 PM