In T96903 and at the Lyon hackathon we identified a set of interconnected issues around structured data, storage and caching. This task is aiming to provide a high-level summary. It is intended to be a starting point for a more focused discussion with stakeholders in this area.
Supporting a widening range of devices and use cases
The way our users interact with our projects has changed: They use devices ranging from feature phones on marginal connections to many-core high-resolution desktops on super-fast low-latency connections. Some of them want to quickly look up short summaries and factoids, while others immerse themselves in long form articles, and enjoy rich visualizations and media.
Our platform was originally designed around long-form articles displayed exclusively on desktops. As a result, it is not as easy to adapt to different devices and use cases as it could be. To become more adaptable, we need to evolve how we store and represent content and data.
Separating data from presentation
By separating data from its presentation, we gain flexibility in how we select and present data for a device or use case. For example, we can show infobox data differently depending on device, or use it to present a short summary in search results. Carefully designed presentation components can offer a better editing experience. For example, we could let users update a city's population right inside the rendered infobox component, with a widget prompting for a source of the new information.
With Wikidata we already have a great community-driven repository of semantic structured data. In Wikipedia, it is already used for language links, some infobox data, translations and article summaries. However, a more systematic integration is needed to reap the full benefits for both reading and editing.
We also have less general data that doesn't fit Wikidata's mission. This includes licensing information, image metadata, template parameters, categories, and newer types like revision scores, lead images or parsoid round-trip information. We need extensible storage and query APIs, as well as a systematic integration with MediaWiki functionality like page histories and recent changes.
Finally, our least structured data is regular article content, made up of regular paragraphs, lists and tables. This content is currently stored as wikitext, and converted to cached HTML for display. For visual editing and other transformations, we are also storing this content as machine-readable HTML5 with RDFa. Additional derived formats are being created, and will also need storage support and exposure via APIs.
Change propagation
A challenge with the decomposition of content into multiple bits of data is the systematic propagation of changes through the system. Our current methods of tracking dependencies and scheduling asynchronous updates are relatively difficult to extend to new types of content, and show some signs of strain. With more dependencies to track and more types of content to update, we will need to improve the scalability, ergonomics and efficiency of change propagation.
See also: T102476
Content composition and caching
After separating data from presentation, we need to re-assemble content for a given device and use case. For performance and efficiency of change propagation it would be desirable to perform at least some of this assembly as late as possible—either at the edge, or directly in the client. However, we need to balance late assembly with the overheads of doing this at high volume; choosing the right granularity and division of labor between client and server will be important. We also need to provide a reasonable user experience for clients without JavaScript and other modern browser features. Our analytics as discussed in T58575 show that these still make up about 2.5% of our page views, partly driven by feature phones.
A general composition mechanism should support typical content use cases like media embeddings, tag extensions, transclusions or data widgets. We could also consider using the same mechanism for skins.
See also: