Graph/Graphoid/Kartographer - data storage architecture
Open, HighPublic

Description

Graph/Graphoid need to solve the "data" problem, which seems to be identical to the <map> tag (Kartographer). Both <graph> and <map> have similar data storage needs, so I will only talk about the <graph>.

Intro

The <graph> tag contains JSON graph spec that describes how the graph should be drawn:

<graph>{ "data": ..., "marks": ..., ... }</graph>

During the page parse, the spec needs to be extracted and stored, so that it can be later used by the Graphoid service or the browser.

Requirements

  1. parser needs to store the graph definition (spec) in some storage
  2. Graphoid service needs the spec to generate the image
  3. browser needs to get the graph image generated by the Graphoid service
  4. browser needs to get the JSON spec when the user wants to interact with the graph
  5. BonusA: on save preview (not VE), it would be good to show the graph from Graphoid, instead of drawing it client-side, bit allow "click-to-interact" mode.
  6. BonusB: When user looks at an older revision of an article, they should see the graphs for that revision.
  7. BonusC: A special case of BonusB - if the graph spec uses external data that changes, or if Vega library changes the way it works, users should still see the original graph when looking at the older revisions.

Current limitations

The current approach of storing the spec in page_props SQL table has these issues:

  • 64KB limit per article - breaks for bigger/multiple graphs
  • for a new graph, it will take time for slave DBs to get replicated - which sometimes prevents graphoid from getting the spec from the api, causing an error (which may get cached)
  • the URL with graph ID (hash) becomes invalid if the spec changes
  • viewing older revisions do not show graphs unless they are identical to the latest revision
  • getting the spec via api.php is not cached by varnish (? TBD)

Special case - Visual Editor plugin

Since VE plugin can only support simple non-templated graphs, it would make sense to always render the graph on the client, without using Graphoid. On edit, if the user opens the graph editor, VE should load the graph libraries and render the content of the <graph> tag.

Solution #1 - SQL storage

Introduce new graph-specific SQL tables:

table graph_specs - contains just the graph specs, indexed by their hashes
    fields: hash (string key), last_used(timestamp), spec (blob)

table graph_revs - lists which graphs are used by which page revisions
    fields: id (autonum key), page_rev_id (int), hash (string)
  • When creating a new article revision, or updating an existing one (null edit), the graph ext will ensure that each graph exists in graph_specs table and update last_used timestamp on each. Graph ext will also ensure that graph_revs table contains only the right graph hashes. Graph extension will never delete any rows from the graph_specs, thus preventing any dead links from browser cache. This assumes that at that point graph ext will know revid of the new/existing revision.
  • A cleanup batch job will delete any graph_specs rows that are older than 1 month (?) and not referenced by the graph_revs table.
  • Graph ext will implement an api to get graph spec by hash: api.php? action=getgraph & hash=XXX
  • Browser will get an image by Graphoid/<hash>.png url (3). Graphoid will access graph spec via the api (2), and so can the browser (4). For page preview bonusA, the graph ext will add/update a row in the graph_specs table, but will not touch graph_revs.
  • This approach solves bonusB unless we choose to cleanup older page revision rows in graph_revs.
  • The bonusC is not solved because we do not store images, but regenerate them when needed.

PROs: data is part of the wiki mysql, under proper DB management (monitoring, backups, etc); per-user access control checks
CONs: slave replication bug is still there, api.php caching issues, no BonusC

Solution #2 - RestBASE POST storage

In the near future RestBASE plans to provide an alternative Cassandra-based storage with TTL support operating via HTTP (T101093)

  • When saving/null editing, graph ext will POST graph spec to RestBASE, which will store it in Cassandra, and return the hash ID as a header. The hash will be used to construct the <img href="...Graphoid/(hash).png" /> tag (or it can return the full URL) Additionally, RB will POST the spec to the Graphoid service (2), and store the resulting image to Cassandra.
  • Graphoid service will only support POST requests - posting spec will return an image.
  • When browser requests an image via GET (3), RestBASE will simply return the stored image, without calling Graphoid. When browser requests the graph spec via GET (4), RestBASE will return it from Cassandra as well. For page preview bonusA, graph ext would POST the spec to RB with an additional "temp" parameter/header. RB will check if this spec already exists in Cassandra, and if not, store it with a 1 month (?) TTL. On browser request, the image will not be cached.
  • For BonusB, older images are available from RB, or can be regenerated on the fly because the spec is also stored.
  • BonusC is solved only partially, because the hash is generated from the spec, and if spec stays the same but external data or vega lib change, newer image will override the older one.

How to solve BonusC?

BonusC can only be solved if we store each generated image forever (except those generated for the "preview" mode). At the same time, we do not want to store multiple identical images. But we won't know that the image is identical until after we generate it.

  • A user should be able to view an old page revision with all additional resources (e.g. graph) - as of the moment when the newer page revision was added
  • For the "HEAD" revision of the page, and ONLY for it, the page should be in a flux state - changes to any templates, or external resources such as images, should be reflected in the HEAD revision automatically.
  • We want to optimize both storage and cache, so that if two revisions contain a reference to an identical resource, that resource is only stored once, and has the same URL to improve browser and server-side caching.

So it seems we need two mechanisms - one to keep the HEAD revision up to date (null edits), and another is archiving - to preserve whatever state the current HEAD, plus all referenced resources/images/graphs/etc before adding a new revision. It is ok if there was a rendering error in an old revision - because that is an accurate representation of the past. TBD: mechanism to manually delete old images due to legal reasons.

Assuming we use Cassandra (solution #2) for image storage, when graph ext POSTs the spec to RB, RB could first generate the image, and only then return the hash. This way HTML will contain a permanent URL to the image. This works but at a significant performance degradation - parsing will pause until image generation is done. Another problem with this is null-edits - each null-edit could generate a different images for the same article revision, and all of them will be stored forever. To delete them, we would have to track which image belong to which revision.

Related Objects

StatusAssignedTask
OpenNone
OpenNone
ResolvedJGirault
ResolvedJGirault
OpenNone
ResolvedYurik
ResolvedJGirault
ResolvedJGirault
ResolvedEsanders
ResolvedJGirault
ResolvedYurik
ResolvedYurik
ResolvedMaxSem
ResolvedYurik
ResolvedYurik
ResolvedJGirault
ResolvedYurik
ResolvedYurik
OpenNone
OpenNone
ResolvedYurik
ResolvedJGirault
OpenDeskana
Resolveddebt
Resolveddebt
InvalidNone
OpenMaxSem
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 19 2015, 6:29 AM
Yurik edited the task description. (Show Details)Nov 19 2015, 6:18 PM
Yurik set Security to None.
Yurik edited the task description. (Show Details)Nov 22 2015, 1:01 AM
Yurik changed the title from "Graph/Graphoid - architecting next steps" to "Graph/Graphoid/Kartographer - data storage architecture".Nov 22 2015, 1:09 AM
Yurik added a subscriber: daniel.Nov 22 2015, 1:12 AM
Yurik edited the task description. (Show Details)Nov 22 2015, 1:25 AM
tstarling added a comment.EditedNov 24 2015, 12:18 AM

The Math extension uses its own table, very similar to your solution #1. Both Math and Score use Swift for storage of generated PNGs.

Considering the very similar problems solved by Math, Score and your new extensions, it would be nice if there was some sort of common framework. Math and Score already have hundreds of lines of duplicated boilerplate code.

Yurik moved this task from Backlog to Prioritized on the Graphs board.Nov 25 2015, 9:16 PM
Yurik edited the task description. (Show Details)Nov 26 2015, 12:49 AM
Yurik added a comment.Nov 26 2015, 2:03 AM

@tstarling - I totally agree about de-dupping the code. At what point does math & score save the image - is it during the parse or on request? If on request, how do they solve the slow replication problem - slave DB might not yet have the data needed by the redering service.

The math extension saves on parse / save, either to the db (old) or to RESTBase (new).

Nemo_bis removed a project: RfC.Nov 26 2015, 3:11 PM
GWicke added a comment.EditedDec 9 2015, 11:09 PM

I think it's worth stepping back for a moment to reflect on the problem we are trying to solve here.

Judging by the popularity of using templates to generate graph definitions, it seems clear that some kind of graph template or "preset" functionality is desired by users. Many graphs end up fitting the same preset pattern. It might also be desirable if these presets could be shared across wikis.

Things that differ between uses of a particular type of graph are:

  • The data source.
  • A limited number of display options, currently often passed as template parameters.

Examples of existing presets (implemented via wikitext templates) following this pattern:

Based on this, it seems that we should be able to simplify the graph tag to only provide:

  1. The name of the preset graph definition to use.
  2. One or more references to the datasets to use.
  3. Some config options passed to the preset graph definition, using a stable parameter interface defined by it.

This would largely separate presentation from data, and can enable sharing of graph presets across projects. Specs would likely be small enough to fit in a URL, which would let us generate graphs asynchronously & on demand, significantly simplifying the storage architecture. Specs would also be small enough to avoid a "wall of gibberish" editor experience in pages. Previews in VisualEditor can be implemented by constructing an image href from the spec & loading the image, without having to worry about long-term storage or POSTs.

We could further consider allowing the specification of small inline datasets, and moving the parameters into tag parameters. Strawman, referencing a dataset:

<graph type="GraphMap" 
  basemap="WorldMap-iso2.json"
  data="wikidata:somequery"
  scale="100"
  projection="conicEqualArea"
  colorScale="category10"
  domainMin="1000"
  domainMax="1500000"
  defaultValue="red"
  scaleType="log"
  legend="yes"/>

Same, using a small inline yaml data set:

<graph type="GraphMap" 
  basemap="WorldMap-iso2.json"
  scale="100"
  projection="conicEqualArea"
  colorScale="category10"
  domainMin="1000"
  domainMax="1500000"
  defaultValue="red"
  scaleType="log"
  legend="yes">
AE: 8442
AF: 28398
AL: 3150
</graph>
Krinkle added a subscriber: Krinkle.EditedDec 9 2015, 11:23 PM

Proposal from my angle of today's IRC meeting.

Problems:

  • Don't want to generate end-user files within the parser (Score audio file, Math rendering, Graph rendering, etc.).
  • Urls to these files must be deterministic so that they can be generated by 404-handler.
  • Urls must contain enough metadata so that the rendering service can fetch the source text from MediaWiki.
  • MediaWiki has to either permanently store each extracted snippet from all revisions indefinitely, or itself be able to re-extract it on demand.
  • MediaWiki should not clobber said store with duplicate entries every time a revision is saved and the tag wasn't modified.
  • MediaWiki should support exposing data through all layers for revisions not yet saved (e.g. preview), e.g. url to end-user file cannot rely on revision ID, or needs an alternate approach for previews.

Solution:

  • MediaWiki extracts text (from extension tags <score>, <math>, <graph> etc) in the Parser and puts these in a data store (BagOStuff backend, SQL by default, Memcached/Redis for prod).
  • Data store would be volatile (LRU/TTL?) and uniquely keyed by hash to de-duplicate across revisions.
  • In order for MediaWiki to be able to re-generate, we need a link table to associate a hash with a revision ID and position of the extension tag in the source (e.g. revision-123-graph-2). The link table would be persistent and keep data of old revisions. Though being keyed by hash as well, it would not contain duplicates. But it would keep around all different versions of extension tags ever created. Which is desirable considering we want to be able to view old revisions.
  • End-user urls to generator service outputted by the Parser identity the source data by their hash.
  • Generator service queries MediaWiki API (possibly through RESTBase) for the source data.

I think this addresses all mentioned problems. A few highlights:

  • For previews, we'll rely on the data store keeping around the blob long enough for the user to load the preview and the service to generate the media file – since MediaWiki won't be able to backfill this as it won't have a linktable entry yet.
  • There shouldn't be an issue with slave lag affecting link tables, since link tables would only be used for cache misses in the data store, which presumably won't happen in the first few seconds while replication takes place. Except for multi-DC, because data store (if BagOStuff) won't be replicated. We wanna use wgMainStash instead of wgMainCacheType instead so that it's replicated. (ping @aaron)
Krinkle added a subscriber: aaron.Dec 9 2015, 11:32 PM

I'm a bit worried about the revision-123-graph-2 -- that implies you have a counter (a state) between parsing each graph, which is a big no-no from my understanding, as it prevents parallel template parsing.

A few usage examples:
https://de.wikipedia.org/wiki/Vorlage:GraphMap
https://de.wikipedia.org/wiki/Vorlage:GraphChart
https://hu.wikipedia.org/wiki/Kr%C3%A1lovice (most huwiki cities have Lua-generated graphs)
https://en.wikipedia.org/wiki/Template:Graph:Moscow_Metro_expansion

I'm a bit worried about the revision-123-graph-2 -- that implies you have a counter (a state) between parsing each graph, which is a big no-no from my understanding, as it prevents parallel template parsing.

How else can one identify which one from the wikitext revision is the one? I'm open to other ideas. Sequential is the most reliable I could think of, though it's not super reliable since transcluded content could bring in a varying number of graphs in theory. The content is variable as well when dealing with transclusion. And either way requires full parsing to find out.

Adding unique IDs to the wikitext in PST also isn't feasible since we can't add IDs to transcluded content.

GWicke added a comment.EditedDec 10 2015, 12:22 AM

@Krinkle: Something I proposed in an earlier iteration of this discussion was to address the original spec by (revision, hash of the entire unexpanded <graph> tag). However, as you point out, the devil is in the details. References to graph tags in transcluded content are problematic in general, and the entire extraction process is a lot of complexity in itself. Also, another code path is needed for previews.

tstarling added a comment.EditedDec 10 2015, 6:33 AM

I'm a bit worried about the revision-123-graph-2 -- that implies you have a counter (a state) between parsing each graph, which is a big no-no from my understanding, as it prevents parallel template parsing.

How else can one identify which one from the wikitext revision is the one? I'm open to other ideas. Sequential is the most reliable I could think of, though it's not super reliable since transcluded content could bring in a varying number of graphs in theory. The content is variable as well when dealing with transclusion. And either way requires full parsing to find out.

It's a bit of a moot point, if they're all generated from templates and the templates can change. There's no guarantee that any of the graphs in an old revision will stay the same when you reparse them a month later. The concept of identity becomes a bit cloudy. And yes, the number of graphs can change. A Lua module can even generate {{#tag:graph}} calls with frame.callParserFunction(), depending on the phase of the moon or whatever.

Bearing in mind that this is just a storage space optimisation and is not required for solving the problem:

I think a way to do it could be by tracking image references in server-side caches. If we can be notified when an HTML cache object is evicted, preferably with the HTML given to us, then we can extract the image URLs and decrement their reference counts. Once the reference count reaches zero, a time-to-live (TTL) period is set. Once the TTL expires, the object is considered to be no longer referenced in client caches and is evicted. So on preview, the hash can be inserted with a reference count of zero, and if it is added to an HTML cache before the TTL expires, the reference count will be incremented and the data will be preserved.

This will work for all the use cases (math, score, graph).

In the short term, if we don't worry about reference counting, here's a quick sketch of my implementation idea:

function graphHook(...) {
	$text = $parser->recursivePreprocess( $text );
	$params = array( 'text' => $text );
	$service = $parser->getRenderService( 'graph' );
	$result = $service->render( $params );
	...

	$html = Html::rawElement( 'img', array(
			'class' => 'mw-graph-img',
			'src' => $result->getUrl( 'g' )
	) );
	...
	return $html;
}

abstract class RenderService {
	function render( $params ) {
		$result = new Result;
		$mapper = $this->getIdMapper();
		$id = $mapper->getId( $params );
		foreach ( $this->getResourceNames( $params ) as $name ) {
			$ext = ...;
			$dir = $this->getBaseUrl() . $this->getSubdir( $id );
			$result->setUrl( $name, "$dir/$name-$id.$ext" );
		}
	}

	// It's nice to have a PHP implementation, at least for non-WMF users, but
	// the 404 handler could forward to an equivalent non-PHP URL if desired.
	// In that case, this function would not be called.
	function handle404( $path ) {
		if ( preg_match( ..., $path, $m ) ) {
			$id = ...;
			$resourceName = ...;
		} else {
			... error ...
		}

		$mapper = $this->getIdMapper();
		$params = $mapper->getParams( $id );
		$worker = $this->getWorker( $name, $params );
		$status = $worker->execute();
		... error handling ...

		$status = $this->getStore()->storeFiles();
		... error handling ...
	}

	function getIdMapper() {...}
	function getWorker() {...}
	function getStore() {...}
}

interface IdMapper {
	function getId( $params );
	function getParams( $id );
}

// As proposed by gwicke
class SerializingIdMapper implements IdMapper {
	function getId( $params ) {
		return base64_encode( json_encode( $params ) ); // or whatever
	}

	function getParams( $id ) {
		return json_decode( base64_decode( $id ) );
	}
}

// An abstraction of the existing Math/Score scheme
class SqlStorageIdMapper implements IdMapper {
	function getId( $params ) {
		$id = $this->getFromDB( $params );
		if ( !$id ) {
			$id = $this->insert( $params );
		}
		return $id;
	}

	function getParams( $id ) {
		...
		return $db->select( ..., array( 'id' => $id ), ... );
	}

	private function getFromDB( $params ) { ... }
	private function insert( $params ) { ... }
}

abstract class RenderServiceWorker {
	function __construct(..., $store, ...) {
		...
		$this->store = $store;
		...
	}
	
	function getStore() {
		return $this->store;
	}

	abstract function execute();
}

abstract class Store {
	abstract function getPath( $resourceName );
	abstract function storeFiles();
}

class LocalStore extends Store {
	function getPath( $resourceName ) {
		...
		return "$dir/$resourceName-$id.$ext";
	}	
	function storeFiles() {
		// Already stored
	}
}

class RemoteStore extends Store {
	function getPath( $resourceName ) {
		...
		$local = "$tempDir/$resourceName-$id.$ext";
		$remote = "$remoteUrlBase/$resourceName-$id.$ext";
		$this->resourcePaths[$local] = $remote;
		return $local;
	}

	function storeFiles() {
		$batch = array();
		foreach ( $this->getResourcePaths() as $local => $remote ) {
			$batch[] = array(
				'op' => 'store',
				'src' => $local,
				'dst' => $remote
			);
		}
		$status = $this->backend->doQuickOperations( $ops );
		return $status;
	}

}

// NB: to reduce code duplication between extensions, worker subclasses have 
// few concerns. They have no knowledge of ID mapping, 404 handling or
// FileBackend storage.
class GraphServiceWorker extends RenderServiceWorker {
	function execute() {
		// Get paths on the local filesystem where destination files should be written
		$graphPath = $this->getStore()->getPath( 'g' );
		$someOtherResourcePath = $this->getStore()->getPath( 'x' );
		
		// Generate the file
		...
		file_put_contents( $graphPath, $graphImageData );
		...
	}
}

class GraphRenderService extends RenderService {
	var $idMapperClass = 'SqlStorageIdMapper';
	var $workerClass = 'GraphServiceWorker';
}

class MathRenderService extends RenderService {
	var $idMapperClass = 'SerializingIdMapper';
	var $workerClass = 'MathServiceWorker';
}
Yurik added a subscriber: Anomie.Jan 6 2016, 11:20 PM
Yurik added a comment.Jan 7 2016, 9:39 AM

I just spoke at length with @daniel & @Anomie. @daniel's proposed solution (slightly modified due to later considerations)

2 storage tables, blob storage (hash -> json blob), and the index table (expiration timestamp, pageid, and a hash).

On save, add the hash, pageid, and the max timestamp to the index table. On preview, also add a value to the index table with the timestamp set within 30 days, unless it already exists for longer. On both save and preview, add json to the blob storage, and add the same json to memcached.

Have a cleanup script that removes all rows which have no timestamp beyond today.

Yurik added a comment.EditedJan 8 2016, 7:33 PM

A much simpler proposal that will solve most issues with very little effort:

  • On parse (normal & preview), store hash->data in Memcached with indefinite TTL
  • On parse (normal only), store title->timestamp in Memcached (this is only needed to prevent DDOS)

When a client tries to get data via API by title + hash:

  • check Memcached for hash
  • if not exists, check Memcached for title, and fail if title exists and its timestamp is recent
  • if title does not exist or has old timestamp, perform full tile re-parsing
  • if not exists, check Memcached for title, and fail if title exists and its timestamp is recent

I'm not sure what this is trying to achieve. Is it some sort of PoolCounter substitute?

  • if title does not exist or has old timestamp, perform full tile re-parsing

When you reparse you often won't see the same hash again, so what graph will you deliver to the user?

Change 263160 had a related patch set uploaded (by Yurik):
Cache JSON objects in memcached

https://gerrit.wikimedia.org/r/263160

Change 263160 merged by jenkins-bot:
Cache JSON objects in memcached

https://gerrit.wikimedia.org/r/263160

Yurik moved this task from Unsorted to Tracking on the Maps (Kartographer) board.Jan 30 2016, 1:47 PM
Yurik added a project: Maps.Feb 2 2016, 5:59 PM
Restricted Application added a project: Discovery. · View Herald TranscriptFeb 2 2016, 5:59 PM
Yurik moved this task from All map-related tasks to Kartographer on the Maps board.Feb 2 2016, 5:59 PM
Deskana moved this task from Needs triage to Maps on the Discovery board.Feb 3 2016, 6:14 PM
RobLa-WMF mentioned this in Unknown Object (Event).May 4 2016, 7:33 PM
Yurik triaged this task as "High" priority.May 6 2016, 12:52 PM
Yurik assigned this task to MaxSem.
Yurik moved this task from Tracking to General on the Maps (Kartographer) board.May 8 2016, 12:54 PM
Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 11:27 PM
Pchelolo edited projects, added Services (watching); removed Services.
Yurik removed a project: Maps.Dec 15 2016, 4:40 AM