Page MenuHomePhabricator

Document eqiad/codfw transition plan for OCG
Closed, DeclinedPublic

Description

OCG has a redis instance which maps "already rendered PDFs" to "machines storing a cached copy of the already-rendered PDF".

The redis instance needs to be purged when we do a datacenter switch.

This is a big hammer, but our hit rates right now are only ~25% (which is a task in itself), so it's not that bad.

We need to document the transition process.

Event Timeline

@cscott I think we could aim higher, and it would be not impossible to do. Here is what I would do:

  1. Direct all the traffic from mediawiki through the load balancer
  1. When you get a request, do the usual processing and set a lock (poolcounter can be used for this, I guess) so that no other machine can do the same processing and any subsequent request can be answered correctly if the document is not ready.
  1. Once you have produced the pdf, upload it to an object storage (I would say cassandra, which allows you to set TTLs too) and remove the lock and the files from disk
  1. all subsequent requests will find the object on cassandra, if it's not expired.

This would allow you not only to have ocg in multiple datacenters, but also to make it more reliable and simple, to avoid problems with pooling/depooling of servers.

I am unsure how complicated is to modify ocg to do this, but we do have libraries and internal knowledge for interacting from node apps with cassandra and swift, so this doesn't look like a moonshot, from an external POV.

Adding a few more people that might have an opinion on the ticket.

I agree. Having a consistent mapping between requests and cached content and using locks to get rid of double-processing seems like the way to go. Then, the flow would become rather trivial:

  1. Check for the existence of a cached copy. If present, return it
  2. Check if a lock for the domain/title combo exists. If it does, wait or instruct the client to retry
  3. Render the PDF, store it with an acceptable TTL and return it

agree with @Joe that making ocg machines stateless would make (de)pooling easier and operations in general. do we have stats on how big such a cache would be?

I actually have a question for @cscott

how does mediawiki learn which backend to contact? I don't see any reference to the OCG redis server in mediawiki-config.

Depending on what is the response, a multi-dc setup might be significantly easier to create, independently from the other flaws of OCG

I actually have a question for @cscott

how does mediawiki learn which backend to contact? I don't see any reference to the OCG redis server in mediawiki-config.

Depending on what is the response, a multi-dc setup might be significantly easier to create, independently from the other flaws of OCG

Answering myself: from https://github.com/wikimedia/mediawiki-extensions-Collection/blob/master/Collection.body.php#L1118-L1123 it seems that it's OCG itself providing the URL from which to download the article from, in the form eg "ocg1002:8000/command=download_file&collection_id=edac2649681caf7080e37595dae2ed14c8a35342&writer=rdf2latex"

So we could easily make OCG live in the two datacenters if we keep the two clusters completely independent of each other.

It will mean that upon switching servers we'd loose all the cache, but we won't be serving errors to users (besides the need to wait until your PDF is rendered again).

Answering myself: from https://github.com/wikimedia/mediawiki-extensions-Collection/blob/master/Collection.body.php#L1118-L1123 it seems that it's OCG itself providing the URL from which to download the article from, in the form eg "ocg1002:8000/command=download_file&collection_id=edac2649681caf7080e37595dae2ed14c8a35342&writer=rdf2latex"

So we could easily make OCG live in the two datacenters if we keep the two clusters completely independent of each other.

The obvious first step here is to start providing FQDNs in the URLs so that the target machine is reachable from wherever the extension's code is executed.

Answering myself: from https://github.com/wikimedia/mediawiki-extensions-Collection/blob/master/Collection.body.php#L1118-L1123 it seems that it's OCG itself providing the URL from which to download the article from, in the form eg "ocg1002:8000/command=download_file&collection_id=edac2649681caf7080e37595dae2ed14c8a35342&writer=rdf2latex"

So we could easily make OCG live in the two datacenters if we keep the two clusters completely independent of each other.

The obvious first step here is to start providing FQDNs in the URLs so that the target machine is reachable from wherever the extension's code is executed.

Yes, that would help greatly.

The code is just using os.hostname in node, so FQDN is probably a VM configuration issue rather than an OCG code issue.

If I understand the lock suggestion, on a datacenter transition we'd let the backend machines currently in the process of generating PDFs (ie, holding a lock) complete their jobs and upload the results to cassandra. Cassandra being a distributed cache wouldn't mind if half its servers (ie, eqiad) suddenly went away.

@cscott let's focus on the simpler issues instead than on changing the storage system.

As for getting the FQDN, suggestions I see are like http://unix.stackexchange.com/questions/259702/node-js-get-fqdn but @mobrovac might have better advice on this :)

Turns out that puppet is overriding the hostname anyway, see dicussion in T133864: Use FQDNs instead of hostnames in the download urls sent to Mediawiki.

@Joe We have a tentative plan for eqiad/codfw transition in https://phabricator.wikimedia.org/T120077#2248789 -- where on-wiki should this information live?

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.

Declining as OCG has been dead for many years - see T177931