Storing PNG images retrieved from mathoid
Closed, DeclinedPublic

Description

We need to decide how to store the PNG images retrieved from Mathoid. Before we can start with the implementation of the Mathoid PNG support in the Math extension code

Related Objects

StatusAssignedTask
ResolvedNone
ResolvedPhysikerwelt
OpenNone
InvalidNone
OpenNone
OpenNone
OpenPhysikerwelt
OpenNone
ResolvedPhysikerwelt
ResolvedPhysikerwelt
OpenPhysikerwelt
OpenNone
ResolvedPhysikerwelt
OpenNone
OpenNone
DeclinedPhysikerwelt
DuplicateNone
DuplicateNone
Resolvedmobrovac
ResolvedPhysikerwelt
ResolvedNone
Resolvedmobrovac
DeclinedPhysikerwelt
DeclinedPhysikerwelt
Physikerwelt updated the task description. (Show Details)
Physikerwelt raised the priority of this task from to High.
Physikerwelt added a project: Math.
Physikerwelt changed Security from none to None.
Physikerwelt added subscribers: Unknown Object (MLST), TTO, Pkra and 2 others.
GWicke added a comment.EditedDec 9 2014, 8:52 PM

Options:

  • Varnish: Simple to hook up, but some miss rate
  • MySQL database: Simple to hook up, no misses
  • RESTBase: Simple to hook up, no misses

Other considerations: It might be worth thinking about a generic SVG-to-PNG service, which could also handle storage / caching.

Possible interface: Pass it a path to the source SVG (resolved internally), and return a PNG:

GET /v1/en.wikipedia.org/transform/svg/to/png/somepath

This could work for both math & other SVGs.

why not store it in the database, like we do for the SVG?

why not store it in the database, like we do for the SVG?

Added MySQL to the options.

fbstj added a subscriber: fbstj.Dec 9 2014, 10:18 PM
Physikerwelt moved this task from Incoming to Next-up on the Math board.Dec 9 2014, 10:55 PM
Physikerwelt moved this task from Next-up to Doing on the Math board.Apr 13 2015, 5:29 PM

Change 204328 had a related patch set uploaded (by Physikerwelt):
Prepare math extension for mathoid PNG support

https://gerrit.wikimedia.org/r/204328

fbstj removed a subscriber: fbstj.Apr 16 2015, 7:21 AM

The patch linked by the gerritbot suggests to store images in the database. I guess that would be most comfortable for small and medium size wikis. However, @akosiaris indicated in the gerrit patch that this is not acceptable.

Storing PNG images into a database is not advisable and is considered bad practice. For example, fetching a relatively small image (say 1MB) evicts a large part of the query cache (depending on the configuration it can be considered small as well, but it still happens). Also, an otherwise unnecessary transaction occurs, a connection to a database needs to happen adding latency. The InnoDB buffer pool also gets filled with image data while it could be used to hold other data. It is also a possible that it will cause replication lags and will definitely enlarge the size of database backups. Finally, it is way slower than fetching images from the filesystem.

Just pointing out that varnish is a cache layer, not a storage layer. It will work wonders for serving already generated and hot data to users but it is not a persistent storage. So using it for this contradicts the Task's title. That being said it can be used to cache the PNG images generated by Mathoid and greatly reduce the number of req/s the mathoid service will receive no matter what other choice we make.

I am also a bit unclear on the Task's description. It reads:

We need to decide how to store the PNG images retrieved from Mathoid

Have we already asked ourselves the question Do we need to store the PNG images generated by Mathoid and answered yes ?

@akosiaris Please not that the png images usually are very small... even compared to the other columns selected by the database engine within the same query. So there are no additional queries. (cf https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FMath/1667083cbd8f7ab3b922c0f485cb59bcd6587ab8/MathMathML.php#L446)
See http://demo.formulasearchengine.com/index.php?title=Special:FormulaInfo&pid=384996&eid=math1 for a typical example of the image size.
All the arguments you make are hold for the MathML and SVG string as well.
I am not sure about "Finally, it is way slower than fetching images from the filesystem." If the caching argument was true, it might read the image from memory, which is potentially faster than reading if from a hardisk. (http://en.wikipedia.org/wiki/Memory_hierarchy)

To conclude, I think we should continue with the database cache approach and switch to another chache type, once a solution for that is ready. Reading PNG and SVG image from different sources does not sound reasonable to me.

@akosiaris: All information currently stored in the database about math in the tables math, mathoid, and mathlatexml is just to cache the texvc to (MathML, SVG, PNG, HTML) conversion. It would be interesting to see how large the actual table sizes are in production. For my enwiki dump it's in the order of 5GB for the mathoid table.

Have we already asked ourselves the question Do we need to store the PNG images generated by Mathoid and answered yes ?

The generation takes about a second per formula. Soma pages have 500 formulae, that would be 10 minutes. So I guess the answer is yes.

Have we already asked ourselves the question Do we need to store the PNG images generated by Mathoid and answered yes ?

The generation takes about a second per formula. Soma pages have 500 formulae, that would be 10 minutes. So I guess the answer is yes.

OK, this sounds to me like we need caching, assuming we need the latest version of the page/image in 95+% of the cases which sounds reasonable.

@akosiaris Please not that the png images usually are very small... even compared to the other columns selected by the database engine within the same query. So there are no additional queries. (cf https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FMath/1667083cbd8f7ab3b922c0f485cb59bcd6587ab8/MathMathML.php#L446)
See http://demo.formulasearchengine.com/index.php?title=Special:FormulaInfo&pid=384996&eid=math1 for a typical example of the image size.
All the arguments you make are hold for the MathML and SVG string as well.

That's partly true. text compresses way better than PNGs. I was unaware that we cache MathML and SVG in the database. Perhaps we should revisit that choice.

I am not sure about "Finally, it is way slower than fetching images from the filesystem." If the caching argument was true, it might read the image from memory, which is potentially faster than reading if from a hardisk. (http://en.wikipedia.org/wiki/Memory_hierarchy)

I see 2 issues with that statement:

  1. You are leaving out of the equation the time it takes to actually connect to the database, issue the query, have the database isolate you in a transaction, fetch the data from memory and then return it.
  2. You are comparing hot data (in mysql query cache) with cold data (in the filesystem). However the OS also has a filesystem cache, the pagecache which should be taken into account. In which case filesystem beats the database again

There is this simple reproducible benchmark from 2 years ago. http://blog.lick-me.org/2013/01/repeat-after-me-mysql-is-not-a-filesystem/

After all of that, may I just point out that using the filesystem is actually out of the question in this case ? So I dragged us into a moot point unfortunately here.

To conclude, I think we should continue with the database cache approach and switch to another chache type, once a solution for that is ready. Reading PNG and SVG image from different sources does not sound reasonable to me.

So, caching. We got better options for caching than the database approach these days. Apart from Varnish we also got the options of memcached and RESTBase.

Regarding the topic discussed, I am not going to add more than what @akosiaris has already said. Databases, and in particular, MySQL/MariaDB/InnoDB is a highly inefficient caching system- it is optimized for secure and consistent (so, durable, always-touching-disk-writes) small data. Due to the problems of the query cache with innodb multiversioning, partitioning and clustering, plus limitations on SMP systems, query cache is disabled on all wikimedia sites (and any decent database setup). He is already giving you some alternatives (and maybe SVGs should follow the same pattern).

What I really wanted to add regarding the proposed patch is that we should never store base64-encoded data on the database -that should be stored in binary format-, and that even if we had to, MEDIUMTEXT is not the right format, as it would introduce character conversion encoding and sorting overhead. Allays use BINARY, VARBINARY and its family for binary data. Even wiki pages still use binary format, although that is mostly due to legacy.

@akosiaris Please not that the png images usually are very small... even compared to the other columns selected by the database engine within the same query. So there are no additional queries. (cf https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FMath/1667083cbd8f7ab3b922c0f485cb59bcd6587ab8/MathMathML.php#L446)
See http://demo.formulasearchengine.com/index.php?title=Special:FormulaInfo&pid=384996&eid=math1 for a typical example of the image size.
All the arguments you make are hold for the MathML and SVG string as well.

That's partly true. text compresses way better than PNGs. I was unaware that we cache MathML and SVG in the database. Perhaps we should revisit that choice.

The PNG is in almost all cases smaller than the compressed SVG. Thus, you can thus regard it as lossly compression of the SVG image... and of course the PNG text has higher entropy thus the compression of the field is worth, however the compression of the information contained in the field is higher, compared to the SVG.

I am not sure about "Finally, it is way slower than fetching images from the filesystem." If the caching argument was true, it might read the image from memory, which is potentially faster than reading if from a hardisk. (http://en.wikipedia.org/wiki/Memory_hierarchy)

I see 2 issues with that statement:

  1. You are leaving out of the equation the time it takes to actually connect to the database, issue the query, have the database isolate you in a transaction, fetch the data from memory and then return it.

Thats done once anyhow for the MathML and SVG information. Adding PNG will not additional overhead.

  1. You are comparing hot data (in mysql query cache) with cold data (in the filesystem). However the OS also has a filesystem cache, the pagecache which should be taken into account. In which case filesystem beats the database again

    There is this simple reproducible benchmark from 2 years ago. http://blog.lick-me.org/2013/01/repeat-after-me-mysql-is-not-a-filesystem/

This test uses 16MB files. Typical image sizes are a few kb. The size is defnetively smaller than the wikitext of a medium size article. So one could discuss to use a different storage engine for the wikitext as well.

After all of that, may I just point out that using the filesystem is actually out of the question in this case ? So I dragged us into a moot point unfortunately here.

To conclude, I think we should continue with the database cache approach and switch to another chache type, once a solution for that is ready. Reading PNG and SVG image from different sources does not sound reasonable to me.

So, caching. We got better options for caching than the database approach these days. Apart from Varnish we also got the options of memcached and RESTBase.

The fact that the specialpage sets the caching information in the page header causes that memcached should already be used.
Unless I'm mistaken the database cache is an additional layer of caching that is used only if the specialpage that displays the image is no longer cached.
This additional layer was introduced because the generation rendering of tex is exetremly expensive... Probably comparable to the rendering of the thubmnails for images. If the cache would be purged for whatever reason all pages that contain mathematical formula would need to be rerendered immediately. This would certainly cause a high load spike for the mathoid servers...
I like this discussion, but I think a benchmark would be more reliable...

scfc added a subscriber: scfc.May 29 2015, 10:00 PM

it seemes that the solution for production is to use restbase. However, I'm not sure if that's feasable for privatly administered wikis? What do you think?
I'm not claiming that it's ideal to store the images in the database, but for small wikis with only few formula it currently seems the most sraight forward approach to me.

@GWicke: While I see that mathoids restification solves this porblem in production, I still do not see how png images can be stored for small wikis installed on private wikis that rely on database and filesystem cache. If this problem was solved, everything would be done in order to remove the texvc rendering mode. One alternative is to use the FS, in the same way as it's currently done. However, calling those hooks that force to write to the FS immidately does not seem to be an optimal solution to me. However, without those hooks png images are not visible to readers on the first page visit.

@Physikerwelt, as of last Friday we now have a fairly solid sqlite backend for RESTBase, which can be used for small installs. It should also not be too hard to add MySQL or Postgres support based on this.

@GWicke: This makes the world a better place;-)

Physikerwelt closed this task as Declined.Oct 20 2015, 7:46 AM
Physikerwelt claimed this task.

RestBase should takes care of storing whatetever needs to be stored (i.e. PNG images).

Change 204328 abandoned by Physikerwelt:
Prepare math extension for mathoid PNG support

https://gerrit.wikimedia.org/r/204328