Page MenuHomePhabricator

Math hashes should include versioning to allow sensible updates
Closed, ResolvedPublic

Description

Currently, if there's been a behavior change in texvc which affects rendering, there's basically no way to re-render the image / HTML for some given input that's been previously rendered.

This makes it very difficult to clean up after bugs. :(

A couple of possibilities:

  1. Embed a version number into the input and output hashes; bump the version number on any breaking change. Old entries will just not get used anymore... but with no garbage collection we'll end up doubling our disk usage for each version. :P
  1. Embed a version number into the input hash, but *not* the output hash. Update files and purge from squids when they change. May require users to do a force-reload sometimes to see the new file. [Also may have problems with our current caching system for math.]

While we're at it, it wouldn't hurt to change the has fields from raw binary to hex, which is much easier to work with. :P

[Also consider plotting garbage collection, though...]


Version: unspecified
Severity: enhancement

Details

Reference
bz16719

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:25 PM
bzimport added projects: Math, Schema-change.
bzimport set Reference to bz16719.

conrad.irwin wrote:

I suggest a middle way:

a) add a field to the math table for texvc version (can be done with the BIN -> HEX change) but don't change the input hash at all. The version only needs to be updated when behaviour of a command changes, as error messages aren't cached.

b) change the output hash only if the output PNG may have changed (i.e. add a helper function changed_on() to texvc, like the tex_use_ams() stuff).

This avoids filling the disk with lots of duplicate images, and some easy analysis of the maths table will allow for further garbage collection when necessary.

It may be necessary to insert some retro-active changed_on()s, or to just invalidate all images once, to fix bugs currently there. (Or provide users with a method they can use to purge broken math images)

(In reply to comment #1)

I suggest a middle way:

a) add a field to the math table for texvc version (can be done with the BIN ->
HEX change) but don't change the input hash at all. The version only needs to
be updated when behaviour of a command changes, as error messages aren't
cached.

b) change the output hash only if the output PNG may have changed (i.e. add a
helper function changed_on() to texvc, like the tex_use_ams() stuff).

This avoids filling the disk with lots of duplicate images, and some easy
analysis of the maths table will allow for further garbage collection when
necessary.

Hmm... so the logic on parsing <math> would go roughly:

  • calculate the input hash
  • fetch 'math' table record
    • if no record, run texvc and save the new info into record
    • if record lists old version, run texvc and save the new info into record
    • if record lists current version, do nothing
  • return the HTML/MathML/img from the record, depending on output format

After each texvc upgrade, this would cause a re-run of texvc for each unique <math>...</math> contents as they're encountered in wiki page parsing.

If the tweak for output hash is clever enough, this would save new versions of actually affected math bits -- cache-safe due to the new filename -- while non-affected math bits would save over the old file but not look any different, so no caching issues there.

Sounds pretty good to me!

Do we know how to adjust the output hashing only when particular commands are in use?

conrad.irwin wrote:

get per-command hash changes with texvc

(In reply to comment #2)

  • calculate the input hash
  • fetch 'math' table record
    • if no record, run texvc and save the new info into record
    • if record lists old version, run texvc and save the new info into record
    • if record lists current version, do nothing
  • return the HTML/MathML/img from the record, depending on output format

I was originally planning to leave the old rows in the table, so that a maintenance script would be able to pick up old versions of re-rendered files and delete them when they are superceded. It may not be worth the cost of dobuling the size of the math table - I'll leave that as Wikimedia's call.

Sounds pretty good to me!

Do we know how to adjust the output hashing only when particular commands are
in use?

Patch attached :).

Attached:

sumanah wrote:

I'm sorry for the delay in response, Conrad. We're working on reducing our backlog of unreviewed commits and patches, since there's been such a wait. :-( Thanks for the patch. If you have time in the next couple of weeks, it would be great if you could check to make sure your patch still cleanly applies to MediaWiki as it is in our Subversion trunk. I'll try to get a reviewer soon!

Thanks.

physik wrote:

Rerendering can be forced with ?action=purge&mathpurge=true