Page MenuHomePhabricator

Validate TeX input for all renderers, not just texvc
Closed, ResolvedPublic

Description

Author: physik

Description:
Hi,
inspired by the security workshop of the Amsterdam Hackathon (2013) I see some minor issues with the security of the math extension. I think the rendering options (Source, MathJax and LaTeXML) are potentially dangerous since they do not check input and validate/escape the output only.
(The same holds for the alt attribute of the generated texvc image.)
I had a deeper look into the texvc code. The parser seems to do a very good job and I think we should continue to use the parsed output before processing it with MathJaX, LaTeXML or another ?-technology. My concern about the texvc parser is the ocaml language. I think more people are used to antlr nowadays. It’s not too much work to convert the ocaml script to grammar files (.g) that are quite common in the university landscape where each researcher develops her/his own language;-)
The bad news about is that the php output tool used for the TeX checking task.
However, as a result I suppose to change the math rendering process in the
https://code.google.com/p/antlrphpruntime/
is currently not maintained by a lot of people and I’m not sure if it can be following way

  1. Check Input Syntax + Semantics based on grammar.
  2. Pass checked TeX to the renderer.

A further positive aspect by a global grammar is that there is a defined standard which commands are allowed in the math tags and which commands are not allowed.
Apart from that checking the output can introduce additional security. (cf. https://gerrit.wikimedia.org/r/#/c/66365/)
Best
Physikerwelt


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=54624

Details

Reference
bz49169

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:50 AM
bzimport set Reference to bz49169.
bzimport changed Security from none to Software security bug.
Restricted Application changed the visibility from "Public (No Login Required)" to "acl*security (Project)". · View Herald TranscriptNov 22 2014, 1:50 AM
Restricted Application changed the edit policy from "All Users" to "acl*security (Project)". · View Herald Transcript

(In reply to comment #0)

The bad news about is that the php output tool used for the TeX checking
task.

Can you explain what you mean by this?

physik wrote:

oh the lines are mixed:

The bad news about is that the php output tool used for the TeX checking task
https://code.google.com/p/antlrphpruntime/
is currently not maintained by a lot of people and I’m not sure if it can be used for production.
However, as a result I suppose to change the math rendering process in the
following way

physik wrote:

An other advantage with this approach is that all users see the same error message if invalid tex code was entered to the page.

By the way. I don't get how to trigger Jenkins... Sometimes it happens automatically sometimes not. The same for the link to the bug reports. I don't know what I'm doing wrong see:
https://gerrit.wikimedia.org/r/#q,I9e4b22b3f08317b7221c4d279e717d18ddb7966b,n,z

(In reply to comment #3)

An other advantage with this approach is that all users see the same error
message if invalid tex code was entered to the page.

By the way. I don't get how to trigger Jenkins... Sometimes it happens
automatically sometimes not. The same for the link to the bug reports. I
don't know what I'm doing wrong see:

It is always triggered on Code-Review +2.

To get triggered before that, you currently have to be added to a whitelist, which I requested (bug 49348). This is so an evil test submitted by an attacker can't take over the machine. This will no longer be necessary when bug 45499 is fixed.

As far as linking to the bug, put:

Bug: 123

at the very bottom of the commit on the *first* patch set (seems to be buggy if added later). If the Gerrit bot still doesn't work, it's a bug in the bot.

(In reply to comment #4)

To get triggered before that, you currently have to be added to a whitelist,
which I requested (bug 49348).

You're apparently already on there, so it must be a different issue.

physik wrote:

(In reply to comment #5)

(In reply to comment #4)

To get triggered before that, you currently have to be added to a whitelist,
which I requested (bug 49348).

You're apparently already on there, so it must be a different issue.

That's strange non of my recent commits was tested. Quite bad timing, since I wanted to test the debug fields for the math extension and they run locally fine, but fail at jenkins.
Is there a way to trigger the build manually. On my own server there is a button trigger events manually (I use the gerrit plugin), on the mw jenkins I could not find a similar a button.

physik wrote:

Ok. I see i have to link manually:
This change demonstrates what I mean.
https://gerrit.wikimedia.org/r/#/c/67607/

(In reply to comment #7)

Ok. I see i have to link manually:
This change demonstrates what I mean.
https://gerrit.wikimedia.org/r/#/c/67607/

You only need the bug line at the bottom. Use "Bug: 49169", not "(Bug 49169)".

physik wrote:

(In reply to comment #8)

You only need the bug line at the bottom. Use "Bug: 49169", not "(Bug
49169)".

Yes, I got it. However I can not change PS1, so I can only do it for the next commit. However, now the link is there.
My main Question is, if the ANTLR variant could be acceptable.

(In reply to comment #9)

My main Question is, if the ANTLR variant could be acceptable.

To confirm, you mean removing the OCaml implementation entirely, and using ANTLR-generated PHP for everything?

I'm not completely convinced this is a good idea, mainly because antlrphpruntime only has two committers, and there hasn't been any change since June 2010.

I do see some of the benefits, though (not needing an OCaml compiler, end-users could just use the PHP without even needing to compile).

physik wrote:

Yes. If would be convinced that this is a good idea, I would simply just do it.

I think especially for private wikis it would be great. No need to run makefiles and to install LaTeX on a web server. Furthermore manny users have no rights to do so.
Especially users that have only little money that they can spend on servers and don't get root access will not be able to use math in the most secure way.

In general I like the way to have a grammer file that is written in widely used language that is compiled to native code (in this case PHP). If someone knows an alternative to antlrphpruntime I'd be more than happy.

A reason against antlrphpruntime is that it could bring more security issues to the system in comparison to not to checking the tex code at all. But that's the reason why I want to discuss that before starting with the implementation.

My main concerns are:

  • If we find an issue with the code produced, or with how the code enforces the grammer, we would want to be able to fix it (with no active development on antlrphpruntime, that would fall to someone in the MediaWiki development community)
  • If we find an issue with the grammer, we need someone who knows how to write the grammer

I don't personally want to be the person who has to make these updates if there is a problem, but if there are 2 or 3 people in the community (so someone is available to respond as soon as an issue is reported) who are comfortable with doing these, then I don't think there should be too much of an issue going in this direction. Might be worth an email to wikitech-l to see?

Math is an extension, so I changed the component.

physik wrote:

The first step is done. I seperated the security checks and the rewrites of custom commands from the actual conversion.
See
https://gerrit.wikimedia.org/r/#/c/90748/
However, I have no idea how to find someone to review this
http://www.formulasearchengine.com/Verify%20texvc%20light

I created a guide how to verify the correctness of the code based on all formula of a database dump en-wiki
at http://www.formulasearchengine.com/Verify%20texvc%20light

However I have the feeling that this will help little to find reviewers for that task.

physik wrote:

It would be nice if someone could take the ticket... so that it get's on any task list. Now I have the impression that nobody has the review on his TODO list even though it affects and probably solves 70+ bugs reported to Bugzilla.

physik wrote:

I think it's a major problem, because beside the security concerns it causes a lot of confusion, if the renders behave different.

physik wrote:

Hi,

Hashar mentioned at

https://gerrit.wikimedia.org/r/#/c/90748/

that the proposed way won't work since nobody will review the changes in ocaml.
Now I spend 2 days on rewriting the ocaml stuff with php.

https://github.com/physikerwelt/texvcCheckPhp/blob/master/statelex.php

Now it basically does very similar things compared to the old ocaml stuff. But for about 5/1000 of the formula it behaves different. However 5/100 is a very high number if one considers that enwiki alone has about 300k formula.
Reimplementing every single stupid 'bug' of the ocaml code seems unreasonable to me. For example I had to add a lot of complexity to support additional (unneeded) brakes and even more to support that patterns like $a^b_c$ are rewritten to $a_c^b$ with is pointless as well.

So, I won't spend further work on a exact reimplementation of the ocaml code in PHP until I get further feedback.

I would strongly recommend to use

https://gerrit.wikimedia.org/r/#/c/90748/

for now and switch the texvc verification component in a further step separated from the introduction of the MathML rendering mode.

For now it's important to demonstrate that the mathml (and svg images) look similar to the former png images.
Furthermore we have to avoid that special behavior of mathjax leads to different output for different users.

All in all I need someone to discuss about this issue, since I can not discuss that with myself.

Best
physikerwelt

Physikerwelt, I know this project is important to you, but it's not quite as high as some other security issues we're dealing with.

It would help if you could outline your plans a little more. From what I can gather from your patches and comments.

  • You merged gerrit 67607, which runs texvc on all math content now, to prevent rendering if texvc fails (due to the checks, or something in the rendering).
  • You modified texvc to create texvccheck, which removes the rendering portion of texvc, in gerrit 90748.

I'm assuming you're planning to replace the call to callTexvc() with something like callTexvcCheck() that calls the texvccheck version of texvc, and use that for the checks created in 67607?

  • You were initially (Comment #1) thinking of re-implementing the texvccheck portion of texvc in antlr grammer, compiled to php.

Is that an accurate summary?

If so, the work remaining here that you're looking for is to get someone to validate that texvccheck correctly validates in the same way that texvc does, +2 gerrit 90748, and then use texvccheck instead of texvc for the validation check. Correct?

physik wrote:

Chris, (In reply to comment #19)

Physikerwelt, I know this project is important to you, but it's not quite as
high as some other security issues we're dealing with.

I see that. However, my evil plan to provoke some kind of feedback was successful;-)

It would help if you could outline your plans a little more. From what I can
gather from your patches and comments.

  • You merged Gerrit change #67607, which runs texvc on all math content now,

to prevent
rendering if texvc fails (due to the checks, or something in the rendering).

I have merged that to the LaTeXML branch, that is used by myself and some scientist that want to use LaTeXML for rendering. But it's not part of the master branch. So for all wikimedia wikis, the tex input is being passed to mathjax without being checked before.

  • You modified texvc to create texvccheck, which removes the rendering

portion
of texvc, in Gerrit change #90748.

Yes.

I'm assuming you're planning to replace the call to callTexvc() with
something
like callTexvcCheck() that calls the texvccheck version of texvc, and use
that
for the checks created in 67607?

Yes. https://gerrit.wikimedia.org/r/#/c/85801/48/MathInputCheckTexvc.php

  • You were initially (Comment #1) thinking of re-implementing the texvccheck

portion of texvc in antlr grammer, compiled to php.

antlr, has been changed a lot. For the current version antlr 4 there is no php support at all. So I see little chances for that approach now. Consequently, I have stopped think about that.

  • Separately (or maybe using antlr?), you made an initial pass at rewriting

the
texvccheck portion of texvc in php, at
https://github.com/physikerwelt/texvcCheckPhp/blob/master/statelex.php, but
it
sounds like that will take some (a lot?) more work to make it compatible with
the existing texvc and texvccheck.

It would require a lot of work. But my main point is that I think it would be wrong to re-implement the portions of texvccheck that are unnecessary or even wrong.

Is that an accurate summary?

Beside the comments yes. Thank you for that.

If so, the work remaining here that you're looking for is to get someone to
validate that texvccheck correctly validates in the same way that texvc does,
+2 Gerrit change #90748, and then use texvccheck instead of texvc for the
validation
check. Correct?

Yes. Gabriel was thinking about reviewing the follow up on #90748( Gerrit change #85801 ) in the next week.
But I could not find anyone, who was even thinking about reviewing change #90748 .

physik wrote:

The problem to me is the folling:
Setting the bug importance to high leads to intermediate feedback, whereas setting the importance to normal means the issue is never tackled.
Visiting
http://meta.wikimedia.org/w/index.php?title=Help:Displaying_a_formula&oldid=15233
proves that the plans support MathML reach more than 10 years to the past.
Since the development of the new version of the Math extension was completed more than 2 month ago (which resolves this bug and a lot of other bugs) indicates that it might take another 10 years until someone submits the change.

physik wrote:

I uploaded a new change that fixes this bug only
https://gerrit.wikimedia.org/r/#/c/105187/

When can we get this moved out of the Security component?

physik wrote:

$wgMathDisableTexFilter = true; was set in https://gerrit.wikimedia.org/r/#/c/113382/ to fix performance problems.
As far as I remember this was not the reason for the performance problem.
So once we this setting is removed, we can forget about this issue.

(In reply to Alex Monk from comment #24)

When can we get this moved out of the Security component?

It's already been fixed with $wgMathDisableTexFilter on 6a0af8f3b4ea985c8dca1a1d5fa78e9d3cf1210c , so the underlying issue is not secret.

However, Chris Steipp would prefer to get it fixed on the cluster before moving out.

(In reply to physikerwelt from comment #25)

As far as I remember this was not the reason for the performance problem.

I guess this needs to be retested and figured out. If it doesn't cause perf problems, we can set:

$wgMathDisableTexFilter = false;

on WMF, then presumably mark this as fixed and move out of Security product.

physik wrote:

it would be nice to see some progress here
"Reported: 2013-06-05"

physik wrote:

Someone needs to merge https://gerrit.wikimedia.org/r/#/c/158559/ and monitor the performance indation or decide that this is no option.
I'm fine with both decision but I neither want nor have permissions to decide that on my own.

physik wrote:

this seems to be not urgend at all... so move it to a future release

Restricted Application changed the visibility from "acl*security (Project)" to "Custom Policy". · View Herald TranscriptNov 24 2014, 9:27 PM
Restricted Application changed the edit policy from "acl*security (Project)" to "Custom Policy". · View Herald Transcript
chasemp changed the visibility from "All Users" to "acl*security (Project)".Jan 14 2015, 6:20 PM
chasemp changed the edit policy from "All Users" to "acl*security (Project)".
chasemp changed the edit policy from "acl*security (Project)" to "Custom Policy".
chasemp changed the visibility from "Custom Policy" to "Custom Policy".
chasemp subscribed.

the policy churn here is me addressing T75781. Old existing CC'd folks should now all be allowed via policy in the expectant way. They changed names of the fields on me from last test till now so I had to reset and fix again. Thanks.

Which patch is still pending related to this ticket?
Latest mentioned was https://gerrit.wikimedia.org/r/#/c/158559/ which got merged.
Which codebase is this task about?

texvc was removed a couple years ago. This ticket is even older than the other related ones :)

T172583#7525298

Mstyles claimed this task.
Mstyles added a project: SecTeam-Processed.
Mstyles subscribed.

due to the fact that texvc was removed, I'm marking this ticket as resolved. Please reopen if there are any issues still remaining.

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".
sbassett changed the edit policy from "Custom Policy" to "All Users".
sbassett removed subscribers: chasemp, GWicke, csteipp.