Page MenuHomePhabricator

Port wikidiff2 to a memory-safe language
Open, Needs TriagePublic

Description

With active work going on in wikidiff2, I'm creeped out by it being in C++ which isn't known for being very safe against buffer overflows and other memory safety issues. It would be cool if it were in a language that doesn't have these problems.

Possible solutions:

  • Stick to PHP implementation. Last time I checked, it was roughly an order of magnitude slower than our current wikidiff2. On pathologically bad inputs, however, DifferenceEngine activates an improved algorithm that cuts a few corners and can be even faster than the straightforward algo used by wikidiff2 - at the expense of somewhat worse (at least theoretically) results.
  • Keep it a PHP extension. The only real candidate here is Rust which can easily create shared libraries callable from C/C++. This also appears to be the fastest C++ alternative available.
  • Use some other language. That would require doing diffs via a service. Note that our standard language used for services - Node.js - is roughly in PHP ballpark performance wise.

Event Timeline

Does the PHP implementation also have the improved handling for Japanese/Chinese/Thai languages?

Modern C++ (aka C++11 and later) can be memory safe and free from buffer overflows as long as good practices are followed -- avoid usage of raw pointers (use unique_ptr and shared_ptr instead), make good use of const reference and move semantics instead of passing things by pointer between functions, allocate on the stack instead of the heap where feasible, and use RAII patterns to encapsulate OS resources like file handles. The main issue is the layer in between the PHP API and the extension, as the PHP API is raw C. Both C++ and Rust can have issues here if that API is misused, so Rust doesn't bring you any advantages in that regard.

While it still might be easier to rewrite it in Rust (going from old-style C++ to modern C++ is effectively also a rewrite despite being the same base language), I just wanted to chime in that the overall premise of "C++ isn't safe" is flawed, provided modern practices and patterns are followed with it.

(Blame @Legoktm for this comment, he told me to make it 😉)

I'm not especially concerned about wikidiff2 since it's written in a defensive way and has mostly been reviewed for security. We are passing user input to much more horrifying C code, like exif which has the worst pointer arithmetic tricks I have ever seen and has been the subject of multiple security vulnerabilities.

Modern C++ (aka C++11 and later) can be memory safe and free from buffer overflows as long as good practices are followed

I wouldn't go that far. Wikidiff2 avoids pointer arithmetic, and it doesn't use malloc or the new operator, but it uses iterators heavily and so is subject to iterator invalidation, which is the C++ equivalent of a dangling pointer. The std::vector operator[] and iterator arithmetic are not bounds checked, so buffer overflow is possible. We could consider compiler bounds check flags to improve these areas. I did use std::vector<>::at() in a few places where the bounds weren't immediately obvious.

I just wouldn't put wikidiff2 high on my list of things to port to Rust.

As for performance, wikidiff2 is showing up in PHP timeout logs now that we have PHP timeouts again, showing that it commonly takes tens of seconds for certain sorts of diffs. It has good constant factors but a poor time order, probably O(N^3) with the new paragraph move detection. There's a reason we ported it from PHP to C++, and that reason holds now more than ever. I don't think it would be feasible in JavaScript without reconsidering the algorithm.

As for performance, wikidiff2 is showing up in PHP timeout logs now that we have PHP timeouts again, showing that it commonly takes tens of seconds for certain sorts of diffs. It has good constant factors but a poor time order, probably O(N^3) with the new paragraph move detection. There's a reason we ported it from PHP to C++, and that reason holds now more than ever. I don't think it would be feasible in JavaScript without reconsidering the algorithm.

I would be interested in those logs, can you point me to them?

I would be interested in those logs, can you point me to them?

I filed T204010 for this.