Page MenuHomePhabricator

wikidiff2 performance: DiffEngine similarity bailout
Open, Needs TriagePublic

Description

The fuzz test tool in wikidiff2 routinely produces inputs which take tens of seconds to diff.

A typical example has 113 lines and ~316,000 spaces. Debug logging shows that the word diff complexity limit of 40M is often reached, but diffs barely under the threshold take ~500ms. This would be tolerable if there were only a few such diffs, but move detection causes the word diffs to be repeated many times.

One idea I have for improving the situation is to calculate an upper bound on character similarity early. The diff will be thrown away if the similarity is less than 0.2 for regular line change ops or 0.4 for move candidates, so there is no point in continuing with the diff if we can know that the similarity cannot reach 0.2 or 0.4.

An easy place to do an early similarity calculation would be after filling xhash and yhash. Knowing a list of input elements which cannot be part of the LCS means that all such elements will end up in delete/add/change operations.

The similarity metric is computed as follows:

	for (int i = 0; i < diff.size(); ++i) {
		int op = diff[i].op;
		int charCount;
		switch (op) {
			case DiffOp<Word>::del:
			case DiffOp<Word>::copy:
				charCount = countOpChars(diff[i].from);
				break;
			case DiffOp<Word>::add:
				charCount = countOpChars(diff[i].to);
				break;
			case DiffOp<Word>::change:
				charCount = std::max(countOpChars(diff[i].from), countOpChars(diff[i].to));
				break;
		}
		opCharCount[op] += charCount;
		charsTotal += charCount;
	}
	if (opCharCount[DiffOp<Word>::copy] == 0) {
		charSimilarity = 0.0;
	} else {
		if (charsTotal) {
			charSimilarity = double(opCharCount[DiffOp<Word>::copy]) / charsTotal;
		} else {
			charSimilarity = 0.0;
		}
	}

I would replace the switch with a simpler algorithm, making it more linear:

charCount = countOpChars(diff[i].from) + countOpChars(diff[i].to);

So copies would count twice, and changes would have the sum of added and deleted characters instead of the maximum. Then the charSimilarity denominator would be simply the sum of the input sizes. The upper bound on the numerator would be the sum of the sizes of matching lines in xhash/yhash.

The thresholds might need to be retuned due to this change in metric. There might need to be a feature flag.