Page MenuHomePhabricator

Array vs array comparisons may lead to combinatorial explosion in Web2Cit
Closed, ResolvedPublic

Description

To get a translation score, we compare the translation output vs the expected output for each translation field.

In some cases we need to compare arrays vs arrays (for example, in the author fields). In these cases, we calculate two scores and average them:

  • we compare arrays item-wise in the order they were given (i.e., 1st item of one array vs 1st item of the other array, etc); and
  • we compare one array vs all possible permutations of the other array, and return the one with the highest score.

However, as noted by @Nidiah (who also uses this strategy in Web2Cit-Research), the number of possible permutations grows very rapidly with array length. For example, the items of an array only 10 items long can be ordered in over 3 million different ways!!

As a result, Web2Cit-Server crashes with an out of memory error when trying to process test cases including such "long" arrays, which would result in Web2Cit-Monitor and Web2Cit-Gadget failures as well.

Is there a more efficient array vs array comparison method we may use? In the meantime, consider running the ordered comparison of arrays only.

Event Timeline

diegodlh created this task.

Unordered array comparisons disabled in 1d643ece.

Pending merge into Web2Cit-Core's main, and updating Web2Cit-Server.

One alternative to compare arrays independently of the order of their items could be:

  1. sort both arrays alphabetically
  2. if one array is shorter, equate its length to that of the longest array by adding undefined items; create different versions of this extended array with the undefined items in different positions
  3. compare one array vs all versions of the other array, 1st item vs 1st item, etc; return the highest score

Although this method does have a combinatorial step (2), I don't expect it to return as many combinations as the permutation step in the original approach.

This strategy is not perfect, though. In the following example:

sourcetarget
EnríquezM. Enriquez
StorniA. Storni
Sosa VilladaC. Sosa Villada
OcampoS. Ocampo

The "ordered" score would be relatively high. However, the "unordered" score would be lower (opposite to its original purpose of being a higher, less-strict comparison).

diegodlh claimed this task.

Unordered array comparisons disabled in 1d643ece.
Pending merge into Web2Cit-Core's main, and updating Web2Cit-Server.

Merged in Web2Cit-Core's main and Web2Cit-Server updated accordingly.

Closing for now. Re-open if an alternative unordered comparison strategy is deemed necessary.