Array vs array comparisons may lead to combinatorial explosion in Web2Cit
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	diegodlh
	Jul 29 2022, 9:53 PM

Description

To get a translation score, we compare the translation output vs the expected output for each translation field.

In some cases we need to compare arrays vs arrays (for example, in the author fields). In these cases, we calculate two scores and average them:

we compare arrays item-wise in the order they were given (i.e., 1st item of one array vs 1st item of the other array, etc); and
we compare one array vs all possible permutations of the other array, and return the one with the highest score.

However, as noted by @Nidiah (who also uses this strategy in Web2Cit-Research), the number of possible permutations grows very rapidly with array length. For example, the items of an array only 10 items long can be ordered in over 3 million different ways!!

As a result, Web2Cit-Server crashes with an out of memory error when trying to process test cases including such "long" arrays, which would result in Web2Cit-Monitor and Web2Cit-Gadget failures as well.

Is there a more efficient array vs array comparison method we may use? In the meantime, consider running the ordered comparison of arrays only.

Event Timeline

diegodlh triaged this task as High priority.Jul 29 2022, 9:53 PM

diegodlh created this task.

Unordered array comparisons disabled in 1d643ece.

Pending merge into Web2Cit-Core's main, and updating Web2Cit-Server.

diegodlh moved this task from To do to Doing on the Web2Cit-Core board.Jul 29 2022, 10:37 PM

One alternative to compare arrays independently of the order of their items could be:

sort both arrays alphabetically
if one array is shorter, equate its length to that of the longest array by adding undefined items; create different versions of this extended array with the undefined items in different positions
compare one array vs all versions of the other array, 1st item vs 1st item, etc; return the highest score

Although this method does have a combinatorial step (2), I don't expect it to return as many combinations as the permutation step in the original approach.

This strategy is not perfect, though. In the following example:

source	target
Enríquez	M. Enriquez
Storni	A. Storni
Sosa Villada	C. Sosa Villada
Ocampo	S. Ocampo

The "ordered" score would be relatively high. However, the "unordered" score would be lower (opposite to its original purpose of being a higher, less-strict comparison).

Unordered array comparisons disabled in 1d643ece.
Pending merge into Web2Cit-Core's main, and updating Web2Cit-Server.

Merged in Web2Cit-Core's main and Web2Cit-Server updated accordingly.

Closing for now. Re-open if an alternative unordered comparison strategy is deemed necessary.

Superzerocool moved this task from Backlog to Done on the Web2Cit-Monitor board.Aug 29 2022, 1:02 PM

Array vs array comparisons may lead to combinatorial explosion in Web2CitClosed, ResolvedPublicActions

Description

Event Timeline

Array vs array comparisons may lead to combinatorial explosion in Web2Cit
Closed, ResolvedPublic
Actions