We currently assume that all implementations return the same results for all input. For the testers we can capture irrelevant differences through the tester result function (Z8K2), but it is unclear what we do for synthetic results: how do we know whether the differences between two implementations is meaningful or not?
Question was raised here:
https://meta.wikimedia.org/wiki/Talk:Abstract_Wikipedia/Updates/2021-06-17