On December 8, I have noticed that Updater is getting stuck on updates. Turns out there is a performance problem in Updater code, specifically in RdfRepository.java in this piece:
Collection<Statement> aboutStatements = new HashSet<>(insertStatements); aboutStatements.removeAll(entityStatements); aboutStatements.removeAll(statementStatements); aboutStatements.removeAll(filtered(insertStatements).withSubjectStarts(uris.value())); aboutStatements.removeAll(filtered(insertStatements).withSubjectStarts(uris.reference()));
The problem is in the implementation of removeAll:
if (size() > c.size()) { for (Iterator<?> i = c.iterator(); i.hasNext(); ) modified |= remove(i.next()); } else { for (Iterator<?> i = iterator(); i.hasNext(); ) { if (c.contains(i.next())) { i.remove(); modified = true; } } }
As we can see, in certain situations, instead of going over elements of c and removing them, it opts to go over elements of the set and check if they are in c. The problem is that in this case c is a filter on a 100K-size list, which means each check produces the scan of the whole (or close to it) list. This makes the whole procedure extremely slow.