Page MenuHomePhabricator

Incomplete continuation in MWAPI
Closed, ResolvedPublic

Description

The following query should return all Lexeme revisions by @Nikki (uncomment line 13 to turn it into a more useful query):

SELECT ?lexeme ?date (GROUP_CONCAT(DISTINCT ?lemma; separator = "/") AS ?lemmas) WHERE {
  hint:Query hint:optimizer "None".
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:endpoint "www.wikidata.org";
                    wikibase:api "Generator";
                    mwapi:generator "allrevisions";
                    mwapi:garvuser "Nikki"; # adjust user name here
                    mwapi:garvnamespace "146";
                    mwapi:garvlimit "max".
    ?title wikibase:apiOutput mwapi:title.
  }
  BIND(URI(CONCAT(STR(wd:), STRAFTER(?title, "Lexeme:"))) AS ?lexeme)
  # MINUS { ?lexeme ontolex:sense ?sense. }
  ?lexeme wikibase:lemma ?lemma;
          schema:dateModified ?date.
}
GROUP BY ?lexeme ?date
ORDER BY ASC(?date)

However, it only returns a few results (though the exact number varies as @Nikki continues to edit). When I try the equivalent query in the API sandbox, I notice that after the first page of results, I get (when I click the handy “continue” button) a few empty pages before more results arrive. I suspect that WDQS terminates continuation after the first empty continuation page, assuming that no more results will arrive – but this is incorrect, according to the following note on the garvnamespace parameter:

Note: Due to miser mode, using this may result in fewer than arvlimit results returned before continuing; in extreme cases, zero results may be returned.

Event Timeline

Looking a bit more closely at the code, I suspect I’ve slightly misdiagnosed the problem. Specifically, consider ContinueIterator::next, near the end of MWApiServiceCall.java:

@Override
public IBindingSet next() {
    if (recordsCount >= maxContinue) {
        close();
    }
    if (closed || lastResult == null) {
        return null;
    }
    if (lastResult.getResultIterator().hasNext()) {
        recordsCount++;
        return lastResult.getResultIterator().next();
    }
    // If we can continue, do the continue                                                                                                                                            
    if (allowContinue && lastResult.getContinue() != null) {
        lastResult = doSearchRequest(recordsCount);
    }
    if (closed || lastResult == null) {
        return null;
    }
    if (lastResult.getResultIterator().hasNext()) {
        recordsCount++;
        return lastResult.getResultIterator().next();
    }
    return null;
}

This looks like it should handle a single empty page correctly – but I think it might get tripped up if there are multiple consecutive empty pages. The duplicated code in there before and after the continue handling smells of something that should be a loop instead.

A better (i.e. more static) example might be GZWDer, they've created a lot of lexemes and are currently inactive. See the edits vs the query.

I experimented with this a bit, and applying the following patch is enough to yield me 458 results by GZWDer instead of 35:

diff --git a/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/mwapi/MWApiServiceCall.java b/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/mwapi/MWApiServiceCall.java
index 2993f8e..396b700 100644
--- a/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/mwapi/MWApiServiceCall.java
+++ b/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/mwapi/MWApiServiceCall.java
@@ -315,10 +315,10 @@ public class MWApiServiceCall implements MockIVReturningServiceCall, BigdataServ
         // Note though that XPathExpression is not thread-safe. Maybe use ThreadLocal?
         XPathExpression itemsXPath = path.compile(template.getItemsPath());
         NodeList nodes = (NodeList) itemsXPath.evaluate(doc, XPathConstants.NODESET);
-        if (nodes.getLength() == 0) {
-            return null;
-        }
         IBindingSet[] results = new IBindingSet[nodes.getLength()];
+        if (results.length == 0) {
+            return new ResultWithContinue(results, searchContinue);
+        }
         final Map<OutputVariable, XPathExpression> compiledVars = new HashMap<>();
         // TODO: would be nice to convert it to stream expression, but xpath
         // throws, and lambdas do not work with checked exceptions.

The problem is that going through the continuations, all the way back to 2013 (the first part of the garvcontinue parameter is clearly a timestamp), actually takes a lot of requests: 148, in this case. And while they probably won’t take quite as long in production, with WDQS and the API server being in the same datacenter, that’s still a pretty significant slowdown. We might need to put some limit on continuation across empty results after all.

For @Nikki, it’s even worse – the service needed 1664 API requests, which took slightly over five minutes, to return 690 results. (I assume the continuation was more “fine-grained” this time because Nikki has more edits than GZWDer, even though before 2018 none of them were in the Lexeme namespace.)

You can use the arvstart parameter to avoid going through all the edits. Something like mwapi:garvnamespace "20180501000000"; for lexemes I guess.

Long term, it would help if we could use usercontribs instead of allrevisions for things like this.

Change 472704 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[wikidata/query/rdf@master] Allow continuation across empty results

https://gerrit.wikimedia.org/r/472704

Smalyshev triaged this task as Medium priority.Dec 13 2018, 6:17 PM

Change 472704 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[wikidata/query/rdf@master] Allow continuation across empty results

https://gerrit.wikimedia.org/r/472704

Change 472704 merged by jenkins-bot:
[wikidata/query/rdf@master] Allow continuation across empty results

https://gerrit.wikimedia.org/r/472704

Smalyshev claimed this task.

The code has been deployed, but I'm not sure whether it works properly as the original query still returns few results. @Lucas_Werkmeister_WMDE could you check?

Well, the original query doesn’t tweak the parameters, and by default we don’t want to follow too many empty continuations. This query returns results, though:

SELECT ?lexeme ?date (GROUP_CONCAT(DISTINCT ?lemma; separator = "/") AS ?lemmas) WHERE {
  hint:Query hint:optimizer "None".
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:endpoint "www.wikidata.org";
                    wikibase:api "Generator";
                    wikibase:limitEmptyContinuations 200;
                    mwapi:generator "allrevisions";
                    mwapi:garvuser "Nikki"; # adjust user name here
                    mwapi:garvnamespace "146";
                    mwapi:garvstart "20191101000000"; # when this issue was reported (roughly)
                    mwapi:garvend "20180501000000"; # when lexemes were introduced (roughly)
                    mwapi:garvlimit "max".
    ?title wikibase:apiOutput mwapi:title.
  }
  BIND(URI(CONCAT(STR(wd:), STRAFTER(?title, "Lexeme:"))) AS ?lexeme)
  # MINUS { ?lexeme ontolex:sense ?sense. }
  ?lexeme wikibase:lemma ?lemma;
          schema:dateModified ?date.
}
GROUP BY ?lexeme ?date
ORDER BY ASC(?date)

(It’s also pretty close to the timeout, unfortunately.) If you comment out the wikibase:limitEmptyContinuations parameter, the query finishes much faster but only returns one result, because it stopped following the empty continuations after 25 API responses.

@Nikki, after uncommenting the MINUS line (try it!), it returns 23 remaining lexemes you created that don’t have a sense yet.