Page MenuHomePhabricator

WDQS/Blazegraph data loading has timeout
Closed, DeclinedPublic

Description

Loading data inside Blazegraph can take quite a lot of time. Right now one of the ways to do it, in loadData.sh script, is to use SPARQL LOAD command. Unfortunately, it has timeout:

Processing wikidump-000000042.ttl.gz                                                           
SPARQL-UPDATE: updateStr=LOAD <file:///srv/wdqs/munge//wikidump-000000042.ttl.gz>         
java.util.concurrent.TimeoutException                             
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)         
        at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:292)
        at com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlUpdate(QueryServlet.java:460)

which can break data loading, especially unattended process. We should either disable the timeout or make it much longer.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev renamed this task from Data loading has timeout to WDQS/Blazegraph data loading has timeout.Jul 5 2019, 10:56 PM
Smalyshev triaged this task as Medium priority.

Change 520948 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Bump default timeout

https://gerrit.wikimedia.org/r/520948

Change 520948 merged by jenkins-bot:
[wikidata/query/rdf@master] Bump default timeout

https://gerrit.wikimedia.org/r/520948

Trying via dataloader API gets this:

Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]: 22:25:54.928 [qtp226170135-55] ERROR c.b.r.sail.webapp.BigdataRDFServlet - cause=java.util.concurrent.
TimeoutException, query=DATALOADER-SERVLET: wdq req.requestURI=/bigdata/dataloader, req.xForwardedFor=null, req.queryString=null, req.method=POST, req
.remoteHost=localhost, req.requestURL=http://localhost:9999/bigdata/dataloader, req.userAgent=curl/7.52.1         
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]: 2019-07-13 22:25:54.934:WARN:oejs.HttpChannel:qtp226170135-55: /bigdata/dataloader
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]: java.lang.IllegalStateException: STREAM                                               
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at org.eclipse.jetty.server.Response.getWriter(Response.java:931)                             
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at com.bigdata.rdf.sail.webapp.BigdataServlet.buildAndCommitResponse(BigdataServlet.java:564)
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at com.bigdata.rdf.sail.webapp.DataLoaderServlet.doBulkLoad(DataLoaderServlet.java:323)
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at com.bigdata.rdf.sail.webapp.DataLoaderServlet.doPost(DataLoaderServlet.java:108)      
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)                            
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)                             
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)            
Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655)    Jul 13 22:25:54 wdqs1009 wdqs-blazegraph[7722]:         at ch.qos.logback.classic.helpers.MDCInsertingServletFilter.doFilter(MDCInsertingServletFilter
.java:49)

Looks like there's no way currently to load without timing out. Probably should be.

Is there any solution to this yet? I'm hitting a similar error while following the "getting started" guide using loadRestApi.sh.

This is the stack trace:

DATALOADER-SERVLET: wdq
java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:292)
        at com.bigdata.rdf.sail.webapp.DataLoaderServlet.doBulkLoad(DataLoaderServlet.java:310)
        at com.bigdata.rdf.sail.webapp.DataLoaderServlet.doPost(DataLoaderServlet.java:108)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655)
        at ch.qos.logback.classic.helpers.MDCInsertingServletFilter.doFilter(MDCInsertingServletFilter.java:49)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        at org.wikidata.query.rdf.blazegraph.filters.ClientIPFilter.doFilter(ClientIPFilter.java:43)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.Server.handle(Server.java:503)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
        at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)
        at java.lang.Thread.run(Thread.java:748)

There is a context param queryTimeout set to 10 minutes in web.xml, which is applied for all Blazegraph servlets. Stas prepared a patch, extending it 10x times, https://gerrit.wikimedia.org/r/#/c/wikidata/query/rdf/+/520948/ you might apply it locally (or just edit web.xml file) to resolve your issue, as the change has not been applied to the WDQS master due to this timeout is system-wide and extending it might result in unexpected consumption of resources (this timeout will be also applied to queries, including very heave ones, thus allowing them running much longer before generating timeout).

Gehel subscribed.

Current data loading seems to work fine, batch size is small enough that timeout are not an issue