Page MenuHomePhabricator

Update Lingua Libre datasets periodically
Open, Needs TriagePublic

Description

Hi, @Julien_Baley noticed that the datasets (https://lingualibre.org/datasets/) are rather old: some have been last updated in November 2019 and the others on April 2020). Would it be possible to run a cron task to update them perdiodically (one a week or less if it is too heavy)

Event Timeline

GoEThe renamed this task from Update dataset periodically to Update Lingua Libre datasets periodically.Dec 16 2020, 2:18 PM
GoEThe updated the task description. (Show Details)

Assessment and fixes are underway. One update have been done, 13 languages got updated. More to come.

Yug claimed this task.

I will clase this issue since it's core question (automatic periodicity) is resolved.

WikiLucas00 added subscribers: Poslovitch, VIGNERON, Seb35 and 2 others.

I re-open this task because most datasets are still un-updated since 2019 or 2020 (and I am not aware of existing recent discussions about this topic).

It appears that CommonsDownloadTool needs to be fixed to solved this issue. If @Poslovitch is not wrong, the problem comes from the fact there are too many files to get from Wikimedia Commons in one time and there is a ratelimit that is currently not supported by CommonsDownloadTool and which prevents the download to success.

It appears that CommonsDownloadTool needs to be fixed to solved this issue. If @Poslovitch is not wrong, the problem comes from the fact there are too many files to get from Wikimedia Commons in one time and there is a ratelimit that is currently not supported by CommonsDownloadTool and which prevents the download to success.

A push is underway with WikiapiJS, base of it have been discussed and coded, summarized, and shared. Please ping me if this issue stales again.

I tried to run create_datasets.sh. The problem occurs when the script is at the "Download and package ALL sounds..." step.

It seems that the query is too long and it ends with a timeout.

Here is the backtrace (I added a print to see "response.text" at line 215 in commons_download_tool.py

java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:292)
        at com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlQuery(QueryServlet.java:678)
        at com.bigdata.rdf.sail.webapp.QueryServlet.doPost(QueryServlet.java:275)
        at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:269)
        at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:195)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655)
        at org.wikidata.query.rdf.blazegraph.throttling.ThrottlingFilter.doFilter(ThrottlingFilter.java:320)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        at org.wikidata.query.rdf.blazegraph.throttling.SystemOverloadFilter.doFilter(SystemOverloadFilter.java:82)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        at ch.qos.logback.classic.helpers.MDCInsertingServletFilter.doFilter(MDCInsertingServletFilter.java:49)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        at org.wikidata.query.rdf.blazegraph.filters.QueryEventSenderFilter.doFilter(QueryEventSenderFilter.java:117)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        at org.wikidata.query.rdf.blazegraph.filters.ClientIPFilter.doFilter(ClientIPFilter.java:43)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        at org.wikidata.query.rdf.blazegraph.filters.RealAgentFilter.doFilter(RealAgentFilter.java:33)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.Server.handle(Server.java:503)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
        at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)
        at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last):
  File "../CommonsDownloadTool/commons_download_tool.py", line 229, in <module>
    get_params()
  File "../CommonsDownloadTool/commons_download_tool.py", line 215, in get_params
    response = json.loads(response.text)['results']['bindings']
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 682749 column 128 (char 23283281)

I am not an expert of SPARQL so I do not know how to fix that efficiently.
Maybe if one adds an option to commons_download_tool.py so that we can manage only one language at a time (and not all of them) may help.