wdqs-updater crashing when update size is too large
Open, NormalPublic

Description

around 11:30 UTC Nov 23 2018, wdqs updater crashed on all wdqs nodes. The updater is restarted by systemd a few times, but enters failed state once start limit is reached. The logs don't provide much information (see below).

Restarting the updater manually does not solve the issue, it goes back to failed after a number of attempts.

11:40:15.580 [update 8] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Weird reference http://www.wikidata.org/reference/5ff5d1308dcd060222a9dcea0a29de7f5828ed2c: unknown value http://www.wikidata.org/value/7e281616976c7de150357c18e76abfd1 for http://www.wikidata.org/entity/Q22089547
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
11:40:39.662 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Checking where we left off
11:40:39.666 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater
11:40:40.070 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Found left off time from the updater
11:40:40.071 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Found start time in the RDF store: 2018-11-23T11:25:15Z
11:40:40.144 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Creating consumer wdqs1005
11:40:40.729 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Subscribed to 6 topics
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic eqiad.mediawiki.page-undelete-0 to (timestamp=1542972315000, offset=140144)
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic codfw.mediawiki.revision-create-0 to (timestamp=1542972315000, offset=56945679)
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic codfw.mediawiki.page-undelete-0 to (timestamp=1542972315000, offset=6575)
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic eqiad.mediawiki.revision-create-0 to (timestamp=1542972315000, offset=982852035)
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic codfw.mediawiki.page-delete-0 to (timestamp=1542972315000, offset=409711)
11:40:40.731 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic eqiad.mediawiki.page-delete-0 to (timestamp=1542972315000, offset=7485389)
11:40:42.883 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Found 927 changes
11:41:04.045 [update 0] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Weird reference http://www.wikidata.org/reference/f0241d2920c31202a1c72a60f2e154b44ffb7447: unknown value http://www.wikidata.org/value/054f90e140d6e4478605edbffbcb192e for http://www.wikidata.org/entity/Q2841515
11:41:04.434 [update 6] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Weird reference http://www.wikidata.org/reference/290275f5e6979162473e25b67ddf983812d47fc5: unknown value http://www.wikidata.org/value/054f90e140d6e4478605edbffbcb192e for http://www.wikidata.org/entity/Q388110
Gehel created this task.Fri, Nov 23, 11:43 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Nov 23, 11:43 AM

Out of memory error on a bind:

Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.util.Arrays.copyOf(Arrays.java:3332)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.lang.StringBuilder.append(StringBuilder.java:202)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.util.regex.Matcher.appendReplacement(Matcher.java:883)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.util.regex.Matcher.replaceAll(Matcher.java:955)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.lang.String.replace(String.java:2240)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bind(UpdateBuilder.java:32)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bindStatements(UpdateBuilder.java:70)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.rdf.RdfRepository.syncFromChanges(RdfRepository.java:303)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.Updater.handleChanges(Updater.java:213)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.Updater.run(Updater.java:129)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.Update.run(Update.java:133)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.Update.main(Update.java:76)

Mentioned in SAL (#wikimedia-operations) [2018-11-23T12:04:32Z] <gehel> manually increasing wdqs-updater heap to 4G - T210235

increasing heap stops the updater from crashing, but blazegraph refuses updates > 200M (we probably don't want to increase this limit)

Change 475455 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: reduce batch size to 300 temporarily

https://gerrit.wikimedia.org/r/475455

Change 475455 merged by Gehel:
[operations/puppet@production] wdqs: reduce batch size to 300 temporarily

https://gerrit.wikimedia.org/r/475455

Reducing batch size seems to work, updates are processed again.

Mentioned in SAL (#wikimedia-operations) [2018-11-23T14:02:23Z] <gehel> restor wdqs-updater heap to 2G - T210235

Gehel triaged this task as High priority.Mon, Nov 26, 1:25 PM
Gehel added a subscriber: Smalyshev.

The immediate issue is solved. I'd like a review from @Smalyshev to see if we have a good way to prevent this from happening again without increasing complexity too much.

All I can think is having a retry mechanism with smaller batch sizes when the update queries start to be too large, but it seems like a lot of additional complexity for a very specific error scenario. We can probably do better.

Gehel claimed this task.

This is weird, bind shouldn't consume that much space... I'll need to look into the code in detail, but something looks wrong there. Update batches should not be that big.

Smalyshev renamed this task from wdqs-updater crashing on all wdqs servers to wdqs-updater crashing when update size is too large.Wed, Nov 28, 7:53 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptWed, Nov 28, 7:53 PM
Addshore moved this task from incoming to monitoring on the Wikidata board.Mon, Dec 3, 5:54 PM
Smalyshev lowered the priority of this task from High to Normal.Tue, Dec 11, 1:04 AM

Given that this seems to be very rare, dropping the priority a bit.

Also got this one today:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:472)
        at java.lang.StringBuffer.append(StringBuffer.java:310)
        at java.lang.StringBuffer.append(StringBuffer.java:97)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:486)
        at java.lang.StringBuffer.append(StringBuffer.java:338)
        at java.util.regex.Matcher.appendReplacement(Matcher.java:890)
        at java.util.regex.Matcher.replaceAll(Matcher.java:955)
        at java.lang.String.replace(String.java:2240)
        at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bindValue(UpdateBuilder.java:40)
        at org.wikidata.query.rdf.tool.rdf.RdfRepository.syncFromChanges(RdfRepository.java:308)
        at org.wikidata.query.rdf.tool.Updater.handleChanges(Updater.java:214)
        at org.wikidata.query.rdf.tool.Updater.run(Updater.java:129)
        at org.wikidata.query.rdf.tool.Update.run(Update.java:163)
        at org.wikidata.query.rdf.tool.Update.main(Update.java:88)

We probably need some special handling for huge updates.

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Mon, Dec 17, 8:27 AM

We have a limit of 20M per update, and we usually update about 500 to 700 entities, so we expect the average item's data size be no more than around 30k. However, we have things like:

-rw-r--r-- 1 smalyshev wikidev 763K Dec 17 08:40 Q56939386.ttl
-rw-r--r-- 1 smalyshev wikidev 634K Dec 17 08:40 Q56901541.ttl
-rw-r--r-- 1 smalyshev wikidev 634K Dec 17 08:40 Q56881481.ttl
-rw-r--r-- 1 smalyshev wikidev 621K Dec 17 08:40 Q56900584.ttl
-rw-r--r-- 1 smalyshev wikidev 620K Dec 17 08:40 Q56895083.ttl
-rw-r--r-- 1 smalyshev wikidev 618K Dec 17 08:40 Q56895307.ttl
-rw-r--r-- 1 smalyshev wikidev 618K Dec 17 08:40 Q56900617.ttl
-rw-r--r-- 1 smalyshev wikidev 615K Dec 17 08:40 Q56895281.ttl
-rw-r--r-- 1 smalyshev wikidev 614K Dec 17 08:40 Q56897805.ttl
-rw-r--r-- 1 smalyshev wikidev 614K Dec 17 08:40 Q56902115.ttl
-rw-r--r-- 1 smalyshev wikidev 614K Dec 17 08:40 Q56897887.ttl
-rw-r--r-- 1 smalyshev wikidev 613K Dec 17 08:40 Q56901716.ttl
-rw-r--r-- 1 smalyshev wikidev 613K Dec 17 08:40 Q56899853.ttl
-rw-r--r-- 1 smalyshev wikidev 613K Dec 17 08:40 Q56898200.ttl
-rw-r--r-- 1 smalyshev wikidev 610K Dec 17 08:40 Q56900456.ttl
-rw-r--r-- 1 smalyshev wikidev 610K Dec 17 08:40 Q56897358.ttl

Looks like many scientific article entries are enormous. Most of regular entries still are under 10k. We may want to track entity sizes btw, I wonder if it's easy to do (maybe extract somehow from HTTP response?)

I assume when we encounter such data we'd have to chunk the update.