Page MenuHomePhabricator

Production error: Queryservice-updater errors
Closed, ResolvedPublic

Description

Google incident:

https://console.cloud.google.com/errors/detail/CKfwurLSuIXTUQ?project=wikibase-cloud

 java.lang.IllegalStateException: RDF store reports the last update time is before the minimum safe poll time. You will have to reload from scratch or you might have missing data.
at org.wikidata.query.rdf.tool.change.ChangeSourceContext.getStartTime ( org/wikidata.query.rdf.tool.change/ChangeSourceContext.java:100 )
at org.wikidata.query.rdf.tool.WbStackUpdate.initialize ( org/wikidata.query.rdf.tool/WbStackUpdate.java:305 )
at org.wikidata.query.rdf.tool.WbStackUpdate.runUpdaterWithArgs ( org/wikidata.query.rdf.tool/WbStackUpdate.java:229 )
at org.wikidata.query.rdf.tool.WbStackUpdate.updateBatch ( org/wikidata.query.rdf.tool/WbStackUpdate.java:207 )
at org.wikidata.query.rdf.tool.WbStackUpdate.getAndProcessBatchesFromApi ( org/wikidata.query.rdf.tool/WbStackUpdate.java:163 )
at org.wikidata.query.rdf.tool.WbStackUpdate.main ( org/wikidata.query.rdf.tool/WbStackUpdate.java:133 )

AC

  • Figure out what happened and document here
  • Fix it

Event Timeline

So at first glance seems nothing is currently broken.

All the batches are reported to have been successfully processed once on production and the problematic items available in the triple store from previous updates.

queryservice-updater
22:59:57.226 [main] INFO  o.w.query.rdf.tool.WbStackUpdate - Starting Updater 0.3.84 (3bcfddc0f526c9a227773e68fe29419343d33993) --wikibaseHost wikibase.inguma.eus --ids Q7071 --entityNamespaces 120,122,146 --sparqlUrl http://queryservice.default.svc.cluster.local:9999/bigdata/namespace/qsns_f02d35e8fd/sparql --wikibaseScheme http --conceptUri https://wikibase.inguma.eus
ava.lang.IllegalStateException: RDF store reports the last update time is before the minimum safe poll time.  You will have to reload from scratch or you might have missing data.
	at org.wikidata.query.rdf.tool.change.ChangeSourceContext.getStartTime(ChangeSourceContext.java:100)
	at org.wikidata.query.rdf.tool.WbStackUpdate.initialize(WbStackUpdate.java:305)
	at org.wikidata.query.rdf.tool.WbStackUpdate.runUpdaterWithArgs(WbStackUpdate.java:229)
	at org.wikidata.query.rdf.tool.WbStackUpdate.updateBatch(WbStackUpdate.java:207)
	at org.wikidata.query.rdf.tool.WbStackUpdate.getAndProcessBatchesFromApi(WbStackUpdate.java:163)
	at org.wikidata.query.rdf.tool.WbStackUpdate.main(WbStackUpdate.java:133)

However, looking at the timestamps and it seems they batches failed but got incorrectly marked as done.

The batch in apidb is marked as done with the suspiciously fast execution time of 0.

MariaDB [apidb]> select * from qs_batches where entityIds like 'Q7071';
+-------+-----------+---------+---------+-----------+------+---------------------+---------------------+
| id    | eventFrom | eventTo | wiki_id | entityIds | done | created_at          | updated_at          |
+-------+-----------+---------+---------+-----------+------+---------------------+---------------------+
| 22441 |     24706 |   24707 |      26 | Q7071     |    1 | 2022-05-28 22:59:57 | 2022-05-28 22:59:57 |
+-------+-----------+---------+---------+-----------+------+---------------------+---------------------+
1 row in set (0.022 sec)

The queryservice reports the old data https://wikibase.inguma.eus/query/#SELECT%20%2a%20WHERE%7B%20%3Chttps%3A%2F%2Fwikibase.inguma.eus%2Fentity%2FQ7071%3E%20%3Fp%20%3Fo%20%7D

schema:dateModified	13 April 2022

So it seems there is nothing marking these batches as failed??

Not really sure what to do now? How did @Addshore solve/identify failed batches on wbstack? Do we fix the queryservice-updater to mark failed batches as failed?

Ok fortunately this future flow was described in the brain dump

A future flow might look something like this:
- API makes batches
- Updater claims a batch, with a TTL for the claim time
- Updater tries to write the changes
- IF it is successful, it says so to the API which records success.
- IF no success is reported, the API would re serve the batch in the future once the TTL is done with.

So seems indeed currently everything just gets marked as done.

Check on wbstack, does it have these schema:dateModified triples?
I'm guessing that it might not, and they were added when the dump was loaded (never happened on wbstack), and you'll just want to get rid of them

Deleting the dump timestamp makes it fallback to the 30 day check and it now passes, however we should probably increase that number to prevent this from happening in the future too.

08:12:50.626 [main] INFO  o.w.query.rdf.tool.WbStackUpdate - Starting Updater 0.3.84 (3bcfddc0f526c9a227773e68fe29419343d33993) --wikibaseHost wikibase.inguma.eus --ids Q7071 --entityNamespaces 120,122,146 --sparqlUrl http://queryservice.default.svc.cluster.local:9999/bigdata/namespace/qsns_f02d35e8fd/sparql --wikibaseScheme http --conceptUri https://wikibase.inguma.eus
08:12:51.584 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Checking where we left off
08:12:51.584 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater
08:12:51.684 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the dump
08:12:51.716 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Defaulting start time to 30 days ago: 2022-05-03T08:12:51.716Z
08:12:56.585 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 0 at (0.0, 0.0, 0.0) updates per second and (0.0, 0.0, 0.0) ids per second
Loop 30/100. TotalM: 18. FreeM: 3. UsedM: 15. MaxM: 29. Threads: 6
Loop 31/100. TotalM: 18. FreeM: 3. UsedM: 15. MaxM: 29. Threads: 5
Loop 32/100. TotalM: 18. FreeM: 3. UsedM: 15. MaxM: 29. Threads: 5
Loop 33/100. TotalM: 18. FreeM: 3. UsedM: 15. MaxM: 29. Threads: 5
Loop 34/100. TotalM: 18. FreeM: 3. UsedM: 15. MaxM: 29. Threads: 5
Loop 35/100. TotalM: 18. FreeM: 3. UsedM: 15. MaxM: 29. Threads: 5
Loop 36/100. TotalM: 18. FreeM: 3. UsedM: 15. MaxM: 29. Threads: 5

I've deleted the timestamps across all namespaces on production, will unset done on the batches has ran before 26th to have them run again.

image.png (206×1 px, 21 KB)

Re-running all batches since the first errors occured

MariaDB [apidb]> update qs_batches set done = 0 where created_at > "2022-05-26";
Query OK, 15416 rows affected (0.204 sec)
Rows matched: 15416  Changed: 15416  Warnings: 0

Deleting the dump timestamp makes it fallback to the 30 day check and it now passes, however we should probably increase that number to prevent this from happening in the future too.

So when none of the top level dates are there, (the dump triple, or the main date modified triple) then the length of 30 days is irrelevant (I believe) as these triples that get checked will not be added (due to the way we run the updater)

Normally when reading from recent changes, the date of the last change added to the query service will be set as this triple.
However we don't really from RC, instead just pass --ids, so this date is never set.

Deleting the dump timestamp makes it fallback to the 30 day check and it now passes, however we should probably increase that number to prevent this from happening in the future too.

So when none of the top level dates are there, (the dump triple, or the main date modified triple) then the length of 30 days is irrelevant (I believe) as these triples that get checked will not be added (due to the way we run the updater)

Yeah i think that is right, then we just need to remind ourselves to delete these timestamps for each batch.

As for the errors. It seems they have stopped since we deleted the timestamps

All the batches have ran through and can't see any other problems. moving this to review as there is a PR attached. https://github.com/wmde/wbaas-deploy/pull/358

toan updated the task description. (Show Details)
Tarrow claimed this task.