Page MenuHomePhabricator

WDQS Updater ran into issue and stopped working
Closed, ResolvedPublic

Description

At about 1:10 UTC, We started seeing SSH errors on wdqs* servers. @Dzahn reported this error and tries to SSH into the servers but couldn't. Then we started getting a lot of wdqs-updater errors like: java.time.format.DateTimeParseException

@Dzahn power-cycled wdqs2005 and we could login into wdqs2005 afterwards. We discovered on wdqs2005, blazegraph seems to be working fine but wdqs-updater errors continued. This same issue was seen in all wdqs servers. Also, I tried to revert back some updates to solve these problems via the deploy host, but didn't get anywhere with this. I logged in to the autodeployment host (wdqs1009) to see to try and revert back update using git rebase but this did not stop the errors.

Then through all this, wdqs for users seems to be working fine (but obviously they are not getting updates as updater has failed). I will let people smarter than me jump on this when they are available.

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptOct 24 2018, 3:12 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Mathew.onipe triaged this task as High priority.Oct 24 2018, 3:12 AM

Change 469378 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Disable Kafka update due to breakage

https://gerrit.wikimedia.org/r/469378

Change 469378 merged by Elukey:
[operations/puppet@production] Disable Kafka update due to breakage

https://gerrit.wikimedia.org/r/469378

Smalyshev raised the priority of this task from High to Unbreak Now!.Oct 24 2018, 6:01 AM

This seems to be caused by this error:

Oct 24 00:28:30 wdqs1003 wdqs-updater[1614]: 00:28:30.876 [main] WARN  o.w.q.r.tool.change.JsonDeserializer - Data in topic eqiad.mediawiki.revision-create cannot be deserialized [{"comment": "Stabl", "database": "mediawikiwiki", "meta": {"domain": "www.mediawiki.org", "dt": "2018-10-24T00:28:24.162300+00:00", "id": "b7aa095a-d723-11e8-be6d-1418776134a1", "request_id": "23e8fa7b-8c07-4ffc-ba43-d30e025b65cb", "schema_uri": "mediawiki/revision/create/3", "topic": "mediawiki.revision-create", "uri": "https://www.mediawiki.org/wiki/Extension:EditSubpages"}, "page_id": 22458, "page_is_redirect": false, "page_namespace": 102, "page_title": "Extension:EditSubpages", "parsedcomment": "Stabl", "performer": {"user_groups": ["*"], "user_is_bot": false, "user_text": "2001:56A:76EC:E800:3C6D:9381:428E:ACE5"}, "rev_content_changed": true, "rev_content_format": "text/x-wiki", "rev_content_model": "wikitext", "rev_id": 2930951, "rev_len": 6582, "rev_minor_edit": false, "rev_parent_id": 2930950, "rev_sha1": "7sk6l1w8g56i7aiff8kuursjb4p7goj", "rev_timestamp": "2018-10-24T00:28:06Z"}].
Oct 24 00:28:30 wdqs1003 wdqs-updater[1614]: com.fasterxml.jackson.databind.exc.InvalidFormatException: Cannot deserialize value of type `java.time.Instant` from String "2018-10-24T00:28:24.162300+00:00": Text '2018-10-24T00:28:24.162300+00:00' could not be parsed at index 19
Oct 24 00:28:30 wdqs1003 wdqs-updater[1614]:  at [Source: (byte[])"{"comment": "Stabl", "database": "mediawikiwiki", "meta": {"domain": "www.mediawiki.org", "dt": "2018-10-24T00:28:24.162300+00:00", "id": "b7aa095a-d723-11e8-be6d-1418776134a1", "request_id": "23e8fa7b-8c07-4ffc-ba43-d30e025b65cb", "schema_uri": "mediawiki/revision/create/3", "topic": "mediawiki.revision-create", "uri": "https://www.mediawiki.org/wiki/Extension:EditSubpages"}, "page_id": 22458, "page_is_redirect": false, "page_namespace": 102, "page_title": "Extension:EditSubpages", "parsedcomme"[truncated 401 bytes]; line: 1, column: 97] (through reference chain: org.wikidata.query.rdf.tool.change.events.RevisionCreateEvent["meta"]->org.wikidata.query.rdf.tool.change.events.EventsMeta["dt"])

Looks like something produces messages with dt that updater can not deserialize.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptOct 24 2018, 6:01 AM

The first error is at Oct 24 00:28:30 which matches Mediawiki deployment. Pinging @20after4 as the deployer.

Another problem on this - for some reason, when Updater fails, systemd keeps restarting in a cycle every 3 seconds. This creates unnecessary load on the machine, while not helping anything. We should fix that too.

Smalyshev added a comment.EditedOct 24 2018, 6:33 AM

Also, I can not access most hosts by ssh. Whatever updater problem is there, it should not block ssh, but except for 1003 and 2003, all others are unresponsive over ssh.

< _joe_> as to why it's impossible to ssh into those hosts, I guess it has to do with journald and some interactions with ssh

The message that comes from mediawiki.org looks like:

{"comment": "Stabl", "database": "mediawikiwiki", "meta": {"domain": "www.mediawiki.org", "dt": "2018-10-24T00:28:24.162300+00:00", "id": "b7aa095a-d723-11e8-be6d-1418776134a1", "request_id": "23e8fa7b-8c07-4ffc-ba43-d30e025b65cb", "schema_uri": "mediawiki/revision/create/3", "topic": "mediawiki.revision-create", "uri": "https://www.mediawiki.org/wiki/Extension:EditSubpages"}, "page_id": 22458, "page_is_redirect": false, "page_namespace": 102, "page_title": "Extension:EditSubpages", "parsedcomment": "Stabl", "performer": {"user_groups": ["*"], "user_is_bot": false, "user_text": "2001:56A:76EC:E800:3C6D:9381:428E:ACE5"}, "rev_content_changed": true, "rev_content_format": "text/x-wiki", "rev_content_model": "wikitext", "rev_id": 2930951, "rev_len": 6582, "rev_minor_edit": false, "rev_parent_id": 2930950, "rev_sha1": "7sk6l1w8g56i7aiff8kuursjb4p7goj", "rev_timestamp": "2018-10-24T00:28:06Z"}

Note the different format for "dt" timestamp. Timestamps from all other sources are normal, all timestamps from mediawiki.org are in weird format.

Mentioned in SAL (#wikimedia-operations) [2018-10-24T07:33:26Z] <gehel> powercycling wdqs1010 - T207817

Do we have a patch or should we roll back group0?

Gehel added a comment.Oct 24 2018, 2:50 PM

Do we have a patch or should we roll back group0?

We have a workaround on the WDQS side (switching back to recent changes instead of kafka events). But the root cause isn't fixed, and it is unclear to me what change caused that issue. IMHO, this should be treated as a contract change and either needs to be communicated accordingly as a breaking change or rolled back.

Tarrow added a subscriber: Tarrow.Oct 24 2018, 3:04 PM

Ping @Pchelolo.

This is happening because of https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/468482/ as part of a fix for T207329: Clear watchlist on enwiki only removes 50 items at a time.

Interesting! I checked Jodatime stuff to make sure one of our Java based pipeline handled the timestamp format change, I'm surprised that Jackson can't parse this!

We could rollback https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/468482/ - we can workaround it in change-prop and the issue fixed by that is much rarer, but eventually, we'd want to include millisecond resolution into the dt of the events.

@Pchelolo I can look into how to make Jackson parse it, it's probably possible, but I'd appreciate rolling it back for now until that is done.

Maybe jackson just can't parse this microsecond stuff? Maybe milliseconds are fine?

Change 469449 had a related patch set uploaded (by 20after4; owner: 20after4):
[mediawiki/extensions/EventBus@master] Revert "Set event datetime with microsecond resolution."

https://gerrit.wikimedia.org/r/469449

Gehel added a comment.Oct 24 2018, 3:42 PM

Interesting! I checked Jodatime stuff to make sure one of our Java based pipeline handled the timestamp format change, I'm surprised that Jackson can't parse this!

We can definitely parse that format if we wanted to, but we do have a fixed format configured. We could enable some magic, but it seems very bad form to autodetect a time format. There are too many ways to get it wrong.

Side note, technically, this change has introduced a timezone, which should change the semantic from a LocalDate to an Instance in Joda time terminology :) so either it was technically wrong before or it is now :)

Change 469451 had a related patch set uploaded (by 20after4; owner: 20after4):
[mediawiki/extensions/EventBus@wmf/1.33.0-wmf.1] Revert "Set event datetime with microsecond resolution."

https://gerrit.wikimedia.org/r/469451

Maybe jackson just can't parse this microsecond stuff? Maybe milliseconds are fine?

Judging by the error message, it cripples on the . after seconds.

@Smalyshev Also please note we're only including microseconds and not milliseconds because of the HHVM incompatibility with PHP and the need to support HHVM for now. For all our purposes milliseconds should be enough, but seems like Jackson will not like them either out of the box.

ok anyone care to give the ol' +2 to revert this on the branch? https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/469451/

Change 469451 merged by jenkins-bot:
[mediawiki/extensions/EventBus@wmf/1.33.0-wmf.1] Revert "Set event datetime with microsecond resolution."

https://gerrit.wikimedia.org/r/469451

Smalyshev lowered the priority of this task from Unbreak Now! to High.Oct 24 2018, 3:52 PM
Smalyshev added a project: Wikimedia-Incident.

Maybe jackson just can't parse this microsecond stuff? Maybe milliseconds are fine?

I hope Jackson parsing format is configurable. I'll look into it today. Hopefully I can make it parse both.

Mentioned in SAL (#wikimedia-operations) [2018-10-24T16:00:33Z] <twentyafterfour> 15:59:06 Synchronized php-1.33.0-wmf.1/extensions/EventBus/: revert "Set event datetime with microsecond resolution." on 1.33.0-wmf.1 refs T207817 (duration: 00m 56s)

Judging by the quick skim of the Jackson code it supports DateTimeFormatter.ISO_INSTANT format that should support fractional seconds parsing.

However, this incident raises a question whether the change applied to EventBus was indeed backward-compatible. We've checked the known consumers (one we apparently forgot to check) that they at least don't cripple on parsing fractional seconds in ISO 8601, but maybe there exist more things introducing the fractional seconds into the dt will break. Maybe our initial approach was incorrect @Ottomata ?

@Pchelolo I think a public notification on Kafka format change would be a great idea. Since various tools may use Kafka streams (and streams derived from them), and some tools may be sensitive to exact formats.

The spec at https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/resource_change/1.yaml defines "dt" as:

format: date-time
description: the time stamp of the event, in ISO8601 format

But I'm not sure what "ISO8601 format" specifically means - ISO 8601 allows many format variations. Maybe we should specify it more strictly.

Change 469449 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Revert "Set event datetime with microsecond resolution."

https://gerrit.wikimedia.org/r/469449

With T207873 deployed (planned next Monday) Updater should be able to parse such timestamps, but I'd appreciate announcement upfront, so I could test it. If it can wait till Nov 6 when I am back from vacation, even better.

Thanks @Smalyshev, I think you are write that changes like this should be announced a bit better. We'll try to do better next time.

I wonder...where should we announce changes like this? engineering@? wikitech-l?

engineering@? wikitech-l?

I think engineering is good, and probably wikitech too since the data can get into EventStreams as I understand and that is public.

Restricted Application added a project: Analytics. · View Herald TranscriptOct 25 2018, 9:10 PM
fdans moved this task from Incoming to Radar on the Analytics board.Oct 29 2018, 3:54 PM
Addshore moved this task from incoming to monitoring on the Wikidata board.Oct 31 2018, 5:39 PM
greg moved this task from Active Situation to Follow-up/Actionables on the Wikimedia-Incident board.
greg added a subscriber: greg.

(I'm guessing this is more of a "follow-up" task vs an "active situation" task, yes?)

I'm guessing this is more of a "follow-up" task vs an "active situation" task, yes?

Yes, by now it's definitely a follow-up tracking task.

Smalyshev closed this task as Resolved.Nov 7 2018, 4:40 PM
Smalyshev claimed this task.

I think we're mostly done with implementing the followup. For reference, incident report is here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20181024-WDQS