Page MenuHomePhabricator

Event Utilities partially downloads schemas
Open, LowPublicBUG REPORT

Description

With some regularity, like once a week, Refine uses Event Utilities to download a schema via Resource Loader and ends up with an error that looks like the schema was chopped off while being retrieved.

The latest example has this error:

The following 1 of 2 dataset partitions for output table `event`.`mediawiki_structured_task_article_image_suggestion_interaction` failed refinement:
        org.wikimedia.analytics.refinery.job.refine.RefineTargetException: Failed refinement of hdfs://analytics-hadoop/wmf/data/raw/event/codfw.mediawiki.structured_task.article.image_suggestion_interaction/year=2022/month=05/day=31/hour=21 -> `event`.`mediawiki_structured_task_article_image_suggestion_interaction` /wmf/data/event/mediawiki_structured_task_article_image_suggestion_interaction/datacenter=codfw/year=2022/month=5/day=31/hour=21. Original exception: while scanning a simple key
 in 'reader', line 214, column 5:
        mini
        ^
could not find expected ':'
 in 'reader', line 214, column 9:
    mini
        ^

I'm looking through the full logs (from an-launcher1002, sudo -u analytics kerberos-run-command analytics yarn logs -applicationId application_1651744501826_136341 | less). But looking around it looks like it got the schema just fine, and there are even mentions of successful refinement *before* the failure messages. So I'm confused. Here are some of the relevant log lines in order:

22/06/01 00:22:25 INFO DataFrameToHive: Writing DataFrame to /wmf/data/event/mediawiki_structured_task_article_image_suggestion_interaction/datacenter=eqiad/year=2022/month=5/day=31/hour=21 with schema:
root
 |-- _schema: string (nullable = true)
 |-- action: string (nullable = true)
 |-- action_data: string (nullable = true)
 |-- active_interface: string (nullable = true)
 |-- dt: string (nullable = true)
 |-- homepage_pageview_token: string (nullable = true)
 |-- http: struct (nullable = true)
 |    |-- client_ip: string (nullable = true)
 |    |-- has_cookies: boolean (nullable = true)
 |    |-- method: string (nullable = true)
 |    |-- protocol: string (nullable = true)
 |    |-- request_headers: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- response_headers: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- status_code: long (nullable = true)
 |-- is_mobile: boolean (nullable = true)
 |-- meta: struct (nullable = true)
 |    |-- domain: string (nullable = true)
 |    |-- dt: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- request_id: string (nullable = true)
 |    |-- stream: string (nullable = true)
 |    |-- uri: string (nullable = true)
 |-- newcomer_task_token: string (nullable = true)
 |-- page_id: long (nullable = true)
 |-- page_title: string (nullable = true)
 |-- geocoded_data: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- user_agent_map: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- is_wmf_domain: boolean (nullable = true)
 |-- normalized_host: struct (nullable = true)
 |    |-- project_class: string (nullable = true)
 |    |-- project: string (nullable = true)
 |    |-- qualifiers: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- tld: string (nullable = true)
 |    |-- project_family: string (nullable = true)
 |-- datacenter: string (nullable = true)
 |-- year: long (nullable = true)
 |-- month: long (nullable = true)
 |-- day: long (nullable = true)
 |-- hour: long (nullable = true)

22/06/01 00:22:08 ERROR Refine: Failed refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/event/codfw.mediawiki.structured_task.article.image_suggestion_interaction/year=2022/month=05/day=31/hour=21 -> `event`.`mediawiki_structured_task_article_image_suggestion_interaction` /wmf/data/event/mediawiki_structured_task_article_image_suggestion_interaction/datacenter=codfw/year=2022/month=5/day=31/hour=21.
while scanning a simple key
 in 'reader', line 214, column 5:
        mini
        ^
could not find expected ':'
 in 'reader', line 214, column 9:
    mini
        ^

22/06/01 00:22:41 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/event/eqiad.mediawiki.structured_task.article.image_suggestion_interaction/year=2022/month=05/day=31/hour=21 -> `event`.`mediawiki_structured_task_article_image_suggestion_interaction` /wmf/data/event/mediawiki_structured_task_article_image_suggestion_interaction/datacenter=eqiad/year=2022/month=5/day=31/hour=21. (# refined records: 57)

And later on, maybe this is just a summary of above?

22/06/01 00:36:50 ERROR Refine: The following 1 of 2 dataset partitions for output table `event`.`mediawiki_structured_task_article_image_suggestion_interaction` failed refinement:
        org.wikimedia.analytics.refinery.job.refine.RefineTargetException: Failed refinement of hdfs://analytics-hadoop/wmf/data/raw/event/codfw.mediawiki.structured_task.article.image_suggestion_interaction/year=2022/month=05/day=31/hour=21 -> `event`.`mediawiki_structured_task_article_image_suggestion_interaction` /wmf/data/event/mediawiki_structured_task_article_image_suggestion_interaction/datacenter=codfw/year=2022/month=5/day=31/hour=21. Original exception: while scanning a simple key
 in 'reader', line 214, column 5:
        mini
        ^
could not find expected ':'
 in 'reader', line 214, column 9:
    mini
        ^

22/06/01 00:36:50 INFO Refine: Successfully refined 1 of 2 dataset partitions into table `event`.`mediawiki_structured_task_article_image_suggestion_interaction` (total # refined records: 56)

Event Timeline

Ah okay so the success you see is for the eqiad partition, which is different than the codfw one, which is failing.

And indeed, line 214 in this schema, starts with minimum, so it is pretty clear the response is being truncated somehow.

I somehow doubt this is the served http responses fault. Ultimately the loading is going through CloseableHttpClient, which IIUC uses DefaultHttpRequestRetryHandler which retries 3 times.

Something sure is fishy!

I'm starting to regret our old decision to maintain both json and yaml schema formats in the schema repos. (See also T308450: [BUG] jsonschema-tools materializes fields in yaml in a different order than in json files). The intention was to prefer yaml, but make it a little easier for folks to parse json without and extra dependency on a yaml lib.

In T309717: Event Utilities partially downloads schemas, I'm starting to suspect that the simple Refine failures we often see and fix by rerunning might be caused by a bug in the SnakeYAML parser used by Jackson. I'm not certain of this, but something is very fishy here.

I'd wanted to just switch our EventSchemaLoader's LATEST_FILE_NAME to be latest.json, but because of T308450: [BUG] jsonschema-tools materializes fields in yaml in a different order than in json files I'm worried that there may be some issues with evoloving hive tables when I do. I can't recall for sure, but I think that if the order of a struct's field changes, there might be some incompatible Hive schema errors?

Change 808978 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Spark JsonSchemaConverter - log when schema does not contain type field

https://gerrit.wikimedia.org/r/808978

Change 808978 merged by Milimetric:

[analytics/refinery/source@master] Spark JsonSchemaConverter - log when schema does not contain type field

https://gerrit.wikimedia.org/r/808978

@Milimetric, I see that you merged it, does it mean we can close?

The change was just a logging change to help us debug this. I don't think we actually deployed that change though (we need to bump Refine .jar version in puppet). I'll do this this ops week.

This task is going to be open for a long time, until we find the actual cause.

Thank you for the context @Ottomata. From now on, I'll change the line of questioning. Next time instead of "can we close?" I'll ask "what are the next steps?" : )

Ah, I was going to update the version of refinery-job that Refine uses, but I realized that we merged Spark 3 changes into refinery-source. @JAllemandou, am I correct in assuming that we can't update Refine's refinery version until we finish Spark 3 in T295072: Install spark3 in analytics clusters.

Ah, I was going to update the version of refinery-job that Refine uses, but I realized that we merged Spark 3 changes into refinery-source. @JAllemandou, am I correct in assuming that we can't update Refine's refinery version until we finish Spark 3 in T295072: Install spark3 in analytics clusters.

spark3 code has been merged and this prevents regular production deploy of refinery until we use spark3 for refinery (having spark3 everywhere or not). I think we can still backport some changes if needed using dedicated branched and manual upload of jars. Not sure if it's worth here.

No, lets wait. This is just a small logging change.

Alright! We are finally on Spark 3, and deployed the error logging change that Dan wrote last year!

We saw one of these failures today:

The following 1 of 1 dataset partitions for output table `event`.`navigationtiming` failed refinement:
        org.wikimedia.analytics.refinery.job.refine.RefineTargetException: Failed refinement of hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_NavigationTiming/year=2023/month=06/day=27/hour=05 -> `event`.`navigationtiming` /wmf/data/event/navigationtiming/year=2023/month=6/day=27/hour=5. Original exception: java.lang.IllegalArgumentException: `mwSpecialPageName` schema must contain 'type' field. schema is:
{"description":"If the page being measured is a Special Page, this will contain the name of the page","type":null}

applicationId: application_1687442676858_24893

This is stranger than I thought. Clearly the HTTP response for the schema is not being truncated. Somehow the field type is just wrong?! mwSpecialPageName in this schema has type string!.

mwSpecialPageName:
  description: >-
    If the page being measured is a Special Page, this will contain the
    name of the page
  type: string

The conditional checks if the type is Java null (meaning not present), or if it is set to JSONSchema "null". In this case it looks like it is JSONSchema "null", which is just so strange!

The conditional checks if the type is Java null (meaning not present), or if it is set to JSONSchema "null".

Wait, no, the check that throws this error is:

if (schemaTypeNode == null || !schemaTypeNode.isTextual)

And indeed null is not Java null, but it is not 'textual' either. type is set to JSON null, which is not a valid JSONSchema type at all.

So this is even stranger. How could the JSONSchema both be parsed as JSON correctly, but have a specific node with a different value than the parsed JSON document?

/me would love to see aunit-test replicating this :)

Change 933456 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] JsonSchemaConverter - log full JSONSchema when converting to Spark fails

https://gerrit.wikimedia.org/r/933456

Change 933456 merged by Ottomata:

[analytics/refinery/source@master] JsonSchemaConverter - log full JSONSchema when converting to Spark fails

https://gerrit.wikimedia.org/r/933456

Added more logging, and scheduled it for next week's train: https://etherpad.wikimedia.org/p/analytics-weekly-train

would love to see aunit-test replicating this :)

I'm not sure what unit test to add. The JsonSchemaConverter code is actually doing the right thing: failing when type is null. The question is: why is type intermittently null? A rerun succeeds with a non null type.

The code does ultimately use com.github.fge.jsonschema.core.load.SchemaLoader in JsonSchemaLoader. I think we don't strictly need the SchemaLoader stuff, as our $refs are always resolved in the concrete versioned schemas we use.

If https://gerrit.wikimedia.org/r/933456 doesn't clue us into the problem, I think the next things I'd try are:

This happened again today, the exception was:

 at [Source: (StringReader); line: 220, column: 14]
        at com.fasterxml.jackson.dataformat.yaml.snakeyaml.error.MarkedYAMLException.from(MarkedYAMLException.java:27)
        at com.fasterxml.jackson.dataformat.yaml.YAMLParser.nextToken(YAMLParser.java:359)
        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:252)
        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:437)
        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:261)
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:68)
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
        at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:4173)
        at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2577)
        at org.wikimedia.eventutilities.core.json.JsonLoader.parse(JsonLoader.java:108)
        at org.wikimedia.eventutilities.core.json.JsonLoader.load(JsonLoader.java:59)
        ... 41 more
Caused by: while scanning a quoted scalar
 in 'reader', line 220, column 16:
        client_dt: '2020-04-0
                   ^
found unexpected end of stream
 in 'reader', line 220, column 26:
        client_dt: '2020-04-0

This is such a strange error. The schema in question is wikipediaportal.

The line is from the schema examples. In this case the response is just truncated!

The code reading the HTTP response body is in BasicHttpClient and BasicHttpResult. BasicHttpResult ends up calling org.apache.http.util.EntityUtils.toByteArray. This uses ByteArrayBuffer to fill and return a byte[].

Is it possible there is a bug in EntityUtils here? If the response headers don't include Content-Length, the initial buffer size is set to 4096, but it is expanded as the response is read into the buffer. I checked, and an response from schema.wikimedia.org that is not cached by CDN does not have Content-Length, but a cached response does.

How else could the body be returned truncated without causing an exception earlier on?

Even stranger, is how is it possible that sometimes we see what look like untruncated but incorrect responses?

As a next step, we could try adding debug logging in BasicHttpClient and/or BasicHttpResponse.

Another example happened today:

application_1719935448343_494307

24/08/04 16:26:16 ERROR Refine: Failed refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_HomepageVisit/year=2024/month=08/day=04/hour=13 -> `event`.`homepagevisit` /wmf/data/event/homepagevisit/year=2024/month=8/day=4/hour=13.
org.wikimedia.eventutilities.core.json.JsonLoadingException: Failed loading JSON/YAML data from /analytics/legacy/homepagevisit/latest
	at org.wikimedia.eventutilities.core.json.JsonLoader.load(JsonLoader.java:61)
	at org.wikimedia.eventutilities.core.json.JsonSchemaLoader.load(JsonSchemaLoader.java:64)
	at org.wikimedia.eventutilities.core.event.EventSchemaLoader.load(EventSchemaLoader.java:120)
	at org.wikimedia.eventutilities.core.event.EventSchemaLoader.getSchema(EventSchemaLoader.java:174)
	at org.wikimedia.eventutilities.core.event.EventSchemaLoader.getLatestEventSchema(EventSchemaLoader.java:234)
	at org.wikimedia.eventutilities.core.event.EventSchemaLoader.getLatestEventSchema(EventSchemaLoader.java:243)
	at org.wikimedia.analytics.refinery.job.refine.EventSparkSchemaLoader.loadSchema(SchemaLoaders.scala:75)
	at org.wikimedia.analytics.refinery.job.refine.RefineTarget.schema$lzycompute(RefineTarget.scala:177)
	at org.wikimedia.analytics.refinery.job.refine.RefineTarget.schema(RefineTarget.scala:176)
	at org.wikimedia.analytics.refinery.job.refine.RefineTarget.schemaForRead$lzycompute(RefineTarget.scala:187)
	at org.wikimedia.analytics.refinery.job.refine.RefineTarget.schemaForRead(RefineTarget.scala:186)
	at org.wikimedia.analytics.refinery.job.refine.RefineTarget.inputDataFrameReader$lzycompute(RefineTarget.scala:253)
	at org.wikimedia.analytics.refinery.job.refine.RefineTarget.inputDataFrameReader(RefineTarget.scala:252)
	at org.wikimedia.analytics.refinery.job.refine.RefineTarget.inputDataFrame(RefineTarget.scala:545)
	at org.wikimedia.analytics.refinery.job.refine.RefineTarget.inputPartitionedDataFrame(RefineTarget.scala:571)
	at org.wikimedia.analytics.refinery.job.refine.Refine$.getInputPartitionedDataFrame(Refine.scala:770)
	at org.wikimedia.analytics.refinery.job.refine.Refine$.$anonfun$refineTargets$1(Refine.scala:716)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.wikimedia.analytics.refinery.job.refine.Refine$.refineTargets(Refine.scala:712)
	at org.wikimedia.analytics.refinery.job.refine.Refine$.$anonfun$apply$6(Refine.scala:425)
	at scala.collection.parallel.AugmentedIterableIterator.map2combiner(RemainsIterator.scala:116)
	at scala.collection.parallel.AugmentedIterableIterator.map2combiner$(RemainsIterator.scala:113)
	at scala.collection.parallel.immutable.ParHashMap$ParHashMapIterator.map2combiner(ParHashMap.scala:80)
	at scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1056)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at scala.util.control.Breaks$$anon$1.catchBreak(Breaks.scala:67)
	at scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1053)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: com.fasterxml.jackson.dataformat.yaml.snakeyaml.error.MarkedYAMLException: while scanning a quoted scalar
 in 'reader', line 244, column 9:
        dt: '2020-04-02T19:11:20.942
            ^
found unexpected end of stream
 in 'reader', line 244, column 33:
        dt: '2020-04-02T19:11:20.942
                                    ^

 at [Source: (StringReader); line: 244, column: 7]
	at com.fasterxml.jackson.dataformat.yaml.snakeyaml.error.MarkedYAMLException.from(MarkedYAMLException.java:27)
	at com.fasterxml.jackson.dataformat.yaml.YAMLParser.nextToken(YAMLParser.java:359)
	at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:252)
	at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:437)
	at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:261)
	at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:68)
	at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
	at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:4173)
	at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2577)
	at org.wikimedia.eventutilities.core.json.JsonLoader.parse(JsonLoader.java:108)
	at org.wikimedia.eventutilities.core.json.JsonLoader.load(JsonLoader.java:59)
	... 41 more
Caused by: while scanning a quoted scalar
 in 'reader', line 244, column 9:
        dt: '2020-04-02T19:11:20.942
            ^
found unexpected end of stream
 in 'reader', line 244, column 33:
        dt: '2020-04-02T19:11:20.942
                                    ^

	at org.yaml.snakeyaml.scanner.ScannerImpl.scanFlowScalarSpaces(ScannerImpl.java:1928)
	at org.yaml.snakeyaml.scanner.ScannerImpl.scanFlowScalar(ScannerImpl.java:1843)
	at org.yaml.snakeyaml.scanner.ScannerImpl.fetchFlowScalar(ScannerImpl.java:1027)
	at org.yaml.snakeyaml.scanner.ScannerImpl.fetchSingle(ScannerImpl.java:1002)
	at org.yaml.snakeyaml.scanner.ScannerImpl.fetchMoreTokens(ScannerImpl.java:390)
	at org.yaml.snakeyaml.scanner.ScannerImpl.checkToken(ScannerImpl.java:227)
	at org.yaml.snakeyaml.parser.ParserImpl$ParseBlockMappingValue.produce(ParserImpl.java:586)
	at org.yaml.snakeyaml.parser.ParserImpl.peekEvent(ParserImpl.java:158)
	at org.yaml.snakeyaml.parser.ParserImpl.getEvent(ParserImpl.java:168)
	at com.fasterxml.jackson.dataformat.yaml.YAMLParser.nextToken(YAMLParser.java:355)
	... 50 more

@Ottomata: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on October 11th.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!