Page MenuHomePhabricator

Gracefully handle unmarshalling errors in Feed (and remotely send error information)
Closed, ResolvedPublic2 Story Points

Description

Recently, for a five-hour period, the app was crashing on startup for all non-English users due to an unmarshalling error. T144990: [CRASH] Content Service shouldn't send empty objects

We should make the app more resilient, so that we can handle such errors without crashing.

Event Timeline

Mholloway created this task.Sep 8 2016, 1:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2016, 1:38 PM
Mholloway triaged this task as High priority.Sep 8 2016, 1:39 PM
Dbrant renamed this task from Handle unmarshalling errors without crashing to Gracefully handle unmarshalling errors in Feed (and remotely send error information).Sep 9 2016, 6:21 PM
Dbrant set the point value for this task to 2.
Mholloway moved this task from To Do to Doing on the Mobile-App-Android-Sprint-91-Protactinium board.
Mholloway added subscribers: Niedzielski, bearND, Dbrant.EditedSep 22 2016, 8:21 PM

Digging deeper into this, it wasn't actually an unmarshalling error that was causing the crash, but a lowly NullPointerException:

java.lang.NullPointerException: Attempt to invoke virtual method 'int java.lang.String.hashCode()' on a null object reference
	at org.wikipedia.feed.featured.FeaturedArticleCard.hashCode(FeaturedArticleCard.java:81)
	at org.wikipedia.feed.model.Card.getHideKey(Card.java:27)
	at org.wikipedia.feed.FeedCoordinatorBase.isCardHidden(FeedCoordinatorBase.java:151)
	at org.wikipedia.feed.FeedCoordinatorBase.access$100(FeedCoordinatorBase.java:20)
	at org.wikipedia.feed.FeedCoordinatorBase$ExhaustionClientCallback.success(FeedCoordinatorBase.java:121)
	at org.wikipedia.feed.aggregated.AggregatedFeedContentClient$CallbackAdapter.onResponse(AggregatedFeedContentClient.java:105)
	at retrofit2.ExecutorCallAdapterFactory$ExecutorCallbackCall$1$1.run(ExecutorCallAdapterFactory.java:68)
	at android.os.Handler.handleCallback(Handler.java:739)
	at android.os.Handler.dispatchMessage(Handler.java:95)
	at android.os.Looper.loop(Looper.java:148)
	at android.app.ActivityThread.main(ActivityThread.java:5551)
	at java.lang.reflect.Method.invoke(Native Method)
	at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:730)
	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:620)

The field in question and others in the feed content classes were marked @NonNull on the assumption that, when assembling the aggregated response, the server code would simply omit an empty section. (Ha!)

For at least the hardening part of this task, then, it seems the appropriate measure is to change all of the fields in the feed content classes to @Nullable and add the necessary null checks. @Dbrant @Niedzielski @bearND do you agree?

@Mholloway, right, the NPE is because the JSON violated the expected constraints of the model. When we wrote the service we said that if a response was sent, it would always provide certain fields. When we wrote the Android client, we clarified the contract in Javaland by marking them @NonNull or @Nullable.

AFAIK, Gson does not provide a mechanism to verify @NonNull assumptions. Gson will unmarshall @NonNull FooModel foo as null if the field does not exist. The code that dereferences the foo field under the false assumption that it can never be null causes the NPE. It's an unmarshalling error because an invalid object was created through the process of unmarshalling and it's a broken contract because the API promises a field that it never sent.

Invalid objects should never be allowed to exist.

Here are my thoughts on some of the approaches I'm aware of:

  1. You could handle incomplete JSON responses by marking every field @Nullable. This means that every object is a valid one but imagine the checks we'll have to add. You think it's a pain to have to check getActivity() != null in your callback, wait until it's that way for nearly every field in the repo! I would strongly disagree with this change.
  1. We could try to build a custom Gson unmarshaller that inspected each @NonNull annotated field and threw an unmarshalling error (which we would catch) when one was detected. This would be much less invasive but also means we agree that we can't trust our own service. APIs are contracts and a possibly broken contract really means no contract. We should treat every client of the Content Service like a fickle bomb.
  1. My opinion is that we should inspect the Content Service output in the service itself. Write it once at the source in JavaScriptland tests where the livin' is easy. The alternative is writing 1 or 2 into every client of the Content Service. Each client will have to assume a worst case scenario for every response and treat it as incomplete until proven otherwise.

In some ways, I really wish we had 2 built into the Gson library by default. We wouldn't have to crash when something akin to a network error occurred. However, it's not built in for us or for other clients. The default implementation is pretty unforgiving but such is the nature of contracts. To the extent possible, I would focus on 3. On the client, 2 seems most practical but also a little challenging if it doesn't exist already.

Mholloway added a comment.EditedSep 23 2016, 2:16 PM

@Niedzielski, thanks for the detailed thoughts. Fair enough about this being an unmarshalling error.

I started along the lines of (1) before you responded and I'll upload the WIP patch just to demonstrate where I was going with it. My strategy for limiting the need for null checks was to make only the content model class fields @Nullable, and then to have the AggregatedFeedContent class act as a kind of gatekeeper that would check for the presence or absence of expected fields in the constituent objects and then forward on either a valid object or null to the AggregatedFeedContentClient in the getter calls in onResponse (the only place AggregatedFeedContent's getters are consumed), where we already have null handling.

Still, (2) or (3) or some combination thereof seems like a more robust solution. I'll do some preliminary looking at the unmarshaller. @bearND , any thoughts on (3)?

Change 312521 had a related patch set uploaded (by Mholloway):
[DO NOT MERGE] WIP: Make all feed content model class fields @Nullable

https://gerrit.wikimedia.org/r/312521

Looking into (2), I can see a couple different possibilities so far.

(a) We could simply create null-checking TypeAdapters for each class we're concerned about and register them when creating GsonUtil.DEFAULT_GSON.

(b) We could create an annotation-checking deserializer along the lines of something like this, only looking to @NonNull and @Nullable (I'm guessing it's possible somehow to change the retention policy on these to RUNTIME) rather than defining a custom annotation.

Of the two, (b) seems more elegant but also more complicated and possibly functionally worse. In particular I'm worried about poor performance as a result of using reflection in this way for a bunch of fields each time we fetch feed content. In particular I'm worried an effect that wouldn't show up on our high-end dev phones might be much worse in the field.

Change 312521 abandoned by Mholloway:
[DO NOT MERGE] WIP: Make all feed content model class fields @Nullable

Reason:
demo

https://gerrit.wikimedia.org/r/312521

My apologies for tardiness in responding. I haven't been as active in this thread as I should have.

In 1, I complained about reduced readability by boilerplate but I should have also mentioned deferred model validation as a much worse risk. The validation should happen within the unmarshaller ideally or the data client if not. The risks of any later validation are:

  • A client forgets to validate the model (the app crashes)
  • A client validates the model directly or as a side effect and there's no graceful opportunity to handle an unmarshalling error (the app crashes)

I believe your patch did validate in the data client so this shouldn't be an issue. I wanted to mention it because I think the patch only considered a subset of service responses. What will we do for the rest? This change might have prevented the issue that sparked this task but will we keep spawning new tasks to validate each model until the entire API is patched on the Android client?

a

This is more of a 1a than a 2a to me because we're still writing code to check each model by hand. However, it's really great to move this into the actual unmarshalling layer so this is way better than "plain 1" (validation in the data client) in my opinion.

BTW, we'll want to write tests for this code. From personal experience, I'll mention that although JSON parsing tests are easily written, they're mind-numbingly boring. It's much more interesting to write tests at the source where a real transform is being performed. As an aside, I used to use Protobuf which actually generates a jar for de/serialization and it was pretty cool to not have to write custom unmarshallers or custom tests while still being functionally performant.

b

This seems a little daunting. However, after 3, it's the approach I would sooner at least try, after a brief survey of libraries available, than 1 or a. I would choose it because it's thorough, generic, and isolated.

To reiterate, in this approach, models that require a missing @NonNull annotation are bugs in the app. Models that have @NonNull but the field is not present or null are bugs in the service, but throw an unmarshalling exception which should be caught.

FWIW, Gson is built on reflection and we explicitly or implicitly chose programmer performance over code performance when we used it. If model validation has to happen on the client, it will take extra cycles but I hope it will be negligible and should happen on a background thread.

All I've said is opinion, of course, but I've tried to justify them. Other folks might feel differently and should weigh in.

Mholloway added a comment.EditedOct 5 2016, 4:41 PM

This is coming along. I got to an [1, 2](a)-like solution yesterday that worked. I've been trying to get to a generalized, 2(b)-like, TypeAdapterFactory-based solution since yesterday afternoon. I'm basing it off the Jesse Wilson joint here (https://stackoverflow.com/a/11272452) only not making it an abstract class with POJO-specific subclasses but just a single FieldCheckingTypeAdapterFactory that should work for all types. My plan was to use either an empty marker interface or (preferably) a class annotation to mark certain classes for field checking, and to use that in public final <T> TypeAdapter<T> create(Gson gson, TypeToken<T> type) to conditionally return a custom field-checking TypeAdapter.

The problem is that I haven't found any way to obtain either the interface(s) or the annotations for a class based on the Type we have access to via the TypeToken in create(). One way around this problem is just to create a custom TypeAdapter for every object we deserialize, and defer the conditional treatment until the concrete object is instantiated in TypeAdapter.read(). Creating that many custom TypeAdapters seems really inefficient, though.[1]

So I think in practical terms, unfortunately, we might have to fall back on having some mechanism of registering specific classes for field checking. Most simple would be to just register a field-checking deserializer separately for each one when creating DEFAULT_GSON in GsonUtil.

What do you think? To the best of your knowledge, is it in fact impossible to get the interfaces and/or annotations for a class from a Gson TypeToken? The generic TypeAdapterFactory solution seems tantalizingly close.

[1] Here's a related article from Foursquare on Gson custom deserialization pitfalls.

I don't know much about Protobufs but they're starting to sound pretty nice...

That's great!!

I'm still digging into the Gson implementation to try to identify similar solutions that might work here and answer your specific questions but I wanted to post this before lunch: I think the approach you're pursing is very similar to Gson's @Since. That is, you just want to mark fields with an annotation and when unmarshalling, test those fields. I think @Since uses Excluder[0] for un/marshalling. The difference in your solution would be the annotation used, the test on that annotation (is it null?), and the result of test failure (just throw an unmarshalling exception).

..Except I think Gson treats Excluder as a special case and not just a factory. This is what I'm still looking into so expect a follow up while I dig into the lib a bit more.

[0] https://github.com/google/gson/blob/master/gson/src/main/java/com/google/gson/internal/Excluder.java

Change 314331 had a related patch set uploaded (by Niedzielski):
Demo: required field Gson fields

https://gerrit.wikimedia.org/r/314331

@Mholloway, after lunch I thought I would do a quick search and see what's available online. I came cross this[0] StackOverflow which I pretty much just copy and pasted. I wrote a test around for the absolute simplest case and it seemed to work. Just wanted to know if you had stumbled upon it too and whether you had any luck with it. I've uploaded a demo patch[1] but it needs a lot of thinking, testing, and work.

[0] http://stackoverflow.com/questions/35587252/java-to-json-validation-using-gson/
[1] https://gerrit.wikimedia.org/r/314331

Well, all it took was a little debugging and a lot of patience but it looks like TypeToken.getRawType().getAnnotations() and TypeToken.getRawType().getInterfaces() are what I was looking for. Thanks for the pointer to Excluder, also very useful!

And I just saw you pushed up a demo patch, thanks!

Sweet! I think it's just that Gson Field data type that has the info you need. I'm not sure what interesting details will fall out when you go to make this production quality but I look forward to seeing it. I'll close out that demo patch.

Change 314331 abandoned by Niedzielski:
Demo: required field Gson fields

Reason:
demoed

https://gerrit.wikimedia.org/r/314331

Change 314428 had a related patch set uploaded (by Mholloway):
WIP: Add field checking for API responses

https://gerrit.wikimedia.org/r/314428

Change 314331 restored by Niedzielski:
Demo: required field Gson fields

Reason:
demoooo

https://gerrit.wikimedia.org/r/314331

Change 314331 abandoned by Niedzielski:
Demo: required field Gson fields

Reason:
demoed

https://gerrit.wikimedia.org/r/314331

  1. My opinion is that we should inspect the Content Service output in the service itself. Write it once at the source in JavaScriptland tests where the livin' is easy.

Still, (2) or (3) or some combination thereof seems like a more robust solution. I'll do some preliminary looking at the unmarshaller. @bearND , any thoughts on (3)?

Sorry I am responding so late to this. I was just made aware of this during our retro.
I think this is a reasonable approach. We try to add tests in MCS land whenever we find an issue and what issues we foresee already. This particular incident happened when we transitioned the aggregated feed endpoint from MCS to RB and it was implemented by someone else with a different understanding of the contract between the service and the app.
So, we need a more comprehensive definition or specification of this contract. I think the best place for this contract is the Swagger spec. I'm just not sure at this point if we can express everything in there. Whatever we cannot fully express in Swagger should be augmented by JavaScript tests in the service, in MCS or RB.

Re: Protobufs. I don't have on hands experience with Protobufs. I did find at least one JS implementation. Whether to allow other content-type formats to be emitted by the services is a whole other discussion though. (BTW, a year ago, I was also excited about FlatBuffers, which promise to be more memory efficient.)

Change 315304 had a related patch set uploaded (by Mholloway):
Beef up feed endpoint response format checks

https://gerrit.wikimedia.org/r/315304

Change 315995 had a related patch set uploaded (by Mholloway):
Add response definitions to spec for feed endpoints

https://gerrit.wikimedia.org/r/315995

Change 315996 had a related patch set uploaded (by Mholloway):
Add response schema validation

https://gerrit.wikimedia.org/r/315996

Change 316469 had a related patch set uploaded (by Mholloway):
WIP: Add tests for field checking with the new @Require annotation

https://gerrit.wikimedia.org/r/316469

Change 315304 abandoned by Mholloway:
Beef up feed endpoint response format checks

Reason:
I think this is superseded by the new spec check stuff.

https://gerrit.wikimedia.org/r/315304

Change 316837 had a related patch set uploaded (by Mholloway):
Mark required (@NonNull) Gson POJO fields @Required

https://gerrit.wikimedia.org/r/316837

Change 314428 merged by jenkins-bot:
Add field checking for API responses

https://gerrit.wikimedia.org/r/314428

Change 316469 merged by jenkins-bot:
Add tests for field checking with the new @Require annotation

https://gerrit.wikimedia.org/r/316469

Change 316837 merged by jenkins-bot:
Mark required (@NonNull) Gson POJO fields @Required

https://gerrit.wikimedia.org/r/316837

Change 315995 merged by jenkins-bot:
Add response definitions to spec for feed endpoints

https://gerrit.wikimedia.org/r/315995

Change 315996 merged by jenkins-bot:
Add response schema validation

https://gerrit.wikimedia.org/r/315996

Change 319899 had a related patch set uploaded (by Mholloway):
Log missing @Required fields remotely when unmarshalling

https://gerrit.wikimedia.org/r/319899

Mholloway removed Mholloway as the assignee of this task.Nov 23 2016, 5:48 PM

Going to release this at least for now since there's no point sitting on it while I'm working on other stuff.

Mholloway closed this task as Resolved.
Mholloway claimed this task.

The main piece of this task was about handling unmarshalling errors. Remote logging is really a separate follow-up task, so I've created a new Phab ticket for it so we can close this one.

Change 319899 abandoned by Mholloway:
WIP: Log missing @Required fields remotely when unmarshalling

Reason:
We'll likely go in a different direction (EventLogging) with this.

https://gerrit.wikimedia.org/r/319899