Page MenuHomePhabricator

Disabled Wikipedia Zero configs yielding errant dialogs
Closed, ResolvedPublic

Description

iOS 9.3.2, iPhone 5c, version 5.0.4 (851.1)

0. Uninstall the Wikipedia for iOS app.

  1. Add the internet facing IP address associated with your iOS host to the enabled = false configuration Zero:310-260
  2. Wait 15 minutes.
  3. Install the Wikipedia for iOS app from the App Store.
  4. Verify your internet facing IP hasn't changed, then launch the app and try to do stuff.
  5. Note that the app errantly prompts that charges are waived. Try navigating to an article from search, for good measure.
  6. Stop the app, then after verifying your internet facing IP hasn't changed, start the app again.
  7. Try to do stuff, and notice the app then says Wikipedia Zero is off.

Expected:

Client side processing of disabled configurations does not errantly yield a prompt that data charges are waived, and users are not told Wikipedia Zero is disabled without first being told it was enabled.

Root cause analysis:

The X-CS value is being emanated in API responses for disabled configurations. Client side processing sees the X-CS value and additional logic is needed to gracefully handle this condition.

Potential mitigation strategy at the server for older clients (action for @DFoy) post this change:

Update IP lists for configurations that have gone from enabled to disabled so that there is only one IP address or, if possible without breaking Zero portal, inactivate old configs such that they aren't part of any IP mapping routine.

Event Timeline

From reading the description it looks like this would affect the Android app as well. Tagging it so we can test and take action as appropriate.

hey @dr0ptp4kt in order to estimate this we need a bit more context/direction on changes.

@JMinor, would a meeting make sense? Who to invite?

might be most efficient, though I think it could just be you and whichever iOS engineer picks this up. I don't particularly need details, apart from context about severity.

It's probably a High, not an Unbreak Now! or Normal.

Whoever picks up the task should schedule an hour with me.

JMinor raised the priority of this task from Medium to High.Aug 2 2016, 9:52 PM
JMinor moved this task from Needs Triage to Bug Backlog on the Wikipedia-iOS-App-Backlog board.

I just tested this in both apps and it appears iOS-specific. I couldn't repro in the Android app.

If I'm reading your code correctly, it looks like you're looking solely to the presence or absence of an X-CS header with a network response to determine the Zero "disposition," and showing enter/exit interstitial dialogs on that basis alone. I don't believe there's any attempt to check the provider enabled/disabled state in the VCL when the X-CS header is applied (and don't know that doing so would be possible or desirable) so I wouldn't rely on that alone if I were you. The ZeroBanner API is returning an empty config as expected for disabled providers, so if possible I'd check to ensure you have a valid (nonempty) zero config before showing the enter/exit interstitial dialog.

Adam and I have 2 follow-up meetings next week to pair further on this.

Skipping TSG QA, as they don't have the setup to verify.

@dr0ptp4kt perhaps you can try this beta and help verify this is fixed?

I did an initial test and it seemed to be working better. This said, I'll need to do some further testing. My cell provider's internet facing IP can toggle between IPv6 and IPv4 and that can make testing a bit weird.

So I'll test some more and I'll look a bit more closely at https://github.com/wikimedia/wikipedia-ios/pull/869/.

I tested some more. The current solution is sufficient and should not prevent release.

I did notice symptoms suggestive of a race condition (e.g., it sometimes takes multiple new pageviews to trigger a warning when hitting enabled = false even though a zero-rated adapter such as one associated with a fixed IP wifi access point in an enabled = true config was already turned off by OS settings and the user by definition can't still be on the zero-rated enabled = true network). I suspect this is related to action=zeroconfig calls themselves bearing headers and triggering recursive calls confusing the client state.

Additionally, I suspect interference from the possibility, like in my case, of access splitting on cellular unpredictably across IPv4 and IPv6. In practice, this shouldn't be such a problem, as generally W0 users are on networks where either everything is IPv4 or IPv6 or at least the IPs are merged into one config and thus outbound headers should be identical.

One thing I fortunately didn't observe was the very bug called out originally in this task. Additionally, state transitions from a definitely enabled network to a network not in the configuration management system at all triggered warnings properly (and the reverse showed appropriate alert and fixed chrome) in a pretty consistent basis.

Note on my phone I have it set to prefer wifi and not automatically fall back to wifi. I do this for bandwidth conservation, but it's also helpful for reasoning about iOS networking adapter behavior.

I think we should probably keep this task open for now, but this piece is okay for shipping.

@dr0ptp4kt can you specify what other work is needed to complete this ticket, as of now I am putting into blocked or waiting on the next release.

I believe the action steps are as follows. I'm comfortable trying to track this down independently on a relaxed timeline if you want to take this off the release board. As it is, I think the garden variety network switching case is addressed adequately. Anyhow, the steps:

  1. Diagnose the source of the the alert view taking more than one article load to actually trigger in the case where the user's requests/responses are split between IPs that are both not zero-rated - one on a disabled partner network, one on an IP not in any config - yet previously they were on a zero-rated network.
  2. Update the code so that for the split IPs case the alert is spawned immediately on the first network response that's deemed not zero-rated.

When the requests/responses aren't split everything behaves normally. I'm only concerned here because of the interplay of multiple access points and the SIM cards that can toggle between different networks (like a multi-SIM phone).

I've been setting up the project in Xcode (yay, no need to install other stuff anymore!) to troubleshoot what I think is a race condition, and plan to try this on my physical device with a bunch of NSLog action. My read of the code suggests that a theoretical race condition between requests traversing different network IPs - at least one on a config marked disabled, and one not on a config - should still yield the same in-app broadcast notification that ultimately should trigger an alert view almost immediately. I do recall there being some peculiarities about these broadcasts in the notification subsystem failing to deliver reliably depending on the thread, but I don't recall the specifics (I do see commentary in the GitHub patch about the UI thread, as an aside).

Anyway, on a hunch I think one way to handle this more reliably is to exclude the action=zeroconfig calls from header introspection, as the actual content received in the action=zeroconfig response is itself sufficient. Additionally, I suspect that the config fetcher code should not cancel pending config fetches. But I'd like to troubleshoot a little more myself. I can even submit a patch if it turns out this works.

Okay, I figured it out why the edge case / race condition is happening. Now I need to think about whether there's a simple fix. The diagnosis follows, I'll update when I've thought more about the fix.

Whereas the js-banner JavaScript based banner cache control instructions sent to the client are over-ridden by Varnish, forcing web browser agents to re-request the js-banner every time (but typically getting an edge cached version), the action=zeroconfig cache-control headers are passed directly to the client.

The action=zeroconfig endpoint sets a Cache-Control header with a 60 second TTL, which is not overridden by Varnish per the link above. So the iOS native client is honoring that when its processing determines it should get a response from the config API due to outbound header changes, but the determination happens within 60 seconds of the last edge config fetch on that network. It is thus getting the old config implicitly from the device cache when the user goes from zero-rated to config-disabled in the scenario I've described. Contrast this with the user going from zero-rated to not-in-a-config - the client doesn't even try to obtain the config and just spawns the UI alert.

Previously, the caching instruction in Varnish for action=zeroconfig was sufficient for native paps. Back then, there might be a one minute delay at the edge before all clients would be getting a fresh configuration, but it didn't matter for the native apps, because the native apps were only using header inspection to make the determination that the user had changed from zero-rated to non-zero-rated. Previously, the X-CS header could only be sent to the client if the domain and URL was considered zero-rated; not so anymore and hence the genesis of the updates to the Android and iOS clients (including this task's primary patch).