Page MenuHomePhabricator

[Bug] The feed/featured endpoint is broken
Closed, ResolvedPublic

Description

Steps to Reproduce

https://en.wikipedia.org/api/rest_v1/feed/featured/2020/09/16
https://en.wikipedia.org/api/rest_v1/feed/featured/2020/09/01

Expected Results

  1. mostread contains list of article
  2. news contains page links
  3. onthisday contains pages
  4. tfa contains extract and page info

Actual Results

Shows empty lists.

Event Timeline

cooltey renamed this task from [Bug] The `feed/featured` endpoint is broken to [Bug] The feed/featured endpoint is broken.Sep 16 2020, 5:07 PM
cooltey triaged this task as Unbreak Now! priority.
cooltey updated the task description. (Show Details)
Mholloway added subscribers: Pchelolo, Joe.

Wikifeeds was just updated a few minutes ago to use the service proxy. Responses from the service itself looked good when I tested on staging so there is something else going on. Ping @Joe and @Pchelolo for visibility.

No, that is overridden in the production https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/8b7969653b3da43de4918d1ea89f4916a8b5ec9a/helmfile.d/services/wikifeeds/values.yaml#13 and goes via http to a local proxy.

The log messages say that it is having issues reaching the aqs-backed endpoints of restbase

For the record, the service doesn't seem to be having any trouble in staging:

mholloway-shell@deploy1001:/srv/deployment-charts/helmfile.d/services/wikifeeds$ curl -s --resolve wikifeeds.discovery.wmnet:4101:$(dig +short staging.svc.eqiad.wmnet | head -n 1) https://wikifeeds.discovery.wmnet:4101/en.wikipedia.org/v1/page/most-read/2020/09/15 | jq .
{
  "date": "2020-09-15Z",
  "articles": [
    {
      "views": 540922,
      "rank": 3,
      "$merge": [
        "http://localhost:6503/en.wikipedia.org/v1/page/summary/Dennis_Nilsen"
      ],
      "view_history": [
        {
          "date": "2020-09-11Z",
          "views": 24735
        },
        {
          "date": "2020-09-12Z",
          "views": 19751
        },
        {
          "date": "2020-09-13Z",
          "views": 26604
        },
        {
          "date": "2020-09-14Z",
          "views": 597995
        },
        {
          "date": "2020-09-15Z",
          "views": 540922
        }
      ]
    },
    {
      "views": 235390,
      "rank": 4,
      "$merge": [
        "http://localhost:6503/en.wikipedia.org/v1/page/summary/Felicitas_Mendez"
      ],
      "view_history": [
        {
          "date": "2020-09-11Z",
          "views": 20
        },
        {
          "date": "2020-09-12Z",
          "views": 11
        },
        {
          "date": "2020-09-13Z",
          "views": 2
        },
        {
          "date": "2020-09-14Z",
          "views": 25
        },
        {
          "date": "2020-09-15Z",
          "views": 235390
        }
      ]
    },
[...]
}

The configuration is exactly the same for staging and eqiad/codfw right now, so I'm not sure what's going on here. I'll have to dig into it further.

I'm reverting codfw (serving traffic) to the last known-good configuration.

MSantos lowered the priority of this task from Unbreak Now! to High.EditedSep 16 2020, 6:02 PM

Keeping this alive until we can figure out the root cause. But this should not be broken anymore.

The revert solved the issue for now. Still, we need to figure out what was going wrong, most apparently in the wikifeeds -> restbase -> aqs link when going through the proxy.

The first thing that comes to mind is that the endpoint is faulty enough on the backend that envoy kicks in its circuit breaking. The fact that the request works fine in staging is proof something more subtle is going on here.

for the record, I just confirmed: eqiad gives a correct result as well:

$ curl -s http://wikifeeds.svc.eqiad.wmnet:8889/en.wikipedia.org/v1/page/most-read/2020/09/15 | jq . | head -n 20

{
  "date": "2020-09-15Z",
  "articles": [
    {
      "views": 540922,
      "rank": 3,
      "$merge": [
        "http://localhost:6503/en.wikipedia.org/v1/page/summary/Dennis_Nilsen"
      ],
      "view_history": [
        {
          "date": "2020-09-11Z",
          "views": 24735
        },
        {
          "date": "2020-09-12Z",
          "views": 19751
        },
        {
          "date": "2020-09-13Z",

so the change was correct and we're meeting some other failure scenario. I'll dig further tomorrow.

My current theory is that when we enabled the service proxy, the pods were cpu starved. The steep increase in "throttled" cpu time when this happened seems to support this idea.

My strategy to verify this is to:

  • deploy again the service proxy in eqiad (not serving traffic) with higher limits.
  • check on logstash that no more error appear there
  • reduce the discovery dns ttl to 10 seconds
  • repool eqiad for a bit

This means that if we do encounter errors, we should be able to rollback in ~ 10 seconds by just re-depooling eqiad.

Change 628041 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] wikifeeds: enable the service proxy in eqiad, raising cpu limits.

https://gerrit.wikimedia.org/r/628041

Change 628041 merged by jenkins-bot:
[operations/deployment-charts@master] wikifeeds: enable the service proxy in eqiad

https://gerrit.wikimedia.org/r/628041

In the meantime, I changed focus given even in eqiad we were still seeing failures with just some monitoring traffic:

I am trying to understand if some specific envoy setting is causing this. I started by adding an idle_timeout to persistent connections - something that usually helps with reducing the risk of trying to reuse a closed connection.

It seems we struck gold: after adding a keepalive timeout of 4 seconds to restbase-for-services, the errors on wikifeeds in eqiad went away:

https://logstash-next.wikimedia.org/goto/eb3371acf8990189f20c0de4cefe97f7

Change 628051 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services_proxy: add a 4s keepalive timeout to restbase endpoint

https://gerrit.wikimedia.org/r/628051

Change 628051 merged by Giuseppe Lavagetto:
[operations/puppet@production] services_proxy: add a 4s keepalive timeout to restbase endpoint

https://gerrit.wikimedia.org/r/628051

Current situation: eqiad and staging use the service proxy, and after the fix was deployed show no signs of errors anymore. I am going to repool eqiad briefly to check if errors reappear when actual traffic hits the service.

After further review - while the change I made fixed the error messages, it did not fix the final restbase result.

The issue we think (thanks @akosiaris for pointing that out) is that the response from wikifeeds includes a

"$merge": [
      "http://localhost:6503/en.wikipedia.org/v1/page/summary/Dennis_Nilsen"
    ],

which restbase cannot interpret. I'll confirm with @Pchelolo if that's the case. In that hypothesis, we will probably need the code to be changed to allow to define a host:port to connect to for restbase, separated from the restbase_uri we send back in the response.

@Joe absolutely correct. Restbase just calls that url and fetches the summary (from itself, so that should be restbase url)

One of the first tasks in removing RESTBase would be to migrate this code into the feed service. This response building was done because initially we stored the feed metadata but couldn't purge it, so we assembled the result with the newer summary content on the fly. Storing feed was proven very hard and error prone, so we stopped storing the feed metadata in restbase, so now this assembly code is not needed.

Aha. Nice catch. So, @Pchelolo, it sounds like the fix here is to collect and assemble full page summaries in Wikifeeds itself?

Aha. Nice catch. So, @Pchelolo, it sounds like the fix here is to collect and assemble full page summaries in Wikifeeds itself?

Yes. Lemme create a subtask with all the details.

So for each request for a feed, we do the following:

  • we call restbase
  • restbase calls wikifeeds
  • wikifeeds calls the mw api and restbase
  • restbase calls aqs and other stuff
  • wikifeeds returns a response with $merge elements
  • restbase calls restbase to find those elements

While I understand there were historical reasons for that, I don't think they hold anymore. I think we should move these requests to happen from wikifeeds itself. For the time being and for the sake of completing the move to TLS, I'll try to just use an https url for restbase in wikifeeds - by estimation, it should work. But this behaviour needs to be rectified as soon as possible.

  • restbase calls aqs and other stuff

Slight correction: it's actually wikifeeds that's making the request to AQS. But yes, agreed on your main point.

  • restbase calls aqs and other stuff

Slight correction: it's actually wikifeeds that's making the request to AQS. But yes, agreed on your main point.

Uh, are you sure? Not what I see in the logs. Also there is no configuration I found for aqs in the chart, maybe I missed it?

My mistake. Wikifeeds calls RESTBase which calls AQS.

Change 628115 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] wikifeeds: use https to connect to restbase directly in eqiad

https://gerrit.wikimedia.org/r/628115

Change 628115 merged by jenkins-bot:
[operations/deployment-charts@master] wikifeeds: use https to connect to restbase directly in eqiad

https://gerrit.wikimedia.org/r/628115

Joe removed Joe as the assignee of this task.Sep 23 2020, 7:47 AM

De-assigning from me as the immediate bug is solved, and Petr is working on the long-term fix.