Page MenuHomePhabricator

page-links-change EventStream doesn't appear to be outputting events
Open, Needs TriagePublicBUG REPORT

Description

Since 18th November it looks like the page-links-change EventStream (https://stream.wikimedia.org/?doc#/Streams/get_v2_stream_page_links_change) hasn't been outputting any events.

See https://stream.wikimedia.org/v2/stream/page-links-change, which should be sending events at a rate comparable to recentchanges, but appears entirely empty.

Event Timeline

elukey added a subscriber: elukey.Tue, Nov 26, 5:48 PM

I can see events when hitting https://stream.wikimedia.org/v2/stream/page-links-change, it just take a bit before they are visualized. https://grafana.wikimedia.org/d/000000336/eventstreams?orgId=1&refresh=1m&from=now-30d&to=now shows an increase in usage, we might have some free connection slots issue, but I'd need to investigate more.

@Samwalton9 do you have a metric or something that I can check showing data missing from the 18th? Or was it reported by somebody? (trying to understand the problem sorry)

elukey added a comment.EditedTue, Nov 26, 5:56 PM

Interesting, the codfw scb2* hosts seems the ones affected, so I guess that depending on the DNS geolocation for streams.wikimedia.org something works or not (hitting the codfw hosts will not work, hitting the eqiad ones will). Still need to reproduce the missing events though, everything seems working well from multiple locations.

I can see events when hitting https://stream.wikimedia.org/v2/stream/page-links-change, it just take a bit before they are visualized. https://grafana.wikimedia.org/d/000000336/eventstreams?orgId=1&refresh=1m&from=now-30d&to=now shows an increase in usage, we might have some free connection slots issue, but I'd need to investigate more.
@Samwalton9 do you have a metric or something that I can check showing data missing from the 18th? Or was it reported by somebody? (trying to understand the problem sorry)

It was reported to us by Internet Archive, but also effects our tooling as well.

I can see events when hitting https://stream.wikimedia.org/v2/stream/page-links-change, it just take a bit before they are visualized. https://grafana.wikimedia.org/d/000000336/eventstreams?orgId=1&refresh=1m&from=now-30d&to=now shows an increase in usage, we might have some free connection slots issue, but I'd need to investigate more.
@Samwalton9 do you have a metric or something that I can check showing data missing from the 18th? Or was it reported by somebody? (trying to understand the problem sorry)

It was reported to us by Internet Archive, but also effects our tooling as well.

Can you tell me a bit more about your tools and if I can repro in some way? What error do you see?

Looping in @mobrovac as FYI

I can see events when hitting https://stream.wikimedia.org/v2/stream/page-links-change, it just take a bit before they are visualized. https://grafana.wikimedia.org/d/000000336/eventstreams?orgId=1&refresh=1m&from=now-30d&to=now shows an increase in usage, we might have some free connection slots issue, but I'd need to investigate more.
@Samwalton9 do you have a metric or something that I can check showing data missing from the 18th? Or was it reported by somebody? (trying to understand the problem sorry)

It was reported to us by Internet Archive, but also effects our tooling as well.

Can you tell me a bit more about your tools and if I can repro in some way? What error do you see?

Sam can speak to the errors on our side, but Mark Graham from internet archive just reported a significant decrease in the number of reports showing up on their end.

dvd added a subscriber: dvd.EditedWed, Nov 27, 1:01 AM

We have a very simple Python loop around sseclient.SSEClient('https://stream.wikimedia.org/v2/stream/page-links-change'). We parse the message data JSON blob, and take any external links for changes with added_links, and put those URLs onto a Kafka topic. Starting sometime after midnight UTC on November 18th our graphs showed a significant decrease in the number of URLs put onto that Kafka topic.

We're using sseclient 0.0.24. With some manual testing, we notice that we connect to the server, and see a very small trickle of messages without any URLs. The production ingest script will occasionally see a burst of new URLs, but then the events would drop back to a trickle.

The counters seem to be moving again at the moment, though!

edit: a few minutes later the event rate is back to a very slow trickle. Any connections would be coming from somewhere in 207.241.232.0/24, and at the moment 207.241.232.165.

@dvd thanks a lot for all the info, I am leaning towards too many connections on our side, will keep this task posted!

Mentioned in SAL (#wikimedia-operations) [2019-11-27T07:49:10Z] <elukey> roll restart of eventstreams on scb2* - T239220

I rolled restart the codfw eventstreams services, so now a lot of connections have been freed. In theory if this is the problem you should see the full stream working as expected now (until it saturates again, but I am working on a solution for that).

I noticed the issue for Wikilink-Tool, hadn't been keeping a close eye on it though. Have restarted the stream-monitoring service and will report back once some data has filled in.

FYI, we will be moving EventStreams to Kubernetes next quarter, which should help it scale a little better.

I noticed the issue for Wikilink-Tool, hadn't been keeping a close eye on it though. Have restarted the stream-monitoring service and will report back once some data has filled in.

Thanks! In theory now it should work full speed if the issue was too many connection, if not we will need to look elsewhere.

dvd added a comment.Wed, Nov 27, 9:31 PM

Within the last minute or two, I just saw another burst of several thousand messages after a long pause. Now they are coming back in again at the usual rate of a few URLs per second.

Within the last minute or two, I just saw another burst of several thousand messages after a long pause. Now they are coming back in again at the usual rate of a few URLs per second.

If you are still seeing impact it means that it was not related to connections slot being fully, since we have been stable for a while. Is there a way that you can show us a graph related to the missing events? It would help a lot in debugging.. Moreover, did you notice if you missed events or if they have just slowed down?

dvd added a comment.Thu, Nov 28, 4:18 PM

If you are still seeing impact it means that it was not related to connections slot being fully, since we have been stable for a while. Is there a way that you can show us a graph related to the missing events? It would help a lot in debugging.. Moreover, did you notice if you missed events or if they have just slowed down?

It is somewhat confounded by an unrelated power outage affecting the crawlers that are fetching these URLs. I'll look into what we can do to separate the metrics out and get them visible to you.

As of this moment, the overall crawl volume generated by this feed is looking closer to what it was prior to Nov 18th. That implies my observations in the last 24 hours are delayed events, rather than dropped events.

elukey added a comment.EditedThu, Nov 28, 4:25 PM

I am using the following simple script:

import json
from sseclient import SSEClient as EventSource

url = 'https://stream.wikimedia.org/v2/stream/page-links-change'
for event in EventSource(url):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            pass
        else:
            print(event)

From stat1007 everything works fine (namely I can see a constant flow of events), meanwhile from my home connection/laptop sometimes the events are not flowing.

The same experiment with curl doesn't lead to any different (two streams pulled correctly in time).

I have done some tests with curl and python again, I am only able to reproduce the problem with python sseclient. It seems that the client gets stuck reading from the network via openssl..

Could this be related to T179986 - another Phab task related to Python sseclient randomly hanging on the eventstreams?

dvd added a comment.Mon, Dec 2, 8:13 PM

Could this be related to T179986 - another Phab task related to Python sseclient randomly hanging on the eventstreams?

Yes, that issue shares a lot in common. It sounds like the problem may be specific to Python sseclient, but still based on something happening server side.

Since @Ottomata asked in the other issue: Restarting the client process doesn't help. A small number of system events does come through, but nothing with a URL attached. The feed will pick back up again at some point in the future, whether I restart the client or not.

It's as if the server is still sending the events just fine, in a way that a browser can decode them fine, but SSEclient can't. And then at some point the server starts sending the events in a way that can be processed by SSEclient again.

dvd added a comment.Mon, Dec 2, 8:28 PM

Since we don't have the source in a public repo currently:

import os
import re
import json
from kafka import KafkaProducer
from sseclient import SSEClient as EventSource
from prometheus_client import start_http_server, Counter

# Exposes metrics and doubles as liveness check for AutoDevOps
start_http_server(5000)
c = Counter('discovered_urls', 'URLs discovered from Wikipedia Feed')
t = Counter('total_events', 'Total events seen from Wikipedia EventStream')

# Configured by GitLab AutoDevOps as K8_SECRET_KAFKA_BROKERS
brokers = json.loads(os.getenv("KAFKA_BROKERS"))
kfk = KafkaProducer(bootstrap_servers=brokers,
                    value_serializer=lambda m: json.dumps(m).encode('utf8'))

def useful_link(url):
    if not re.match(r'https?://', url):
        return False
    if re.match(r'https?://([^:./]\.)*archive.org/', url):
        return False
    return True

url = 'https://stream.wikimedia.org/v2/stream/page-links-change'

for event in EventSource(url):
    t.inc()
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            pass
        else:
            user_is_bot = change.get('performer',{}).get('user_is_bot', False)
            if not user_is_bot:
                for l in change.get('added_links', []):
                    if l.get('external'):
                        link = l.get('link')
                        if useful_link(link):
                            candidate_url = dict(u=link)
                            kfk.send('wikipedia_discovered', candidate_url)
                            c.inc()