Page MenuHomePhabricator

ChangesListHighlights events missing from MySQL starting 2017-07-11
Closed, ResolvedPublic5 Story Points

Description

$ kafkacat -C -b kafka1012.eqiad.wmnet:9092 -o beginning -t eventlogging_ChangesListHighlights > clhighlights
$  (for ts in $(jq '.timestamp' < clhighlights); do date +%Y%m%d -d "@$ts"; done) | sort | uniq -c
      1 20170705
     21 20170706
     10 20170707
      8 20170708
      7 20170709
     48 20170710
     33 20170711
     24 20170712
mysql:research@s3-analytics-slave [log]> select count(*), substring(timestamp, 1, 8) as day from ChangesListHighlights_16484288 where timestamp >'20170705000000' group by day;
+----------+----------+
| count(*) | day      |
+----------+----------+
|       15 | 20170705 |
|       21 | 20170706 |
|       10 | 20170707 |
|        8 | 20170708 |
|        7 | 20170709 |
|       48 | 20170710 |
|       24 | 20170711 |
+----------+----------+
7 rows in set (0.02 sec)

The discrepancy on the first day is fine, that's just because of the Kafka 7-day cutoff. The other days line up perfectly, except for the 11th (15 missing events) and the 12th (24 missing events). I took a deeper look, and up until 2017-07-11 19:53:17 UTC it appears that Kafka and MySQL contain the same events, after that MySQL started dropping almost all of them. Only three events made it through after that, at 22:04:27, 22:04:29 and 22:16:02 (all on the 11th).

Event Timeline

Catrope created this task.Jul 12 2017, 9:27 PM
Restricted Application added a project: Analytics. · View Herald TranscriptJul 12 2017, 9:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Could be wrong about that^

Nuria added a subscriber: Nuria.EditedJul 12 2017, 10:03 PM

The error on mysql consumer:

raise errorclass, errorvalue
sqlalchemy.exc.OperationalError: (OperationalError) (1054, "Unknown column 'rev_content_changed' in 'field list'") 'INSERT IGNORE INTO mediawiki_page_create_1 (comment, `dat
abase`, meta_domain, meta_dt, meta_id, meta_topic, meta_uri, page_id, page_namespace, page_title, rev_id, rev_len, rev_sha1, rev_timestamp, meta_request_id, meta_schema_uri,
page_is_redirect, performer_user_groups, performer_user_id, performer_user_text, rev_content_changed, rev_content_format, rev_content_model, rev_minor_edit, performer_user_
edit_count, performer_user_is_bot, performer_user_registration_dt) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
, %s, %s)' ('Created page with "#* [[Special:UserLogin/signup | Crea una cuenta de Wikimedia]] (Si no tienes una) y a\xc3\xb1ade un nuevo tema a la [[API_talk:Main_page|p\xc
3\xa1gina de discusi\xc3\xb3n de la API]]."', 'mediawikiwiki', 'www.mediawiki.org', '2017-07-12T07:02:23+00:00', '0e406d45-66d0-11e7-b973-1418776139a6', 'mediawiki.page-crea
te', 'https://www.mediawiki.org/wiki/Translations:API:FAQ/45/es', 699206, 1198, 'Translations:API:FAQ/45/es', 2511627, 165, 'npsxc904czwqkl14fj0t9tsfkq4v03j', '2017-07-12T07
:02:23Z', '2ede89f3-182b-4edc-9c39-d0733be6d5f8', 'mediawiki/revision/create/1', 0, '["*", "user", "autoconfirmed"]', 7547201, 'Zeko3991', 1, 'wikitext', 'wikitext', 0, 6, 0
, '2016-09-27T16:52:49Z')

Translation: The schema does not match the table, this is because of this change: https://gerrit.wikimedia.org/r/#/c/362322/

Schemas need to be versioned for mysql to be able to digest the data. Otherwise schema data does not match table, insertion fails, mysql consumer stops and "some" events might not be inserted.

More errolog that looks kind of strange :

9092,kafka1020.eqiad.wmnet:9092,kafka1022.eqiad.wmnet:9092?topics=eventlogging-valid-mixed,eqiad.mediawiki.page-create,codfw.mediawiki.page-create,eqiad.mediawiki.page-move,codfw.mediawiki.page-move,eqiad.mediawiki.page-delete,codfw.mediawiki.page-delete,eqiad.mediawiki.page-undelete,codfw.mediawiki.page-undelete&auto_commit_interval_ms=1000&api_version=0.9&function=should_insert_event&identity=eventlogging_consumer_mysql_00 -> mysql://eventlog:Sixdah9AhMu@m4-master.eqiad.wmnet/log?charset=utf8&statsd_host=statsd.eqiad.wmnet&replace=True..
2017-07-12 07:02:43,852 [6646] (MainThread) old ssl module detected. ssl error handling may not operate cleanly. Consider upgrading to python 3.5 or 2.7
2017-07-12 07:02:43,919 [6646] (MainThread) Consuming topics ['eventlogging-valid-mixed', 'eqiad.mediawiki.page-create', 'codfw.mediawiki.page-create', 'eqiad.mediawiki.page-move', 'codfw.mediawiki.page-move', 'eqiad.mediawiki.page-delete', 'codfw.mediawiki.page-delete', 'eqiad.mediawiki.page-undelete', 'codfw.mediawiki.page-undelete'] from Kafka in group eventlogging_consumer_mysql_00 as eventlogging_consumer_mysql_00-eventlog1001.eqiad.wmnet.6646
2017-07-12 07:02:43,919 [6646] (MainThread) Updating subscribed topics to: ['eventlogging-valid-mixed', 'eqiad.mediawiki.page-create', 'codfw.mediawiki.page-create', 'eqiad.mediawiki.page-move', 'codfw.mediawiki.page-move', 'eqiad.mediawiki.page-delete', 'codfw.mediawiki.page-delete', 'eqiad.mediawiki.page-undelete', 'codfw.mediawiki.page-undelete']
2017-07-12 07:02:43,936 [6646] (MainThread) Error sending GroupCoordinatorRequest_v0 to node 22 [NodeNotReadyError: 22]
2017-07-12 07:02:44,023 [6646] (MainThread) Group coordinator for eventlogging_consumer_mysql_00 is BrokerMetadata(nodeId=12, host=u'kafka1012.eqiad.wmnet', port=9092, rack=None)
2017-07-12 07:02:44,023 [6646] (MainThread) Discovered coordinator 12 for group eventlogging_consumer_mysql_00
2017-07-12 07:02:44,025 [6646] (MainThread) Revoking previously assigned partitions set([]) for group eventlogging_consumer_mysql_00
2017-07-12 07:02:44,025 [6646] (MainThread) (Re-)joining group eventlogging_consumer_mysql_00
2017-07-12 07:03:04,296 [6646] (MainThread) Joined group 'eventlogging_consumer_mysql_00' (generation 23) with member_id eventlogging_consumer_mysql_00-eventlog1001.eqiad.wmnet.6646-ef1958c0-166b-4664-9d48-3f8fe4b70f4a
2017-07-12 07:03:04,296 [6646] (MainThread) Elected group leader -- performing partition assignments using range
2017-07-12 07:03:04,301 [6646] (MainThread) Successfully joined group eventlogging_consumer_mysql_00 with generation 23
2017-07-12 07:03:04,301 [6646] (MainThread) Updated partition assignment: [TopicPartition(topic=u'codfw.mediawiki.page-create', partition=0), TopicPartition(topic=u'codfw.mediawiki.page-delete', partition=0), TopicPartition(topic=u'codfw.mediawiki.page-move', partition=0), TopicPartition(topic=u'codfw.mediawiki.page-undelete', partition=0), TopicPartition(topic=u'eqiad.mediawiki.page-create', partition=0), TopicPartition(topic=u'eqiad.mediawiki.page-delete', partition=0), TopicPartition(topic=u'eqiad.mediawiki.page-move', partition=0), TopicPartition(topic=u'eqiad.mediawiki.page-undelete', partition=0), TopicPartition(topic=u'eventlogging-valid-mixed', partition=0), TopicPartition(topic=u'eventlogging-valid-mixed', partition=1), TopicPartition(topic=u'eventlogging-valid-mixed', partition=2), TopicPartition(topic=u'eventlogging-valid-mixed', partition=3), TopicPartition(topic=u'eventlogging-valid-mixed', partition=4), TopicPartition(topic=u'eventlogging-valid-mixed', partition=5), TopicPartition(topic=u'eventlogging-valid-mixed', partition=6), TopicPartition(topic=u'eventlogging-valid-mixed', partition=7), TopicPartition(topic=u'eventlogging-valid-mixed', partition=8), TopicPartition(topic=u'eventlogging-valid-mixed', partition=9), TopicPartition(topic=u'eventlogging-valid-mixed', partition=10), TopicPartition(topic=u'eventlogging-valid-mixed', partition=11)]
2017-07-12 07:03:04,302 [6646] (MainThread) Setting newly assigned partitions set([TopicPartition(topic=u'eventlogging-valid-mixed', partition=10), TopicPartition(topic=u'eventlogging-valid-mixed', partition=7), TopicPartition(topic=u'eventlogging-valid-mixed', partition=8), TopicPartition(topic=u'codfw.mediawiki.page-move', partition=0), TopicPartition(topic=u'eqiad.mediawiki.page-undelete', partition=0), TopicPartition(topic=u'eqiad.mediawiki.page-move', partition=0), TopicPartition(topic=u'codfw.mediawiki.page-create', partition=0), TopicPartition(topic=u'eventlogging-valid-mixed', partition=2), TopicPartition(topic=u'eventlogging-valid-mixed', partition=9), TopicPartition(topic=u'eventlogging-valid-mixed', partition=11), TopicPartition(topic=u'eventlogging-valid-mixed', partition=0), TopicPartition(topic=u'codfw.mediawiki.page-delete', partition=0), TopicPartition(topic=u'eventlogging-valid-mixed', partition=6), TopicPartition(topic=u'eqiad.mediawiki.page-create', partition=0), TopicPartition(topic=u'codfw.mediawiki.page-undelete', partition=0), TopicPartition(topic=u'eventlogging-valid-mixed', partition=4), TopicPartition(topic=u'eventlogging-valid-mixed', partition=5), TopicPartition(topic=u'eventlogging-valid-mixed', partition=3), TopicPartition(topic=u'eqiad.mediawiki.page-delete', partition=0), TopicPartition(topic=u'eventlogging-valid-mixed', partition=1)]) for group eventlogging_consumer_mysql_00

From conversation on IRC looks like these schemas are versioned going forward:= (cc @Pchelolo )

<nuria_> Pchelolo: I see, ok, on our end we can assume that they are versioned going forward then?
<Pchelolo> right nuria_

Nuria assigned this task to Ottomata.Jul 13 2017, 3:14 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.

Since making the change yesterday, I see new events for this schema. Roan, s'ok now?

Yes, events are coming in again. The missing events didn't come back , though.

mysql:research@s3-analytics-slave [log]> select count(*), substring(timestamp, 1, 8) as day from ChangesListHighlights_16484288 where timestamp >'20170705000000' group by day;
+----------+----------+
| count(*) | day      |
+----------+----------+
|       15 | 20170705 |
|       21 | 20170706 |
|       10 | 20170707 |
|        8 | 20170708 |
|        7 | 20170709 |
|       48 | 20170710 |
|       24 | 20170711 |
|        2 | 20170712 |
|       34 | 20170713 |
+----------+----------+
9 rows in set (0.00 sec)
Nuria added a comment.Jul 13 2017, 3:41 PM

@Catrope : we will be backfilling past events today, once that is done you should see the events that did not get persisted to the db. Will ping ticket when done.

Ok, events have been backfilled. However, I accidentally backfilled TOO many events, in that I did not filter out bot events during the backfill process.

@Ottomata: Why are we filtering out bot events? I don't care much one way or the other, but was curious about the rationale. Also, if we're not going to have bot events in the table going forward, it should be fairly easy to DELETE FROM mediawiki_page_create_1 WHERE performer_user_is_bot = 1.

Nuria added a comment.Jul 17 2017, 8:40 PM

@kaldari : sorry this was confusing, we only filter bot events on non-eventbus events for non-mediawiki bots, mostly to not have data that comes from say, google bot crawling our apps. Please see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Changes_and_Known_Problems_with_Dataset

kaldari added a comment.EditedJul 17 2017, 9:06 PM

@Nuria, @Ottomata: Now I'm more confused. How do we define "non-mediawiki bots"? Relying on the user agent doesn't seem reliable. People create new bots every day with random user agent strings, many of which are doing legitimate actions on Wikipedia that we may want to capture. Also, at EventLogging#Changes and Known Problems with Dataset, it says that the data was backfilled for all schemas but page-create. Is there a plan to backfill page-create as well, or are we leaving that how it is?

Nuria added a comment.Jul 18 2017, 2:14 AM

Is there a plan to backfill page-create as well, or are we leaving that how it is?

We are backfilling it now, the process is slightly different, that is why tickets have not been closed.

Now I'm more confused. How do we define "non-mediawiki bots"? Relying on the user agent doesn't seem reliable

To be clear these changes do not affect page-create events.

We rely on user agent in eventlogging pipeline and pageview pipeline to identify self-reported bots, that is standard methodology. The policy on this regard is here: https://meta.wikimedia.org/wiki/User-Agent_policy

In eventlogging we drop events for bots that are not initiated from mediawiki backend with user agent "MediaWiki" (this is of course imperfect)

Also note that event for EventLogging Analytics events, 'bot' events are only dropped from MySQL. The MySQL dbs are causing lots of headaches due to too much data, so we always looking for ways to reduce the data inserted there.

The Hive tables (IF I EVER GET TIME TO FINISH THEM) will have all events, including bots.

FYI, mediawiki_page_create_1 events are now backfilled.

Nuria moved this task from In Code Review to Done on the Analytics-Kanban board.Jul 24 2017, 3:09 PM
Nuria set the point value for this task to 5.Jul 24 2017, 3:13 PM
Nuria closed this task as Resolved.Jul 24 2017, 8:39 PM