/a/mw-log/archive/api on stat1002 no longer being populated
Open, HighPublic

Description

https://github.com/wikimedia/puppet/blob/production/modules/statistics/manifests/rsync/mediawiki.pp#L29

I'm guessing this has something to do with the changes from fluorine to mwlog1001.

The last completed sync appears to be on the 8th March.

New files appearing as ".api.log-20170224.gz.B8qOcC" which I am guessing is an incomplete sync.

Addshore created this task.Mon, Mar 20, 9:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Mar 20, 9:18 AM
Addshore moved this task from Backlog to Watching on the WMDE-Analytics-Engineering board.
Addshore moved this task from Backlog to Watching on the User-Addshore board.
Addshore triaged this task as "High" priority.
elukey added a subscriber: elukey.Mon, Mar 20, 9:23 AM

Seems a perfect match with merge time of https://gerrit.wikimedia.org/r/#/c/341570

I checked on mwlog1001 and stat1001, the network connection seems fine and not blocked. For some reason files from api.log-20170224.gz have different modification time, so rsync includes them on the list of files to send to stat1002. I am running the cron manually now, it will take a bit but it should fix the problem.

Follow up: @Ottomata - should we change the MAILTO option in the stats crontab so everybody will get the notifications? Maybe analytics-alerts@?

This is probably the current error:

elukey@stat1002:/a/mw-log/archive/api$ sudo -u stats /usr/bin/rsync -vrt --perms --chmod=
g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-*.gz /a/mw-log/archive/api/
receiving incremental file list
api.log-20170224.gz
rsync: [generator] write error: Connection timed out (110)
rsync error: error in socket IO (code 10) at io.c(837) [generator=3.1.0]
rsync error: received SIGUSR1 (code 19) at main.c(1434) [receiver=3.1.0]

Restarted with one day of contimeout and --progress:

elukey@stat1002:/a/mw-log/archive/api$ sudo -u stats /usr/bin/rsync --contimeout=86400 -v
rt --perms --progress --chmod=g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-*.gz /a/
mw-log/archive/api/
receiving incremental file list
api.log-20170224.gz
  4,172,316,672   6%   14.57MB/s    1:04:34
should we change the MAILTO option in the stats crontab

Oo, does it email me? For sure let's change it.

The above job failed, so I tried to rsync only the March files and I can see success:

elukey@stat1002:~$ sudo -u stats /usr/bin/rsync --contimeout=86400 -vrt --perms --progress --chmod=g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-201703*.gz /a/mw-log/archive/ap
i/
receiving incremental file list
api.log-20170301.gz
 34,405,756,549 100%   13.75MB/s    0:39:46 (xfr#1, to-chk=18/19)
api.log-20170302.gz
    141,787,136   0%   14.52MB/s    0:37:15

I believe that api.log-20170224.gz's size (~60GB) hits a timeout limit in the mwlog1001's rsync server and blocks the overall rsync.

Not sure why api.log-20170224.gz has a different modification time than the one on stat1002, but it might be due to the recent fluorine->mwlog1001 migration. The remaining thing, in my opinion would be to manually ensure that all the files are in sync between mwlog and stat1002, and then find why/where the timeout limit is.

To quicky workaround this problem we could rename api.log-20170224.gz on mwlog1001 to avoid being part of the stat1002 rsync (the file is close to expiration time, there is cron job that sweeps 30 days old files and removes them).

Mentioned in SAL (#wikimedia-operations) [2017-03-20T14:28:01Z] <elukey> (Correct one) Temporary hack for T160888 - moved /srv/mw-log/archive/api.log-20170224.gz to /srv/mw-log/archive/api_log_backup_elukey/ to avoid rsync timeouts to stat1002 (the file is big and close to being deleted for retention)

So after my tests we should see files syncing correctly during the next couple of days, and this will unblock the log rsyncs.

Things to do:

  1. Why /srv/mw-log/archive/api.log-20170224.gz size (~60GB) on mwlog1001 causes connect timeouts while rsynced to stat1002?
  2. Remove /srv/mw-log/archive/api_log_backup_elukey/*
Nuria added a subscriber: Nuria.Mon, Mar 20, 3:12 PM

We have to fix this at this time but ... have we though to publish this info via kafka instead of udp2log? mediawiki is integrated now with kafka via monolog

We have to fix this at this time but ... have we though to publish this info via kafka instead of udp2log? mediawiki is integrated now with kafka via monolog

Yes, and as far as I understand api action info has already made it into hadoop.

We (WMDE) just need to rewrite the script for our dashboard to make use of that!
Although of course it would make more sense to write a general oozie script to cover this case and also add the data for all other action api modules to graphite.

Once this is done we can likely remove this rsync.

Oh! Is all this API data from mw-log udp2log already in Kafka then? If so, we can use kafkatee now to write out files from kafka, instead of maintaining the udp2log instance on mwlog1001.

Oh! Is all this API data from mw-log udp2log already in Kafka then? If so, we can use kafkatee now to write out files from kafka, instead of maintaining the udp2log instance on mwlog1001.

Well, I mean, https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction exists and is populated.
This contains all of the data that I / the script fixed by fixing the above needs.

Nuria edited projects, added Analytics-Kanban; removed Analytics.Mon, Mar 20, 3:59 PM

Change 344908 had a related patch set uploaded (by Elukey):
[operations/puppet@production] Move Hadoop cron MAILTO from singe person to analytics-alerts

https://gerrit.wikimedia.org/r/344908

Change 344908 merged by Elukey:
[operations/puppet@production] Move Hadoop cron MAILTO from singe person to analytics-alerts

https://gerrit.wikimedia.org/r/344908

elukey added a comment.EditedMon, Mar 27, 7:01 AM

Removed /srv/mw-log/archive/api_log_backup_elukey/* from mwlog1001 and verified that recent api logs are rsynced on stat1002 as expected. As far as I can see the next step would be to deprecate this rsync in favor of something different, so I wouldn't spend too much time in figuring out why the rsync fails with files bigger than a certain size (surely 60GB or bigger).

@Addshore is everything ok from your side? Anything left to do for this specific issue?

@Addshore is everything ok from your side? Anything left to do for this specific issue?

All looks good here! Thanks!

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Mon, Mar 27, 3:17 PM