Page MenuHomePhabricator

/a/mw-log/archive/api on stat1002 no longer being populated
Closed, ResolvedPublic

Description

https://github.com/wikimedia/puppet/blob/production/modules/statistics/manifests/rsync/mediawiki.pp#L29

I'm guessing this has something to do with the changes from fluorine to mwlog1001.

The last completed sync appears to be on the 8th March.

New files appearing as ".api.log-20170224.gz.B8qOcC" which I am guessing is an incomplete sync.

Event Timeline

I checked on mwlog1001 and stat1001, the network connection seems fine and not blocked. For some reason files from api.log-20170224.gz have different modification time, so rsync includes them on the list of files to send to stat1002. I am running the cron manually now, it will take a bit but it should fix the problem.

Follow up: @Ottomata - should we change the MAILTO option in the stats crontab so everybody will get the notifications? Maybe analytics-alerts@?

This is probably the current error:

elukey@stat1002:/a/mw-log/archive/api$ sudo -u stats /usr/bin/rsync -vrt --perms --chmod=
g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-*.gz /a/mw-log/archive/api/
receiving incremental file list
api.log-20170224.gz
rsync: [generator] write error: Connection timed out (110)
rsync error: error in socket IO (code 10) at io.c(837) [generator=3.1.0]
rsync error: received SIGUSR1 (code 19) at main.c(1434) [receiver=3.1.0]

Restarted with one day of contimeout and --progress:

elukey@stat1002:/a/mw-log/archive/api$ sudo -u stats /usr/bin/rsync --contimeout=86400 -v
rt --perms --progress --chmod=g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-*.gz /a/
mw-log/archive/api/
receiving incremental file list
api.log-20170224.gz
  4,172,316,672   6%   14.57MB/s    1:04:34

should we change the MAILTO option in the stats crontab

Oo, does it email me? For sure let's change it.

The above job failed, so I tried to rsync only the March files and I can see success:

elukey@stat1002:~$ sudo -u stats /usr/bin/rsync --contimeout=86400 -vrt --perms --progress --chmod=g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-201703*.gz /a/mw-log/archive/ap
i/
receiving incremental file list
api.log-20170301.gz
 34,405,756,549 100%   13.75MB/s    0:39:46 (xfr#1, to-chk=18/19)
api.log-20170302.gz
    141,787,136   0%   14.52MB/s    0:37:15

I believe that api.log-20170224.gz's size (~60GB) hits a timeout limit in the mwlog1001's rsync server and blocks the overall rsync.

Not sure why api.log-20170224.gz has a different modification time than the one on stat1002, but it might be due to the recent fluorine->mwlog1001 migration. The remaining thing, in my opinion would be to manually ensure that all the files are in sync between mwlog and stat1002, and then find why/where the timeout limit is.

To quicky workaround this problem we could rename api.log-20170224.gz on mwlog1001 to avoid being part of the stat1002 rsync (the file is close to expiration time, there is cron job that sweeps 30 days old files and removes them).

Mentioned in SAL (#wikimedia-operations) [2017-03-20T14:28:01Z] <elukey> (Correct one) Temporary hack for T160888 - moved /srv/mw-log/archive/api.log-20170224.gz to /srv/mw-log/archive/api_log_backup_elukey/ to avoid rsync timeouts to stat1002 (the file is big and close to being deleted for retention)

So after my tests we should see files syncing correctly during the next couple of days, and this will unblock the log rsyncs.

Things to do:

  1. Why /srv/mw-log/archive/api.log-20170224.gz size (~60GB) on mwlog1001 causes connect timeouts while rsynced to stat1002?
  2. Remove /srv/mw-log/archive/api_log_backup_elukey/*

We have to fix this at this time but ... have we though to publish this info via kafka instead of udp2log? mediawiki is integrated now with kafka via monolog

We have to fix this at this time but ... have we though to publish this info via kafka instead of udp2log? mediawiki is integrated now with kafka via monolog

Yes, and as far as I understand api action info has already made it into hadoop.

We (WMDE) just need to rewrite the script for our dashboard to make use of that!
Although of course it would make more sense to write a general oozie script to cover this case and also add the data for all other action api modules to graphite.

Once this is done we can likely remove this rsync.

Oh! Is all this API data from mw-log udp2log already in Kafka then? If so, we can use kafkatee now to write out files from kafka, instead of maintaining the udp2log instance on mwlog1001.

Oh! Is all this API data from mw-log udp2log already in Kafka then? If so, we can use kafkatee now to write out files from kafka, instead of maintaining the udp2log instance on mwlog1001.

Well, I mean, https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction exists and is populated.
This contains all of the data that I / the script fixed by fixing the above needs.

Change 344908 had a related patch set uploaded (by Elukey):
[operations/puppet@production] Move Hadoop cron MAILTO from singe person to analytics-alerts

https://gerrit.wikimedia.org/r/344908

Change 344908 merged by Elukey:
[operations/puppet@production] Move Hadoop cron MAILTO from singe person to analytics-alerts

https://gerrit.wikimedia.org/r/344908

Removed /srv/mw-log/archive/api_log_backup_elukey/* from mwlog1001 and verified that recent api logs are rsynced on stat1002 as expected. As far as I can see the next step would be to deprecate this rsync in favor of something different, so I wouldn't spend too much time in figuring out why the rsync fails with files bigger than a certain size (surely 60GB or bigger).

@Addshore is everything ok from your side? Anything left to do for this specific issue?

@Addshore is everything ok from your side? Anything left to do for this specific issue?

All looks good here! Thanks!

@Addshore: I am going to close this task but we might want to open another one to track down how to move away api-logs from mwlog1001->stat1002. What do you think?

@Addshore: I am going to close this task but we might want to open another one to track down how to move away api-logs from mwlog1001->stat1002. What do you think?

I have created T162023 :)

we might want to open another one to track down how to move away api-logs from mwlog1001->stat1002

Hm, doing your analysis in Hadoop is not the only way. I think other folks need API and error logs as regular log files. It'd probably behoove us to get these logs from MW -> Kafka -> Hadoop. If someone also needs log files, we can use kafkatee to write them to disk on stat1002 (or wherever) from Kafka directly.