api on stat1002 no longer being populated
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Addshore
	Mar 20 2017, 9:18 AM

Description

https://github.com/wikimedia/puppet/blob/production/modules/statistics/manifests/rsync/mediawiki.pp#L29

I'm guessing this has something to do with the changes from fluorine to mwlog1001.

The last completed sync appears to be on the 8th March.

New files appearing as ".api.log-20170224.gz.B8qOcC" which I am guessing is an incomplete sync.

Details

	Subject	Repo	Branch	Lines +/-
	Move Hadoop cron MAILTO from singe person to analytics-alerts	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T162023: Migrate wmde wikidata apiLogScanner.php script to use data from hadoop
T160825: Grafana: "wikidata-api" doesn't update anymore
T64873: [Story] Statistics for Wikidata API usage
Mentioned Here: T162023: Migrate wmde wikidata apiLogScanner.php script to use data from hadoop

Event Timeline

Addshore created this task.Mar 20 2017, 9:18 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2017, 9:18 AM

Addshore triaged this task as High priority.Mar 20 2017, 9:18 AM

Addshore mentioned this in T64873: [Story] Statistics for Wikidata API usage.

Addshore moved this task from Incoming to Watching / External on the WMDE-Analytics-Engineering board.

Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

Seems a perfect match with merge time of https://gerrit.wikimedia.org/r/#/c/341570

I checked on mwlog1001 and stat1001, the network connection seems fine and not blocked. For some reason files from api.log-20170224.gz have different modification time, so rsync includes them on the list of files to send to stat1002. I am running the cron manually now, it will take a bit but it should fix the problem.

Follow up: @Ottomata - should we change the MAILTO option in the stats crontab so everybody will get the notifications? Maybe analytics-alerts@?

This is probably the current error:

elukey@stat1002:/a/mw-log/archive/api$ sudo -u stats /usr/bin/rsync -vrt --perms --chmod=
g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-*.gz /a/mw-log/archive/api/
receiving incremental file list
api.log-20170224.gz
rsync: [generator] write error: Connection timed out (110)
rsync error: error in socket IO (code 10) at io.c(837) [generator=3.1.0]
rsync error: received SIGUSR1 (code 19) at main.c(1434) [receiver=3.1.0]

Restarted with one day of contimeout and --progress:

elukey@stat1002:/a/mw-log/archive/api$ sudo -u stats /usr/bin/rsync --contimeout=86400 -v
rt --perms --progress --chmod=g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-*.gz /a/
mw-log/archive/api/
receiving incremental file list
api.log-20170224.gz
  4,172,316,672   6%   14.57MB/s    1:04:34

should we change the MAILTO option in the stats crontab

Oo, does it email me? For sure let's change it.

The above job failed, so I tried to rsync only the March files and I can see success:

elukey@stat1002:~$ sudo -u stats /usr/bin/rsync --contimeout=86400 -vrt --perms --progress --chmod=g-w mwlog1001.eqiad.wmnet::udp2log/archive/api.log-201703*.gz /a/mw-log/archive/ap
i/
receiving incremental file list
api.log-20170301.gz
 34,405,756,549 100%   13.75MB/s    0:39:46 (xfr#1, to-chk=18/19)
api.log-20170302.gz
    141,787,136   0%   14.52MB/s    0:37:15

I believe that api.log-20170224.gz's size (~60GB) hits a timeout limit in the mwlog1001's rsync server and blocks the overall rsync.

Not sure why api.log-20170224.gz has a different modification time than the one on stat1002, but it might be due to the recent fluorine->mwlog1001 migration. The remaining thing, in my opinion would be to manually ensure that all the files are in sync between mwlog and stat1002, and then find why/where the timeout limit is.

To quicky workaround this problem we could rename api.log-20170224.gz on mwlog1001 to avoid being part of the stat1002 rsync (the file is close to expiration time, there is cron job that sweeps 30 days old files and removes them).

Mentioned in SAL (#wikimedia-operations) [2017-03-20T14:28:01Z] <elukey> (Correct one) Temporary hack for T160888 - moved /srv/mw-log/archive/api.log-20170224.gz to /srv/mw-log/archive/api_log_backup_elukey/ to avoid rsync timeouts to stat1002 (the file is big and close to being deleted for retention)

So after my tests we should see files syncing correctly during the next couple of days, and this will unblock the log rsyncs.

Things to do:

Why /srv/mw-log/archive/api.log-20170224.gz size (~60GB) on mwlog1001 causes connect timeouts while rsynced to stat1002?
Remove /srv/mw-log/archive/api_log_backup_elukey/*

We have to fix this at this time but ... have we though to publish this info via kafka instead of udp2log? mediawiki is integrated now with kafka via monolog

In T160888#3114531, @Nuria wrote:

We have to fix this at this time but ... have we though to publish this info via kafka instead of udp2log? mediawiki is integrated now with kafka via monolog

Yes, and as far as I understand api action info has already made it into hadoop.

We (WMDE) just need to rewrite the script for our dashboard to make use of that!
Although of course it would make more sense to write a general oozie script to cover this case and also add the data for all other action api modules to graphite.

Once this is done we can likely remove this rsync.

Oh! Is all this API data from mw-log udp2log already in Kafka then? If so, we can use kafkatee now to write out files from kafka, instead of maintaining the udp2log instance on mwlog1001.

In T160888#3114712, @Ottomata wrote:

Oh! Is all this API data from mw-log udp2log already in Kafka then? If so, we can use kafkatee now to write out files from kafka, instead of maintaining the udp2log instance on mwlog1001.

Well, I mean, https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction exists and is populated.
This contains all of the data that I / the script fixed by fixing the above needs.

• Nuria edited projects, added Analytics-Kanban; removed Analytics.Mar 20 2017, 3:59 PM

Addshore mentioned this in T160825: Grafana: "wikidata-api" doesn't update anymore.Mar 20 2017, 4:02 PM

Change 344908 had a related patch set uploaded (by Elukey):
[operations/puppet@production] Move Hadoop cron MAILTO from singe person to analytics-alerts

https://gerrit.wikimedia.org/r/344908

gerritbot added a project: Patch-For-Review.Mar 27 2017, 6:55 AM

Change 344908 merged by Elukey:
[operations/puppet@production] Move Hadoop cron MAILTO from singe person to analytics-alerts

https://gerrit.wikimedia.org/r/344908

Removed /srv/mw-log/archive/api_log_backup_elukey/* from mwlog1001 and verified that recent api logs are rsynced on stat1002 as expected. As far as I can see the next step would be to deprecate this rsync in favor of something different, so I wouldn't spend too much time in figuring out why the rsync fails with files bigger than a certain size (surely 60GB or bigger).

@Addshore is everything ok from your side? Anything left to do for this specific issue?

In T160888#3132915, @elukey wrote:

@Addshore is everything ok from your side? Anything left to do for this specific issue?

All looks good here! Thanks!

• Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 27 2017, 3:17 PM

@Addshore: I am going to close this task but we might want to open another one to track down how to move away api-logs from mwlog1001->stat1002. What do you think?

elukey claimed this task.Mar 29 2017, 12:40 PM

elukey moved this task from In Progress to Done on the Analytics-Kanban board.

• Nuria closed this task as Resolved.Mar 29 2017, 7:38 PM

In T160888#3140077, @elukey wrote:

@Addshore: I am going to close this task but we might want to open another one to track down how to move away api-logs from mwlog1001->stat1002. What do you think?

I have created T162023 :)

we might want to open another one to track down how to move away api-logs from mwlog1001->stat1002

Hm, doing your analysis in Hadoop is not the only way. I think other folks need API and error logs as regular log files. It'd probably behoove us to get these logs from MW -> Kafka -> Hadoop. If someone also needs log files, we can use kafkatee to write them to disk on stat1002 (or wherever) from Kafka directly.

Addshore moved this task from Watching 👀 to Closing ✔️ on the User-Addshore board.Apr 11 2017, 10:23 PM

Addshore moved this task from Watching / External to Done on the WMDE-Analytics-Engineering board.May 8 2018, 1:26 PM

/a/mw-log/archive/api on stat1002 no longer being populatedClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

/a/mw-log/archive/api on stat1002 no longer being populated
Closed, ResolvedPublic
Actions