Previously an access request now (see title)
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| keep fewer dataset web server logs, add date to filename | operations/puppet | production | +25 -1 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Duplicate | None | T117203 [WD] External usage KPI | |||
| Resolved | Addshore | T119070 Track the number of wikidata dumps that are downloaded by type | |||
| Resolved | ArielGlenn | T118739 Push dumps.wm.o logs files to stat1002 |
Event Timeline
The logs are on dataset1001, but we should really be copying them off somewhere else like all apache logs. Do you have access to logs on another host?
I have access to fluorine which contains mediawiki logs and udp2log.
Also the stat / analytics cluster.
Copying to either of those locations IMO would be good.
Well, these should end up on fluorine like everything else. Let me look into how that works (or someone who knows can tell me now).
They could be added here
If we are okay to do this then I can close this ticket!
See a similar change I made recently.
I don't think any apache logs end up on fluorine
Well it looks like apache error logs end up on fluorine, but not access logs
Change 253594 had a related patch set uploaded (by ArielGlenn):
keep fewer dataset web server logs, add date to filename
need to change the file name format for these logs, otherwise it's going ot be very annoying for you on the other end of that rsync. see above patchset.
After looking at the other rsyncs you do (erbium, oxygen), and considering the other syncs the dataset hosts do (datasets downloadable to the public), can datasets push to stat1002 rather than the other way around? We could add that right in the dumps module; the other way, it winds up in the dataset module with the other rsyncs, which doesn't feel clean to me. Also, if you wind up doing this for logs on any other hosts, auth/rsyncd config is centralized on your end instead of spread out on the other hosts.
I am fine with doing it either way :)
Someone from the analytics team may also have an opinion though!
since no one from analytics noticed (silence = consent) I'll go ahead and do this the way described above.
Change 253594 merged by ArielGlenn:
keep fewer dataset web server logs, add date to filename
Really? The bot didn't add the changeset to this ticket? Well it's this: https://gerrit.wikimedia.org/r/#/c/268129/ for the class, needs some cleanup and then to be called with the right destination. Where should they land exactly?
Hey sorry, I don't think I've seen this ticket before, hence the silence! I just commented on change about pull vs. push.
After a ridiculous amount of help from @Ottomata (thank you!) this is now live, and a manual run of the cron job from the command line worked as expected, so closing.
/a/log/webrequest/archive/dumps.wikimedia.org on stat1002 is full of them. Are you looking in the right place?