After the migration of Wikistats from stat1002 to stat1005, work was needed to repair data files. Also the folder setup for Wikistats is different on stat1005. This seemed a good time to do a more thorough overhaul of the production bash files.
- bring some uniformity
- add introductions (read.me files)
- expand in-file comments
- make logs easier to read (and stored in a uniform non-transient way), and more complete (also tracing bash logic)
- merge some of the bash files into a more generic version
- improve backup, using hdfs
Rationale: I'm hoping and expecting that the Wikistats 1.0 bash files are easier to get to grips with than before. Even when some of these jobs will no longer be used after Wikistats 2.0 project completes, there will be a transition period. In the meantime someone should have a decent change to takeover when I get hit by a bus. Also there are other jobs outside the scope of Wikistats 2.0 project.
I prioritized decent logging, which helps greatly to understand what the perl scripts do, even when the bash files are just a bit more tedious (switching xtrace on and off repeatedly).
For some scripts that are not in the scope of Wikistats 2.0, but are popular and strategic, the next step would be to have those productionized, and preferably handed over to Analytics Team. I expect productionization will again bring updates to the scripts. Yet I chose to do this first overhaul on my own. With the improved uniformity and logging any further changes surely will be easier.
Backups: Backup situation for stats servers imho should improve. For some large data sets there are only the public data/dump servers [1] which have no in-house backup or mirroring at all. So I kept a local copy of > 2 TB of data on stat1002. Case in point are the condensed pageview files [2], merged from 720 hourly files to one monthly file, while retaining granularity (sparse arrays). Future historians will love these ZeitGeist-like resources, much as they love Library of Congress' tweet archives. Backup of these strategic data is now done to hdfs, which brings greatly improved robustness at least on hardware level.
Bash files that have been migrated to stat1005 say so: '# migrated to stat1005'. All cron jobs which run every month have been migrated.
A few manually run jobs still need update. A larger portion of not migrated jobs are one-off, Q&D, obsolete. Some vetting needs to happen yet.
[1] https://dumps.wikimedia.org/
[2] https://dumps.wikimedia.org/other/pagecounts-ez/
Major production jobs: cont'd