Page MenuHomePhabricator

Renovation of Wikistats production jobs
Closed, ResolvedPublic

Description

After the migration of Wikistats from stat1002 to stat1005, work was needed to repair data files. Also the folder setup for Wikistats is different on stat1005. This seemed a good time to do a more thorough overhaul of the production bash files.

  • bring some uniformity
  • add introductions (read.me files)
  • expand in-file comments
  • make logs easier to read (and stored in a uniform non-transient way), and more complete (also tracing bash logic)
  • merge some of the bash files into a more generic version
  • improve backup, using hdfs

Rationale: I'm hoping and expecting that the Wikistats 1.0 bash files are easier to get to grips with than before. Even when some of these jobs will no longer be used after Wikistats 2.0 project completes, there will be a transition period. In the meantime someone should have a decent change to takeover when I get hit by a bus. Also there are other jobs outside the scope of Wikistats 2.0 project.

I prioritized decent logging, which helps greatly to understand what the perl scripts do, even when the bash files are just a bit more tedious (switching xtrace on and off repeatedly).

For some scripts that are not in the scope of Wikistats 2.0, but are popular and strategic, the next step would be to have those productionized, and preferably handed over to Analytics Team. I expect productionization will again bring updates to the scripts. Yet I chose to do this first overhaul on my own. With the improved uniformity and logging any further changes surely will be easier.

Backups: Backup situation for stats servers imho should improve. For some large data sets there are only the public data/dump servers [1] which have no in-house backup or mirroring at all. So I kept a local copy of > 2 TB of data on stat1002. Case in point are the condensed pageview files [2], merged from 720 hourly files to one monthly file, while retaining granularity (sparse arrays). Future historians will love these ZeitGeist-like resources, much as they love Library of Congress' tweet archives. Backup of these strategic data is now done to hdfs, which brings greatly improved robustness at least on hardware level.

Bash files that have been migrated to stat1005 say so: '# migrated to stat1005'. All cron jobs which run every month have been migrated.
A few manually run jobs still need update. A larger portion of not migrated jobs are one-off, Q&D, obsolete. Some vetting needs to happen yet.

[1] https://dumps.wikimedia.org/
[2] https://dumps.wikimedia.org/other/pagecounts-ez/

Major production jobs: cont'd

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Major production jobs + visualisations developed by Erik Zachte which are still in use

Note: Wikistats 1.0 scripts is a subset

xml dump based stats on wiki content, edits and editors

10,000s of html files for ~800 wikis in ~25 languages

dir /home/ezachte/dumps/bash

  • count_report_publish.sh (runs in parallel each month once for Wikipedia project, once for other projects), see 'Statistics per Wikimedia project' on https://stats.wikimedia.org/, calls count.sh and report.sh
  • count.sh (collect counts for all wikis in one project)
  • report.sh (generate reports for all wikis in one project)
  • progress_wikistats.sh (follow progress of count_report_publish.sh), e.g. in https://stats.wikimedia.org/WikiCountsJobProgressCurrent.html
  • count_some_wikis.sh (ad hoc rerun)
  • report_all.sh (generate all reports for final publishing, after light manual vetting of English 'draft' reports), calls report.sh
  • sort_dblists.sh (sort wikis for one project, by expected data collection time)
  • collect_countable_namespaces.sh (daily query API for each wiki for which namespaces are deemed 'countable')

compact page views per wiki article (~50M!) from 720 hourly files per month, into 30 daily files, then into 1 monthly file, for ease of download, batch processing and posterity

immense shrinkage with preservation of hourly granularity (sparse arrays)
see https://dumps.wikimedia.org/other/pagecounts-ez/merged/

dir /home/ezachte/wikistats/dammit.lt/bash

  • dammit_compact_daily.sh
  • dammit_compact_monthly.sh

monthly reports on page views per project, per wiki, with historic trends, M0M, etc (updated daily)

see https://stats.wikimedia.org/EN/TablesPageViewsSitemap.htm

dir /home/ezachte/dumps/dammit.lt/bash/

  • dammit_sync.sh
  • dammit_projectviews_monthly.sh

monthly reports on page views per country or per language

see https://stats.wikimedia.org/wikimedia/squids/SquidReportsCountriesLanguagesVisitsEdits.htm

dir /home/ezachte/squids/bash

  • SquidReportArchive.sh (run manually)

top1000 lists for daily mediacount files (one for each of ~20 columns)

see .zip files in https://dumps.wikimedia.org/other/mediacounts/daily/

dir /home/ezachte/mail-lists/bash

  • report_mail_lists_counts.sh (runs each day, that often for list moderators)

viz: 'Animated growth figures per Wikimedia project'
(yet to do)

still used for keynotes, see https://stats.wikimedia.org/wikimedia/animations/growth/index.html

dir /home/ezachte/dumps/bash

  • [...] xml dump based reporting scripts run in special mode to update counts for this viz.

dir /home/ezachte/animations/growth

  • javascript/html/images

viz' Wikipedia Views Visualized'
(to do, partially)

see viz. at https://stats.wikimedia.org/WiViVi
update scripts to prep monthly page view data for javascript viz.

dir /home/ezachte/wikistats/traffic/bash/

  • collect_country_info_from_wikipedia.sh
  • datamaps_views.sh (done 9/10/2017)
  • traffic_geo.sh

viz. 'Wikipedia edits for a normal day in May 2011'
(maybe update some day?)

animation works, but as the title says, the 'normal day' is 6 years ago, see https://stats.wikimedia.org/wikimedia/animations/requests/

dir ...

dir /home/ezachte/animations/requests

  • javascript/html/images

Wikistats portal
(PM)

see https://stats.wikimedia.org/

resides/runs on thorium, so no effect from stat1002 -> stat1005

DarTar triaged this task as Medium priority.Sep 22 2017, 3:20 PM
DarTar added a project: Research.
DarTar moved this task from Backlog to In Progress on the Research board.

script datamaps_views.sh, for updating WiViVi data, has been adapted to stat1005
viz. now shows data for Sep 2017
https://stats.wikimedia.org/wikimedia/animations/wivivi/wivivi.html

Thanks for work on this, I may try to run the scripts again on translatewiki.net dumps and tell you whether I found it easier. :)

script stat1005:/home/ezachte/wikistats/dumps/bash/extract_dump.sh has been adapted to stat1005

it invokes perl file ../perl/WikiDumpFilterArticles.pl which extracts a specified list of articles from a huge xml dump, including headers and all, so it can be fed into Wikistats scripts and runs minutes instead of hours, for debugging and explaining anomalies in known articles

script stat1005:/home/ezachte/wikistats/dumps/bash/collect_edits.sh has been adapted to stat1005

it collects all edits from either one xml stub dump or from all stub dumps for all wikis in all projects, and writes one line per revision
containing wiki, e (=edit), timestamp, same in seconds, namespace (number), namespace (prefix), page title, user name
and writes to stat1005:/home/ezachte/wikistats_data/dumps/csv/csv_[project_code]/EditsTimestampsTitles[language code].csv

(caveat: 58GB for wp:en)

@Erik_Zachte I'm going to mark this one as resolved. If you don't agree, please re-open.