Page MenuHomePhabricator

Create a reports directory under analytics.wikimedia.org
Closed, ResolvedPublic5 Estimated Story Points

Description

Recently, Product-Analytics has been using the ad-hoc datasets function to publish Jupyter notebooks to analytics.wikimedia.org/datasets.

However, that directory is meant for raw data files, and using it for reports means a worse experience for people accessing it because (1) people looking for raw data have to wade through notebooks and (2) people browsing for reports have to wade through datasets.

So we should have a separate directory meant specifically for reports. Setting up parallel /srv/published-reports/ directories that sync to analytics.wikimedia.org/reports seems like a good approach, although there are probably others.

Additionally, it would nice if there was a way for us to move or delete things that have already been synced (as far as I know, that isn't possible with the datasets folder); for example, we may want to reorganize reports for better discoverability or get rid of old reports by deleting them or moving them to an archive folder.

Event Timeline

We discussed this in a meeting yesterday, and the consensus was that setting up a second system of syncing folders would be too much work.

However, @mforns suggested that we simply rename the current sync destination to something like analytics.wikimedia.org/public/ and simply place our reports into a /reports/ subdirectory. The current analytics.wikimedia.org/datasets/ would then redirect to analytics.wikimedia.org/public/datasets/. This seems like an elegant solution to me.

It turns out it is possible to remove a file from the public folder simply by deleting it from the source folder. However, this does require figuring out which of the 5 or so source folders it's in. That's annoying, but we can live with it.

Nuria added a project: Analytics-Kanban.
Nuria moved this task from Incoming to Operational Excellence on the Analytics board.

Change 547041 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] statistics - rename published-datasets to just published

https://gerrit.wikimedia.org/r/547041

It turns out it is possible to remove a file from the public folder simply by deleting it from the source folder. However, this does require figuring out which of the 5 or so source folders it's in. That's annoying, but we can live with it.

This is price we pay for multi source syncing :p


Ok! This change is a bit complex but I think the right one to do! Instead of /public, I'm going to go with /published (everything in analytics.wikimedia.org is 'public').

After we are done:

Source nodes:

  • /srv/published will be the new source directory on e.g. stat1007 and notebook1003
  • The current /srv/published-datasets will be moved to /srv/published/datasets
  • A symlink from /srv/published-datasets will point at /srv/published/datasets (so any existent jobs/scripts keep working)

Dest node & analytics.wikimedia.org:

  • analytics.wikimedia.org/published will be created
  • analytics.wikimedia.org/datasets will be moved to analytics.wikimedia.org/published/datasets
  • analytics.wikimedia.org/datasets will redirect to analytics.wikimedia.org/published/datasets

If all goes well, everything should work with 100% backwards compatibility. This will allow SWAP users to create a reports (or whatever else) directory they want in /srv/published/reports on source nodes.


Procedure

Stop puppet on all relevant nodes

sudo cumin 'R:Class = statistics::published_datasets or R:Class = statistics::rsync::published_datasets' 'puppet agent --disable "otto - T235494"'

Rename source node directories and symlink for published-datasets backwards compatibility. Also remove old (now renamed) scripts and crons.

# For each source node (with R:Class = statistics::rsync::published_datasets)
  sudo mkdir -p /srv/published
  sudo chown root:wikidev /srv/published
  sudo chmod 775 /srv/published
  sudo mv -v /srv/published-datasets /srv/published/datasets
  sudo ln -sv /srv/published/datasets /srv/published-datasets

  # Remove old published-datasets related crons and scripts from source hosts.
  rm -v /srv/published/datasets/README /usr/local/bin/published-datasets-sync
  sudo crontab -e # Remove Puppet Name: rsync-published-datasets

Prepare thorium to receive rsyncs from source nodes in /srv/published-rsynced

sudo mkdir /srv/published-rsynced
sudo chown root:www-data /srv/published-rsynced
sudo chmod 775 /srv/published-rsynced

# Move the source node data into a datasets folder (since we moved it into /srv/published/datasets on the source nodes already)
for source_host in $(ls /srv/published-datasets-rsynced); do
  sudo mkdir -p /srv/published-rsynced/$source_host
  sudo chown stats:wikidev /srv/published-rsynced/$source_host
  sudo chmod 775 /srv/published-rsynced/$source_host
  sudo mv -v /srv/published-datasets-rsynced/$source_host  /srv/published-rsynced/$source_host/datasets
done

Rename analytics.wikimedia.org /datasets to /published.

sudo mkdir /srv/analytics.wikimedia.org/published
sudo mv -v /srv/analytics.wikimedia.org/datasets /srv/analytics.wikimedia.org/published/datasets

Remove old published-datasets related crons and directories from thorium.

rmdir /srv/published-datasets-rsynced
sudo crontab -e # Remove Puppet Name: hardsync-published-datasets

Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/547041.
Re-enable and run puppet:

sudo cumin 'R:Class = statistics::published_datasets or R:Class = statistics::rsync::published_datasets' 'puppet agent --enable && run-puppet-agent'

(^ Cumin command might need changed to statistics::published and statistics::rsync::published)

Finally: edit any published-datasets documentation on wikitech.

@elukey to review procedure please! :)

@Ottomata thank you! This looks like a great plan 😁

Change 547041 merged by Ottomata:
[operations/puppet@production] statistics - rename published-datasets to just published

https://gerrit.wikimedia.org/r/547041

Woo hoo, done!

Everything is now /srv/published, with the previous /srv/published-datasets now at /srv/published/datasets.

https://analytics.wikimedia.org/published/

@Neil_P._Quinn_WMF you can create a notebook dir inside of /srv/published on any notebook box. As for naming, perhaps published/notebooks is more clear than published/reports? I guess up to yall though!

Ottomata set the point value for this task to 5.

Ahh, duh. Thank you!

@Neil_P._Quinn_WMF you can create a notebook dir inside of /srv/published on any notebook box. As for naming, perhaps published/notebooks is more clear than published/reports? I guess up to yall though!

Good point, especially since reportupdater output goes somewhere else. I just put the first notebook at /published/notebooks/WMF-Language/key-metrics.html!

One question (and I hope I'm not missing something dumb this time 😛): I get directory pages for /published/ and /published/notebooks/WMF-Language/, but not /published/notebooks/. Any idea what's going on? Just some weird caching thing?

Must be because I get a dir listing! :)

Must be because I get a dir listing! :)

Yeah, now I do too. I'm just going to ignore any more problems I think I find today 😅

@Ottomata I think Dashiki dashboards did not like the rename:
https://pingback.wmflabs.org

Change 547617 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] analytics.wikimedia.org - set CORS header for redirects too

https://gerrit.wikimedia.org/r/547617

Change 547617 merged by Ottomata:
[operations/puppet@production] analytics.wikimedia.org - set CORS header for redirects too

https://gerrit.wikimedia.org/r/547617

So I'm pretty sure that ^ should fix, but the responses will have to expire out of varnish cache before they start returning with the right headers.

Nuria added subscribers: CCicalese_WMF, Nuria.

Moving this to "in progress" until we can verify fix cc @CCicalese_WMF (re:dashboards broken due to reshuffle of files)

Thank you, they are indeed working again. It seems that https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/545917 is not yet functional there. Is there anything that needs to happen past merging that patch that will enable it for the dashboard?

Is there anything that needs to happen past merging that patch that will enable it for the dashboard?

Yes, the reports need to run, they have not yet.

Cool, thanks. I couldn't remember how often they run. The patch was merged yesterday, but I guess since the dashboard was not functional, the reports didn't update. Thanks!

@CCicalese_WMF mmm, looking at report times i think last time these were updated was October 6th, pinging @mforns to make sure we have setup crons for the new jobs pullind data from hive

nuria@stat1006:/srv/published-datasets/periodic/reports/metrics/pingback
total 160
159973514 drwxr-xr-x 18 stats stats 4096 Nov 27 2018 ..
218300421 -rw-r--r-- 1 stats stats 5498 Oct 6 00:03 php.tsv
218300424 -rw-r--r-- 1 stats stats 6792 Oct 6 00:07 memoryLimit.tsv
218300420 -rw-r--r-- 1 stats stats 5232 Oct 6 00:10 serverSoftware.tsv
218300422 -rw-r--r-- 1 stats stats 4183 Oct 6 00:14 database.tsv
218300423 -rw-r--r-- 1 stats stats 5854 Oct 6 00:17 os.tsv
218300427 -rw-r--r-- 1 stats stats 5281 Oct 6 00:20 version_simple.tsv
218300426 -rw-r--r-- 1 stats stats 82660 Oct 6 00:24 version.tsv
218300419 -rw-r--r-- 1 stats stats 6300 Oct 6 00:27 machine.tsv
219742217 drwxr-xr-x 2 stats stats 4096 Oct 6 00:44 php_drilldown
218300425 -rw-r--r-- 1 stats stats 2422 Oct 6 00:47 count.tsv
218300418 -rw-r--r-- 1 stats stats 3385 Oct 6 00:51 arch.tsv
218300417 drwxr-xr-x 3 stats stats 4096 Oct 6 00:51 .

@Nuria
The reports in stat1006 are the ones created by querying MySQL.
The updated new ones are in stat1007, with all the other Hive reports, and already have systemd timers running them in puppet.
I left the old report files in stat1006 for safety. But if you think it's confusing or @CCicalese_WMF prefers to drop them, I can delete them.

I left the old report files in stat1006 for safety

I see. Given that all that data is in hive on pingback EL table I do not think we need them on stat1006.

@CCicalese_WMF The reports on 1007 were correctly updated on the 27th, next update will have the newly merged changes