Page MenuHomePhabricator

Clean up datasets.wikimedia.org
Closed, ResolvedPublic8 Estimated Story Points

Description

nobody knows what aggregate-datasets, limn-public-data, etc. are for

Requirements

  • Dashiki needs structured folder for structured metrics *something*/<<metric-name>>/<<submetric-name>>/<<wiki>>.tsv
  • Dashiki needs unstructured folders to graph random files (hopefully this doesn't get too crazy, maybe all should go in a base directory that's specifically for unstructured metrics
  • Researchers on stat1003 output public datasets
  • Researchers on stat1002 output public datasets

Current State

  • stat1003 rsyncs to limn-public-data
  • stat1002 rsyncs to aggregate-datasets
  • stat1002 *now* rsyncs to limn-public-data
  • ?? public-datasets (looks like ad-hoc work)

Ideal Solution

stat1001: https://datasets.wikimedia.org

README.md
/common
    README.md: this is rsynced from stat1002 and 1003 and wherever with no --delete
/reports
    README.md
    /per-wiki
        /sessions
            /visualeditor
                /enwiki.tsv
                /all.tsv
            /wikitext
                /enwiki.tsv
                /all.tsv
    /cross-wiki
        /request-breakdowns (now browser, we should rename)
            /by-os-or-browser.tsv
            /by-os.tsv

stat1003:/srv/reportupdater/output/... -> stat1001:.../reports/
stat1002:/a/reportupdater/output/... -> stat1001:.../reports/

Steps

  1. move unstructured stuff from limn-public-data/* to common/legacy/limn-public-data/*
  2. symlink limn-public-data to common/legacy/limn-public-data
  3. move structured stuff from limn-public-data to reports
  4. announce the plan to do the same thing for aggregate-datasets and public-data
  5. in the distant future delete the symlinks
  6. Make sure intentions for directories are documented in README
  7. send an email to list
  8. wikitech documentation?
  9. update dashiki config & code for datasets api root (remove /metrics)
  10. update the output paths of reportupdater jobs

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric subscribed.
Milimetric added a project: Analytics.
Milimetric set Security to None.
Milimetric moved this task from Incoming to Event Platform on the Analytics board.

We have a plan!!! updating description

Milimetric raised the priority of this task from Low to Medium.Mar 17 2016, 3:36 PM
Milimetric updated the task description. (Show Details)
Milimetric moved this task from Event Platform to Analytics Query Service on the Analytics board.
Nuria lowered the priority of this task from Medium to Low.Jul 25 2016, 4:45 PM

The following folders in datasets,wikimedia.org contain data that isn't used any more. We can recheck with their owners, but the dashboards that retrieved them don't exist any more. So, when reorganizing the data sets, those folders can be deleted!

https://datasets.wikimedia.org/limn-public-data/mobile/
https://datasets.wikimedia.org/limn-public-data/edit/
https://datasets.wikimedia.org/aggregate-datasets/refinery/
https://datasets.wikimedia.org/limn-public-data/extdist/

See the task that removed the corresponding dashboards: T147000

Nuria set the point value for this task to 8.

Change 334167 had a related patch set uploaded (by Ottomata):
/srv/datasets.wikimedia.org -> /srv/datasets

https://gerrit.wikimedia.org/r/334167

Change 334167 merged by Ottomata:
/srv/datasets.wikimedia.org -> /srv/datasets

https://gerrit.wikimedia.org/r/334167

Triaging https://datasets.wikimedia.org/limn-public-data/ :

edit/       (old data, ping @Jdforrester-WMF if he still needs any of it, otherwise DELETE)
ee/         (old version, already migrated to metrics/ee, but keep in an archive in case, rename to Editor Engagement maybe)
extdist/    (DELETE, I think, doesn't seem used any more, dashboard deleted)
flow/       (structured, reportupdater is updating this (should be in metrics))
language/   (DELETE)
metrics/    (structured, reportupdater, can just update the config to use the new directory)
mobile/     (legacy, used to be in limn dashboard)

Just talked with Dan about the stat1002 aggregate-datasets vs stat1003 public-datasets problem. (These directory names have no real meaning, stay with me!)

aggregate-datasets/ is rsynced --delete from stat1002 to datasets.wikimedia.org (on thorium).
public-datasets/ is rsynced --delete from stat1003 to datasets.wikimedia.org (on thorium).

Both of these directories are for the same thing: allowing researchers to publish datasets. They don't have access to thorium directly, so instead they put their data into these directories on the host they do have access to, and then it will eventually be rsynced over.

It would be much simpler if the data at datasets.wikimedia.org was just in a single directory. But, we can't do this if we want people to have control over deleting their data too.

We either have to keep these rsync --delete crons with separate directories...or use NFS. :o

Question for other opsen: May we export an NFS mount from thorium to stat1002 and stat1003 that users can write to from stat1002 and stat1003? Or, if we want to totally keep NFS off of a host that hosts websites (thorium), we could also just export an NFS mount from stat1002 to stat1003 (or vice versa), and then rsync from one of them to thorium.

Let's see, who should I ping about this? @faidon? :) Whatcha think?

Change 334435 had a related patch set uploaded (by Ottomata):
Add hardsync shell script

https://gerrit.wikimedia.org/r/334435

I talked with @akosiaris this morning, and he suggested I try to do some fanciness with deleting the destination directory and hardlinks. I think I got something!

https://gerrit.wikimedia.org/r/#/c/334435/

I'll get a review on this before actually using it to create a directory we can expose at datasets.wikimedia.org.

Talked with dan again, and we decided that since we are doing T132594 anyway, we might as well make things easy on ourselves and create a brand new structure at analytics.wikimedia.org/datasets/, and leave datasets.wikimedia.org alone. Then we can copy things into analytics.wm.org/datasets and eventually delete them from datasets.wikimedia.org.

Cronjob that fails periodically:

---------- Forwarded message ----------
From: Cron Daemon <root@stat1002.eqiad.wmnet>
Date: Wed, Jan 25, 2017 at 8:15 PM
Subject: Cron <hdfs@stat1002> /usr/bin/rsync -rt /a/reportupdater/output/* thorium.eqiad.wmnet::srv/limn-public-data/metrics/
To: hdfs@stat1002.eqiad.wmnet


rsync: mkdir "/limn-public-data/metrics" (in srv) failed: No such file or directory (2)
rsync error: error in file IO (code 11) at main.c(674) [Receiver=3.1.1]

waiting for @Ottomata, we can untangle this together Monday

Change 335273 had a related patch set uploaded (by Ottomata):
Revert some of the changes last week to datasets.wm.org - we will cleanup at analytics.wm.org instead

https://gerrit.wikimedia.org/r/335273

Change 335273 merged by Ottomata:
Revert some of the changes last week to datasets.wm.org - we will cleanup at analytics.wm.org instead

https://gerrit.wikimedia.org/r/335273

Change 334435 merged by Ottomata:
Add hardsync shell script

https://gerrit.wikimedia.org/r/334435

Yeehaw! Ok, so, that cron thing should be fixed. I reverted some datasets.wm.org stuff back to how it was before.

Also! {/a,/srv}/published-datasets now exists on stat1002 and stat1003. Its contents will eventually end up living at analytics.wikimedia.org/datasets/, likely with stat1003 files taking overwriting ones with the same names from stat1002 (since the sync happens in alphabetical order).

@Milimetric, you can now as you see fit move files files into published-datasets, and link dashboards against them as they are synced to analytics.wikimedia.org/datasets. As you do so, we can add any redirects from datasets.wikimedia.org as necessary, and delete files out of public-datasets, aggregate-datasets, and limn-public-data. Once we are ready to do this, we can start announcing the change.

Ottomata added subscribers: Nuria, Ottomata.

I've merged T132594 as a duplicate, since really this whole cleanup now involves both sites.

Ok, have a few other priorities first, but will get back to this on Friday. Will first move all our reportupdater reports and update their configured location on meta. That should be:

  • disable job in puppet
  • merge (need ops)
  • move output files to the right place
  • update job in puppet
  • merge

Will make a plan of which reports to do first to not bother anyone else. Then once that works we can announce and do the rest.

Dan, assigning this to you, hope that's ok.

Yes, I will pick it up next week.

Change 337642 had a related patch set uploaded (by Milimetric):
[WIP] DO NOT MERGE

https://gerrit.wikimedia.org/r/337642

Change 337672 had a related patch set uploaded (by Milimetric):
[WIP] DO NOT MERGE

https://gerrit.wikimedia.org/r/337672

Those last two patches should hopefully be all the code changes we need. Now we need to:

  1. fix and merge the puppet one
  2. test that dashboards work with it
  3. merge the dashiki patch and re-deploy the dashboards

Change 337672 merged by Ottomata:
Symlink reportupdater output to published-datasets

https://gerrit.wikimedia.org/r/337672

Change 339536 had a related patch set uploaded (by Milimetric):
Update dataset location

https://gerrit.wikimedia.org/r/339536

Change 339536 merged by Milimetric:
Update dataset location

https://gerrit.wikimedia.org/r/339536

Change 337642 merged by Milimetric:
Move datasets to analytics.wikimedia.org

https://gerrit.wikimedia.org/r/337642

I've deployed the dashboards we control and they're all looking good. But I tracked what data they use and we have some data on analytics.wikimedia.org/datasets now that none of them need (pasted at the end of this comment). I would love to go through apache logs on thorium to see if these are ever accessed. @elukey: can I either get access to read thorium:/var/log/apache2/datasets_access* or get a copy of the last couple of weeks in my home directory on thorium? If nobody looked at them for a few weeks, we can delete them from this new clean structure and warn people that we're going to remove them for good from their original place in limn-public-data (soon, not yet). Thanks!

├── ee
│   └── datasets
│       └── enwiki
├── flow
│   └── datafiles
│       ├── archive_20151029171100
│       └── flow_betafeature
├── language
│   └── datafiles
└── metrics
    ├── echo
    │   ├── crosswiki_betafeature
    │   ├── days_to_read
    │   └── monthly_production_and_consumption_of_notifications
    ├── ee
    │   ├── daily_edits
    │   ├── daily_edits_by_anon_users
    │   ├── daily_edits_by_bot_users
    │   ├── daily_edits_by_nonbot_reg_users
    │   ├── daily_unique_anon_editors
    │   └── daily_unique_nonbot_reg_editors

@Milimetric: please check /home/milimetric/apache_logs_datasets on thorium, it should be good :) (you are the only one with read permissions for those files)

Thank you very much. Ok, so searching through this month of data, I found that the following are basically *never* used by anything other than a crawler (which made me think we should have a robots.txt in general). So I will delete these after mentioning at standup tomorrow:

├── ee
│   └── datasets
│       └── enwiki
├── language
│   └── datafiles
└── metrics
    ├── echo
    │   ├── crosswiki_betafeature

And the following are accessed, albeit rarely, so we can keep them and I'll follow up with the right people:

├── flow
│   └── datafiles
│       ├── archive_20151029171100
│       └── flow_betafeature
└── metrics
    ├── echo
    │   ├── days_to_read
    │   └── monthly_production_and_consumption_of_notifications
    ├── ee
    │   ├── daily_edits
    │   ├── daily_edits_by_anon_users
    │   ├── daily_edits_by_bot_users
    │   ├── daily_edits_by_nonbot_reg_users
    │   ├── daily_unique_anon_editors
    │   └── daily_unique_nonbot_reg_editors

This is done for now, in that the new site is up and reports that we control are syncing to it. The old unused reports have been deleted and now what's left to do is announce and explain how this works. I'll make a subtask that will be longer-running.