Page MenuHomePhabricator

An API for monitoring dumps of WMF wikis
Closed, ResolvedPublic

Description

Session title

An API for monitoring dumps of WMF wikis

Main topic

Useful, consistent, and well documented APIs

Type of activity

Unconference Session

Description



=== 1. The problem ===
People tracking the generation of dumps of all sorts for our wiki projects must either screenscrape https://dumps.wikimedia.org/backup-index.html, check the dumpsruninfo.txt file in the directory for the dump of the particular wiki on the given date, or, in the case of other sorts of dumps, check directory listings via the web to see what's going on.

With the rewrite of the dumps, we have the opportunity to include a monitoring/stats API from the start of the design.

=== 2. Expected outcome ===
* List requirements for a monitoring/stats API for all forms of dumps; what do users need?
* Review internal WMF projects that may have code or data we can reuse
* Discuss and compile a list of third-party software that could be adapted for our use
* Put together a roadmap of timetable/tasks for implementation for use with the Dumps 1.0 (current infrastructure)

=== 3. Current status of the discussion ===
Starting it now. Note that I can't dedicate coding time to this now as I am working on another piece of the rewrite this quarter.

=== 4. Links ===
* This task is part of T128755 scheduled for work during the quarter Jan-Mar 2017.

==Proposed by==
[[ https://www.mediawiki.org/wiki/User:ArielGlenn | Ariel Glenn ]]

==Preferred group size==
The more the merrier

==Supplies needed==
An easel with white paper or a whiteboard, with markers

==Interested attendees (sign up below)==

# Add your name here

Event Timeline

Kickstarting the discussion based on some notes I have when I was planning for T92966.

User story

As someone that archives the dumps of the wikis, I would want the API to provide information about what kind of dumps are available for me to download and archive. As I only archive dumps that are fully complete, I would like to also know the status of the dump (whether it is complete, still in progress or otherwise). Having a list of files available in a dump would be nice as well.

Technical information

The Dumps API can be separated into the following few types.

Information on individual dumps

An individual dumps is a dump of a wiki on a certain date (e.g. the English Wikipedia's dump on October 01, 2016). When users want to use a database dump of a wiki, this API endpoint will provide the necessary information that they need to know about.

Possible information about the dump to provide in this API endpoint:

  • status: The current status of the dump (e.g. in-progress, done, failed, error).
  • completion: An estimate of how far a dump has progressed to completion in percentage format (related to T105693).
  • filelist: An array of all available (and completed) files for download.

When the user asks for individual files, the following information can be provided:

  • filename: The name of the file.
  • md5sum: The MD5 checksum for the file.
  • sha1sum: The SHA1 checksum for the file.
  • filesize: The size of the file (in bytes).
  • url: The URL to the file on the download server.

When the user asks for the individual stages of the dump, the following information can be provided:

  • id: An identifier for the dump stage (such as metahistory7zdump for the full history dumps in .7z format).
  • status: The status of the dump stage (e.g. done, in-progress, waiting, failed and skipped).
  • eta: An estimate of the time that a dump stage would be complete, would be null if the status of the dump stage is not in-progress.
  • updated: The timestamp when the status of the dump stage has changed.
  • filelist: An array of all available files for download as part of the dump stage.

Information on available dumps for each wiki

When users want to get a list of all available dumps for a wiki, this API endpoint will provide the necessary information that they need to know about.

Possible information to provide in this API endpoint:

  • dumps: An array of all available dumps for a given wiki, with the following information about the individual dump in a child array:
    • date: The date of the individual dump in %Y%m%d format.
    • status: The current status of the dump (e.g. in-progress, done, failed, error).
  • latest: The date of the latest complete dump in %Y%m%d format.

Information about the archive status of each dump

Ideally, we can get information about the individual dump's existence on the Internet Archive. The rough idea is to provide the URL to the Internet Archive item for the dump when it is archived. It would be even better if information about older dumps can be kept and users can be directed to the individual items on the Internet Archive.

However, I am not sure about the usefulness of having such a functionality in the API yet. If it is indeed useful for users, we will also need to figure out how we can update the system on the archive progress of each dump.

Miscellaneous

We can also add additional information into the various APIs above:

  • Total size of the individual dump (in bytes). This is to get an estimate of how large a dump is (useful for making estimates on downloading the whole dump, for me when archiving at least).
  • Array of URLs to the mirror sites for each individual files. Users can use this information to directly download from the mirror sites instead, and perhaps add more visibility to our mirrors.

Thanks for this very detailed report of needs, @Hydriz! Now we need some folks that track the status of ongoing dumps to add such stories. Guess I need to go nag people again to subscribe to this ticket.

@ArielGlenn Hey! As developer summit is less than four weeks from now, we are working on a plan to incorporate the ‘unconference sessions’ that have been proposed so far and would be generated on the spot. Thus, could you confirm if you plan to facilitate this session at the summit? Also, if your answer is 'YES,' I would like to encourage you to update/ arrange the task description fields to appear in the following format:

Session title
Main topic
Type of activity
Description Move ‘The Problem,' ‘Expected Outcome,' ‘Current status of the discussion’ and ‘Links’ to this section
Proposed by Your name linked to your MediaWiki URL, or profile elsewhere on the internet
Preferred group size
Any supplies that you would need to run the session e.g. post-its
Interested attendees (sign up below)

  1. Add your name here

We will be reaching out to the summit participants next week asking them to express their interest in unconference sessions by signing up.

To maintain the consistency, please consider referring to the template of the following task description: https://phabricator.wikimedia.org/T149564.

@srishakatux Yep, I'm planning on running this. I'll update the fields today, thanks for the heads up.

To the facilitator of this session: We have updated the unconference page with more instructions and faqs. Please review it in detail before the summit: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Unconference. If there are any questions or confusions, please ask! If your session gets a spot on the schedule, we would like you to read the session guidelines in detail: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Session_Guidelines. We would also then expect you to recruit Note-taker(s) 2(min) and 3 (max), Remote Moderator, and Advocate (optional) on the spot before the beginning of your session. Instructions about each role player's task are outlined in the guidelines. The physical version of the role cards will be available in all the session rooms! See you at the summit! :)

Summarizing the api-related suggestions from the unconference session:

  • Having the descriptive text for each file produced by the dumps, human-readable, explaining likely use cases (as the current html page does) is valuable
  • Xml is easy to read even without the schema. Having dump contents that are easy to read by humans in the future, or an explanation of the fields equally as easy, is a must.
  • It might be nice to have statistics about downloaders of various dump files (which formats are popular, which steps are used or not, etc). (See T155060)
  • Multi-lingual descriptions of dump content for humans with a machine-readable interface
  • Signed md5/sha1 files? Could use a special GPG key just as releases do.
  • filtering dumps to produce a subset of article content on demand, e.g. "just wikiproject X" or "just en wiktionary articles for German words"
  • Xml or json format for the information in dumpruninfo.txt
  • Soon there will be the possibility to provide dumps of structured data from wikidata that only concerns content on commons; how do we advertise these?

Action items from the session:

  • summarize list of desired features of api
  • solicit more input from folks on mailing lists
  • add json formatted equivalent to current dumps and advertise this
  • follow up on T155060 (stats for downloaders)
  • talk with Bryan Davis about outreach survey to dumps users (labs? stats100* people? other?)
  • TBD

@bd808 We need to talk outreach surveys sometime soonish.

The hard work for the Tool Labs surveys was done by @leila and @yuvipanda who actually came up with the survey questions and format. I would recommend opening a task for designing the survey itself and adding #research-and-data to get some help from the experts on writing your questions. You will also need to decide how you want to advertise the survey. For the annual Tool Labs survey we have gone with a direct email approach using Tool Labs project membership and email addresses attached to Wikitech accounts. For dumps you may want to do something completely different, but again the folks in Research have a lot more experience than I do in figuring out what the reasonable options are.

@ArielGlenn I'm happy to help with the outreach survey, especially because researchers use the dumps heavily. :) Given the scope of this project, I think designing this survey will be time consuming and we will need to set aside time for it. Would running the survey in April-June time period make sense with your timelines?

You know us data consumers, we're greedy. But if April-June is what's reasonable, then let's shoot for that timeframe. Thanks a lot for being willing to help out.

@ArielGlenn sure, just one request: please ping me before we finalize the goals for April-June so we can assess together the scope of this work. This can help me set aside enough time for it in April-June quarter.

Proposing a first take for a dumps api with the current dumps, that would be written in such a way that it could be re-used/extended for the dumps rewrite. It will use information available in the dumpsruninfo.txt file, and the md5/sha1 sums files; no new data will be added.

Requests would look like:

dumpsapi/v1/wikis=XX,YY,ZZ&dates=XX,YY,ZZ,format=json&query=status

If 'wikis' is omitted, information on all wikis will be returned. If 'dates' is omitted, information on latest run for specified wikis will be returned. Available formats and queries might be expanded in the rewrite. We do versioning right away ('v1') instead of waiting til we realize it would have been a good idea.

Output format could be, one entry for each wiki:

{wiki: XXX,
  [{job: jobname, 
    status: in-progress/waiting/etc, 
    started: timestamp, 
    completed: timestamp, 
    files: 
        [{filename: basename, 
          url:url to download, 
          size: size in bytes},
         {filename: other basename,
          size: other size,
          md5sum: string,
          sha1: string}
        ...
        ]
   },
   { job: secondjobname,
    status:...
   }
  ]
}
  • if md5sum or sha1 are not available because the job has not run or is in progress or the sums have not yet been computed, the values will not be provided
  • if file is not completed, size may be provided; if file is not started, size will not be provided
  • All filenames for a job may not be available in the status if it is in progress; if filenames are known in advance then they may be provided

Could use uwsgi python {2,3} framework for this. We already have puppet manifests for providing such services.

What is job/ jobname in this context?
Did you consider also (optionally?) including human readable (multilingual?) descriptions?

Could use uwsgi python {2,3} framework for this. We already have puppet manifests for providing such services.

Hm… I think we should try to stick to what we use for web services in other places (PHP, Node, …?) unless there are strong reasons not to (existing infrastructure in another environment might be one, for example).

job/jobname is the name of the dum "job", i.e. xmlstubsdump, pagetabledump, and so on. This is the value for the first entry on each line in the dumpruninfo file.

Human readable descriptions and/or multilingual descriptions would be done as part of the rewrite project; I don't want to try to implement them for the existing dumps when there's nothing there to support such features.

The dump scripts (except Import/Export) are in python, and the api would be using some functions from that library to, for example, retrieve information about dumps configuration or existing runs. For the rewrite, if that's done in some other language then I expect the API implementation to do the same. There seem to be a number of uswgi services currently deployed, some of which are python-based.

Test implementation of api mostly as described above, for current dumps. I've stashed it here: https://github.com/apergos/dumpstatusapi until a repo is set up in gerrit (request pending).

Next up will be a patch to the current dump code doing basicall what's described in T92966 i.e. writing json files in each directory at dump run time, and having the monitor script collect all the information in a single json file which can be parsed by a backend to this status script.

This is going to be bare bones, since we want to focus most energy on work on the Dumps-Rewrite project.

Change 335007 had a related patch set uploaded (by ArielGlenn):
sample uwsgi app that would produce json status output for dumps

https://gerrit.wikimedia.org/r/335007

Quite glad that the proposed idea is largely based on my recommendations, I am humbled. :D

Regardless, a list of all available dates should be returned together with details about the latest dump date available for a particular wiki. It would be a legitimate use case for me to query for all available dumps for archiving and having its inclusion should be relatively cheap. Also, will you allow users to make API queries that specify that only certain fields be returned, or will you just return all available details in each query?

The rest of the format looks pretty good so far to me. Thanks for the work!

@Hydriz Yep, your ideas were indeed the basis.

I likely will not implement your other suggestions for the current dumps; however, once the dust settles on these patches (and the patch to come that makes current dumps write out json files for use by the status script) I will certainly accept patches from others. The reason for this is that I need to be focussing mostly on the rewrite and less on new features for the current dumps. I am sneaking the existing work in, in the context of "working on the status API for the rewrite" ;-)

https://gerrit.wikimedia.org/r/#/c/336395/ Writes some json output along side the text files for:

  • md5/sha1 sums
  • dumpruninfo
  • status info in index.html

This could be scooped up by monitor.py and combined nicely into a single json file for all wikis easily parseable by an api script, or retrievable on request of a client.

Change 336395 had a related patch set uploaded (by ArielGlenn):
[operations/dumps] write hash sums, dumpruninfo, status report additionally in json

https://gerrit.wikimedia.org/r/336395

Change 342310 had a related patch set uploaded (by ArielGlenn):
[operations/dumps] use the various json outputs to write a combined file for status api use

https://gerrit.wikimedia.org/r/342310

Change 342311 had a related patch set uploaded (by ArielGlenn):
[operations/dumps] have dump monitor collect current run json files and produce index.json

https://gerrit.wikimedia.org/r/342311

New versions of the patches coming in, for the dumps to generate json files themselves. Sample output produced is below. These will be generated for each wiki and dump ru date during the run. Monitor will sweep latest dump dirs for all wikis, collect all json into one index.json file in main directory. Tested and ready to go but it will need to wait for the end of the current run before deploy.

{   "version": "0.8",
    "jobs": {
        "abstractsdump": {
            "files": {
                "tenwiki-20170311-abstract1.xml": {
                    "md5": "a67b77d2039477526e471369603d98b1",
                    "sha1": "422a476e6125fc61a4f67160dff4cdf079c79bff",
                    "size": 244709,
                    "url": "/mydumps/tenwiki/20170311/tenwiki-20170311-abstract1.xml"
                },
                "tenwiki-20170311-abstract2.xml": {
                    "md5": "d652857de75f75d3d317dfd143a3f0b3",
                    "sha1": "6eef20c208f1d4ca32ab0d88e6332c3e1665fd80",
                    "size": 110626,
                    "url": "/mydumps/tenwiki/20170311/tenwiki-20170311-abstract2.xml"
                }
            },
            "status": "done",
            "updated": "2017-03-11 18:46:15"
        },
        "abstractsdumprecombine": {
            "files": {
                "tenwiki-20170311-abstract.xml": {
                    "md5": "6cff197c33a5a553ad6cc9d5157513f5",
                    "sha1": "0ea39558ecf0843867c43a73ac37247d4f885c78",
                    "size": 355320,
                    "url": "/mydumps/tenwiki/20170311/tenwiki-20170311-abstract.xml"
                }
            },
            "status": "done",
            "updated": "2017-03-11 18:46:17"
        },
       ...
}

Change 336395 merged by ArielGlenn:
[operations/dumps] write hash sums, dumpruninfo, status report additionally in json

https://gerrit.wikimedia.org/r/336395

Change 342310 merged by ArielGlenn:
[operations/dumps] use the various json outputs to write a combined file for status api use

https://gerrit.wikimedia.org/r/342310

Change 342311 merged by ArielGlenn:
[operations/dumps] have dump monitor collect current run json files and produce index.json

https://gerrit.wikimedia.org/r/342311

Deployed and we'll see how the output looks tomorrow when the next dump run starts.

Looks pretty good! While this may not be the final form of output, especially after the entire dump production process is rewritten, this is a fine basis, so I'm closing this ticket.

Change 335007 abandoned by ArielGlenn:
sample uwsgi app that would produce json status output for dumps

Reason:
Not necessary; the various json files produced as dumps run cover this need

https://gerrit.wikimedia.org/r/335007