Page MenuHomePhabricator

Assess data dumps collection
Closed, DeclinedPublic


  1. Review and update the landing page (see T307003)
  2. Search for and review existing phab tasks related to documentation
  3. Assess the content area and describe:
    • your process: what was a fruitful assessment technique, and what was a dead end?
    • your findings: what are the most needed improvements in this set of docs, and how would you group those improvements into related tasks? how much overlap is there between the priority improvements in your assessment vs. what is already tracked in phab tasks or mentioned on Talk pages?

Event Timeline

TBurmeister changed the task status from Open to In Progress.Sep 13 2022, 4:03 PM
TBurmeister triaged this task as Medium priority.


  • Read the prose on the page, looking for comprehensibility, simple language vs terms that are jargon or hard to understand, non-inclusive language, typos.
  • While reading, note links that don't offer any necessary or useful additional info and are merely extra or superfluous.
  • Click all links on the page to see if they work and quickly assess the length and potential freshness of the linked page.
  • Identify links that point to the same subpages.
  • Get a sense of how many of the linked pages live in different domains or platforms and are not subpages of this page.
  • Attempt to assess the intended audience for each section of the page: is the page mixing together content that is clearly for different audiences?
  • Note if the page is translated and which templates, if any, it uses.

Dead ends:

  • I want to make stylistic improvements to this page, but the overall impact of doing so seems negligible given the retranslation that would be necessary. Removing phrases like "our wikis' content" and getting rid of the superfluous linked examples in "They are still useful even so" doesn't seem worth it.

Improvements to consider:

  • There are many links on the page, and some of them point to the same subpages. All duplicate links should be removed -- it should only be necessary to link to something once.
  • Some of the subpages could be combined; many of them are rather short and cover related topics.
  • Group all related info together: any links about data format should be together, links about licensing should be together, links about tools, etc.
  • Divide links by audience. Some of the links to code repos and very detailed dump maintenance info would not be useful for users of the dumps; there should be separate sections or pages for dump maintainers vs. dump users.
  • Structure landing page sections (and ideally also sub-pages) around common user tasks. From the landing page, it's not clear which format of data dump I should investigate if I'm interested in user behavior on Wikipedia vs. the types of files on Commons (for example). The sentence buried in the intro (" The dumps are used by researchers and in offline reader projects, for archiving, for bot editing of the wikis, and for provision of the data in an easily queryable format, among other things") actually would be a good way to guide readers into the type of data dump/format that is most useful for their project -- instead of requiring them to learn about the different types of dumps and their licenses, benefits, drawbacks, etc.

I'm now going to follow the process documented at mw:Documentation/Toolkit/Collection_audit to assess the documentation landscape for the topic of "data access". I already know that work is going to explode beyond to include docs at:

Therefore, I'm going to decline this task and T312995 in favor of tracking this larger scope of work in the parent task. T312997.