Page MenuHomePhabricator

<Product Research> WikiWho migration to production
Closed, DeclinedPublic

Description

Request Status: New Request
Request Type: research

Request Title: WikiWho Migration to the WMF

  • Request Description: WikiWho is a service (API + Datasets) for mining changes and attribution information from wiki pages. The academic institution that hosts the service is “sunsetting” the service and wants to support us in its migration. The service will likely go down in January unless we migrate it.

Project Purpose: To bring the WikiWho article attribution service in-house, which is used by XTools, Who-Wrote-That, Education-Program-Dashboard, among other applications and researchers.

  • Indicate Priority Level: High
  • Main Requestors: Community Tech (PM: Natalia Rodriguez)
  • Ideal Delivery Date: January 2022
  • Stakeholders: <list stakeholder, team/org>

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYesT288840: Migrate WikiWho service to VPS
Product One PagerNo<add link here>
Product Requirements Document (PRD)No<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNohttps://docs.google.com/presentation/d/16vmwb_t3DDDfQ26tQ8U1_GK3Ky3ZAlPgELNvQgT3onI/edit#slide=id.gf326185eb3_0_57
Product BriefNo<add link here>
Other LinksNo<add links here>

Event Timeline

DAbad changed the task status from Open to In Progress.Oct 14 2021, 3:44 PM

Initial Wikiwho Migration Discussion
Attendees: Community Tech + Platform PM

Action Items:

  • Community tech will continue to explore short-term options leveraging existing tools, while Platform PMs will explore how we can support this with API Platform, Data Platform, etc.
  • We will reconvene in 2-3 weeks to discuss viable short, medium, and long-term opportunities

Notes:
Context:

  • Wikiwho is relied on by several internal projects
  • the API gets approximately 300,000 requests per month

Component Needs:

  • Data Persistence - Currently wikiwho generates a very large volume of data, as it stores revisions against each word of an article. The content is stored in currently in Python Pickle disk format that is supported by the current host. We'd expect to need at least 4 TB of storage in its current form, though 5 would be ideal.
  • EventListener - To continuously build upon the dataset, it reads from EventsStreams and appends to pickle files as new revisions are created. We would need to also replicate this process.
  • Algorithm Support - There is a wikiwho algorithm that is triggered by the eventlister to generate the data. We would also need to be able to deploy and trigger this by eventStreams
  • Wikiwho API - In order for the tools to leverage the data, there is a Wikiwho API that would need to be refactored/migrated to the WMF. Endpoints would have to be modified based on solutions defined for data persistence.

General Needs:

  • Currently the code is in Python and Community Tech does not have Python experts to do the necessary refactoring
  • Persistence remains an open question. Where to store, what hardware to buy, etc needs to be determined.
  • A potential MVP that the team is considering is using WMCS. Given the some of the challenges we are currently having in https://phabricator.wikimedia.org/T289582 trying to do the same. We may need alternative short term solutions.
DAbad triaged this task as High priority.

@DAbad: Would T288840: Migrate WikiWho service to VPS be a parent or subtask of this task? If it is, feel free to set that semantic connection via Edit Related Tasks.... Thanks!

API Questions:

2021/10/14 - Platform PMs met to discuss the request and break down work required. next steps include:

  • Focusing on a data persistence (storage) solution and evaluating if we can potentially leverage data platform in the short-term (Luke)
  • evaluating api requirements (Seve)
  • all open questions will be posted to this ticket

Data Platform Notes:

  • Based on storage estimates we should be ok to onboard the datasets with growth estimates in our Cassandra DB
  • Data persistence should be simple enough but we don't have anything implemented yet for event streaming and triggering services

First off, let me thank you all for the quick assistance!

  • Does the API have any reference documentation?

There is also more detailed internal docs on setting up the API server from scratch. It is on a private GitHub repo. I can try to get these docs prepared for you, but I imagine most of it won't apply to our stack. https://github.com/wikiwho/WikiWho is the actual algorithm. How we store and serve that output will probably differ greatly from how Gesis (the current maintainer) does it. See F34639572 for a high-level overview of their stack.

There is one other component that we will need: https://github.com/wikiwho/WhoColor/. This is basically an extension to the WikiWho algorithm, which adds annotated HTML so you can visually represent attribution data of an article. This is the API endpoint used by Who-Wrote-That.

  • Are there any known set of consumers?

As for Wikimedia specifically, we know of three major consumers: XTools, Who-Wrote-That and Education-Program-Dashboard. Gesis has informed us they receive roughly 292,500 requests a month (~9,750 a day). XTools caches its results; uncached it would by itself make roughly 8,000 requests a day.

  • Do we have any SLO/SLAs defined for the service?

No.

  • Where is the API currently hosted?

On Gesis' internal servers. We do not have access to these servers.

Some additional comments on the database side after discussion with SRE and Platform team today:

  • Postgres DB and Pickle files are not supported by SRE
  • Longer term would need to re-design to not use these

11/1 Platform PM Triage Meeting:

  • Need to look at how existing will go into new framework
  • Next step is that if they were relying on putting it on cloud services, should instead be using kubernetes
  • Data Persistence: putting it on Cassandra would be reasonable to do by mid-June
    • Event streams + refactoring data model
  • DA to follow-up on timeline
DAbad renamed this task from <Research> WikiWho Migration to <Product Research> WikiWho Migration.Dec 3 2021, 2:42 PM

2021-12-08 Technology Steering Committee Discussion
Notes:

  • Enterprise has been interested in how to potential leverage Wikiwho as well as how we would redesign
  • Current timeline is not feasible to meet for data platform, however we will keep as use case

Action Items:

  • Review how this could fit into future Data Platform, API Platform, etc. (include Enterprise)

An update: our VPS installation has been up and running for some time now, and performing very well. It appears we may even be able to acquire more storage capacity to add more languages. Re-engineering the stack to work in production will be a huge effort that I don't think Community-Tech has time for in the near-term. That said, we are still interested in bringing this to production, but are considering it low-priority for now.

Thanks to everyone who has investigated and helped with this effort so far! We don't want to forget about it, but I believe for the time being there's no need to dedicate more time to this unless you feel compelled to do so :) I will let someone else adjust the priority as they see fit. I will remove T288840 as a parent task, as that was repurposed to be about only the VPS installation, which is now resolved.

Thanks again!

One more update...

Today @Andrew took the system down to do some hypervisor maintenance, and he noted that the system is using server-local storage rather than ceph, that this is probably just a mistake, and that we can avoid future downtime if we rebuild the server and use distributed storage instead. (We can keep the Cinder volume and attach it to the new server.) We'll do that if/when we rebuilt the server.

Also, we still have one element of the system that doesn't come back up automatically after a reboot: the init.d celery daemon (sudo /etc/init.d/celeryd start). When we rebuild the system, we should also switch that init.d celery config out and use a systemd service instead.

After receiving a report of the service not working, I logged in and found celery down again. I've restarted it.

MusikAnimal lowered the priority of this task from High to Low.Jun 2 2022, 8:46 PM

To be clear, this task is (or was) about research around deploying WikiWho to WMF production. I left my summary at T293386#7715345. With the VPS instance performing swimmingly, I don't think this task is high-priority and it could even be closed, assuming no one has any plans to work on this effort further (CommTech does not). We do want to add more languages to our WikiWho installation but that can be done with the existing VPS infrastructure.

MusikAnimal renamed this task from <Product Research> WikiWho Migration to <Product Research> WikiWho migration to production.Jun 2 2022, 8:46 PM

@lbowmaker: Any feedback to the previous comment? Thanks.

Aklapper changed the task status from Resolved to Declined.Jun 30 2022, 11:15 AM

(Resetting task status as it seems this was not resolved)