Page MenuHomePhabricator

Trending by location page views — Security & Research & Enterprise
Open, Needs TriagePublic

Description

Context

Enterprise wants to create a new event type in their APIs that signals trending readership, not linked to editing or revisions, as our APIs work now.

This idea has garnered interest from Google and unsigned leads that can become reusers. Customers are eager to disambiguate language from location. Particular interest falls on supporting humans and algorithms making decisions about what content to surface can be more reactive.

@fkaelin believes this could set an interesting precedent for using streaming data that is untapped. Foundation internal + enterprise solutions abound in that data lake. Hal Triedman believes this project represents original work that can enable the solutions the data that Fabian has seen to be used. Other potentially interested groups include, T&S, Disinformation and Research, Fundraising Tech, Community Editing, Comms.

As of now, Pageviews on Wikipedia are only processed in 24 hour latent batches. With live(-ish) PV data, Enterprise, the community and our reusers can make more informed decisions based on real-time information about who, when, where and how Wikipedia is consumed. Ergo, the ability to create a whole suite of tools and gain on reading for everyone is unblocked.

Description

In Enterprise API it looks like the following —

Trending_page:
      id: 26912;
      title: Sapindales;
      Location: Italy;
  • Expected Deliverable. What is the ideal outcome or result of your request?
  • Research team creates a customized stream of pageview events according to Privacy team’s specifications, emitting events at regular intervals.
  • Privacy team ingests the customized pageview event stream into Spark at regular intervals, runs DP pipeline to anonymize the data, does geographic modeling/formatting in line with WME specifications, and passes it along to WME.
  • WME ingests privatized data from Privacy Team into its infrastructure and serves it to customers.

Estimated Effort

@fkaelin can you expand here, please?

  • Priority

Medium- high priority for Enterprise.

I need this task resolved in:

  • 3 months — 6 months

For use by WMF Research team; please leave everything below as it is:

  1. Does the request serve one of the existing Research team's audiences? If yes, choose the primary audience. (1 of 4)
  2. What is the type of work requested?
  3. What is the impact of responding to this request?
    • Support a technology or policy need of one or more WM projects
    • Advance the understanding of the WM projects.
    • Something else. If you choose this option, please explain briefly the impact below.