Page MenuHomePhabricator

Develop a tool or integrate feature in existing one to visualize WMCS edits data
Open, HighPublic

Description

This task is about developing a tool to visualize edits coming from Wikimedia Cloud Services. I'm inclined to integrating the visualization in an existing analytics tool where these stats might be appropriate and relevant to showcase.

Looking at https://en.wikipedia.org/wiki/Wikipedia:Statistics, home for these statistics could be in Wikistats: https://stats.wikimedia.org/v2/#/all-projects.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2019, 7:30 PM
srishakatux triaged this task as Normal priority.Jun 26 2019, 7:31 PM

@bd808 I expressed interest in working on code related stuff to @Bmueller and she proposed this idea :) First, just making sure this is not in your TODO list?
Then, I am considering to integrate the visualization in Wikistats2 tool, what do you think about that? Birgit mentioned about Pageviews, but I think it is more about views than edits and this visualization won't fit there. Also, who should I reach out to when I dig deeper into investigating where the data is coming from, you or Andrew?

bd808 added a comment.Jun 26 2019, 8:16 PM

@srishakatux I would be happy to help you get started on this! It is something I have been thinking about working on in my "spare time", but getting it done is a lot more important to me than doing it myself. The python script that I have for collecting the data from the Analytics mysql database and computing the total edits vs edits from Cloud VPS IP addresses is:

wmcs-edits.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Tabulate total edit actions and edit actions originating from Cloud VPS
# instances in a date range per-wiki.
#
# Copyright (c) 2019 Wikimedia Foundation and contributors
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
# more details.
#
# You should have received a copy of the GNU General Public License along
# with this program.  If not, see <http://www.gnu.org/licenses/>.
import argparse
import datetime
import dns.resolver
import ipaddress
import itertools
import logging
import os
import sys

import pymysql

WMCS_NETWORKS = [
    ipaddress.IPv4Network(net) for net in [
        # eqiad
        '10.68.0.0/24',
        '10.68.16.0/21',
        '172.16.0.0/21',
        '10.68.32.0/24',
        '10.68.48.0/24',
        # codfw
        '10.196.0.0/24',
        '10.196.16.0/21',
        '172.16.128.0/21',
        '10.196.32.0/24',
        '10.196.48.0/24',
    ]
]

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def parse_date(s):
    try:
        return datetime.datetime.strptime(s, '%Y-%m-%d').date()
    except ValueError:
        raise argparse.ArgumentTypeError('Not a valid date: "%s"' % s)


def strcspn(string, pred):
    return len(list(itertools.takewhile(lambda x: x not in pred, string)))


def pairwise(iterable):
    a = iter(iterable)
    return zip(a, a)


def conf_file(name):
    return open('/srv/mediawiki-config/{}'.format(name)).read()


def dblist(name):
    dbs = []
    for line in conf_file("dblists/{}.dblist".format(name)).splitlines():
        line = line[0:strcspn(line, '#')].strip()
        if line[0:2] == '%%':
            dbs = eval_dblist(line)
        else:
            dbs.append(line)
    return set(dbs)


def eval_dblist(line):
    terms = line.strip('% ').split()
    result = dblist(terms.pop())
    for op, part in pairwise(terms):
        if op == '+':
            result = result + dblist(part)
        elif op == '-':
            result = result - dblist(part)
    return list(result)


def get_public_open_wikis():
    return dblist('all') - dblist('closed') - dblist('private')


def get_slice(dbname):
    for s in ['s1', 's2', 's4', 's5', 's6', 's7', 's8']:
        if dbname in dblist(s):
            return s
    return 's3'


def get_conn(dbname):
    s = get_slice(dbname)
    ans = dns.resolver.query(
        '_{}-analytics._tcp.eqiad.wmnet'.format(get_slice(dbname)),
        'SRV'
    )[0]
    return pymysql.connect(
        host=str(ans.target),
        port=ans.port,
        db=dbname,
        read_default_file='/etc/mysql/conf.d/analytics-research-client.cnf',
        charset='utf8mb4',
        autocommit=False,
        cursorclass=pymysql.cursors.DictCursor,
    )


def get_edit_counts(dbname, startts, endts):
    """Count WMCS edits and total edits for given date range in given db."""
    wmcs_edits = 0
    total_edits = 0
    connection = get_conn(dbname)
    with connection.cursor() as cur:
        cur.execute("""
        SELECT cuc_ip FROM cu_changes
        WHERE cuc_timestamp > %s AND cuc_timestamp < %s
        """, (startts, endts))
        for row in cur:
            total_edits += 1
            try:
                ip = ipaddress.IPv4Address(row['cuc_ip'].decode('utf-8'))
            except ipaddress.AddressValueError:
                # not an IPv4 address, skip it
                continue
            for network in WMCS_NETWORKS:
                if ip in network:
                    wmcs_edits += 1
                    continue
    return {
        'wmcs': wmcs_edits,
        'total': total_edits
    }


def calc_wmcs_edits(starttime, endttime):
    """Calculate the number of all / WMCS edits for all open wikis in a given
    time period.

    Return a dict of 'dbname' => { 'all': <all-edits-count>, 'wmcs': <wmcs-edit-count>}"""
    stats = {}
    for dbname in get_public_open_wikis():
        logging.info('Processing %s', dbname)
        try:
            stats[dbname] = get_edit_counts(dbname, starttime, endttime)
        except pymysql.MySQLError as e:
            logging.exception('Skipping %s', dbname)
    return stats


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Daily visit count')
    parser.add_argument(
        '-s', '--start',
        metavar='YYYY-MM-DD', required=True, type=parse_date,
        help='Start date (inclusive)')
    parser.add_argument('-e', '--end',
        metavar='YYYY-MM-DD', type=parse_date,
        help='End date (exclusive)')
    args = parser.parse_args()

    if not args.end:
        args.end = args.start + datetime.timedelta(1)

    days = 90
    now = datetime.datetime.utcnow()
    start = (now - datetime.timedelta(days)).strftime('%Y%m%d000000')
    cutoff = now.strftime('%Y%m%d000001')

    data = calc_wmcs_edits(
        args.start.strftime('%Y%m%d000000'),
        args.end.strftime('%Y%m%d000000'),
    )

    grand_total = 0
    wmcs_total = 0
    for wiki in sorted(data.keys()):
        t = data[wiki]['total']
        w = data[wiki]['wmcs']
        grand_total += t
        wmcs_total += w
        print('{},{},{}'.format(wiki, t, w))
    print('{},{},{}'.format('TOTAL', grand_total, wmcs_total))

I have been running this ad hoc to tabulate the data at a monthly granularity for Feburary-May. I think that the script could be adapted and used with Reportupdater to compute and store the results in https://analytics.wikimedia.org/datasets/periodic/reports/metrics/ in the tab separated values format preferred by the Analytics tools. From there I think you could get help from folks on the Analytics team in deciding on Dashiki or Wikistats 2 for visualizing the data.

I had to get new permissions on the Analytics cluster (T220892: Membership in "researchers" group for Bryan Davis) to access the database that stores the CheckUser data that this data collection script needs. You will probably need to do the same to implement and test Reportupdater integration.

Nuria added a subscriber: Nuria.Mon, Sep 16, 9:04 PM

sounds like bryan has a csv with counts of wiki, total, wmcs edits. I think we can do a dashboard on top of that csv easily and further updates to it can come from data in hadoop. @bd808 we can do a fast tryout with the data you have if you copy the data to a publicly accessible directory. See: https://wikitech.wikimedia.org/wiki/Analytics/Ad_hoc_datasets

Since it only has 4 fields (wiki, total-edits, wmcs edits and wiki) data should be safe to release

FYI that wikistats2 is probably not the best place for this data. https://stats.wikimedia.org/v2/#/all-projects displays data aggregated for all wikis around some classical metrics like "active editors". Edits in labs seems a better fit for a dataset of its own.
If we want a public dashboard let's make one from the data stored on the cvs @bd808 has been creating (probably dashiki for public is best, for internal data we have superset) , we can do a fast tryout once files are moved to a publicly accessible endpoint.

Nuria added a subscriber: mforns.EditedTue, Sep 17, 8:31 PM

Moving from other ticket: I think we can probably add a column in geoeditors-daily that speaks as to whether the edit happen on cloud. We can add a column to table below and they can compute aggregates via reportupdater or oozie and serve them with a dashiki dashboard cc @mforns and @JAllemandou for comments

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/monthly/insert_geoeditors_daily_data.hql

Yes! if we manage to use that UDF and populate a hive table with it, I think it would be easy to configure a reportupdater job to generate the desired reports.

@Nuria Why the geoeditors table though? I guess it's the only one that updates daily?

Nuria added a comment.Tue, Sep 17, 9:15 PM

@mforns cause geoeditors daily is updated from cu_changes (in hadoop) which is the only table that has IPs and it is the only one that would be able to tell you whether an edit came from cloud internal ips, makes sense?

@Nuria yep, makes sense.

Thanks everyone for your help and sorry for the delay! I took some time understanding various pieces here :)
Here is an update and some questions:
I made small changes to Bryan's script and for now added two files enwiki.tsv and fawiki.tsv to /srv/published-datasets/wmcs directory. Both these files show one-day edit data for these two wikis, including the cloud edits with timestamps.

@bd808 I will only be able to get the data that falls in the 90-day timeframe, right and not anything beyond that?

@JAllemandou Could you share your HQL query or some instructions to pull the data in T233504#5516053?

@Nuria Could you point me to next steps from here? My sense is that for Dashiki, I will need to create or modify two configuration files on Meta-wiki, make changes to config.yaml in analytics/dashiki codebase, and have a Cloud VPS instance perhaps. I am guessing that for the layout we would need something like the compare as for vital-signs that would help populate data from different wikis.

@srishakatux dashiki is the last thing we would do, we just need a config file but there should not be any changes needed on the codebase. You got it right, we just deploy the result of dashiki reading the config (and thus creating a javascript file) to an instance of cloud labs, I think the vital-signs layout would do but before we get there let's focus on the data.

Is there value in the data being per wiki? Or we want total aggreggates (that is: edits_in_cloud/total_edits for all projects)? It is always easier to extract a signal from data by looking at ratios than total numbers, so even if you have total numbers in the file I would viasualize ratios. (

Right now the data is not daily, yet column headers say daily? Probably some changes are needed there?

Timestamp WMCS_Daily_Edits Total_Daily_Edits
20190806000035 1 1
20190806000042 0 1
20190806000128 0 1
20190806000138 0 2
20190806000210 0 1
20190806000221 1 1
20190806000301 1 1
20190806000351 1 1
20190806000411 1 1
20190806000426 1 1
20190806000441 1 1

Yeah, storing aggregates makes more sense than storing data per wiki. I think that’s how we were gathering data in the past via the script. I got influenced by some of the other formats in analytics/datasets. I’ve dropped a new TSV file wmcs_edits.tsv in the /srv/published-datasets/wmcs directory, the data in it looks like as shown below. The data in it shows the edits from August 2019.

WIKI    TOTAL EDITS     WMCS EDITS      WMCS %
abwiki  617     5       0.81
acewiki 480     6       1.25
adywiki 138     1       0.72
afwiki  14197   82      0.58
afwikibooks     22      1       4.55
afwikiquote     32      1       3.12
afwiktionary    299     2       0.67
akwiki  199     2       1.01
alswiki 4269    1203    28.18
amwiki  1790    27      1.51
amwikimedia     15      0       0.0

@bd808 I will only be able to get the data that falls in the 90-day timeframe, right and not anything beyond that?

Yes, the checkuser data is purged after 90 days so at any point that is as much as you can see into the past. I have csv files in /home/bd808/projects/wmcs-edits on stat1007 with (wiki, total_edits, wmcs_edits) rows in them that we can use to backfill your TSV dataset back to February 2019. These aggregations are per month which is all we need to show the broad trends that we are interested in highlighting in the dashboard.

Milimetric raised the priority of this task from Normal to High.
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Hi @srishakatux - I'm sorry for not answering sooner, I've been sick the whole past week :(
Let's synchronize on how to generate the data.
In my mind the plan is to:

  1. have the network_origin column added to geoeditors-daily (T233504 and associated CR https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/538613/)
  2. Create a reportupdater query generating the aggregated data with desired dimensions every month, storing results in a new file per month

If this plan is Ok for you, we can start devising the reportupdater query as the column addition patch is done and mostly reviewed :)

Also, I'll let @Milimetric speak about vizualization and how we want to set this up as I'm very ignorant in that regard :)

@srishakatux / @bd808: what would you like to see in your dashboard? Off the top of my head I can see how the following things might be useful:

Let me know if you want to see other nuances about the data or answer other questions, and we can brainstorm from there.

@Milimetric @JAllemandou Hello! Thank you both for your help. This project is a learning project for me :) So, for the pieces that are feasible to do by the folks outside your team, I would love to give a shot at them. I might be a bit slow and reach out to you via IRC for questions if that is okay with you.
Perhaps setting the ReportUpdater (I have a task for it already here T232671), or configuration file that needs to be created on Meta for generating the dashboard.

@srishakatux / @bd808: what would you like to see in your dashboard? Off the top of my head I can see how the following things might be useful:

  • overall % of edits coming from WMCS over time (have to look at the data to see if monthly or daily is more useful)

I think to start with we need this and monthly data would be more useful.

bd808 reassigned this task from JAllemandou to srishakatux.Thu, Oct 3, 12:17 AM

Milimetric reassigned this task from srishakatux to JAllemandou. Mon, Sep 30, 4:22 PM

@Milimetric I don't quite understand what happened here, so I'm reverting the ownership change. Part of the intent of this task was to give @srishakatux a concrete project to work on that would get her more familiar with the data sources and tooling that the Analytics team maintains. When we opened T232664: Membership in "researchers" group for Srishti Sethi to get her the equivalent access that I have to the Analytics systems, @Nuria noticed this task and had some ideas about how to make the raw numbers we want easier to get at. Great! More than great actually, but after that things seem to have taken a turn where the Analytics team now wants to do all the work rather than letting us learn by doing.

Not learning the tooling now is going to slow us down in the long term as we find other metrics we want to track and report on. We very much expect to need help learning the subtleties of using ReportUpdater and Dashiki, but we do want to learn. I understand that giving us mentorship will probably cost the Analytics team more time than having local experts just complete the work this time. The intent however is that it will save both teams time in the longer term by spreading knowledge. Hopefully we will even be able to contribute back to the documentation on Wikitech as we find things that seemed obvious to the original documentation creators, but are not obvious to folks with less experience using the software stack.

Nuria added a comment.Thu, Oct 3, 4:49 AM

ah , sorry about changing ownership I mean to assign to Analytics the subtask T233504: add whether an edit happened on cloud VPS to geoeditors-daily dataset

Nuria added a comment.Thu, Oct 3, 5:02 AM

@srishakatux you can use reportupdater to pull data out of hive , there is a change in place that is migrating reports from mysql to hive that while unrelated to this one you can take a look for inspiration: https://gerrit.wikimedia.org/r/#/c/analytics/reportupdater-queries/+/540159/

@Milimetric @JAllemandou Hello! Thank you both for your help. This project is a learning project for me :) So, for the pieces that are feasible to do by the folks outside your team, I would love to give a shot at them. I might be a bit slow and reach out to you via IRC for questions if that is okay with you.
Perhaps setting the ReportUpdater (I have a task for it already here T232671), or configuration file that needs to be created on Meta for generating the dashboard.

@srishakatux : Please ping us on IRC when you want - We'll move at your pace trying tomake it a good learning experience :) I have seen nuria has added you as a reviewer to https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/538613/ - I'll let you comment/question on this before merging it. And then reportupdater when you are ready :)

One question on the interaction with @Milimetric about needed data: monthly aggregation of percentage of edits coming from WMCS I get it - I want to confirm that "overall" means we don't keep the numbers per-project (I prefer triple check :)

bd808 added a comment.Thu, Oct 3, 4:52 PM

One question on the interaction with @Milimetric about needed data: monthly aggregation of percentage of edits coming from WMCS I get it - I want to confirm that "overall" means we don't keep the numbers per-project (I prefer triple check :)

The data we have been collecting manually so far is aggregated monthly by wiki. I think we would like the same granularity going forward so that we can easily answer questions like "What percentage of wikidata edits were done from Cloud Services in July?" and "What percentage of global edits were done from Cloud Services in Q2?".

Understood @bd808 - Thanks for the answer :)

@Milimetric I don't quite understand what happened here, so I'm reverting the ownership change.

aaah, sorry Bryan, never meant to steal tasks! I was just back from parental and didn't have enough context. Hopefully all is clear now. Also, we don't just fully support you all learning our tools, we much prefer it. Anytime you or anyone else can learn more about our platform and provide feedback about how it can be more widely useful and easier to understand an Analytics faerie gets its wings. Thank you!