Page MenuHomePhabricator

Create a WMCS edits dashboard via Dashiki
Closed, ResolvedPublic

Description

This task is about developing a tool to visualize edits coming from Wikimedia Cloud Services. I'm inclined to integrating the visualization in an existing analytics tool where these stats might be appropriate and relevant to showcase.

Looking at https://en.wikipedia.org/wiki/Wikipedia:Statistics, home for these statistics could be in Wikistats: https://stats.wikimedia.org/v2/#/all-projects.

Relevant links:
https://wmcs-edits.wmflabs.org/
https://meta.wikimedia.org/wiki/Config:Dashiki:WMCSEdits
https://meta.wikimedia.org/wiki/WMCS_Edits_Dashboard

(There is nothing more remaining in this task, bugs can be filed separately. Huge thank you to Analytics team for your help & contributions to this project :))

Details

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@bd808 I expressed interest in working on code related stuff to @Bmueller and she proposed this idea :) First, just making sure this is not in your TODO list?
Then, I am considering to integrate the visualization in Wikistats2 tool, what do you think about that? Birgit mentioned about Pageviews, but I think it is more about views than edits and this visualization won't fit there. Also, who should I reach out to when I dig deeper into investigating where the data is coming from, you or Andrew?

@srishakatux I would be happy to help you get started on this! It is something I have been thinking about working on in my "spare time", but getting it done is a lot more important to me than doing it myself. The python script that I have for collecting the data from the Analytics mysql database and computing the total edits vs edits from Cloud VPS IP addresses is:

wmcs-edits.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Tabulate total edit actions and edit actions originating from Cloud VPS
# instances in a date range per-wiki.
#
# Copyright (c) 2019 Wikimedia Foundation and contributors
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
# more details.
#
# You should have received a copy of the GNU General Public License along
# with this program.  If not, see <http://www.gnu.org/licenses/>.
import argparse
import datetime
import dns.resolver
import ipaddress
import itertools
import logging
import os
import sys

import pymysql

WMCS_NETWORKS = [
    ipaddress.IPv4Network(net) for net in [
        # eqiad
        '10.68.0.0/24',
        '10.68.16.0/21',
        '172.16.0.0/21',
        '10.68.32.0/24',
        '10.68.48.0/24',
        # codfw
        '10.196.0.0/24',
        '10.196.16.0/21',
        '172.16.128.0/21',
        '10.196.32.0/24',
        '10.196.48.0/24',
    ]
]

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def parse_date(s):
    try:
        return datetime.datetime.strptime(s, '%Y-%m-%d').date()
    except ValueError:
        raise argparse.ArgumentTypeError('Not a valid date: "%s"' % s)


def strcspn(string, pred):
    return len(list(itertools.takewhile(lambda x: x not in pred, string)))


def pairwise(iterable):
    a = iter(iterable)
    return zip(a, a)


def conf_file(name):
    return open('/srv/mediawiki-config/{}'.format(name)).read()


def dblist(name):
    dbs = []
    for line in conf_file("dblists/{}.dblist".format(name)).splitlines():
        line = line[0:strcspn(line, '#')].strip()
        if line[0:2] == '%%':
            dbs = eval_dblist(line)
        else:
            dbs.append(line)
    return set(dbs)


def eval_dblist(line):
    terms = line.strip('% ').split()
    result = dblist(terms.pop())
    for op, part in pairwise(terms):
        if op == '+':
            result = result + dblist(part)
        elif op == '-':
            result = result - dblist(part)
    return list(result)


def get_public_open_wikis():
    return dblist('all') - dblist('closed') - dblist('private')


def get_slice(dbname):
    for s in ['s1', 's2', 's4', 's5', 's6', 's7', 's8']:
        if dbname in dblist(s):
            return s
    return 's3'


def get_conn(dbname):
    s = get_slice(dbname)
    ans = dns.resolver.query(
        '_{}-analytics._tcp.eqiad.wmnet'.format(get_slice(dbname)),
        'SRV'
    )[0]
    return pymysql.connect(
        host=str(ans.target),
        port=ans.port,
        db=dbname,
        read_default_file='/etc/mysql/conf.d/analytics-research-client.cnf',
        charset='utf8mb4',
        autocommit=False,
        cursorclass=pymysql.cursors.DictCursor,
    )


def get_edit_counts(dbname, startts, endts):
    """Count WMCS edits and total edits for given date range in given db."""
    wmcs_edits = 0
    total_edits = 0
    connection = get_conn(dbname)
    with connection.cursor() as cur:
        cur.execute("""
        SELECT cuc_ip FROM cu_changes
        WHERE cuc_timestamp > %s AND cuc_timestamp < %s
        """, (startts, endts))
        for row in cur:
            total_edits += 1
            try:
                ip = ipaddress.IPv4Address(row['cuc_ip'].decode('utf-8'))
            except ipaddress.AddressValueError:
                # not an IPv4 address, skip it
                continue
            for network in WMCS_NETWORKS:
                if ip in network:
                    wmcs_edits += 1
                    continue
    return {
        'wmcs': wmcs_edits,
        'total': total_edits
    }


def calc_wmcs_edits(starttime, endttime):
    """Calculate the number of all / WMCS edits for all open wikis in a given
    time period.

    Return a dict of 'dbname' => { 'all': <all-edits-count>, 'wmcs': <wmcs-edit-count>}"""
    stats = {}
    for dbname in get_public_open_wikis():
        logging.info('Processing %s', dbname)
        try:
            stats[dbname] = get_edit_counts(dbname, starttime, endttime)
        except pymysql.MySQLError as e:
            logging.exception('Skipping %s', dbname)
    return stats


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Daily visit count')
    parser.add_argument(
        '-s', '--start',
        metavar='YYYY-MM-DD', required=True, type=parse_date,
        help='Start date (inclusive)')
    parser.add_argument('-e', '--end',
        metavar='YYYY-MM-DD', type=parse_date,
        help='End date (exclusive)')
    args = parser.parse_args()

    if not args.end:
        args.end = args.start + datetime.timedelta(1)

    days = 90
    now = datetime.datetime.utcnow()
    start = (now - datetime.timedelta(days)).strftime('%Y%m%d000000')
    cutoff = now.strftime('%Y%m%d000001')

    data = calc_wmcs_edits(
        args.start.strftime('%Y%m%d000000'),
        args.end.strftime('%Y%m%d000000'),
    )

    grand_total = 0
    wmcs_total = 0
    for wiki in sorted(data.keys()):
        t = data[wiki]['total']
        w = data[wiki]['wmcs']
        grand_total += t
        wmcs_total += w
        print('{},{},{}'.format(wiki, t, w))
    print('{},{},{}'.format('TOTAL', grand_total, wmcs_total))

I have been running this ad hoc to tabulate the data at a monthly granularity for Feburary-May. I think that the script could be adapted and used with Reportupdater to compute and store the results in https://analytics.wikimedia.org/datasets/periodic/reports/metrics/ in the tab separated values format preferred by the Analytics tools. From there I think you could get help from folks on the Analytics team in deciding on Dashiki or Wikistats 2 for visualizing the data.

I had to get new permissions on the Analytics cluster (T220892: Membership in "researchers" group for Bryan Davis) to access the database that stores the CheckUser data that this data collection script needs. You will probably need to do the same to implement and test Reportupdater integration.

sounds like bryan has a csv with counts of wiki, total, wmcs edits. I think we can do a dashboard on top of that csv easily and further updates to it can come from data in hadoop. @bd808 we can do a fast tryout with the data you have if you copy the data to a publicly accessible directory. See: https://wikitech.wikimedia.org/wiki/Analytics/Ad_hoc_datasets

Since it only has 4 fields (wiki, total-edits, wmcs edits and wiki) data should be safe to release

FYI that wikistats2 is probably not the best place for this data. https://stats.wikimedia.org/v2/#/all-projects displays data aggregated for all wikis around some classical metrics like "active editors". Edits in labs seems a better fit for a dataset of its own.
If we want a public dashboard let's make one from the data stored on the cvs @bd808 has been creating (probably dashiki for public is best, for internal data we have superset) , we can do a fast tryout once files are moved to a publicly accessible endpoint.

Moving from other ticket: I think we can probably add a column in geoeditors-daily that speaks as to whether the edit happen on cloud. We can add a column to table below and they can compute aggregates via reportupdater or oozie and serve them with a dashiki dashboard cc @mforns and @JAllemandou for comments

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/monthly/insert_geoeditors_daily_data.hql

Yes! if we manage to use that UDF and populate a hive table with it, I think it would be easy to configure a reportupdater job to generate the desired reports.

@Nuria Why the geoeditors table though? I guess it's the only one that updates daily?

@mforns cause geoeditors daily is updated from cu_changes (in hadoop) which is the only table that has IPs and it is the only one that would be able to tell you whether an edit came from cloud internal ips, makes sense?

Thanks everyone for your help and sorry for the delay! I took some time understanding various pieces here :)
Here is an update and some questions:
I made small changes to Bryan's script and for now added two files enwiki.tsv and fawiki.tsv to /srv/published-datasets/wmcs directory. Both these files show one-day edit data for these two wikis, including the cloud edits with timestamps.

@bd808 I will only be able to get the data that falls in the 90-day timeframe, right and not anything beyond that?

@JAllemandou Could you share your HQL query or some instructions to pull the data in T233504#5516053?

@Nuria Could you point me to next steps from here? My sense is that for Dashiki, I will need to create or modify two configuration files on Meta-wiki, make changes to config.yaml in analytics/dashiki codebase, and have a Cloud VPS instance perhaps. I am guessing that for the layout we would need something like the compare as for vital-signs that would help populate data from different wikis.

@srishakatux dashiki is the last thing we would do, we just need a config file but there should not be any changes needed on the codebase. You got it right, we just deploy the result of dashiki reading the config (and thus creating a javascript file) to an instance of cloud labs, I think the vital-signs layout would do but before we get there let's focus on the data.

Is there value in the data being per wiki? Or we want total aggreggates (that is: edits_in_cloud/total_edits for all projects)? It is always easier to extract a signal from data by looking at ratios than total numbers, so even if you have total numbers in the file I would viasualize ratios. (

Right now the data is not daily, yet column headers say daily? Probably some changes are needed there?

Timestamp WMCS_Daily_Edits Total_Daily_Edits
20190806000035 1 1
20190806000042 0 1
20190806000128 0 1
20190806000138 0 2
20190806000210 0 1
20190806000221 1 1
20190806000301 1 1
20190806000351 1 1
20190806000411 1 1
20190806000426 1 1
20190806000441 1 1

Yeah, storing aggregates makes more sense than storing data per wiki. I think that’s how we were gathering data in the past via the script. I got influenced by some of the other formats in analytics/datasets. I’ve dropped a new TSV file wmcs_edits.tsv in the /srv/published-datasets/wmcs directory, the data in it looks like as shown below. The data in it shows the edits from August 2019.

WIKI    TOTAL EDITS     WMCS EDITS      WMCS %
abwiki  617     5       0.81
acewiki 480     6       1.25
adywiki 138     1       0.72
afwiki  14197   82      0.58
afwikibooks     22      1       4.55
afwikiquote     32      1       3.12
afwiktionary    299     2       0.67
akwiki  199     2       1.01
alswiki 4269    1203    28.18
amwiki  1790    27      1.51
amwikimedia     15      0       0.0

@bd808 I will only be able to get the data that falls in the 90-day timeframe, right and not anything beyond that?

Yes, the checkuser data is purged after 90 days so at any point that is as much as you can see into the past. I have csv files in /home/bd808/projects/wmcs-edits on stat1007 with (wiki, total_edits, wmcs_edits) rows in them that we can use to backfill your TSV dataset back to February 2019. These aggregations are per month which is all we need to show the broad trends that we are interested in highlighting in the dashboard.

Milimetric raised the priority of this task from Medium to High.
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Hi @srishakatux - I'm sorry for not answering sooner, I've been sick the whole past week :(
Let's synchronize on how to generate the data.
In my mind the plan is to:

  1. have the network_origin column added to geoeditors-daily (T233504 and associated CR https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/538613/)
  2. Create a reportupdater query generating the aggregated data with desired dimensions every month, storing results in a new file per month

If this plan is Ok for you, we can start devising the reportupdater query as the column addition patch is done and mostly reviewed :)

Also, I'll let @Milimetric speak about vizualization and how we want to set this up as I'm very ignorant in that regard :)

@srishakatux / @bd808: what would you like to see in your dashboard? Off the top of my head I can see how the following things might be useful:

Let me know if you want to see other nuances about the data or answer other questions, and we can brainstorm from there.

@Milimetric @JAllemandou Hello! Thank you both for your help. This project is a learning project for me :) So, for the pieces that are feasible to do by the folks outside your team, I would love to give a shot at them. I might be a bit slow and reach out to you via IRC for questions if that is okay with you.
Perhaps setting the ReportUpdater (I have a task for it already here T232671), or configuration file that needs to be created on Meta for generating the dashboard.

@srishakatux / @bd808: what would you like to see in your dashboard? Off the top of my head I can see how the following things might be useful:

  • overall % of edits coming from WMCS over time (have to look at the data to see if monthly or daily is more useful)

I think to start with we need this and monthly data would be more useful.

Milimetric reassigned this task from srishakatux to JAllemandou. Mon, Sep 30, 4:22 PM

@Milimetric I don't quite understand what happened here, so I'm reverting the ownership change. Part of the intent of this task was to give @srishakatux a concrete project to work on that would get her more familiar with the data sources and tooling that the Analytics team maintains. When we opened T232664: Membership in "researchers" group for Srishti Sethi to get her the equivalent access that I have to the Analytics systems, @Nuria noticed this task and had some ideas about how to make the raw numbers we want easier to get at. Great! More than great actually, but after that things seem to have taken a turn where the Analytics team now wants to do all the work rather than letting us learn by doing.

Not learning the tooling now is going to slow us down in the long term as we find other metrics we want to track and report on. We very much expect to need help learning the subtleties of using ReportUpdater and Dashiki, but we do want to learn. I understand that giving us mentorship will probably cost the Analytics team more time than having local experts just complete the work this time. The intent however is that it will save both teams time in the longer term by spreading knowledge. Hopefully we will even be able to contribute back to the documentation on Wikitech as we find things that seemed obvious to the original documentation creators, but are not obvious to folks with less experience using the software stack.

@srishakatux you can use reportupdater to pull data out of hive , there is a change in place that is migrating reports from mysql to hive that while unrelated to this one you can take a look for inspiration: https://gerrit.wikimedia.org/r/#/c/analytics/reportupdater-queries/+/540159/

@Milimetric @JAllemandou Hello! Thank you both for your help. This project is a learning project for me :) So, for the pieces that are feasible to do by the folks outside your team, I would love to give a shot at them. I might be a bit slow and reach out to you via IRC for questions if that is okay with you.
Perhaps setting the ReportUpdater (I have a task for it already here T232671), or configuration file that needs to be created on Meta for generating the dashboard.

@srishakatux : Please ping us on IRC when you want - We'll move at your pace trying tomake it a good learning experience :) I have seen nuria has added you as a reviewer to https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/538613/ - I'll let you comment/question on this before merging it. And then reportupdater when you are ready :)

One question on the interaction with @Milimetric about needed data: monthly aggregation of percentage of edits coming from WMCS I get it - I want to confirm that "overall" means we don't keep the numbers per-project (I prefer triple check :)

One question on the interaction with @Milimetric about needed data: monthly aggregation of percentage of edits coming from WMCS I get it - I want to confirm that "overall" means we don't keep the numbers per-project (I prefer triple check :)

The data we have been collecting manually so far is aggregated monthly by wiki. I think we would like the same granularity going forward so that we can easily answer questions like "What percentage of wikidata edits were done from Cloud Services in July?" and "What percentage of global edits were done from Cloud Services in Q2?".

Understood @bd808 - Thanks for the answer :)

@Milimetric I don't quite understand what happened here, so I'm reverting the ownership change.

aaah, sorry Bryan, never meant to steal tasks! I was just back from parental and didn't have enough context. Hopefully all is clear now. Also, we don't just fully support you all learning our tools, we much prefer it. Anytime you or anyone else can learn more about our platform and provide feedback about how it can be more widely useful and easier to understand an Analytics faerie gets its wings. Thank you!

Developer-Advocacy Hi team - here we have the first version of the dashboard: https://wmcs-edits.wmflabs.org with help from Analytics folks :)

It would be great to have your feedback and share what you think is missing. One thing to note - I have used the data from the spreadsheet Bryan has been maintaining for past months. In the process of working on the dashboard, I learned that it makes sense to finalize on the dashboard layout first, and then write/complete the hive queries to periodically output data in a format suitable for the visualization layout. The dashboard we currently have includes three views:

  • Timeseries - Here, we show a visualization of the percentage of Wikimedia cloud services edits made to ~10 wikis monthly. These wikis include popular wikis such as Wikipedia, Wikidata, Commons, and others that have a high total or WMCS edit count. For example, here is how I would read one of the lines in the graph: “7% of edits made to English Wikipedia came from WMCS”.
  • Hierarchical - It shows a pie chart of WMCS edits of wikis. Here is how I would interpret one of the slices: "83% of total WMCS edits came from Wikidatawiki".
  • Tabular - This view shows tabular data, exactly what we were maintaining in the spreadsheets. It shows total edits, WMCS edits, and % of WMCS edits corresponding to all Wikimedia wikis.

Few things that I think we could improve in the visualization:

  • Change “allwikis” in Timeseries and “otherwikis” in Hierarchical view to all caps, so that they look a bit different from others.
  • We could add ~10-20 more wikis in Timeseries and Hierarchical views.

Thoughts?

NIce @srishakatux We can also compute this data daily so you have more fine grained data should that be of interest

mforns raised the priority of this task from High to Needs Triage.Nov 21 2019, 6:20 PM
mforns triaged this task as High priority.
srishakatux renamed this task from Develop a tool or integrate feature in existing one to visualize WMCS edits data to Create a WMCS edits dashboard via Dashiki.Dec 6 2019, 7:06 PM

Failed to load resource: the server responded with a status of 404 ()

Due to https://wmcs-edits.wmflabs.org/favicon.ico

@Jayprakash12345 slow internet connection might be the issue, it is loading fine for me here and on another network for a colleague :)

Note from @Bmueller via chat:

  • I think we need a small explanation for the different views
  • Investigate how filtering by dates can be supported in the hierarchical view

By the way, you can filter by date in the hierarchical view, little date selector on the left. While answering that I noticed that reports stopped a month ago, this is the error, I'm investigating:

Jan  6 15:00:00 stat1007 reportupdater-wmcs[257397]: 2020-01-06 15:00:00,698 - INFO - Starting execution.
Jan  6 15:00:00 stat1007 reportupdater-wmcs[257397]: 2020-01-06 15:00:00,698 - INFO - Writing the pid file.
Jan  6 15:00:00 stat1007 reportupdater-wmcs[257397]: 2020-01-06 15:00:00,719 - INFO - Executing "<Report key=wikis_by_wmcs_edits_percent type=script granularity=months lag=86400 is_funnel=False first_date=2019-10-01 start=2019-12-01 end=2020-01-01 db_key=None sql_template=None script=/srv/reportupdater/jobs/reportupdater-queries/wmcs/wikis_by_wmcs_edits_percent explode_by={} max_data_points=None graphite={} results={'header': '[]', 'data': '0 rows'} group=None>"...
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]: Traceback (most recent call last):
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]:   File "/srv/reportupdater/jobs/reportupdater-queries/wmcs/dynamic_pivot.py", line 21, in <module>
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]:     pivoted[wiki][date] += number
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]: NameError: name 'number' is not defined
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]: 2020-01-06 15:01:30,279 - ERROR - Report "wikis_by_wmcs_edits_percent" could not be executed because of error: object of type 'NoneType' has no len()
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]: Traceback (most recent call last):
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]:   File "/srv/reportupdater/reportupdater/reportupdater/executor.py", line 131, in execute_script_report
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]:     report.results = self.normalize_results(report, None, tsv_reader)
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]:   File "/srv/reportupdater/reportupdater/reportupdater/executor.py", line 171, in normalize_results
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]:     empty_row = [report.start] + [None] * (len(normalized_header) - 1)
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]: TypeError: object of type 'NoneType' has no len()
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]: 2020-01-06 15:01:30,307 - INFO - Deleting the pid file.
Jan  6 15:01:30 stat1007 reportupdater-wmcs[257397]: 2020-01-06 15:01:30,307 - INFO - Execution complete.

Ok, I traced this to the source data not being present. I am rerunning the jobs now and this should update by itself.

srishakatux lowered the priority of this task from High to Medium.Jan 12 2020, 4:54 PM

@Milimetric It looks like there is no data generated for the tabular and hierarchical view for the month of January and no data at all for February. Could you or @JAllemandou help investigate?

Change 571947 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater-queries@master] Add more delay to wmcs reports

https://gerrit.wikimedia.org/r/571947

Change 571947 merged by Mforns:
[analytics/reportupdater-queries@master] Add more delay to wmcs reports

https://gerrit.wikimedia.org/r/571947

I checked the report files, the logs, and the data in edit_hourly.
I believe the problem was that the mediawiki_history data set was not available on 2 Feb 2020,
and thus the edit_hourly table was not yet populated with January data,
and then the reports failed.

The delay of the reports was of 1 day, which I think is too little time.
Usually (if nothing goes wrong) mediawiki_history is ready on the 3rd of the month.
So I increased the reports delay to 4 days.
I think this will prevent the problems next months.

I also cleaned the report files, so when reportupdater runs next, the reports should be correctly updated,
and after a bit they will also be synched to the public endpoint, so the dashboard can show them.
Please, wait for a couple hours and check :]

I checked in https://analytics.wikimedia.org/datasets/periodic/reports/metrics/wmcs/ and all files are now complete with latest data.
The dashboard is still showing old data for me, but it must be a caching problem that will resolve itself in a bit.

@mforns thank you for looking into this! I still don't see February data in the reports. Is there a reason for that?

@srishakatux no problemo! Yes, the month of February is still not over, so reportupdater will wait until then to calculate the corresponding metrics.

@mforns OOPS! I forgot in all these months how this was supposed to work :D

Hehe, no problem, I'm also constantly checking these in my brain, especially with weekly reports.

Relevant links:
https://wmcs-edits.wmflabs.org/
https://meta.wikimedia.org/wiki/Config:Dashiki:WMCSEdits
https://meta.wikimedia.org/wiki/WMCS_Edits_Dashboard

(There is nothing more remaining in this task, bugs can be filed separately. Huge thank you to Analytics team for your help & contributions to this project :))