Page MenuHomePhabricator

Identify common abuse filters that affect translations
Closed, ResolvedPublic

Description

Content Translation allows to create new articles, and the produced content goes through the usual edit filters. As a result, some of these filters may prevent the publication of content or raise some warning.

There are many of these filters since they are created by each community. This mean that most of the time, Content Translation can only show generic messaging. However, knowing the issues that most commonly affect users of Content Translation would allow to provide some extra guidance.

This ticket is aimed to capture that initial investigation, where for the set of most used languages, the filters with most incidences are identified, and their associated issues described. This will be helpful to define how to provide additional guidance and help for those cases.


Plan:

Dashboard (requires permissions to access): https://superset.wikimedia.org/superset/dashboard/cx-abuse-filter/

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have scripts to get this info easily. Here are some samples from the last few days:

20180301
2: kk | abusefilter-warning | abusefilter-warning-test-edit
2: it | abusefilter-warning | abusefilter-warning-tag-obsoleto
2: pl | abusefilter-warning | abusefilter-warning-content-translation
2: en | abusefilter-disallowed | abusefilter-warning-large-nonenglish-addition
2: es | abusefilter-disallowed | abusefilter-warning-8
2: es | abusefilter-disallowed | abusefilter-warning-3
3: pt | abusefilter-warning | abusefilter-warning-nowiki
3: en | abusefilter-warning | abusefilter-warning-cx
5: en | abusefilter-disallowed | abusefilter-warning-cx

20180302
2: en | abusefilter-disallowed | abusefilter-warning-external-images
2: pl | abusefilter-warning | abusefilter-warning-content-translation
2: cs | abusefilter-disallowed | Abusefilter-warning-reflist
2: es | abusefilter-disallowed | abusefilter-disallowed
2: pt | abusefilter-warning | abusefilter-warning-fontes
2: pt | abusefilter-disallowed | abusefilter-warning-nowiki
2: bg | abusefilter-warning | abusefilter-warning-sources2
3: en | abusefilter-disallowed | abusefilter-warning-cx

20180303
2: en | abusefilter-disallowed | abusefilter-warning-external-images
2: cs | abusefilter-disallowed | Abusefilter-warning-reflist
2: cs | abusefilter-disallowed | abusefilter-warning-přímý-odkaz-do-jiné-Wikipedie
2: pt | abusefilter-warning | abusefilter-warning-nowiki
2: ar | abusefilter-warning | abusefilter-warning-ref template
3: en | abusefilter-disallowed | abusefilter-warning-email
4: es | abusefilter-disallowed | abusefilter-disallowed
5: en | abusefilter-disallowed | abusefilter-warning-cx

20180304
2: ru | abusefilter-warning | abusefilter-warning-create-empty-page
2: pl | abusefilter-warning | abusefilter-warning-content-translation
2: pt | abusefilter-warning | abusefilter-warning-fontes
2: ar | abusefilter-disallowed | abusefilter-warning-veryshortarticle
2: ro | abusefilter-warning | abusefilter-warning-lint
3: en | abusefilter-warning | abusefilter-warning-cx
7: en | abusefilter-disallowed | abusefilter-warning-cx

20180305
2: it | abusefilter-warning | abusefilter-warning-tag-obsoleto
2: ru | abusefilter-disallowed | abusefilter-warning-create-empty-page
2: ro | abusefilter-warning | Abusefilter-warning-biodata-removal
2: en | abusefilter-disallowed | abusefilter-disallowed
2: en | abusefilter-disallowed | abusefilter-warning-large-nonenglish-addition
2: fa | abusefilter-disallowed | abusefilter-warning-short-new-article
2: ar | abusefilter-disallowed | abusefilter-warning-ref template
2: es | abusefilter-disallowed | abusefilter-warning-3
3: pt | abusefilter-warning | abusefilter-warning-fontes
4: pl | abusefilter-warning | abusefilter-warning-content-translation
7: en | abusefilter-warning | abusefilter-warning-cx
8: en | abusefilter-disallowed | abusefilter-warning-cx

20180306
2: it | abusefilter-warning | abusefilter-warning-tag-obsoleto
2: ru | abusefilter-warning | abusefilter-warning-inappropriate-category
2: pt | abusefilter-warning | abusefilter-warning-fontes
2: zh | abusefilter-warning | abusefilter-warning-short-new-article
2: ar | abusefilter-warning | abusefilter-warning-ref template
3: pt | abusefilter-warning | abusefilter-warning-nowiki
3: en | abusefilter-warning | abusefilter-warning-cx
6: en | abusefilter-disallowed | abusefilter-warning-cx

20180307
2: ur | abusefilter-disallowed | abusefilter-disallowed
2: cs | abusefilter-disallowed | abusefilter-warning-subjektivní
2: en | abusefilter-disallowed | abusefilter-warning-large-nonenglish-addition
2: pt | abusefilter-warning | abusefilter-warning-nowiki
2: id | abusefilter-warning | abusefilter-warning
2: cs | abusefilter-warning | Abusefilter-warning-reflist
2: ro | abusefilter-warning | abusefilter-warning-lint
2: es | abusefilter-disallowed | abusefilter-warning-3
3: en | abusefilter-warning | abusefilter-warning-cx
5: en | abusefilter-disallowed | abusefilter-warning-cx

20180308
2: en | abusefilter-disallowed | abusefilter-warning-links-containing-username
2: nl | abusefilter-warning | abusefilter-warning-schuttingtaal
2: es | abusefilter-disallowed | abusefilter-warning-13
2: it | abusefilter-warning | abusefilter-warning-tag-obsoleto
2: pl | abusefilter-warning | abusefilter-warning-content-translation
2: el | abusefilter-warning | abusefilter-warning
2: fr | abusefilter-disallowed | abusefilter-warning-16
2: pt | abusefilter-warning | abusefilter-warning-nowiki
2: hi | abusefilter-warning | abusefilter-warning-articletitle
2: pt | abusefilter-disallowed | abusefilter-warning-ofensa
2: cs | abusefilter-disallowed | Abusefilter-warning-reflist
2: cs | abusefilter-warning | Abusefilter-warning-reflist
2: nl | abusefilter-warning | abusefilter-warning-titel
3: pt | abusefilter-disallowed | abusefilter-warning-nowiki
3: es | abusefilter-disallowed | abusefilter-disallowed
6: en | abusefilter-warning | abusefilter-warning-cx
18: en | abusefilter-disallowed | abusefilter-warning-cx

20180309
2: fr | abusefilter-disallowed | abusefilter-warning-16
2: cs | abusefilter-disallowed | abusefilter-warning-přímý-odkaz-do-jiné-Wikipedie
2: hi | abusefilter-warning | abusefilter-warning-articletitle
2: es | abusefilter-disallowed | abusefilter-disallowed
2: pt | abusefilter-warning | abusefilter-warning-fontes
2: pt | abusefilter-disallowed | abusefilter-warning-nowiki
2: cs | abusefilter-warning | Abusefilter-warning-reflist
9: en | abusefilter-disallowed | abusefilter-warning-cx

20180310
2: pt | abusefilter-warning | abusefilter-warning-nowiki
2: en | abusefilter-disallowed | abusefilter-disallowed
2: es | abusefilter-disallowed | abusefilter-warning-3
3: zh | abusefilter-warning | abusefilter-warning-spam
3: pt | abusefilter-disallowed | abusefilter-warning-ofensa
3: pt | abusefilter-disallowed | abusefilter-warning-nowiki
3: ar | abusefilter-disallowed | abusefilter-warning-veryshortarticle
9: en | abusefilter-disallowed | abusefilter-warning-cx

20180311
2: cs | abusefilter-disallowed | Abusefilter-warning-reflist
2: ru | abusefilter-warning | abusefilter-warning-inappropriate-category
2: nl | abusefilter-warning | abusefilter-warning-tekstnacatofiw
2: pt | abusefilter-warning | abusefilter-warning-fontes
2: zh | abusefilter-warning | abusefilter-warning-spam
2: es | abusefilter-disallowed | abusefilter-warning-3
3: en | abusefilter-warning | abusefilter-warning-cx
3: en | abusefilter-disallowed | abusefilter-warning-cx
5: fa | abusefilter-disallowed | abusefilter-disallowed

As you see, there are some common ones, but it changes frequently.

If we want to formalize a regular check of this, this will be possible, but let's discuss how do we want to do it.

I had a great meeting about this with @JAllemandou and @mforns yesterday. Here is the summary.

The original idea that @Neil_P._Quinn_WMF and I had was to do it in Turnilo. @JAllemandou agrees that Turnilo is a good tool for visualizing data of this kind. However, Turnilo is a frontend for Druid, which is designed for storing very large datasets, whereas this dataset will have no more than several dozens of datasets per day. He suggests using Superset as a frontend, which provides similar features and is more appropriate for data of this size.

Here is the roadmap for actually implementing this:

  • T204273: Update the way in which CX logs publish failures. I can do it.
  • Set up a table with four columns: date, project (enwiki, dewiki, etc.), abuse_filter_name, count. (Open question for @JAllemandou and @mforns: In which database will this table live?)
  • Make a query that reads the data about the frequency of different AbuseFilters every day from EventLogging, and inserts them into the four-column table. I can do it.
  • Schedule this query using Report Updater. I can probably do it myself, but will need review from @JAllemandou, @mforns or @Milimetric.
  • Set up a Superset dashboard that visualizes this data. I have no experience with this, so I'll need help from @JAllemandou, @mforns or @Milimetric.

Is this blocked on picking a database? Were @JAllemandou and @mforns thinking about hive or mysql? If mysql, then feel free to make a table in the staging database. If hive, wmf database seems fine, we can figure out if we need to split up that database later on.

Is this blocked on picking a database?

The current important blocker is getting https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/460895/ merged. But since you've brought this up: has anything changed lately in how new EventLogging schemas and tables are created? Will I have to get it approved or whitelisted anywhere? Or do we just merge it and deploy it?

Were @JAllemandou and @mforns thinking about hive or mysql? If mysql, then feel free to make a table in the staging database. If hive, wmf database seems fine, we can figure out if we need to split up that database later on.

Even though the ContentTranslation EventLogging patch is not merged yet, it's probably good to start thinking of the table.

Mysql will probably be easier for me, but @JAllemandou and @mforns should say what is better.

Were @JAllemandou and @mforns thinking about hive or mysql?

But since you've brought this up: has anything changed lately in how new EventLogging schemas and tables are created? Will I have to get it approved or whitelisted anywhere? Or do we just merge it and deploy it?

I think both MySQL and Hive would work. If we chose MySQL, we should add the schema to the newly EL-MySQL whitelist, so that it gets inserted there. If we chose Hive, it would probably be a more long term solution, because we're heading towards deprecating MySQL log database in the future.

@Amire80 In one of our last stand/up meetings, we brought up this task, and some of our team members recalled that Superset was not working properly with labs MySQL databases. We are making sure that's true. @JAllemandou, you said log db was working for you in Superset?

It works in that it is queryable, but for small data only (times out otherwise). The thing I have not tested is to build charts out of data.

So, the EventLogging schema for this task started working today and it already has several events logged. It will take a few more days to see how many events do we have per day, but I expect no more than several hundreds per day, possibly even less than a hundred.

It may, however, add up if we want to see a lot of data for many months.

So here's, more or less, the query that I'll want to run every day to collect the events to the new table:

select
  id,
  wiki,
  count(wiki) as count
from (
  select
    distinct(concat(
      event_filterId,
      event_sourceLanguage,
      event_sourceTitle,
      event_targetLanguage,
      event_token
    )) as session,
    timestamp,
    event_filterId as id,
    wiki
  from
    log.ContentTranslationAbuseFilter_18472730
  where
    timestamp like '20181027%'
) as source
group by
  id;

@JAllemandou, @mforns at this point I need your help with setting up the table, and perhaps improving this query.

Hi @Amire80

I think it looks good to start!

Probably you'll want a date field YYYY-MM-DD to the output, no? This way Superset will be able to plot timely progression.
Also, you'll have to add the insert into <table> before the select, so that data gets inserted into the new table.
And one way to ensure that the table exists and has the desired schema is adding a create-table-if-not-exists statement before the query.

Also, you'll have to add the insert into <table> before the select, so that data gets inserted into the new table.
And one way to ensure that the table exists and has the desired schema is adding a create-table-if-not-exists statement before the query.

Oh, so I should just simply create my own table like that, in an SQL script scheduled with report updater? I thought I'd need to do it with a DBA or something :)

@Amire80

Oh, so I should just simply create my own table like that, in an SQL script scheduled with report updater? I thought I'd need to do it with a DBA or something :)

If we store the data in the 'staging' database in analytics-store host, then we don't need DBAs :]. The user that reportupdater uses to connect to that host has write permits to 'staging', so it can create tables and insert rows.

OK, here's another attempt:

create table if not exists staging.cx_abuse_filter_daily (
  id bigint(20) not null auto_increment,
  date varchar(10),
  project varbinary(1024),
  abuse_filter_id bigint(20),
  count bigint(20),
  primary key (id)
);

insert into
  staging.cx_abuse_filter_daily (date, project, abuse_filter_id, count)
select
  date(20181120000000) as date,
  wiki as project,
  abuse_filter_id,
  count(wiki) as count
from (
  select
    distinct(concat(
      event_filterId,
      event_sourceLanguage,
      event_sourceTitle,
      event_targetLanguage,
      event_token
    )) as session,
    event_filterId as abuse_filter_id,
    wiki
  from
    log.ContentTranslationAbuseFilter_18472730
  where
    timestamp like '20181120%'
) as source
group by
  abuse_filter_id;

I tested this on stat1007 by connecting to mysql -u research -h db1108.eqiad.wmnet.

Does this look like something that can be scheduled?

("20181120" will be replaced by a variable, but I'm not sure how to do it.)

@Amire80

I think this looks really good!
I might be missing some detail, but seems that it would work.

The placeholder for the date field that reportupdater expects is:
date('{from_timestamp}') as date,
And in the where section you'll have to use:
timestamp >= '{from_timestamp}' and timestamp < '{to_timestamp}'

Oh, and one last thing. As we're using RU to populate another table, instead of updating a report, we have to trick it to make it think that it is generating a report.
As a separate last statement, we have to add something like:
select date('{from_timestamp}') as date, true as done;
This will output a fake result, which RU uses to keep track of which dates have been executed.

After those small changes, I think you should be able to run a RU job and see it working.
You will need to put this query in a file, and you'll also need to write a short config file for RU to know how and when to run the query.
You can read the docs here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater
And please, ask for help any time! We can even meet in Hangouts and pair on this.
Unless you have already another repo for reportupdater queries, you should put your query and config in a new folder here: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/reportupdater-queries

Change 475618 had a related patch set uploaded (by Amire80; owner: Amire80):
[analytics/limn-language-data@master] Add a scheduled job for daily CX abuse filters statistics

https://gerrit.wikimedia.org/r/475618

Change 475618 merged by jenkins-bot:
[analytics/limn-language-data@master] Add a scheduled job for daily CX abuse filters statistics

https://gerrit.wikimedia.org/r/475618

Change 476081 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add analytics reportupdater job for language cx

https://gerrit.wikimedia.org/r/476081

Change 476081 merged by Ottomata:
[operations/puppet@production] Add analytics reportupdater job for language cx

https://gerrit.wikimedia.org/r/476081

Change 476101 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-language-data@master] Correct credentials file in cx config file

https://gerrit.wikimedia.org/r/476101

Change 476101 merged by jenkins-bot:
[analytics/limn-language-data@master] Correct credentials file in cx config file

https://gerrit.wikimedia.org/r/476101

@Amire80

I just checked and data looks good in analytics-slave::staging.cx_abuse_filter_daily.

Is it accessible on the web? On the web I still see a file with date and done.

@Amire80 Oh! No, the table you created will not be accessible on the web. What is accessible on the web are reports created by RU. As we used RU in this alternate way to insert data into another database, that data will not be copied over any report, thus not appearing on the web.

Oh, I get it now 🤦🏼

Cool, so the Superset dashboard is the next step! I'll really need help with that, 'cause I have no Superset experience.

Change 476493 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-language-data@master] Correct date type in cx query

https://gerrit.wikimedia.org/r/476493

Change 476493 merged by jenkins-bot:
[analytics/limn-language-data@master] Correct date type in cx query

https://gerrit.wikimedia.org/r/476493

Update: A basic Superset dashboard is set up, but to make it more useful, there should probably be a meeting between myself, @Pginer-WMF and @mforns to discuss the details of how does it look like.

Pginer-WMF lowered the priority of this task from High to Medium.Apr 5 2019, 4:40 PM

@Amire80 I couldn't find any other task that refers to fixing the broken job.
Maybe it was in an email... or conversation? I couln't find them either.
We can use this task for that anyway, no?

The chart works as is, but as a follow-up, I made T223958.