Page MenuHomePhabricator

Measure the impact of Tainted References Wikidata feature
Closed, DeclinedPublic

Description

Context: Tainted References feature on Wikidata is intended to make mismatched statement value/reference pairs more prominent to Wikidata editors.

Definitions:

  • tainted reference: mismatching statement value and reference pair
  • edit triggering a tainted reference: edit changing exclusively a value of the statement
  • edit cleaning the tainted reference: one of the following
    • edit changing the reference of the statement on which tainted reference has been previously triggered.
    • edit removing the reference of the statement on which tainted reference has been previously triggered.
    • edit reverting the edit triggering a tainted reference

Goal: Fewer mismatching value/reference pairs exist.

We want to measure how many tainted references are triggered, and how many of these are being cleaned.
To have comparable figures, we need to have a baseline values for the period before enabling the new future (baseline does not exist yet)

In the first iteration we only need to look at the next edit by the same author, making the data simpler, but we might want to extend this later.

Goal: Triggered mismatches do get cleaned up and don’t pile up.

We want to measure how many of tainted references that have been triggered are eventually cleaned, and how long it
Again, we would need to compare with a baseline, and this metric is related to the previous one (at least conceptually, technically those might be measured completely separate)

Technical considerations

  • Wikibase does not help much to identifying triggering and cleaning edits
  • Edits (Mediawiki revisions) changing a statement in any way (without much detail on what has changed: value, reference, qualifier, combination of these) could be filtered by considering only revisions with the comment field containing a value of format /* wbsetclaim-update:N||N */ [[Property:PNNN]]: XYZ, where N, NNN, and XYZ are actualy numbers/values.
  • further reasoning on what the edit change might only be possible by inspecting the change done be the edit (revision), i.e. comparing the JSON object representation of an item in before and after
  • For identifying revisions (edits) changing the same statement (e.g. to be able to recognize if the tainted reference has been cleaned) relying on statements unique ID might be of help. It still likely will be involving analyzing the JSON structure of the item data, as the identifier of the statement is not exposed in the comment or other field.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2019, 3:45 PM

@Jan_Dittrich @Lydia_Pintscher in the description there is my interpretation of the metrics Jan has defined. Please review, point out any unclarities, and feel free to fix all mistakes I've certainly made.
@GoranSMilovanovic I tried to keep the rough description of metrics sparse. I am sure some further refinement and definition is going to be necessary. This is just a first draft. I'd be happy to work together with you, Tainted References developers, Jan and Lydia on it.
@Tarrow @noarave @Addshore mind reviewing what I wrote, in particular on "triggering" and "cleaning" conditions, and details on statement change representation in Wikibase and MediaWiki?

My initial observations - please comment:

  • The wb_changes schema might be what we need?
  • This schema is poorly documented (see the doc pages); for an example on what is unclear to me:
mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> select min(change_time), max(change_time) from wb_changes;
+------------------+------------------+
| min(change_time) | max(change_time) |
+------------------+------------------+
| 20191210140003   | 20191213140444   |
+------------------+------------------+
  • Q. why do we get to see the data for three days only?
  • Proposed approach based on this schema, very sketchy at this point:
    • group by entity,
    • observe all entity changes (the change_info field),
    • figure out when a revision counts as a case of tainted reference (statement value changed, reference(s) untouched),
    • parse the subsequent revisions of the same entity (by the same user exclusively, or not?) and see if the reference(s) for the relevant statement change too.

Also:
@noarave @WMDE-leszek
T231731 defines some conditions on what counts as a tainted reference (i.e. when a tainted reference warning is triggered); it might be very helpful to have access to these data. Where do these data live?

@Jan_Dittrich Could we schedule a hangouts session to discuss this task, if you find some time, please?

GoranSMilovanovic added a comment.EditedDec 13 2019, 2:46 PM

My initial observations (continued) - please comment:

From the wmf.mediawiki_history table (Data Lake, Hadoop):

select page_title, event_comment from wmf.mediawiki_history where event_entity='revision' and event_type='create' and wiki_db='wikidatawiki' and snapshot='2019-11' and event_comment rlike 'wbsetclaim-update' limit 10;

we find:

page_title      event_comment
Q23895474       /* wbsetclaim-update:2||1 */ [[Property:P150]]: [[Q18805608]]
Q508568 /* wbsetclaim-update:2||1|1 */ [[Property:P106]]: [[Q774306]]
Q3505042        /* wbsetclaim-update:2||1 */ [[Property:P279]]: [[Q16911701]]
Q152362 /* wbsetclaim-update:2||1 */ [[Property:P18]]: Mihai Răzvan Ungureanu 2013-11-23.jpg
Q3418516        /* wbsetclaim-update:2||1|2 */ [[Property:P1006]]: 072590327
Q5187   /* wbsetclaim-update:2||1|1 */ [[Property:P150]]: [[Q1026761]]
Q4531589        /* wbsetclaim-update:2||1|1 */ [[Property:P570]]: 12 January 2018
Q4071572        /* wbsetclaim-update:2||1 */ [[Property:P166]]: [[Q791135]]
Q32236148       /* wbsetclaim-update:2||1 */ [[Property:P625]]: 54°41'36"N, 129°4'6"E

so yes, the method proposed by @WMDE-leszek is helpful to recognize when the value of a statement in some particular entity changes + we can also get to revision_id, event_timestamp, and event_user_id from this table.

Proposed approach (corrected):
(1) fetch a relevant revision (i.e. the one having wbsetclaim-update in the event_comment field),
(2) use the API to collect the JSON representation of the revised entity by revision-id,
(3) look at all (or some, constrained by some time frame?) subsequent revisions of the same entity,
(4) use the API to collect the JSON representations from these subsequent revisions;
(5) compare the JSON representations to see if the change in the value of the statement was followed by a change in the reference(s) of the respective statement or not.

Maybe we can use the wb_changes schema in place of the API since the following is claimed for its change_info field: "Stores the new full page data in JSON format. (?)"

Please comment.

Thanks, that looks like you've put quick some thinking into it already!
Just a quick answer so far, I'll review all your thoughts soon (tm)

T231731 defines some conditions on what counts as a tainted reference (i.e. when a tainted reference warning is triggered); it might be very helpful to have access to these data. Where do these data live?

Those counts will be available in Grafana/Graphite/Statsd. <- does this sound right @Tarrow @noarave ?
Not sure if this is relevant, but I believe tracking these will on start when the new feature is deployed (which is probably not surprising, but potentially tracking of then tainted situation was trigger could have started before)

I do not think we want to use the wb_changes table.
wmf.mediawiki_history in hadoop is probably the right way to go with we are figuring this out from edit summaries.

(2) use the API to collect the JSON representation of the revised entity by revision-id,

Note, this will have to be done using Special:EntityData and the revision parameter (wbgetentities doesn't have this functionality)

Wikibase does not help much to identifying triggering and cleaning edits

It could do though?

Without adding anything to wikibase i guess the general approach has to be:

  • Find revisions that touch statement mainsnak values and or references
    • Considerations:
      • These values can be touched using a variety of different api modules and with a variety of different summaries, so not just wbsetclaim-update, if anything working with a blacklist of summaries might be easier (eliminate things that only touch terms for examples)
      • This could be simplified if the definition of a tainted reference had something to do with being done by a real user via our UI, but maybe we don't want to say that.
  • Fetch the entity either side of the change and see what happened and classify that?
  • Once this has been done for a window of data try to figure out exactly what is happening to the statements based on the classifications?

I would be pro a call to discuss this.

@Addshore Well, now it sounds even more complicated than in the ticket description.

I am for a call on this too. Let me just provide a few observations in relation to what has been said and suggested until now.

I do not think we want to use the wb_changes table.

Why? Its documentation says that the change_info field "Stores the new full page data in JSON format", and given that this schema also holds the change_revision_id (docs: "This is equal to the rev_id of the edit made by user") this sounds exactly as what we need? Once again, under these assumptions, we (1) might use wmf.mediawiki_history to select the revisions that we are interested in (as @WMDE-leszek explains in the task description), then (2) use the wb_changes to fetch the JSON representations selecting by rev_id as a key, and then (3) compare the JSON representations to see if a change in statement value is followed by a change in a statement's reference. What is wrong with this approach, before we abandon it?

(2) use the API to collect the JSON representation of the revised entity by revision-id,
Note, this will have to be done using Special:EntityData and the revision parameter (wbgetentities doesn't have this functionality)

@Addshore Thanks for a hint on this one. However, I am not sure if wmf.mediawiki_history and the API are the way to go at all, since:

  • the wmf.mediawiki_history receives a monthly update only, constraining our reporting to monthly updates too, while
  • making tons of API calls also does not sound feasible.

Without adding anything to wikibase i guess the general approach has to be:
Find revisions that touch statement mainsnak values and or references
Considerations:
These values can be touched using a variety of different api modules and with a variety of different summaries, so not just wbsetclaim-update, if anything working with a blacklist of summaries might be easier (eliminate things that only touch terms for examples)

@Addshore Define "summaries" and "blacklist of summaries" please.

Fetch the entity either side of the change and see what happened and classify that?

@Addshore Now this I don't even understand.

Once this has been done for a window of data try to figure out exactly what is happening to the statements based on the classifications?

@Addshore Neither do I understand what do you mean here.

Q. why do we get to see the data for three days only?

Because that is more than the amount of time for the data to be used for the dispatch process (which is what this table is designed for)

Why? Its documentation says that the change_info field "Stores the new full page data in JSON format", and given that this schema also holds the change_revision_id (docs: "This is equal to the rev_id of the edit made by user") this sounds exactly as what we need?

I just removed that bit of outdated documentation, but perhaps it would still give us some useful data?
I have an example below, which includes the property ID of the statement change in the edit.

{"compactDiff":"{\"arrayFormatVersion\":1,\"labelChanges\":[],\"descriptionChanges\":[],\"statementChanges\":[\"P31\"],\"siteLinkChanges\":[],\"otherChanges\":false}","metadata":{"page_id":11980867,"parent_id":874258972,"comment":"\/* wbcreateclaim-create:1| *\/ [[Property:P31]]: [[Q7725634]], [[:toollabs:quickstatements\/#\/batch\/23701|batch #23701]]","rev_id":1075231392,"user_text":"XXXX","central_user_id":XXXX,"bot":0}}

compare the JSON representations to see if a change in statement value is followed by a change in a statement's reference

We should be careful here not to only look at the change immediately after the first change.

@Addshore Thanks for a hint on this one. However, I am not sure if wmf.mediawiki_history and the API are the way to go at all, since:

  • the wmf.mediawiki_history receives a monthly update only, constraining our reporting to monthly updates too, while
  • making tons of API calls also does not sound feasible.

Another option would be dumps, but then you have the same issue (not regular updates)
In terms of the hadoop mediawiki_history table, right now I don't think this task is clear about if we just want to find this data for historical period (generate it for the last year) or have super regular updates.
If the first then there would be no problem using a monthly updates tables such as mediawiki_history.
If moving forward we want this to be more real time then all of the data we need in mediawiki_history is also in the SQL tables for mediawiki for wikidata and the same data can be retrieved from there.

@Addshore Define "summaries" and "blacklist of summaries" please.

For example "wbsetlabel" or "wbsetdescription" summaries will never touch statements.
"wbeditentity" summaries however can touch a statement
"wbsetclaim" summaries will touch a statement.

We opened T231731 as a means to count future tainted-refs triggering and some related data.

What I'm not clear on about all the above discussion is exactly what data we're trying to collect. It seems to me that the baseline we should be aiming for should try to focus only on events that we're interested in changing (i.e. edits to statements via the ui)

Specifically whether we want to include only places where the tainted-references feature would have been triggered or not. If we only want to see places where the feature would have been triggered (e.g by a UI change) that a necessary (but not sufficient) criteria for this would be selecting only those that use wbsetclaim since the UI edits always use these.

If moving forward we want this to be more real time then all of the data we need in mediawiki_history is also in the SQL tables for mediawiki for wikidata and the same data can be retrieved from there.

We might get some or all of this data in the future from live stats collection (which will help us separate out UI and non-UI edits)

We just had a call and here are some of the comments

We opened T231731 as a means to count future tainted-refs triggering and some related data.

What I'm not clear on about all the above discussion is exactly what data we're trying to collect. It seems to me that the baseline we should be aiming for should try to focus only on events that we're interested in changing (i.e. edits to statements via the ui)

The baseline will include only edits by users, and only edits via api modules that the UI uses.
Not totally sure if we can easily filter it beyond that however in terms of UI vs other things using the API.
Maybe we also need to filter out some edits tagged in a certain way?

Specifically whether we want to include only places where the tainted-references feature would have been triggered or not. If we only want to see places where the feature would have been triggered (e.g by a UI change) that a necessary (but not sufficient) criteria for this would be selecting only those that use wbsetclaim since the UI edits always use these.

Yup, wbsetclaim summarires will be in the filter for revisions that are looked at

If moving forward we want this to be more real time then all of the data we need in mediawiki_history is also in the SQL tables for mediawiki for wikidata and the same data can be retrieved from there.

We might get some or all of this data in the future from live stats collection (which will help us separate out UI and non-UI edits)

mediawiki_history will be used

In terms of figuring out which revisions show a trainted references popup we think it would be good to add event logging for this specific event tracking the revision id that triggered the pop up.

Also, moving forward, rather than guessing all of this from edit summaries, it would be a lot nicer to have all of this in event logging.
But doing that for a historical baseline doesn't really work, so the mediawiki_histroy and summary based approach will be needed for now.
Unless we delay rollout of the feature.

GoranSMilovanovic added a comment.EditedDec 18 2019, 12:12 PM

@Addshore @Jan_Dittrich Here is the summary of the approach to collect the baseline data, following our today's meeting:

Step 1. Filter out revisions where the value of the statement is changed

  • we will use the wmf.mediawiki_history table in the WMF Data Lake;
  • we filter out revisions by event_comment following @WMDE-leszek's approach: see task description and my experiment in T240466#5739380;
  • we look for parent revision IDs then because this approach indicates any change and not specifically a change in the value of a statement (thanks @Addshore for this observation);
  • we fetch the JSON representations of the two revisions (the target one + its parent revision) from https://www.wikidata.org/wiki/Special:EntityData/,
  • diff the JSONs and
  • sort out revisions where the value of a statement changed from those where something else happend.

From Step 1. we have a table of rev_ids where a value in the statement changed. Now,

Step 2. For each revision obtained in Step 1.,

  • we look for the subsequent N = 3 (a parameter whose value needs some experimentation) revisions of the same entity,
  • compare the JSON representations of the subsequent revisions with the original one
  • to see if the references of the revised statement had changed too or not.

The ballpark numbers in this approach:

  • In Step 1. we collect the data until we have approx. 200 tainted references recognized @Jan_Dittrich ;
  • we estimate the probability of obtaining a tainted reference from this sample of wmf.mediawiki_history;
  • In Step 2. we look at three (3) revisions following the one triggering the tainted reference to see
  • if the same user who triggered a tainted reference also revised the reference(s) of the statement, and
  • we estimate the probability of a spontaneous resolution of a tainted reference from these data.

The initial experiment should let us learn better what parameter values (sampling, and how far to look for a change in references in the future revisions) to use.

The initial report should be ready until January 10, 2020.

One other occurrence that popped into my head is vandalism and reverts.
What is described above doesn't take into account the following:

  • a statement value is vandalized
  • that statement value is reverted

As far as I know this would be counted as a tainted reference statement with our current approach?

There are many other edge case patterns that I image we might come across, but as discussed in the meeting I won't bother listing them all and maybe we will come across them via experimentation with the baseline.

Addshore moved this task from incoming to in progress on the Wikidata board.Feb 27 2020, 11:03 PM

@WMDE-leszek Please, what is the status of this task?

WMDE-leszek closed this task as Declined.May 4 2020, 6:38 AM

Given the lack of efficient access to the data that would allow recognizing the relevant use case, WMDE has decided to not perform this analysis/measurement.