Page MenuHomePhabricator

Investigate usage of "text" in AbuseFilter rules on wikidata.org
Closed, ResolvedPublic

Description

We want to reduce the amount of "text" provided to AbuseFilter from Wikibase entities in T205252.
Before we can do that we need to see what rules are in place and which bits of the text are actually used.

This should cover:

For example

  • Statement GUIDs are provided as one of the lines in the "text". So are strings such as the following used by abuse filter rules? "Q56596767$199BCB00-D1ED-40A5-B001-439BC5F434F7"
  • The rank of statement is also included as a line such as "normal". Is this used in abuse filter rules?
  • etc.

Event Timeline

Addshore triaged this task as Normal priority.
Addshore moved this task from Backlog to Questions on the wikidata-tech-focus board.
Addshore raised the priority of this task from Normal to High.

Just a comment: "_text" variables are for page title, and so are "_prefixedtext" variables. So, if you're interested in covering such variables, then you also have to include "_title" and "_prefixedtitle" per T173889.

Just a comment: "_text" variables are for page title, and so are "_prefixedtext" variables. So, if you're interested in covering such variables, then you also have to include "_title" and "_prefixedtitle" per T173889.

So, this ticket only cares about the "wikitext" in all of its forms, not the title, we should update the description!

Daimona updated the task description. (Show Details)Feb 5 2019, 11:52 AM
Daimona updated the task description. (Show Details)Feb 5 2019, 11:56 AM

Description updated! Searching for all of the variables yields 76 matches. Checking by hand is feasible, but not optimal. Is there a list of what data we're looking for (e.g. GUIDs and rank, mentioned in task desc)? I'd like to see if I can extract a regex from there.

Rules in entity namespaces (Item, Property, Lexeme)

Nothing for lexemes yet.

Statement GUIDs are provided as one of the lines in the "text". So are strings such as the following used by abuse filter rules? "Q56596767$199BCB00-D1ED-40A5-B001-439BC5F434F7"

This has been making abuse filter matching harder.

The rank of statement is also included as a line such as "normal". Is this used in abuse filter rules?

Sometimes.

Just a random comment: data actually used by existing abuse filters like the rank can be moved from added_lines to new AF variables defined via hooks.

Statement GUIDs are provided as one of the lines in the "text". So are strings such as the following used by abuse filter rules? "Q56596767$199BCB00-D1ED-40A5-B001-439BC5F434F7"

This has been making abuse filter matching harder.

Yup, the format just being a collection of lines is a pretty insane thing to have to try to match.

Just a random comment: data actually used by existing abuse filters like the rank can be moved from added_lines to new AF variables defined via hooks.

Indeed, to know what to move to different vars we would need some sort of overview of all of the elements used.

Is statement GUID used?
Is language ever user?
Are the reference etc hashes ever used?
Are various elements of some data types ever used? (datetimes have lots of 0,s? for example for before after etc)

Is statement GUID used?

Probably not.

Is language ever user?

It is usually matched against in summary.

Are the reference etc hashes ever used?

Probably not.

Are various elements of some data types ever used? (datetimes have lots of 0,s? for example for before after etc)

No. There are two filters which deal with complex datatypes (#55 and #93) but they don't need it. Which doesn't mean we didn't want to create filters to check for invalid data...

So we could get rid of:

  • statement guids
  • all hashes
  • language keys (actually already done for items)
  • some keys from complex values:
    • timevalues (before, after)
    • possibly some others.

We can either use the current approach which is to define keys which should be ignored at all levels of the JSON, or create a slightly more complex layered method of filtering.

The current list is:

		return [
			'language',
			'site',
			'type',
		];

and could be something like:

		return [
			'language',
			'site',
			'type',
			'hash',
			'id',
			'before',
			'after',
		];

but this would have some slightly unexpected consequences, as 'id' is pretty generic for the statement guids, and we would also now be excluding the target ID for statements when they are added etc.

This will need some slight refactoring in EntityContent and related classes, we need a customizable way (that is efficient) per entity type.

Addshore closed this task as Resolved.Thu, Jun 20, 9:53 PM
Addshore claimed this task.

Going to close this investigation ticket now, as we have made some head way, know which step we will take next to chip some save timing off.
Will leave the parent ticket open for this to be worked on in.

Restricted Application added a project: User-Addshore. · View Herald TranscriptThu, Jun 20, 9:53 PM