Page MenuHomePhabricator

[L] Exclude media topics from section topics dataset
Closed, ResolvedPublic

Description

The evaluation of section topics sample showed that media outlets are tagged as section topics, often with high relevancy scores (bbc news, reuters, france info etc). They do not add descriptive and contextual value but rather pollute the data and should not be tagged as section topics.

**AC*
Media keywords fare not identified as section topics

Note: Cormac mentioned that it might be worth checking that there isn't any weirdness in the score computation that make these popular media keywords have high scores. if there is, then the problem is different and this ticket can be resolved

Event Timeline

MarkTraceur renamed this task from Exclude media topics from section topics dataset to [L] Exclude media topics from section topics dataset .Dec 1 2022, 5:57 PM

There isn't too much consistency in the statements of media outlets (see a selection below)
Most have one or more P31 (instance of) statements that are one of these: (or a subclass thereof)

  • Q11033 # mass media
  • Q1193236 # news media
  • Q2943864 # news satire
  • Q27881073 # fake news website
  • Q1331793 # media company
  • Q24354647 # editorial team
  • Q2001305 # television channel
  • Q56611639 # media industry
  • Q38926 # news
  • Q11578774 # broadcasting program

There are 1506831 of these.

2 other properties are also commonly found to describe media outlets: P452 (industry) and P136 (genre), where the value is often one of the Q-ids mentioned earlier, or one of:

  • Q11033 # mass media
  • Q56611639 # media industry
  • Q11030 = journalism
  • Q25245117 = telecommunications
  • Q3972943 = publishing

Combining all these properties & items, we get this query that returns 1686161 results:

SELECT DISTINCT ?item WHERE {
  { wd:Q11033 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = mass media
  UNION
  { wd:Q1193236 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = news media
  UNION
  { wd:Q2943864 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = news satire
  UNION
  { wd:Q27881073 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = fake news website
  UNION
  { wd:Q1331793 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = media company
  UNION
  { wd:Q24354647 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = editorial team
  UNION
  { wd:Q2001305 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = television channel
  UNION
  { wd:Q56611639 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = media industry
  UNION
  { wd:Q38926 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = news
  UNION
  { wd:Q11578774 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = broadcasting program
  UNION
  { wd:Q56611639 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = media industry
  UNION
  { wd:Q11030 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = journalism
  UNION
  { wd:Q25245117 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = telecommunications
  UNION
  { wd:Q3972943 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = publishing
  UNION
  { wd:Q1002697 ^wdt:P279*/^(wdt:P31|wdt:P136|wdt:P452) ?item . } # instance of/genre/industry = periodical
}

This casts a very wide net and it's quite impossible to validate how relevant all results are. I went over a couple hundred results, and none of the ones I did recognize seemed false positives, though.

Does above list of properties/items to identify media outlets (and exclude such topics) make sense? Ok to move forward with this list?


Alternatively, we could check for items that have a property that we only expect in media outlets (e.g. P449 (original broadcaster)). I haven't found that to be particularly helpful, though - they're so specific that they're missing for many items, in which case we're back to also including P31 etc. I don't think we should pursue this.


A selection of media outlet items and their relevant-ish statements:

Reuters:

  • instance of = news agency
  • industry = telecommunications

Associated press:

  • instance of = news agency
  • instance of = photo agency
  • industry = news media
  • industry = media industry

CNN:

  • instance of = United States Cable news
  • industry = journalism

ABC News:

  • instance of = television station
  • industry = journalism

Fox News:

  • instance of = United States cable news
  • industry = journalism

Late Night with Conan O'Brien:

  • instance of = television series
  • genre = talk show

The Onion:

  • instance of = weekly newspaper
  • instance of = news satire
  • industry = publishing

Playboy:

  • instance of = magazine
  • instance of = men's magazine
  • instance of = nude magazine

BBC News:

  • instance of = news desk
  • instance of = news broadcasting
  • industry = journalism

Bild:

  • instance of = daily newspaper
  • genre = tabloid journalism

Al Jazeera:

  • instance of = broadcaster
  • instance of = television station

France 24:

  • instance of = television channel
  • instance of = television station
  • industry = journalism

Charlie Hebdo:

  • instance of = newspaper
  • instance of = satirical newspaper
  • genre = political satire
  • genre = satirical newspaper

Het Nieuwsblad:

  • instance of = daily newspaper

Het Journaal:

  • instance of = television program
  • genre = news program

VTM Nieuws:

  • instance of = television program

Terzake:

  • instance of = television program
  • genre = current affairs

Vive le Vélo:

  • instance of = television program
  • genre = talk show

Humo:

  • instance of = magazine

Dag Allemaal:

  • instance of = periodical

Does above list of properties/items to identify media outlets (and exclude such topics) make sense? Ok to move forward with this list?

LGTM -- let's get @mfossati's and @AUgolnikova-WMF's eyes on it.

Thanks @matthiasmullie for the deep dive in the Wikidata ontology!
Totally agree we can't come up with a complete solution.

I made some explorations as well:

  • the most generic class looks like media and has 18,387 subclasses - query. Probably the widest net we could cast
  • querying all media instances times out, way too many
  • the more specific mass media class has 792 subclasses - query
  • there are 271,628 mass media instances - query. Perhaps a too narrow net

Conclusion: OK to move forward with @matthiasmullie's net, looks like the best trade-off.

Thanks! LGTM for the current version

It's merged, and the bot agrees! 😄 Closing.

It's merged, and the bot agrees! 😄 Closing.

Yes, but what happens the day the bot disagrees?!