Page MenuHomePhabricator

📊Filter out “page redirects”
Closed, ResolvedPublic5 Estimated Story Points

Description

Page redirects should note be present in ImageMatching data.
This specific article type might need different logic than the rest of QID-based filters.

Acceptance criteria

  • Page redirects items have been filtered out from the production dataset.
  • Provide statistics on the number of pages that have been filtered out.

Note

  • We should investigate why some of these articles are still creeping in the dataset (filtering should already be in place)
  • This has elements of a spike. We need to investigate the most appropriate approach to filter out these

items.

Event Timeline

I did some validation, and the filtering logic used in the notebook seems consistent with the output we have for the 2021-01 and 2021-02 runs on the whole set of languages. I could not find any "page redirect" instance in data.

Redirect pages are a special filtering case. By default categories are filtered based on their wikidata instance of qid. Redirects however are excluded by selecting pages where mediawiki_page.page_is_redirect is false in wikipedia dumps.
The page_is_redirect column doc tells us that a value of 1 here indicates the article is a redirect; it is 0 in all other cases.

The query below shows that no articles with page_is_redirect=1 are present in the model output

select snapshot, mp.page_is_redirect, count(*) as articles from imagerec im
join wmf_raw.mediawiki_page as mp
where im.wiki_db = mp.wiki_db
and cast(im.page_id as string) = cast(mp.page_id as string)
and im.snapshot = mp.snapshot
and im.wiki_db != '' and im.snapshot >= '2021-01'
group by im.snapshot, mp.page_is_redirect

Results in

snapshotpage_is_redirectarticles
2021-01false4098151
2021-02false4571301

As an additional validation step, I found no instance of "page redirect" wikidata items (e.g. https://www.wikidata.org/wiki/Q21528878) present in the data.
For example, a crude grep through the raw output returns no match for Q21528878. The query

select instance_of, count(*) from imagerec_parquet where wiki_db != '' and snapshot = '2021-02' and instance_of like '%21528878%'  group by instance_of;

Returns no entry.

instance_of_c1
sdkim claimed this task.
sdkim updated the task description. (Show Details)
sdkim moved this task from To Do to Done on the Image-Suggestions board.