Page MenuHomePhabricator

Pageviews complete dumps have lots of rows with article name = '-'
Closed, ResolvedPublicBUG REPORT

Description

https://dumps.wikimedia.org/other/pageview_complete/readme.html

Steps to replicate the issue (include links if applicable):

What happens?:

  • Observe that many (most?) rows have '-' as the article title

What should have happened instead?:

  • The article title field should be the actual article title

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

N/A

Other information (browser name/version, screenshots, etc.):

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hi @Audiodude, thank you for reporting your issue.

I have double checked the data and here are my findings for month 2024-04:

  • Automated traffic: 237 708 298 lines, among which 23 703 029 have the '-' title (~10%)
  • User traffic: 260 087 587 lines, among which 2 039 218 have the '-' title (~0.8%)

The '-' value as page-title is the default value we use when we don't know the page-title requested, for instance in a url like https://en.wikipedia.org/?curid=1786419 (we know the pageId but not its title from the URL). This way of accessing pages is not usual for users (see the relatively small percentage of lines having the '-' title), while it can be for automated traffic, when people scan wikipedia for instance.

We know that lacking the actual page-title is cumbersome, but as of now we have not devised a solution for this problem as it doesn't occur a lot on user traffic.

Audiodude claimed this task.

Awesome, thank you so much for the explanation! It might be worth adding that to the README here: https://dumps.wikimedia.org/other/pageview_complete/readme.html

but as of now we have not devised a solution for this problem

A solution: https://phabricator.wikimedia.org/T366004#9837455 ;)