Page MenuHomePhabricator

Access to HTTP 404 logs for Wiktionary
Open, LowPublic

Description

Most often many users use wikitionary by directly entering the word they are looking out in the url and if article doesnt exist, get 404, page not found. Is it possible to get access to these 404 logs and share with the community, so that the community can create entries for nonexistant pages which are being looked up by the readers


Version: unspecified
Severity: enhancement

Details

Reference
bz32514

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:01 AM
bzimport set Reference to bz32514.
bzimport added a subscriber: Unknown Object (MLST).

We don't log this information. You should be able to use the page views (http://dumps.wikimedia.org/other/) and compare it to the list of existing pages.

This would take some work, but I think it is your best bet.

We log /some/ of this information, but not all. We have 1:1000 sampled Apache logs that we use for internal analysis, but we don't release these publicly because they contain private data. I guess we could release anonymized versions of them, but we don't do that currently.

The page view statistics at e.g. http://stats.grok.se are obtained using the UDP logger that counts the requests for each page but doesn't write a log line to disk for each request (disk I/O tends to be the limiting factor here, AIUI). With the UDP log stream it should definitely be possible to produce 404 statistics.

Reopening because this isn't as impossible as suggested in comment 1.

The sampled logs wouldn't be of use here anyways; most misses would probably not even show up there.

We could write code to grab 404 statistics but that still wouldn't cover a chunk of the cases here (urls that are well formed but point to an article that isn't written yet). Comparing a list of known viewed titles against known titles on the project is the best bet for right now, and doable immediately by anyone who can script a little bit.

EN.WP.ST47 wrote:

November 1-3, en.wikt, top 300 requested non-existent pages

I have attached the 300 most commonly requested pages that do not exist for the english wiktionary. These pages have been requested 50 times or more in a three day period. Some of the titles look a little strange, such as for example "%25D8%25AC%25D9%2585%25D8%25A7%25D8%25B9", as though it was urlencoded twice, but the names used are the ones I got from the pageviews dumps. These mostly contain url fragments - index.php is the number 1 most requested, but there are also some strange ones - as well as years and unicode gunk. I'll let the script keep running and update the attachment with data for the full month.

Attached:

Nice, can the monthly log of this made available in some place like dumps.wikimedia.org ? My original request was based on Tamil Wiktionary in mind and the urlencoding needs to be decoded for the final output to be useful and unicode might not be junk there. After we get the data from across wiktionaries over a period, we could probably find patterns to exclude junk and give some useful data to community

So what is this currently, a request for an automated regularly updated filtered version of the udp logs? (Maybe to be sent to the analytics team?)
A request for a tool which works on stats.grok.se data or its replacement of the mysterious future?
Please clarify summary and component.

wikimaas wrote:

Not exactly the same, but I developed a small javascript extension which is handy dealing with the 404s described in the Srikanth Logi's scenario.

See http://en.wiktionary.org/wiki/Wiktionary:Beer_parlour/2013/February#Yahoo_Pipe_for_404s for the thread.

I realize probably it would be better if the api is totally parsed inside wiktionary, so not with using pipes, but well, its the idea...

Srikanth: Could you answer comment 6, please?

[removing ops keyword -> analytics area]

sumanah wrote:

Pinging Srikanth once more. :)

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2016, 10:14 AM