Page MenuHomePhabricator

Review Commons Queries for SDAW
Closed, ResolvedPublic

Description

As suggested in one of our recent SDAW meetings, review Commons queries to determine the distribution of queries along various dimensions, and look for interesting patterns.

Topics include review of distribution of scripts and languages, usage of keywords, frequency distribution of queries and results, keyword counts, and any other patterns that pop up.

I'm not 100% sure of the scope of T252544, so this may or may not include that task.

See also: T252544: Gather statistics on head and tail query distribution on Commons

Event Timeline

TJones renamed this task from Review Commons Queries to Review Commons Queries for SDAW.Jul 17 2020, 9:01 PM

The full report is on MediaWiki, and the quick summary is below.

In three month's worth of likely-human queries issued on Commons, over 90% are in the Latin script, about 50% are in English, almost 25% are names, and almost 10% are porn-related.

Among the most common queries, 8 of the top 10 and 66 of the top 100 are porn-related, but even the most common queries are not really that common, and only 6 queries out of over 1.04M unique (lightly normalized) queries were searched 1,000 times or more, and only 660 were searched 50 times or more. Over 950K were unique. There is not really a head—it's pretty much all the long tail.

In a sample of 100 random queries (the long tail), 30 were specific things, 22 people, 14 places, 11 organizations, and 12 were porn. 60 queries were narrow and fairly specific, 17 were fairly broad, and 22 were in the middle. (Broad queries were often one word.)

In a sample of the 100 most common queries (the head-ish), 66 were porn, 7 were looking for "facts", 7 were specific things, 6 were current events, 5 were people. 24 queries were narrow and fairly specific, 46 were fairly broad, and 27 were in the middle. (Broad queries were often one word.)

Only 1.6% of queries used a namespace, 0.9% had a file extension. Boolean and special operators were very rare.

10% of queries got zero results. Less than 1% got a million results or more.

If we break queries on whitespace and punctuation (less than ideal, but easy), 66% of queries are one or two words; 93% are four words or fewer.

Thanks @TJones, this is really great - and generally what we expected.

Is it possible to get some numbers on how many haswbstatement: queries there are and what those queries look like? This would be helpful in knowing how often the "files depicting" feature in the main dropdown is being used, how successful "files depicting" searches are, and might have some impact on ranking decisions in MediaSearch. (Let me know if this should be a new ticket)

@CBogen, it was a quick and easy check. The short answer is that there were none. I added the long answer to my write up, and I'll repeat it here:

There are no special keywords with colons other than namespaces in my sample. ... I found two instances of haswbstatemen, six haswbstatement, and one sshaswb, none with any other search terms. There was one malformed query: haswbstatementP180=Q42133786, but it is also missing the colon.

Keep in mind that my sample is not 100% of all queries because of the filters I used to exclude queries from non-humans and not-normal humans. Editors and others who issue >100 queries in a day would be filtered, for example. My sample has about 1.5M queries and Erik's Top N queries has about 3M, so there may be more well-formed haswbstatement qeries in the full dataset.

It would also be great if we could do an analysis of the 10% of queries that give zero results. What types of queries are they? This will help us determine whether they're something we should target.

I've added the Zero-Results Query Sub-Corpus Analysis to my write up.

Summary:

In three month's worth of likely-human queries issued on Commons, zero-result queries make up about 10% of all queries (which is less than the zero-results rate on Wikipedias). Subjectively, the zero-results queries seem to have less junk than on Wikipedia, and so may be more salvageable. Also, there seem to be more spelling errors/typos in the zero-results queries.

80% of the zero-results queries are in the Latin script (which is less than in the total corpus, which is 90% Latin text). Only 32% are in English (vs 50%), and roughly 25% are names (same as overall). Only 6.5% are porn-related (vs 9.5% overall).

Only 31 of the top 100 most commons zero-results queries are porn-related, vs 66 overall.

Zero-results are more heavily skewed toward unique queries.

In a sample of 200 random zero-results queries (the long tail), 37% were about specific things, 20.5% people, 13% places, 5% facts, 3% organizations, and 6.5% were porn. This is roughly similar to the full corpus, with a bit less porn. 60% of zero-results queries were narrow and fairly specific, 10% were fairly broad, and 22.5% were in the middle. (Broad zero-results queries were often one word.) This is very similar to the full corpus.

In a sample of the 100 most common zero-results queries (the head-ish), 31 were porn, 28 were specific things, 23 were people. This is much more specific and has half the porn of the full corpus. 57 queries were narrow and fairly specific, 31 were fairly broad, and 9 were in the middle. (Broad zero-results queries were often one word.) This is skewed much more toward narrow queries compared to the full corpus.

Breaking on whitespace and punctuation (less than ideal, but easy), 60% of queries are one or two words; 86% are four words or less. This is slightly less than the full corpus. More than half of all high-token queries (10+) give zero results.

Spelling errors seem more common in the zero-results queries (and there is less junk in the zero-results queries than in Wikipedia data); 32% of the random sample of zero-results queries have spelling errors, and 38% of the top 100 zero-results queries have spelling errors. "Did you mean" suggestions and the completion suggester do okay, but could be much better. The current completion suggester doesn't have much to work with because it is limited to page/file/category names, which are not always good matches with what people are searching for. T250436 could be a big help!

The most common zero-results queries are very specific and don't show much variation under normalization (e.g., variation in capitalization or punctuation), which I interpret as either one person repeating the search over and over, someone linking to the search results, or similar "non-organic" source.

Thanks Trey, this is great!

Is it possible to get some examples of the "specific" searches that are returning zero results, e.g. the 41 about specific people and the 26 about specific places? We'd like to understand whether they are about people or places that are simply not covered on commons, or whether there's another reason they're returning zero results.

The other thing that @Ramsey-WMF and I took from this is that we should integrate "did you mean" into media search. I'm thinking I should create a new ticket for that. It would also be great to look into potential approaches to improve "did you mean" - has exploration on this been done before?

@CBogen, I'll try to get some examples; but we have to keep PII in mind. Shouldn't be a problem, though.

It would also be great to look into potential approaches to improve "did you mean" - has exploration on this been done before?

Improving DYM ("did you mean") is the focus of the Glent project (the sub-parts of which we also call "Method 0", "Method 1", and "Method 2" when Erik is not around to remind us that he gave things more mnemonic names). See T212884 and sub-tasks for more.

Method 0 is in production now; it's high precision but low impact/low recall.

Method 1 ran into some trouble because because it was much higher recall and some of the suggestions were not very good. This led to a side project to create a much better (though more expensive) version of edit distance that gives better (higher precision) results for Method 1.

The issues with Method 1 led to abandoning the A/B test and re-thinking the scoring to be based on both frequency and edit distance, and a better method for combining scores from Method 0 and Method 1. This still needs a little bit of analysis and implementation.

Method 2 is specific to CJK languages to handle the fact that they aren't alphabetic, so edit distance is less meaningful. It still needs to be reviewed, but I need outside help because I can't grok typos in CJK languages easily.

When I'm not working on SDAW, I'm trying to get back to Method 1 analysis and implementation and then the Method 2 analysis (leading to implementation, one hopes—and which the early quick look suggests).

we should integrate "did you mean" into media search. I'm thinking I should create a new ticket for that.

Yeah... probably. It does do well for some of the more concrete zero-results queries that don't have too many errors. albrt einstein gets corrected nicely, for example. For less targeted searches (Is this khewre salt himalayan) it is hit and miss (that specific example works well). It can give ridiculous suggestions, though—hence the Glent project.

ah, thank you! This closed a lot of the gaps in my understanding of Glent :)

One of the ideas @Ramsey-WMF and I were tossing around was to possibly incorporate Wikidata terms to improve 'did you mean'. Sounds like something that might fall under Method 2?

URL bits in queries

@CBogen & @Ramsey-WMF: One thing I forgot to mention in my summary above (though it is in the full write up) is that ~9% (1 in 11) of the queries that have zero results have what looks like Google Image Search URL cruft in the query. For example (I've changed the tbnid and docid to random strings just in case they are meaningful):

  • Brian Cox (physicist)&tbnid=6yhg46yhsgdetyh&vet=1&docid=8ikash-ujs7uhg&w=960&h=1440&q=Professor+Brian+Cox&source=sh/x/im

Both Brian Cox (physicist) and Professor Brian Cox are fine queries with good results.

I have no idea how this extra cruft is getting into the queries, and whether it is a long-term problem or something that will resolve itself over time. Someone more familiar with connecting queries and user agents and IPs and such could try to find the source of this. tbnid is the key term to look for in the queries.

Searching commons for tbnid returns ~850 results, which mostly look like URLs for Google Image searches. Some may be in the auxiliary text, which is weird. I didn't look to closely...

Anyway, figuring out this tbnid situaton could either fix a lot of that 9% of the zero-result queries (if it can be fixed at the source), or we can stick with the status quo and at least know that ~9% of the zero-result queries are really not our fault! :)

I created T260292 to track adding 'did you mean' to Media Search. I'm thinking this work is on the Structured Data team side, but if you feel differently let me know!

ah, thank you! This closed a lot of the gaps in my understanding of Glent :)

One of the ideas @Ramsey-WMF and I were tossing around was to possibly incorporate Wikidata terms to improve 'did you mean'. Sounds like something that might fall under Method 2?

Method 2 uses dictionaries and other sources of likely errors for CJK input to generate suggestions. There are many input methods (IME) for Chinese characters, for example; choosing the wrong option from the drop-down list suggested by the IME would be a likely error.

Longer-term, we could look at using Wikidata for improving DYM—we've talked about that before, but we'd have to review the options again—but that would be a new project.

I created T260292 to track adding 'did you mean' to Media Search. I'm thinking this work is on the Structured Data team side, but if you feel differently let me know!

Sounds right to me.

URL bits in queries

  • Brian Cox (physicist)&tbnid=6yhg46yhsgdetyh&vet=1&docid=8ikash-ujs7uhg&w=960&h=1440&q=Professor+Brian+Cox&source=sh/x/im

I've noticed a new pattern here; all of the beginning parts of these queries match Commons categories. My new hypothesis is that there is a tool of some sort out there that takes a query on Google images and tries to determine the matching category on Commons. If that's the case, the tool is clearly broken. Given the volume, it's either a bot or a moderate number of people are using it.

Is it possible to get some examples of the "specific" searches that are returning zero results, e.g. the 41 about specific people and the 26 about specific places? We'd like to understand whether they are about people or places that are simply not covered on commons, or whether there's another reason they're returning zero results.

I checked a bunch of people and places for PII and listed them in my write-up. Since I was checking for and removing PII, I added notes on what's going on with the people and places and my best guess as to why they got no results. I have notes for 16 people, a summary for 20 other people (without listing them—researching people is tedious), notes on 3 other less specific kinds/groups of people and notes on 25 places.

During the much closer review, some queries changed categories and I recognized a few more typos. The stats reported above are still in the right ballpark, though.

Thanks Trey! The biggest takeaway for me is that DYM is not catching enough misspellings, typos, or extra spaces. So a project to improve DYM is probably worthwhile in the future.

It seems like a lot of the cases where foreign language queries turn up zero results could be significantly improved by the eventual proliferation of depicts statements, too.

I think I'm happy with this analysis at this point unless @Ramsey-WMF has any other questions or follow up requests!

Gehel claimed this task.