Page MenuHomePhabricator

Gather statistics on head and tail query distribution on Commons
Closed, ResolvedPublic

Description

The SD team is interested in learning more about the distribution of head and tail queries on Commons. This will help determine the direction in which we go with design in the new Media Search prototype.

For example, we we get mostly broad, general searches as head queries, that users would need help narrowing down? Or are people getting stuck on really specific tail queries, where help broadening searches or related searches might be useful?

Event Timeline

@CBogen: So rather than just gathering statistics on the distribution of query frequency (which I plan to do for T258297), for this task you also want someone to manually review a sample of queries from the head and tail and get a sense of their specificity? If so, do you have a sense of the necessary sample size? Determining query intent can be a lot of work. Or you can cheat and look at query term count as a proxy for specificity—one-word queries are generally less specific, for example.

@TJones Yes, we were trying to review a sample of queries and get a sense of specificity, but don't have sense of the necessary sample size. However we've used other methods (user design research, mainly) to determine the direction we're going with concept chips, so this is no longer urgent. However T257361 is still really important to the design of concept chips - this was assigned to @EBernhardson but maybe it falls under T258297?

T257361 includes the automatic semantic grouping that Erik worked on, so that's for him to do. It is arguably related to T258297, but as a practical mater they should stay separate. As part of T258297 I'll get a sense of what is head and what is tail, at least. If it doesn't take too long, I'll see if I can do a meaningful review of the head and tail for this ticket, too.

I'll look at this a bit as part of T258297.

The full report is on MediaWiki, and the quick summary is below, and the bits most relevant to this ticket are bolded.

In three month's worth of likely-human queries issued on Commons, over 90% are in the Latin script, about 50% are in English, almost 25% are names, and almost 10% are porn-related.

Among the most common queries, 8 of the top 10 and 66 of the top 100 are porn-related, but even the most common queries are not really that common, and only 6 queries out of over 1.04M unique (lightly normalized) queries were searched 1,000 times or more, and only 660 were searched 50 times or more. Over 950K were unique. There is not really a head—it's pretty much all the long tail.

In a sample of 100 random queries (the long tail), 30 were specific things, 22 people, 14 places, 11 organizations, and 12 were porn. 60 queries were narrow and fairly specific, 17 were fairly broad, and 22 were in the middle. (Broad queries were often one word.)

In a sample of the 100 most common queries (the head-ish), 66 were porn, 7 were looking for "facts", 7 were specific things, 6 were current events, 5 were people. 24 queries were narrow and fairly specific, 46 were fairly broad, and 27 were in the middle. (Broad queries were often one word.)

Only 1.6% of queries used a namespace, 0.9% had a file extension. Boolean and special operators were very rare.

10% of queries got zero results. Less than 1% got a million results or more.

If we break queries on whitespace and punctuation (less than ideal, but easy), 66% of queries are one or two words; 93% are four words or fewer.

This is great, thanks for your work on this @TJones! Excited to dig into this more and see if there are any potential ux changes that could improve search on Commons

@mwilliams, if you think of any other similar questions about Commons queries while looking through the write up, let me know, and I'll see if I can answer it.

@mwilliams, if you think of any other similar questions about Commons queries while looking through the write up, let me know, and I'll see if I can answer it.