We've been talking a lot about how to distinguish automated traffic in search - not so we can exclude it but so we can tag it. At the moment our options are maintaining and expanding hideously complex regular expressions, manually hand-coding or giving up.
What if we asked (Wikimedia-specific) clients that were making automated requests to provide something that could be written to the x_analytics field? It's already used by some clients (which suggests it's viable) and because it's clearly key-value separated, it's a lot easier and less computationally complex to handle than user agents, which require painful regular expressions.
Oliver will write up a discussion draft to circulate.