Fri, Jul 20
This is (unsurprisingly) still going on. Over the last 30 days we have ~10k unique fields. A quick parse shows the average days has at least 1000 fields, and the worst ~1600. This seems like a problem that needs a better solution than asking developers to pretty-please log within this (unwritten) set of guidelines. As time goes by, a pre-defined list of a hundred or so fields that will be indexed starts to seem reasonable. It would be nice if somehow we could manage to copy the text content of all the rest of it into some unstructured indexed field so that it's still searchable, but not the end of the world.
Thu, Jul 19
Parsing is one of the more expensive things in mediawiki. Due to the expense the ParserOutput is serialized into a multi-layer cache (memcached, then mysql) so for the most part you shouldn't have to actually do the parsing. Probably this could be done via the parse api, but we explicitly ask other people not to do that and download dumps (across all wikis it's something like 250-300M pages to parse). Probably some code could be written and run in the production job queue to iterate through all the pages and emit some data to be picked up in analytics, but I can't really guess at how long that would run or how much concurrency would be reasonable. You also have alternate datastores like restbase/cassandra that have the rendered html stored in a database. I don't know enough about the storage model to say if that would be easy to source from or not.
Wed, Jul 18
Ran a quick test for data volume we will be shipping over kafka, looks like we will be generating around 2-3GB of compressed (~15GB uncompressed) data into kafka from the once a week batches.