Tue, Aug 20
@mpopov Any movement here? No huge rush but this will let us stop generating all events twice
We need to rework our updater a little bit to share some expensive work before the partitioned jobs, but pull the ContentHandler data per-job. Shouldn't be that much work, but needs to be done on our end so the cirrusSearchElasticaWrite job can be partitioned
Mon, Aug 19
Try to evenly space out elastic nodes in the row evenly in 1G racks.
Fri, Aug 16
Thu, Aug 15
Wed, Aug 14
Known problem related to recent deployment, currently updates are backlogged about 12 hours. The plan to deal with this is T230495, while some short term hacks are being worked on to alleviate the current backlog.
It looks like enwikisource has added a suggestion to Special:Search to enable subphrase completion matching. This has resulted in ~421 users that have turned it on, and we don't see any that have turned it back off for the default. This suggests we could possibly move forward with making subphrase matching the default on wikisource.
Thu, Aug 8
All wikis are writing to cloudelastic now. Still be a few days to catchup on writes since july 29, the day the dump was made. Also somehow importing commonswiki_file only imported ~25M out of 50M items. The saneitizer is working on fixing that, but will take a bit.
Wed, Aug 7
Tue, Aug 6
Think i found it:
Looked into this a little bit (on cloudelastic1001.wikimedia.org), no solution yet:
Mon, Aug 5
I reviewed the draft on mw.org, everything there looks accurate as far as I'm aware. I didn't realize that implicit and explciit AND behave differently. The on-wiki documentation doesn't feel scary enough for what's really going on, but I'm not sure how to make it more explicit that this thing is funny and not what they think it is.
Resolution: The ipv6 address I set, which was the ipv6 version of the ipv4 address, was incorrect. Rather a new ipv6 address within the LVS range needed to be used. Brandon applied a fix with https://gerrit.wikimedia.org/r/#/c/528215/ and https://gerrit.wikimedia.org/r/#/c/528216/ and everything now looks to be working as expected wrt ipv6
Something suspicious i noticed, in lvs/configuration.yaml we define our ipv6 address in a different format than the others. Table below contains example addresses for a few services. In particular note that addresses like 2620:0:860::ed1a::1 are not valid ipv6 addresses, I'm not actually sure what they are.
Backfill was done as follows, with guidance from https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Backfilling#Backfilling_a_kafka_eventlogging_%3CSchema%3E_topic
Fri, Aug 2
Thu, Aug 1
Actually, it looks like wikidata has two separate staements on Q30099997):
- the use of social media by Donald Trump, 45th President of the United States
- Donald Trump's use of social media...
For Trump travel ban (Q30949469) we have:
english description: A travel ban by the 45th President of ***the United States, Donald*** J Trump",
Tue, Jul 30
In a quick test over a few dozen queries it looks like this is faster than our existing ranking. On a rethink it makes sense, before we had to lookup various term statistics and perform rescore queries, but here we generate a random number for each visited document. Uploaded a patch that adds the random rescore functionality.
I suppose there is also the significant terms aggregation, it's similar to the aggregation above but this tries to take into account the frequency in the total document collection vs the frequency in the result set. Essentially this orders structured data statements by how much more likely they are to be found in the result set vs the overall document collection:
It all depends on which code path gets the gateway timeout. Essentially all possible places cirrus talks to elasticsearch can end up erroring out. In terms of the most common code paths:
Script used to collect above results. Note this needs access to elasticsearch directly as cirrussearch does not yet support this query: P8829
Checked back into this and it's looking much better. july 16 had 2500 gateway timeouts per 12 hours, since deploying the highest 12 hour period is 250 gateway timeouts. Might be worth continuing to look into, but knocking this down an order of magnitude is probably sufficient.
Mon, Jul 29
Dashboard with daily dym metrics over last 7 days: https://superset.wikimedia.org/superset/dashboard/40/
Evaluated with respect to our did you mean suggestions and this seems like a plausible path forward. Since this was only a spike will create a separate task to put together the appropriate oozie workflows to generate the data in an ongoing basis.
@Ramsey-WMF This is basically what we can get "for free" from elasticsearch as related-structured data. Essentially this just counts and displays the top N structured data statements over the search results set.
Thu, Jul 25
I haven't worked with flow in 3 or 4 years, my understanding is no one is interested in these metrics and the future of structured discussions is something different. I think these are safe to be cleaned up.
Wed, Jul 24
I hadn't previously thought about re-publishing a new version of the same dataset. It does seem possible. For the purposes of my implementation the actual prefix is immaterial. The container is used to decide what kind of upload this is, the exact prefix is only an implementation detail.
Metrics we should use moving forward
Tue, Jul 23
We left this in waiting/blocked because we weren't sure there the 5xx's were coming from. It looks probable this has the same root cause as T228063, so closing this for now. Can re-open if the problem persists.
Followup will be in T216058 to test import the backing data into druid and evaluate if one of the druid interfaces can visualize our metrics.
reference druid schema for mediawiki history: https://docs.google.com/document/d/1jzrE3xdyEHed4Ek5ORRedOlEeH-i111hdmG3tBTF8QU/edit#