User Details
- User Since
- Oct 7 2014, 4:49 PM (502 w, 1 d)
- Availability
- Available
- LDAP User
- EBernhardson
- MediaWiki User
- EBernhardson (WMF) [ Global Accounts ]
Tue, May 21
Thu, May 16
Went through and made some test charts in superset against my test tables I generated with the live data. It looks like we have everything we need, but I'm going to make one change to the collection scripts to simplify things.
Tue, May 14
Mon, May 13
Security would be interested in us investigating the access control mechanisms in opensearch, having access more limited than "anyone with a network connection in the cluster".
Wed, May 8
Mon, May 6
The four sub-tickets were combined into a single gitlab MR with two calculations, and found in: https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/merge_requests/3. These currently populated two daily partitioned hive tables and i've filled them with data for the months of march and april. Expecting that going forward we will want to move the metrics calculation to airflow, and decide on which metrics are worth dashboarding.
Four tickets were combined into a single ticket, two calculations, and found in the patch above:
Three tickets were combined into a single calculation and found in the patch above:
Three tickets were combined into a single calculation and found in the patch above:
Three tickets were combined into a single calculation and found in the patch above:
Fri, May 3
I've worked through most of this and have it calculating up the last two months of metrics now.
Wed, May 1
The general issue here, missing search suggestions, is resolved and the temporary mitigations put in place have been rolled back. I'm calling this issue done. One of the root causes, network connectivity, has been resolved. The other root cause, promoting a bad index, is tracked in T363521. Some changes have already been put in place to make this code more resilient to network failures, but more might still me done.
Tue, Apr 30
Fri, Apr 26
The consumer seems generally stable. It involved changes to both the application for better error handling, and an increase in the taskmanager memory above. The pods had been running for a week uninterrupted until we brought them down yesterday to verify some new alerting.
Poked at the data-engineering-alerts archive, it looks like these were firing daily and then stopped on Apr 10. I think we can optimistically call this fixed?
per the data-engineering-alerts list archive these were triggering daily alerts the two weeks prior to 2024-04-10 and haven't been emitted since. This is two days after the fix was applied, which is slightly curious. But I remember something about event refining operating over window of hours, so maybe it took some time to pass. I'm willing to call this complete with the errors stopping.
Root cause of the network issue has been tracked down in T363516#9748908, A layer-2 issue with LVS and new racks. With that fixed this error should be triggered less frequently, but we should still apply some resiliency updates to the related code.
decided to delay bringing traffic back to eqiad until monday. To be confident in the daily indices we would probably want to rebuild them all, but that takes many hours and it would finish only a few hours before I'm heading out for the weekend. Didn't seem like a great time to bring traffic back. The daily rebuilds will run, we can look at them an monday and bring traffic back if everything is back to normal.
I poked around a little, but I'm not sure how to check if that fix solved the issue or not. I submitted a join request to the data-enginering-alerts mailing list, can check archives for current frequency after being accepted. I assume these alerts are also recorded by whatever sends them, but i wasn't sure where that is.
These look to have subsided, now 12 in the last 4 days.
Thu, Apr 25
One thing we do have in logstash, although not specifically from the script running eqiad, is a surprising (to me) number of general network errors talking to the elasticsearch cluster. Looking at the Host overview dashboard for mwmaint1002 for today can see that there were intermittent network errors from 03:00 until 06:50. Our completion indices build ran from 02:30 to 6:45. Looking at the last 7 days there are consistently network errors during this time period. I'm assuming we are causing those, but we could try running it at a different time of day.
Started looking over this the other day. Some data we have available:
Wrote a terrible bash script to compare titlesuggest doc counts between the two clusters. This suggests the problem isn't limited to enwiki
Decided against shuffling traffic, rebuild is almost compete already for enwiki. I can see in the logs where the enwiki eqiad build jumped from 44% to complete, but no reason why. nothing in logstash for that period either. I've created T363521 to put something in place to prevent this in the future.
hmm, i can confirm this is happening. The completion index is built new every day in each datacenter. Usually they are the same, but somehow the eqiad index is about half the size of the codfw index (6.7g vs 14.5g). Auto complete is fairly high traffic, we should probably shift the autocomplete traffic to codfw until it can be fixed which probably requires a rebuild and a couple hours.
Wed, Apr 24
Apr 19 2024
Apr 18 2024
This chart should (eventually) contain the same data as gehel posted above. As of this moment only 5 days are calculated but the aggregate % have already settled in. I only spent a couple minutes to make the chart, this probably isn't the best way to present the data. But an example: https://superset.wikimedia.org/explore/?slice_id=3368
Apr 17 2024
One potential improvement we talked about, the initial method of configuring the saneitizer adds new pieces to the flink execution graph. This means you have to play around with some dangerous options to pause saneitization, losing the current saneitization state in the process. We should update the operation of the flag to toggle saneitization so that it still connects to the graph, but never emits any events or state changes when disabled. The general idea is that the shape of the graph should not change due to configuration changes, as graph shape changes require careful deployments.
Iniital deployment has been a bit rocky, in particular saneitizer is visiting pages with error states we haven't seen in normal operation yet. Overall this is probably good, we would have run into pages with these error states eventually. Saneitizer is simply speeding that process up. The pipeline has been running for a couple hours now without issues,. If it's still running without restarts by tomorrow we can probably consider the initial deployment complete.
Apr 16 2024
This looks to be all caught back up from our side
Apr 15 2024
All indices on cloudelastic look to be recreated now as well. It hasn't been running this whole time, it just took me awhile to get around to verifying the operation and finishing the couple wikis that failed the first two times through.
it was backfilling over the weekend but got stuck around feb 6th. It's back to processing hourlies, i expect they will keep decreasing for at least 12 more hours of processing based on the current rates, as long as it doesn't get stuck again. Basically what happened is there is a daily cleanup for old data, and because this is backfilling old data the bits it calculated were deleted in the middle of it working, and it stopped. I've paused the cleanup process for now until it completes.
Apr 12 2024
they are stored and processing through now at a rate of something like one hour per minute. It should catchup soon enough.
Hmm, indeed it looks like hourly transfers have been stuck for quite some time. Somehow airflow thinks there are two hours running and it never failed them. It is still waiting for them to complet even though nothing is running. It looks like we never set an SLA value on this dag, so it's failures probably don't get properly recognized. I've reset the two two tasks that were stuck and will see how i can get these all moving again, along with adding an sla so it properly alerts.
Apr 11 2024
Adam suggesting taking an easier way out and using the actor_signature definition of a unique device. This hashes together a couple values in the web request to create a fingerprint. The absolute number won't really be comparable to the overall unique devices metric, but we can calculate a % of actor_signatures and assume that it's in the same ballpark.
Worked through most of this and can compute single day stats that seem plausible with a notebook (on stat1007, ticket number prefixed to file name in my home dir). Will come back to it once the other metrics are figured out and extend this to calculate 90 days of dailies and offer monthly and ~quartly numbers over those daily stats. To follow up on the above:
Apr 9 2024
Started to look into this a bit closer. We will probably need to do custom work for each endpoint we want to classify. To start with:
In terms of the requested dimensons:
Apr 8 2024
If we want this to be directly comparable to page views then i imagine this should be implemented as a classifier against the web requests table. We would miss a few narrow cases with cross-domain search results (sister-search) but I suspect the referrer attached to the page views is sufficient to classify page views as from-search or not.
If we want a very simple count we currently record a weak fingerprint of the browser which is basically a hash of the ip address and the username. Due to the way this data is collected it does not include cached results, primarily that is short autocompletes and the related articles. This can be counted over whatever time dimension we want. The downside of this is that it's not directly comparable to anything. It is an absolute number and the directionality would be meaningful, but as a standalone datapoint it would be hard to say these sessions represent x% of all unique devices.
Moving to waiting, as we need to wait and see if changing the log buffering fixed the issue or not.
I took a look over the actual event generation, but I can't see why meta.dt would be outdated. Our request logging does cache some things, but the meta information isn't one of them. We fetch the value for meta.dt from wfTimestamp() (global clock) and immediately provide the event to logging. Logging does put the request into a second deferred update, but as long as we are running from inside a deferred update the system guarantees it will run any deferred submitted while running the deferred immediately after (via a scope-stack abstraction).
Apr 3 2024
Apr 2 2024
Mar 25 2024
The interface message as provided above is search-result-size. The english version, provided by dev, is as follows:
Mar 20 2024
Mar 5 2024
We are in the process of deploying a new updater for CirrusSearch, with cloudelastic as the first destination cluster. Duplicates could be a result of that, and are good to report so we can get everything working great before moving on to the primary search clusters.
Feb 29 2024
@bking this is likely related to the transition of cloudelastic to private ips? I'll take a look later if you don't have ideas.
Feb 28 2024
Feb 27 2024
Feb 26 2024
I suspect at the time we initially setup global-search we didn't have the cloudelastic.wikimedia.org alias up and running yet, but now that that exists should certainly point at it instead of individual servers.
Feb 23 2024
Feb 22 2024
To review the documentation changes (there are also two revisions from bking mixed in there): https://wikitech.wikimedia.org/w/index.php?title=Search&diff=2153071&oldid=2127290
Example query of the rest api (could be nicer if we installed curl or wget, or exposed the rest api directly):
KUBECONFIG=/etc/kubernetes/cirrus-streaming-updater-deploy-staging.config kubectl \> exec \ flink-app-consumer-search-backfill-5b9f979487-dsqsb \ -c flink-main-container \ -- \ python3 -c 'import urllib.request; print(urllib.request.urlopen("http://localhost:8081/v1/jobs").read().decode("utf8"))'
On further review, simply documenting the various commands to run seemed error prone. Attached patch adds a python script that simplifies away most of the reindexing and backfill to ease future burden.
Feb 15 2024
Was supposed to be in the backport window today, but train problems blocked that. This is a pretty safe patch though, i'll ship it a little later.
It seems the patch didn't actually make it into wmf.18 as expected, jenkins-bot never finished the merge so this was only deployed in wmf.17. I'll get it shipped there too.
Feb 14 2024
I've been reviewing our options for backfilling and trying to come up with a plan, i think the following will work:
This looks resolved now, the bi-hourly spikes have gone away since the monday deployment.