Now live: https://klaxon.wikimedia.org/
and successfully tested today!
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Dec 23 2020
Dec 22 2020
Thanks for the report.
Dec 21 2020
In T269324#6705379, @Marostegui wrote:@CDanis I was taking a look at the section values and I have seen: flavour which is sort of new to me and seems to accept regular or external. But I haven't been able to find what are each for on https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/master/conftool/extensions/dbconfig/README.md
sX seem to be regular and x1, esX seem to be external, what's the difference between them?
Dec 16 2020
I'll put this in puppet-private once I have the other puppetization ready :)
Not totally sure about the best fit here vs Beta, but you might also
consider using the 'staging' k8s cluster.
So I guess we need a separate task for paging and the check_librenms
deprecation?
Dec 14 2020
Does this mean we can deprecate the check_librenms Icinga integration?
Sure, always happy to help :)
Dec 9 2020
Dec 8 2020
Tagging @BBlack in the hopes he knows anything offhand about shipping new GeoLite2 or full GeoIP2 files to Beta Cluster
Dec 3 2020
Dec 2 2020
The max lifetime of any object in the Traffic CDN is 24 hours. Are you sure they're being cached there? Can you give a full example URL?
Dec 1 2020
In T259312#6660843, @MattCleinman wrote:I believe I read something that they're now caching the app site association file on the Apple CDN, so I'd expect fewer hits...
FWIW, having looked at the past week of webrequest data, I've started to wonder as to whether or not the files on the subdomains matter all that much:
Nov 30 2020
https://thankyou.wikipedia.org/.well-known/apple-app-site-association now live, please let me know if it helps :)
Filed T269046
Nov 25 2020
Sure -- SRE will get that deployed, but we're going to wait until Monday
Nov 30th, in the interest of not making any changes right before the US
holiday weekend.
Nov 24 2020
Nov 23 2020
Today we tried powercycling the anchor while I was watching on serial console. It didn't output a thing. As far as I can tell, we need replacement hardware.
@Emdosis This task relates to very low-level issues stopping the PDFs from being delivered at all. Please open a new task for the different problem you reported, thanks :)
Zero NEL reports of http.response.invalid.content_length_mismatch on PDFs since 2020-11-18T13:28:14, so @BBlack's find was definitely the issue.
Nov 19 2020
BTW, re #2, I'm happy to deploy more patches to the site association file. The turnaround on that should be much faster going forward (since first we had to create a separate docroot and configure Apache to read from it).
Thanks! Can we try you powercycling it while one of us (either you or myself, at your preference) is watching the serial console?
Nov 18 2020
Quick question: when the time comes, will it be possible to dump all the old NEL events out of the existing index and import them into the new index?
Adding to what Brandon says, we do have evidence that it happens on edge DCs other than just eqiad and esams. Here's a user report from eqsin: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.11.12/network-error?id=AXW8XPuZBpCrxDVR4i4O&_g=h@44136fa
Nov 17 2020
Papaul, could you please attach atlas-codfw to one of the SCS servers so we can take a look via serial console? Thanks!
Nov 15 2020
Nov 9 2020
Not sure I can capture the whole discussion but I'll try:
Nov 6 2020
Patches look ready to go -- as discussed with @Ejegg we'll get this deployed Monday.
Nov 5 2020
Sorry, there's a small mess here -- parts of the relevant behavior are specified in the Puppet repo (where Apache2 config files live), and parts are in the mediawiki-config repo.
Nov 4 2020
As discussed, here's a start on the query: https://w.wiki/k6F
Both thresholds in there need some tuning, but it's a start.
In T263263#6603182, @Tchanders wrote:Thanks @CDanis, this is really helpful. We're currently responding to performance and security reviews via our test environment, using the free databases. We may have more questions when we're ready to move over to production.
Sounds good!
For now, do you know how often the databases are updated? I see that some haven't been updated since summer, but some have been updated since T264838#6534393 (those with Oct 4 date now have Nov 1, so I would guess monthly or weekly?)...
Yeah, sorry, there's a variety of legacy files also in that directory.
Nov 3 2020
Thanks for the report @AlexisJazz.
In T267089#6599101, @Peachey88 wrote:For information required to report connectivity issues: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue
In T262869#6565461, @Nemo_bis wrote:We'll prepare at least a lightweight incident report in the coming days.
Did this happen? I couldn't find it. Sorry if I looked in the wrong places.
(I realise others may have higher priority. For instance this one https://wikitech.wikimedia.org/wiki/Incident_documentation/20200814-isp-unreachable .)
Nov 1 2020
db1091 had some hardware failure again about 01:11 UTC.
Oct 30 2020
Approx 23:00 on 28 Oct, the size of the featured feed for frwiki started to become too large to be stored as a value in our memcached.
Memcached error for key "WANCache:frwiki:featured-feeds:1:fr|#|v" on server "127.0.0.1:11213": ITEM TOO BIG
https://logstash.wikimedia.org/goto/e0d82ea5e0c9883a61b200eff570a140
This isn't limited to just esams; it is in fact happening across all cache clusters.
I've written some proposed followups in the task description, feel free to comment or edit :)
the dynamicDefaults idea SGTM as well!
Oct 29 2020
In T266807#6589404, @hnowlan wrote:Kartotherian's logging was a complicating factor that could be improved - there are many log messages that look like potentially critical errors that are actually quite benign. Additionally, the timeouts caused by backlogged queries were not obvious as timeouts in responses.
Filed T266820 to bring codfw maps to production
This was a capacity issue.
Visidata looks amazing, thanks! I love this idea.
Oct 28 2020
In T262626#6572889, @Krinkle wrote:I suppose that could indeed be addressed in various ways, either by partial or complete omission from Logstash and using more restricted stores instead, or by e.g. adding the data Geo and ASN data in a processing step between edge intake to EventGate and Kafka/Logstash. E.g. a simple python processor like we have for Webperf events could presumably add any and all properties we want fairly trivially,
Oct 26 2020
Oct 24 2020
In T266373#6576161, @Framawiki wrote:See also this Grafana dashboard showing increase of daily PDF rendering by Proton from 80k to 20k, since beginning of August.
Oct 23 2020
Similar to @AntiCompositeNumber's comment, this somewhat reminds me of T205619.
Thanks @Tsevener ! Detailed reply below -- and thanks for the questions, they were helpful! -- but here's a high-level summary of what I found and what I'm thinking right now:
I would also like for someone to investigate if systemctl reload php7.2-fpm clears opcache.
Oct 22 2020
Hi @Tsevener -- wanted to check in about something. Is the version of the app WikipediaApp/6.7.2.1780 the version expected to have smoothing of widget refreshes over the whole hour? As far as I can tell that isn't happening:
In T262626#6564427, @Ottomata wrote:
Oct 21 2020
Oct 20 2020
In T264398#6565366, @Gilles wrote:I'm not sure how frontend servers are picked to serve requests (hashed by IP? URL?), but this seems like a noteworthy piece of data, which might be further confirmation of what seems to be a cluster-wide phenomenon in my previous comment.
100% of the TranslatablePage errors were on mw2328. The error occurred on no other appserver.
Oct 14 2020
In T261694#6538860, @MSantos wrote:@CDanis and @Dzahn as per T261424#6538173, is there anything else to be done for the 3rd party block in the traffic layer?
This will be taking effect over the next half hour.
Oct 13 2020
In T261424#6539494, @JMinor wrote:Yes, I think everyone who requested to be added to the allow list has been added. There were a couple questions on the mailing list related to OSM chapters usage. I have reached out to those folks a couple times, but received no answer. So, I think we are clear to merge the change, as is.