Page MenuHomePhabricator
Feed Advanced Search

Dec 23 2020

CDanis added a comment to T270324: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE.

Now live: https://klaxon.wikimedia.org/
and successfully tested today!

Dec 23 2020, 8:30 PM · SRE-OnFire, SRE
CDanis committed rLPRI169fd34beee6: faux secrets for Klaxon.
faux secrets for Klaxon
Dec 23 2020, 2:20 AM

Dec 22 2020

Ladsgroup awarded T270324: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE a Love token.
Dec 22 2020, 9:56 PM · SRE-OnFire, SRE
CDanis committed rOSCTfe06f4da22b1: dbctl: README: document section 'flavor'.
dbctl: README: document section 'flavor'
Dec 22 2020, 3:43 PM
CDanis closed T270664: Was unable to connect (esams) for about 20 minutes as Resolved.

Thanks for the report.

Dec 22 2020, 3:06 PM · Traffic, SRE, netops

Dec 21 2020

CDanis added a comment to T269324: Productionize x2 databases.

@CDanis I was taking a look at the section values and I have seen: flavour which is sort of new to me and seems to accept regular or external. But I haven't been able to find what are each for on https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/master/conftool/extensions/dbconfig/README.md

sX seem to be regular and x1, esX seem to be external, what's the difference between them?

Dec 21 2020, 9:10 PM · Performance-Team (Radar), Patch-For-Review, DBA

Dec 16 2020

CDanis added a comment to T270325: API key for the production 'wikimedia' VictorOps environment.

I'll put this in puppet-private once I have the other puppetization ready :)

Dec 16 2020, 7:14 PM · observability, SRE
CDanis renamed T270325: API key for the production 'wikimedia' VictorOps environment from API key production 'wikimedia' VictorOps environment to API key for the production 'wikimedia' VictorOps environment.
Dec 16 2020, 5:59 PM · observability, SRE
CDanis created T270325: API key for the production 'wikimedia' VictorOps environment.
Dec 16 2020, 5:55 PM · observability, SRE
CDanis created T270324: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE.
Dec 16 2020, 5:51 PM · SRE-OnFire, SRE
CDanis added a comment to T270169: Bootstrap Tegola vector-tile server with baseline MVT schema from OSM bright.

Not totally sure about the best fit here vs Beta, but you might also
consider using the 'staging' k8s cluster.

Dec 16 2020, 1:46 PM · Product-Infrastructure-Team-Backlog-Deprecated (Kanban), Maps
CDanis added a comment to T267018: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations.

So I guess we need a separate task for paging and the check_librenms
deprecation?

Dec 16 2020, 1:40 PM · netops, SRE, User-fgiunchedi, observability

Dec 14 2020

CDanis added a comment to T267018: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations.

Does this mean we can deprecate the check_librenms Icinga integration?

Dec 14 2020, 5:14 PM · netops, SRE, User-fgiunchedi, observability
CDanis added a comment to T269324: Productionize x2 databases.

Sure, always happy to help :)

Dec 14 2020, 2:00 PM · Performance-Team (Radar), Patch-For-Review, DBA

Dec 9 2020

CDanis closed T267800: Access to #mediawiki_security IRC channel for DannyS712 as Resolved.
Dec 9 2020, 2:31 PM · User-DannyS712, SRE

Dec 8 2020

CDanis updated subscribers of T268294: Ensure the necessary data files are present and accessible on beta for IP Info to function.

Tagging @BBlack in the hopes he knows anything offhand about shipping new GeoLite2 or full GeoIP2 files to Beta Cluster

Dec 8 2020, 9:31 PM · Anti-Harassment, IP Info

Dec 3 2020

CDanis closed T269370: test as Invalid.
Dec 3 2020, 4:24 PM
CDanis added a comment to T269370: test.

signed

Dec 3 2020, 4:22 PM
CDanis created T269370: test.
Dec 3 2020, 4:22 PM

Dec 2 2020

CDanis added a comment to T268927: Some PostgreSQL replicas are not fully updated.

The max lifetime of any object in the Traffic CDN is 24 hours. Are you sure they're being cached there? Can you give a full example URL?

Dec 2 2020, 1:39 PM · SRE, Maps (Kartographer)

Dec 1 2020

CDanis added a comment to T259312: Deal with donatewiki Thank You page launching in apps.

I believe I read something that they're now caching the app site association file on the Apple CDN, so I'd expect fewer hits...

Dec 1 2020, 7:38 PM · iOS-app-v6.8.2, Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, SRE, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog
CDanis added a comment to T259312: Deal with donatewiki Thank You page launching in apps.

FWIW, having looked at the past week of webrequest data, I've started to wonder as to whether or not the files on the subdomains matter all that much:

Dec 1 2020, 7:31 PM · iOS-app-v6.8.2, Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, SRE, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog

Nov 30 2020

CDanis added a comment to T259312: Deal with donatewiki Thank You page launching in apps.

https://thankyou.wikipedia.org/.well-known/apple-app-site-association now live, please let me know if it helps :)

Nov 30 2020, 10:27 PM · iOS-app-v6.8.2, Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, SRE, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog
CDanis closed T267714: ripe-atlas-codfw is down as Resolved.

Filed T269046

Nov 30 2020, 10:02 PM · Infrastructure-Foundations, netops, SRE
CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Nov 30 2020, 7:21 PM · Product-Data-Infrastructure, SRE, Goal, Epic

Nov 25 2020

CDanis added a comment to T259312: Deal with donatewiki Thank You page launching in apps.

Sure -- SRE will get that deployed, but we're going to wait until Monday
Nov 30th, in the interest of not making any changes right before the US
holiday weekend.

Nov 25 2020, 8:18 PM · iOS-app-v6.8.2, Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, SRE, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog

Nov 24 2020

CDanis edited P13393 eet.py.
Nov 24 2020, 3:44 PM
CDanis edited P13393 eet.py.
Nov 24 2020, 3:27 PM
CDanis created P13393 eet.py.
Nov 24 2020, 3:16 PM

Nov 23 2020

CDanis reassigned T267714: ripe-atlas-codfw is down from Papaul to faidon.

Today we tried powercycling the anchor while I was watching on serial console. It didn't output a thing. As far as I can tell, we need replacement hardware.

Nov 23 2020, 4:17 PM · Infrastructure-Foundations, netops, SRE
CDanis added a comment to T266373: Connection closed while downloading PDF of articles.

@Emdosis This task relates to very low-level issues stopping the PDFs from being delivered at all. Please open a new task for the different problem you reported, thanks :)

Nov 23 2020, 2:16 PM · Traffic, Web-Team-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog-Deprecated, serviceops, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
CDanis added a comment to T266373: Connection closed while downloading PDF of articles.

Zero NEL reports of http.response.invalid.content_length_mismatch on PDFs since 2020-11-18T13:28:14, so @BBlack's find was definitely the issue.

Nov 23 2020, 1:14 PM · Traffic, Web-Team-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog-Deprecated, serviceops, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error

Nov 19 2020

CDanis added a comment to T259312: Deal with donatewiki Thank You page launching in apps.

BTW, re #2, I'm happy to deploy more patches to the site association file. The turnaround on that should be much faster going forward (since first we had to create a separate docroot and configure Apache to read from it).

Nov 19 2020, 8:55 PM · iOS-app-v6.8.2, Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, SRE, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog
CDanis added a comment to T267714: ripe-atlas-codfw is down.

Thanks! Can we try you powercycling it while one of us (either you or myself, at your preference) is watching the serial console?

Nov 19 2020, 8:44 PM · Infrastructure-Foundations, netops, SRE

Nov 18 2020

CDanis added a comment to T265938: Create a separate logstash ElasticSearch index for schemaed events.

Quick question: when the time comes, will it be possible to dump all the old NEL events out of the existing index and import them into the new index?

Nov 18 2020, 4:09 PM · Wikimedia-Logstash, observability, Analytics, Product-Data-Infrastructure
CDanis added a comment to T266373: Connection closed while downloading PDF of articles.

Adding to what Brandon says, we do have evidence that it happens on edge DCs other than just eqiad and esams. Here's a user report from eqsin: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.11.12/network-error?id=AXW8XPuZBpCrxDVR4i4O&_g=h@44136fa

Nov 18 2020, 1:38 PM · Traffic, Web-Team-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog-Deprecated, serviceops, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error

Nov 17 2020

CDanis reassigned T267714: ripe-atlas-codfw is down from faidon to Papaul.

Papaul, could you please attach atlas-codfw to one of the SCS servers so we can take a look via serial console? Thanks!

Nov 17 2020, 2:41 PM · Infrastructure-Foundations, netops, SRE

Nov 15 2020

CDanis created T267870: ms-be1022 smart storage battery failure; disk sdb possibly bad.
Nov 15 2020, 2:07 PM · SRE-swift-storage, ops-eqiad, SRE

Nov 9 2020

CDanis updated subscribers of T265938: Create a separate logstash ElasticSearch index for schemaed events.

Not sure I can capture the whole discussion but I'll try:

Nov 9 2020, 8:22 PM · Wikimedia-Logstash, observability, Analytics, Product-Data-Infrastructure

Nov 6 2020

CDanis added a comment to T259312: Deal with donatewiki Thank You page launching in apps.

Patches look ready to go -- as discussed with @Ejegg we'll get this deployed Monday.

Nov 6 2020, 5:59 PM · iOS-app-v6.8.2, Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, SRE, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog
CDanis created T267409: grafana email alerting broken?.
Nov 6 2020, 3:04 PM · Patch-For-Review, SRE, observability

Nov 5 2020

CDanis added a comment to T259312: Deal with donatewiki Thank You page launching in apps.

Sorry, there's a small mess here -- parts of the relevant behavior are specified in the Puppet repo (where Apache2 config files live), and parts are in the mediawiki-config repo.

Nov 5 2020, 5:40 PM · iOS-app-v6.8.2, Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, SRE, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog
CDanis added projects to T259312: Deal with donatewiki Thank You page launching in apps: SRE, Wikimedia-Apache-configuration.
Nov 5 2020, 5:00 PM · iOS-app-v6.8.2, Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, SRE, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog

Nov 4 2020

CDanis added a comment to T267176: alert on too many close-to-saturated appservers / apiservers.

As discussed, here's a start on the query: https://w.wiki/k6F
Both thresholds in there need some tuning, but it's a start.

Nov 4 2020, 9:03 PM · Patch-For-Review, User-jijiki, serviceops, observability, SRE
CDanis added a comment to T263263: Access and use the MaxMind database for IPInfo.

Thanks @CDanis, this is really helpful. We're currently responding to performance and security reviews via our test environment, using the free databases. We may have more questions when we're ready to move over to production.

Sounds good!

For now, do you know how often the databases are updated? I see that some haven't been updated since summer, but some have been updated since T264838#6534393 (those with Oct 4 date now have Nov 1, so I would guess monthly or weekly?)...

Yeah, sorry, there's a variety of legacy files also in that directory.

Nov 4 2020, 4:05 PM · IP Info, Anti-Harassment, CheckUser, Tech-Product API Roadmap
CDanis created T267176: alert on too many close-to-saturated appservers / apiservers.
Nov 4 2020, 2:20 AM · Patch-For-Review, User-jijiki, serviceops, observability, SRE

Nov 3 2020

CDanis updated the title for P13151 ✔️ cdanis@lvs1015.eqiad.wmnet ~ 🕕🍺 sudo ipvsadm -L -t api.svc.eqiad.wmnet:https | tail -n+4 | sort -gk5 | phaste from Masterwork From Distant Lands to ✔️ cdanis@lvs1015.eqiad.wmnet ~ 🕕🍺 sudo ipvsadm -L -t api.svc.eqiad.wmnet:https | tail -n+4 | sort -gk5 | phaste.
Nov 3 2020, 11:02 PM
CDanis updated the title for P13150 ✔️ cdanis@lvs1015.eqiad.wmnet ~ 🕕🍺 sudo ipvsadm -L -t api.svc.eqiad.wmnet:https from Masterwork From Distant Lands to ✔️ cdanis@lvs1015.eqiad.wmnet ~ 🕕🍺 sudo ipvsadm -L -t api.svc.eqiad.wmnet:https.
Nov 3 2020, 11:00 PM
CDanis updated the title for P13149 nel-kafkacat-exporter.py from Command-Line Input to nel-kafkacat-exporter.py.
Nov 3 2020, 9:19 PM
CDanis created P13149 nel-kafkacat-exporter.py.
Nov 3 2020, 9:18 PM
CDanis closed T267089: Was unable to connect to most Wikimedia sites for a few minutes as Resolved.

Thanks for the report @AlexisJazz.

Nov 3 2020, 1:39 PM · netops, SRE, Traffic
CDanis added a comment to T267089: Was unable to connect to most Wikimedia sites for a few minutes.

For information required to report connectivity issues: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue

Nov 3 2020, 1:38 PM · netops, SRE, Traffic
CDanis added a comment to T262869: Wikimedia projects not reachable for some Telecom Italia users.

We'll prepare at least a lightweight incident report in the coming days.

Did this happen? I couldn't find it. Sorry if I looked in the wrong places.

(I realise others may have higher priority. For instance this one https://wikitech.wikimedia.org/wiki/Incident_documentation/20200814-isp-unreachable .)

Nov 3 2020, 12:48 AM · Infrastructure-Foundations, Traffic, SRE, netops

Nov 1 2020

CDanis reopened T225060: db1091 crashed as "Open".

db1091 had some hardware failure again about 01:11 UTC.

Nov 1 2020, 1:31 AM · Patch-For-Review, ops-eqiad, SRE, DBA

Oct 30 2020

CDanis added a subtask for T257527: automatically collect network error reports from users' browsers (Network Error Logging API): T266906: update logging ES's template index to type the 'age' field as an integer.
Oct 30 2020, 9:38 PM · Product-Data-Infrastructure, SRE, Goal, Epic
CDanis added a parent task for T266906: update logging ES's template index to type the 'age' field as an integer: T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Oct 30 2020, 9:38 PM · observability, SRE
CDanis created T266906: update logging ES's template index to type the 'age' field as an integer.
Oct 30 2020, 9:38 PM · observability, SRE
CDanis closed T266865: Very long response time on frwiki main page as Resolved.

Approx 23:00 on 28 Oct, the size of the featured feed for frwiki started to become too large to be stored as a value in our memcached.
Memcached error for key "WANCache:frwiki:featured-feeds:1:fr|#|v" on server "127.0.0.1:11213": ITEM TOO BIG

Screenshot_20201030_145412.png (672×1 px, 65 KB)

https://logstash.wikimedia.org/goto/e0d82ea5e0c9883a61b200eff570a140

Oct 30 2020, 6:55 PM · Wikimedia-Incident, serviceops, Traffic, SRE, Performance-Team, Performance Issue
CDanis edited projects for T266865: Very long response time on frwiki main page, added: serviceops; removed netops.

This isn't limited to just esams; it is in fact happening across all cache clusters.

Oct 30 2020, 5:52 PM · Wikimedia-Incident, serviceops, Traffic, SRE, Performance-Team, Performance Issue
CDanis created T266886: Augment NEL reports with a computed timestamp-of-generation.
Oct 30 2020, 5:09 PM · Data-Engineering-Icebox, Analytics
CDanis added a comment to T266807: Kartotherian/Maps outage followups, 2020-10-29.

I've written some proposed followups in the task description, feel free to comment or edit :)

Oct 30 2020, 3:26 PM · SRE-OnFire, Product-Infrastructure-Team-Backlog-Deprecated, Sustainability (Incident Followup), Maps (Kartotherian), SRE
CDanis updated the task description for T266807: Kartotherian/Maps outage followups, 2020-10-29.
Oct 30 2020, 3:24 PM · SRE-OnFire, Product-Infrastructure-Team-Backlog-Deprecated, Sustainability (Incident Followup), Maps (Kartotherian), SRE
CDanis added a comment to T263466: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set header values in event data.

the dynamicDefaults idea SGTM as well!

Oct 30 2020, 3:01 PM · Patch-For-Review, Better Use Of Data, Product-Data-Infrastructure, Event-Platform, Analytics

Oct 29 2020

CDanis added a comment to T266807: Kartotherian/Maps outage followups, 2020-10-29.

Kartotherian's logging was a complicating factor that could be improved - there are many log messages that look like potentially critical errors that are actually quite benign. Additionally, the timeouts caused by backlogged queries were not obvious as timeouts in responses.

Filed T266820 to bring codfw maps to production

Oct 29 2020, 6:00 PM · SRE-OnFire, Product-Infrastructure-Team-Backlog-Deprecated, Sustainability (Incident Followup), Maps (Kartotherian), SRE
CDanis renamed T266807: Kartotherian/Maps outage followups, 2020-10-29 from Kartotherian/Maps issues, 2020-10-29 to Kartotherian/Maps outage followups, 2020-10-29.
Oct 29 2020, 5:45 PM · SRE-OnFire, Product-Infrastructure-Team-Backlog-Deprecated, Sustainability (Incident Followup), Maps (Kartotherian), SRE
CDanis added projects to T266807: Kartotherian/Maps outage followups, 2020-10-29: SRE, Maps (Kartotherian).

This was a capacity issue.

Oct 29 2020, 5:38 PM · SRE-OnFire, Product-Infrastructure-Team-Backlog-Deprecated, Sustainability (Incident Followup), Maps (Kartotherian), SRE
CDanis created T266807: Kartotherian/Maps outage followups, 2020-10-29.
Oct 29 2020, 4:50 PM · SRE-OnFire, Product-Infrastructure-Team-Backlog-Deprecated, Sustainability (Incident Followup), Maps (Kartotherian), SRE
CDanis changed the visibility for T266775: Stalls on db1075 (s3) replica db.
Oct 29 2020, 3:45 PM · Datacenter-Switchover, User-Urbanecm, DynamicPageList (Wikimedia), MediaWiki-General, DBA
CDanis removed a project from T266775: Stalls on db1075 (s3) replica db: Security.
Oct 29 2020, 3:45 PM · Datacenter-Switchover, User-Urbanecm, DynamicPageList (Wikimedia), MediaWiki-General, DBA
CDanis added a comment to T265765: Quick data exploration CLI.

Visidata looks amazing, thanks! I love this idea.

Oct 29 2020, 2:07 PM · Data-Engineering-Icebox, Patch-For-Review, Analytics
CDanis created T266784: distribute tunnelencabulator in wmf-sre-laptop.
Oct 29 2020, 1:03 PM · wmf-sre-laptop, SRE
CDanis added a subtask for T244761: Script to point SRE local machine traffic to another LB: T266783: move tunnelencabulator's repo to a Wikimedia-owned space.
Oct 29 2020, 1:03 PM · SRE
CDanis added a parent task for T266783: move tunnelencabulator's repo to a Wikimedia-owned space: T244761: Script to point SRE local machine traffic to another LB.
Oct 29 2020, 1:03 PM · Infrastructure-Foundations
CDanis created T266783: move tunnelencabulator's repo to a Wikimedia-owned space.
Oct 29 2020, 1:03 PM · Infrastructure-Foundations

Oct 28 2020

CDanis added a comment to T262626: Remove http.client_ip from EventGate default schema (again).

I suppose that could indeed be addressed in various ways, either by partial or complete omission from Logstash and using more restricted stores instead, or by e.g. adding the data Geo and ASN data in a processing step between edge intake to EventGate and Kafka/Logstash. E.g. a simple python processor like we have for Webperf events could presumably add any and all properties we want fairly trivially,

Oct 28 2020, 3:24 PM · Data-Engineering, Better Use Of Data, Analytics-Kanban, Product-Analytics, Product-Data-Infrastructure, observability, Privacy Engineering, Analytics, Event-Platform

Oct 26 2020

CDanis created P13070 Command-Line Input.
Oct 26 2020, 3:28 PM
CDanis closed T241374: fastnetmon misreports attack type and protocol as Resolved.
Oct 26 2020, 12:52 PM · SRE, netops

Oct 24 2020

CDanis edited projects for T266373: Connection closed while downloading PDF of articles, added: serviceops, Proton; removed Traffic.
Oct 24 2020, 4:37 PM · Traffic, Web-Team-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog-Deprecated, serviceops, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
CDanis added a comment to T266373: Connection closed while downloading PDF of articles.

See also this Grafana dashboard showing increase of daily PDF rendering by Proton from 80k to 20k, since beginning of August.

Oct 24 2020, 4:33 PM · Traffic, Web-Team-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog-Deprecated, serviceops, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error

Oct 23 2020

CDanis added a comment to T247454: 502 Server Hangup Error on esams for "Upload a new version of this file" on Special:Upload on Commons.

Similar to @AntiCompositeNumber's comment, this somewhat reminds me of T205619.

Oct 23 2020, 11:22 PM · SRE, Traffic, SRE-swift-storage, Wikimedia-production-error, MediaWiki-Uploading, Commons
CDanis added a comment to T264881: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC.

Thanks @Tsevener ! Detailed reply below -- and thanks for the questions, they were helpful! -- but here's a high-level summary of what I found and what I'm thinking right now:

Oct 23 2020, 6:16 PM · SRE, Traffic, iOS-app-Bugs, Wikipedia-iOS-App-Backlog
CDanis added a comment to T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.

I would also like for someone to investigate if systemctl reload php7.2-fpm clears opcache.

Oct 23 2020, 3:00 PM · Patch-For-Review, Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), Sustainability (Incident Followup), Release-Engineering-Team (Deployment services), Scap

Oct 22 2020

CDanis lowered the priority of T264881: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC from High to Medium.
Oct 22 2020, 4:21 PM · SRE, Traffic, iOS-app-Bugs, Wikipedia-iOS-App-Backlog
CDanis added a comment to T264881: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC.

Hi @Tsevener -- wanted to check in about something. Is the version of the app WikipediaApp/6.7.2.1780 the version expected to have smoothing of widget refreshes over the whole hour? As far as I can tell that isn't happening:

image.png (736×1 px, 57 KB)

Oct 22 2020, 4:21 PM · SRE, Traffic, iOS-app-Bugs, Wikipedia-iOS-App-Backlog
CDanis added a comment to T262626: Remove http.client_ip from EventGate default schema (again).

@CDanis @Krinkle this does leave http.client_ip in the w3c/reportingapi/network_error events, which do get ingested into logstash. Should we remove it there too, or leave it?

Oct 22 2020, 12:39 AM · Data-Engineering, Better Use Of Data, Analytics-Kanban, Product-Analytics, Product-Data-Infrastructure, observability, Privacy Engineering, Analytics, Event-Platform

Oct 21 2020

CDanis created T266194: wikifeeds-production-tls-proxy regularly exceeding its k8s CPU reservation.
Oct 21 2020, 10:11 PM · Kubernetes, Wikifeeds, serviceops
CDanis closed T266108: Call to undefined method MediaWiki\Session\UserInfo::isVerifidd() as Resolved.
Oct 21 2020, 1:58 PM · MediaWiki-extensions-CentralAuth, Wikimedia-production-error
CDanis closed T266108: Call to undefined method MediaWiki\Session\UserInfo::isVerifidd(), a subtask of T245183: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.), as Resolved.
Oct 21 2020, 1:58 PM · Wikimedia-production-error, serviceops, SRE

Oct 20 2020

CDanis added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I'm not sure how frontend servers are picked to serve requests (hashed by IP? URL?), but this seems like a noteworthy piece of data, which might be further confirmation of what seems to be a cluster-wide phenomenon in my previous comment.

Oct 20 2020, 8:18 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
CDanis removed a subtask for T263180: 1.36.0-wmf.14 deployment blockers: T266052: Interface 'MediaWiki\EditPafe\IEditObject' not found.
Oct 20 2020, 5:52 PM · Release-Engineering-Team-TODO, Release, Train Deployments
CDanis removed a parent task for T266052: Interface 'MediaWiki\EditPafe\IEditObject' not found: T263180: 1.36.0-wmf.14 deployment blockers.
Oct 20 2020, 5:52 PM · Wikimedia-production-error
CDanis added a subtask for T245183: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.): T266051: Class 'TranslatablePage' not found.
Oct 20 2020, 5:52 PM · Wikimedia-production-error, serviceops, SRE
CDanis removed a subtask for T263180: 1.36.0-wmf.14 deployment blockers: T266051: Class 'TranslatablePage' not found.
Oct 20 2020, 5:51 PM · Release-Engineering-Team-TODO, Release, Train Deployments
CDanis edited parent tasks for T266051: Class 'TranslatablePage' not found, added: T245183: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.); removed: T263180: 1.36.0-wmf.14 deployment blockers.
Oct 20 2020, 5:51 PM · Wikimedia-production-error
CDanis added a comment to T266051: Class 'TranslatablePage' not found.

100% of the TranslatablePage errors were on mw2328. The error occurred on no other appserver.

Oct 20 2020, 5:51 PM · Wikimedia-production-error

Oct 14 2020

CDanis added a comment to T261694: Support maps serving for affiliate sites via an allow list.

@CDanis and @Dzahn as per T261424#6538173, is there anything else to be done for the 3rd party block in the traffic layer?

Oct 14 2020, 3:54 PM · Epic, Traffic, Maps, Product-Infrastructure-Team-Backlog-Deprecated, SRE
Elitre awarded T261424: Limit maps serving to Wikimedia hosted sites only a Barnstar token.
Oct 14 2020, 3:49 PM · Maps, Traffic, SRE, Product-Infrastructure-Team-Backlog-Deprecated
CDanis closed T261424: Limit maps serving to Wikimedia hosted sites only as Resolved.

This will be taking effect over the next half hour.

Oct 14 2020, 2:24 PM · Maps, Traffic, SRE, Product-Infrastructure-Team-Backlog-Deprecated

Oct 13 2020

CDanis claimed T261424: Limit maps serving to Wikimedia hosted sites only.
Oct 13 2020, 5:42 PM · Maps, Traffic, SRE, Product-Infrastructure-Team-Backlog-Deprecated
CDanis added a comment to T261424: Limit maps serving to Wikimedia hosted sites only.

Yes, I think everyone who requested to be added to the allow list has been added. There were a couple questions on the mailing list related to OSM chapters usage. I have reached out to those folks a couple times, but received no answer. So, I think we are clear to merge the change, as is.

Oct 13 2020, 5:42 PM · Maps, Traffic, SRE, Product-Infrastructure-Team-Backlog-Deprecated