[02:04:13] 06Traffic, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11155999 (10Krinkle) [02:27:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [02:32:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [03:33:07] 06Traffic, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, 07User-notice: [Rollout Phase 3] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510#11156031 (10Krinkle) [03:37:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [03:42:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2010 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [03:44:21] FIRING: [2x] FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [03:49:21] RESOLVED: [2x] FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [04:27:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [04:32:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [05:19:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [05:24:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [06:47:40] FIRING: VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5026 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:02:40] RESOLVED: VarnishHighThreadCount: Varnish's thread count on cp5026:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5026 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:16:51] FIRING: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [08:21:51] RESOLVED: FermMSS: Unexpected MSS value on 10.2.1.27:80 @ ms-fe2009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=codfw&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [08:22:50] 10netops, 06Infrastructure-Foundations, 06SRE: GRE Interfaces statistics not being returned by Juniper MX via gnmi - https://phabricator.wikimedia.org/T403936 (10cmooney) 03NEW p:05Triage→03Low [08:22:59] 10netops, 06Infrastructure-Foundations, 06SRE: GRE Interfaces statistics not being returned by Juniper MX via gnmi - https://phabricator.wikimedia.org/T403936#11156395 (10cmooney) [08:23:01] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#11156396 (10cmooney) [10:12:13] 06Traffic, 06Data-Engineering: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383#11156883 (10Vgutierrez) local tests show that HAProxy issued `46410639` but it never reached the kafka cluster, probably because haproxykafka failed to parse it for... [10:13:28] 06Traffic, 13Patch-For-Review: Add an Allow header on 405 responses - https://phabricator.wikimedia.org/T403767#11156886 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `vgutierrez@cp6016:~$ curl -X TRACE -i https://en.wikipedia.org HTTP/2 405 content-length: 146 cache-control: no-cache content-type... [10:13:57] FIRING: SystemdUnitFailed: bird.service on durum3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:57] RESOLVED: SystemdUnitFailed: bird.service on durum3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:32] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11156961 (10elukey) This patch https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1185877 should solve the last issue with Redfish, but it require... [11:00:56] 06Traffic, 06Data-Engineering: Reduce noise from duplicate sequence-gap alerts on HaProxy-webrequests - https://phabricator.wikimedia.org/T401383#11157028 (10Vgutierrez) probably unrelated but I've found what it could be a HAProxy bug related to `%rt` being increased twice per request: https://github.com/hapro... [11:11:25] FIRING: SystemdUnitCrashLoop: dnsdist.service crashloop on doh3006:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:21:25] RESOLVED: SystemdUnitCrashLoop: dnsdist.service crashloop on doh3006:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [11:33:52] FYI, I'm temporarily remove durum3005 from active service for approx. 10 minutes, the underlying VM was installed with non-DRBD when the routed Ganeti cluster in esams was single-node and I'm now moving it back to DRBD [11:46:40] and it's back [11:48:01] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum3005:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [11:53:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum3005:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:29:27] FYI, I'm temporarily remove doh3005 from active service for approx. 10 minutes, the underlying VM was installed with non-DRBD when the routed Ganeti cluster in esams was single-node and I'm now moving it back to DRBD [12:39:38] ack /cc sukhe [12:42:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh3005:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [12:42:26] and it's back [12:47:00] RESOLVED: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on doh3005:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=esams&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [14:30:39] 10netops, 06Infrastructure-Foundations, 06SRE: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#11157827 (10cmooney) I have set the bandwidth to '6000000000' either side manually in the UI so let's see how it goes. [14:42:28] 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops: WikiKube clusters close to exhausting Calico IPPool allocations - https://phabricator.wikimedia.org/T375845#11157897 (10cmooney) Is there anything remaining to do on this task? Looks like we have enough space now after the change in... [14:54:09] 06Traffic, 06MediaWiki-Platform-Team, 06Reader Experience Team: Toggling desktop view doesn't toggle user back into mobile mode - https://phabricator.wikimedia.org/T403866#11157986 (10Tgr) a:03Krinkle [15:05:00] 06Traffic, 10MediaWiki-Platform-Team (Radar): [Rollout Phase 1] Implement unified mobile routing and enable on wikitech.wikimedia.org - https://phabricator.wikimedia.org/T401595#11158093 (10Krinkle) >>! In T401595#11153001, @Jdlrobson-WMF wrote: > […] presumably we should be be getting more mobile (Minerva... [17:52:07] 06Traffic, 06SRE, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11159159 (10Dzahn) Deployed! And we spot checked it on cp1011. It isn't caching anymore now. [18:08:46] 10netops, 06Infrastructure-Foundations, 06SRE: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11159215 (10Papaul) BGP is up on mr1-eqsin cr2/3-eqsin ` mr1-eqsin# run show route protocol ospf inet.0: 198 destinations, 200 routes (198 active, 0 holddown, 0 hidden) Res... [18:47:29] 06Traffic, 10DNS, 06SRE: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11159330 (10ssingh) We merged the NOOP change that implements the new YAML-based `pdns-recursor` config. We have not enabled it anywhere yet because we don't have a host that ha... [18:50:41] 06Traffic, 10DNS, 06SRE: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11159335 (10ssingh) One of the related issues here is figuring out what `pdns-recursor` settings are actually applied, since the vary across prod DNS, Wikimedia DNS, and WMCS re... [19:27:01] [non-urgent] hello traffic friends - as prep for the upcoming migration of MediaWiki to PHP 8.3, there's an ATS script that I'd like to (re)introduce [0] via [1][2]. [19:27:01] do you folks prefer to actively review these kinds of changes? (e.g., vs. review by the team that "owns" the script alone) [19:27:01] [0] https://phabricator.wikimedia.org/T403655 [19:27:01] [1] https://gerrit.wikimedia.org/r/1184914 [19:27:02] [2] https://gerrit.wikimedia.org/r/1184915 [19:27:57] in any case, at least for the second change (i.e., the one that actually affects the behavior of ATS), I would be following the usual "disable-puppet -> test on a single host -> incremental rollout with cumin" approach [19:28:42] looking shortly [19:30:11] sukhe: thanks, and please take your time :) [19:31:35] 06Traffic, 10DNS, 06SRE: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#11159569 (10CDobbins) @ssingh: that's right. I thought about this a bit over the weekend, and I think the easiest approach is going to be doing a clean install in a VM, grabbing... [19:38:50] swfrench-wmf: looks good, feel free to take ownership of this :> [19:38:53] thanks for checking as always [19:42:23] sukhe: great, thank you very much! I'll follow up within the team for review and come back here just to announce when I'm messing around with cache hosts :) [19:42:30] (likely tomorrow) [20:14:47] 06Traffic: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792#11159912 (10matmarex) We've gotten the changes merged in the GitHub repo, and synced to the script at Meta, after resolving some more conflicts between the two (https://meta.wikimedia.org/wiki/MediaWiki_tal... [20:20:48] swfrench-wmf: noted and please let us know if we can help :> [20:21:05] (since you are doing most of the work!) [20:23:06] 06Traffic, 06MediaWiki-Platform-Team, 10MobileFrontend, 06Reader Experience Team: Toggling desktop view doesn't toggle user back into mobile mode - https://phabricator.wikimedia.org/T403866#11159938 (10Krinkle) a:05Krinkle→03Jdlrobson-WMF @Jdlrobson-WMF It looks like you're using Chrome Mobile on Andro... [20:53:56] 06Traffic: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792#11160068 (10Ladsgroup) @matmarex I'm thankful for the work you have done on chasing down and getting it updated. [22:05:36] 06Traffic, 06MediaWiki-Platform-Team, 10MobileFrontend, 06Reader Experience Team: Toggling desktop view doesn't toggle user back into mobile mode - https://phabricator.wikimedia.org/T403866#11160409 (10Jdlrobson-WMF) a:05Jdlrobson-WMF→03Krinkle > @Jdlrobson-WMF It looks like you're using Chrome Mobile... [22:05:39] 06Traffic, 06MediaWiki-Platform-Team: Mobile domain removal: Random page doesn't work with mobile/desktop switching - https://phabricator.wikimedia.org/T404023 (10Jdlrobson-WMF) 03NEW