[05:06:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [05:11:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [08:23:10] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th): Migrate Benthos `webrequest_sampled_live` to feed from HAProxy data - https://phabricator.wikimedia.org/T390029#10692775 (10JAllemandou) Awesome, thank you @elukey :) [09:21:05] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:26:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:29:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:32:51] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10693017 (10ayounsi) Alarms graphing is working well. {F58951374} On this dashboard as well: https://grafana.wikimedia.org/d/fb... [09:39:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:42:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:42:32] hello! I have another ATS lua change, this time a little more complex than usual. the code already has serviceops review but wanted to get an okay from ye if possible https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131748 [09:42:54] basically we want to roll out changes to all wikis bar a few and this was the most straightforward way to do it [09:47:00] RESOLVED: [2x] PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:48:34] hi hnowlan, let me check [09:50:49] thanks! [09:53:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [09:57:18] hnowlan looks good to me, but maybe someone with more experience in lua can have a more authoritative opinion [11:09:51] FIRING: FermMSS: Unexpected MSS value on 10.2.2.44:443 @ registry1004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=eqiad&var-cluster=misc - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [11:20:18] also happy to add more tests if it'd give certainty [11:21:58] 06Traffic: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10693244 (10Fabfur) [11:38:01] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10693273 (10ayounsi) 05Open→03Invalid The alert was too sensitive, I made https://gerrit.wikimedia.org/r/c/operations/alerts/+/1132591 to improve it. [11:40:04] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105#10693284 (10ayounsi) Closing this task as we now have alerting for all the MX running a not too old Junos (and we're upgrading Junos in T364092). [11:40:08] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105#10693287 (10ayounsi) 05Stalled→03Resolved a:03ayounsi [11:43:23] RESOLVED: FermMSS: Unexpected MSS value on 10.2.2.44:443 @ registry1004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=eqiad&var-cluster=misc - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [11:51:01] fabfur: --^ [12:18:36] tnx! [12:18:40] good job! [13:08:49] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10693542 (10cmooney) [13:45:59] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10693761 (10Ladsgroup) Update: if all goes well, this should be done in two to three weeks. [13:49:29] 10Domains, 06Traffic: [toolforge] transfer/adopt toolsbeta.org domain to the foundation - https://phabricator.wikimedia.org/T362253#10693901 (10Andrew) > Hello, Doneva! > > We have recently acquired a new domain for one of our test/dev clusters, toolsbeta.org. This will mirror the existing domain toolforge.or... [13:56:19] 10Domains, 06Traffic: [toolforge] transfer/adopt toolsbeta.org domain to the foundation - https://phabricator.wikimedia.org/T362253#10693953 (10Andrew) Oops, seems Doneva is no longer at markmonitor, so I need to get new contact info from Rob. [14:03:07] fabfur: would it be okay with you if I did a test rollout of that change to a single host (via disabling puppet etc) and see if it's safe? [14:03:59] hnowlan: which change is this? just catching up (fabfur might be out for lunch) [14:05:11] sukhe: ahh good to know - it's this one https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131748 [14:06:53] ok happy to look at it in ~10 mins or so [14:10:41] thanks! [14:15:30] ok for me, sorry was afk [14:26:05] looking now [14:39:06] hnowlan: looks good based on whatever context I could muster :). please feel free to deploy, on a single host and then roll it out [14:39:11] or let us know if you want us to do it [14:44:36] thanks! I will give it a go now [14:47:28] ah, we have to wait a little bit while warming caches - it'll be tomorrow morning [14:47:48] ok! [14:51:24] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10694188 (10cmooney) +1. I think I disabled it on the fasw a while ago as it was unable to connect to them, and I was worried about wasting clock cycles trying. But since their upgrades I t... [14:55:51] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10694205 (10joanna_borun) p:05Triage→03Medium [14:55:58] 10netops, 06Infrastructure-Foundations: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052#10694206 (10ayounsi) a:03ayounsi [15:34:34] 06Traffic: Unify CDN ats/haproxy/varnish upgrade cookbooks - https://phabricator.wikimedia.org/T390094#10694350 (10BCornwall) I think having to have documentation for cookbooks so simple supports my argument that abstractions are not the answer here. [15:36:28] 06Traffic: Unify CDN ats/haproxy/varnish upgrade cookbooks - https://phabricator.wikimedia.org/T390094#10694359 (10Volans) >>! In T390094#10685679, @Fabfur wrote: > Not much experience in writing cookbooks but I like the approach of having a `__init__` with common code and minimal code on the specific cookbooks,... [20:00:56] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10695612 (10xcollazo) Hello, I would like to exercise this rule by running a very heavy Presto query. Is t... [20:23:58] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10695703 (10BCornwall) 05Stalled→03Resolved [20:24:06] 06Traffic, 13Patch-For-Review: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737#10695708 (10BCornwall) [21:10:27] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10695918 (10BTullis) >>! In T381389#10695612, @xcollazo wrote: > Hello, I would like to exercise this rule...