[09:57:45] 06Traffic, 13Patch-For-Review, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Remove blubberoid LVS/k8s service - https://phabricator.wikimedia.org/T365742#9866873 (10jijiki) I will take care of the rest when I find some time, thank you! [11:37:38] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867080 (10cmooney) Detailed steps are in P64182 [11:55:56] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867102 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=54328f3a-52e5-42cd-bdf1-26ee5617a4d5) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their... [12:15:09] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867139 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=512f5f90-4832-4c61-b0eb-75b61fcd6f8c) set by cmooney@cumin1002 for 1:30:00 on 18 host(s) and thei... [12:25:39] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867154 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=76763bfc-4091-4d8a-b3f8-e84d96a9bd49) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their... [12:42:15] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9867210 (10MatthewVernon) @Eevans are you OK to do this, please? Should just be a case of checking `swift-dispersion... [12:43:42] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9867212 (10MatthewVernon) @Eevans would you be OK to handle this as well, please? It's a bit more involved as you'll... [12:47:41] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9867218 (10MatthewVernon) @Eevans you OK to handle this, please? Should just be a quick cluster health check afterwa... [12:57:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:57:52] ^ hmm [12:58:55] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m [12:59:19] spike of 503s on mw-web-ro [12:59:48] https://grafana.wikimedia.org/goto/jb8cbQsIR?orgId=1 [13:00:29] a deploy is starting, I wonder if we should notify them [13:00:57] topranks: what's the timestamp for enabling pybal on lvs1019? [13:01:19] vgutierrez: it's enabled now [13:01:29] I konw [13:01:32] *know [13:01:36] 12:54:51 UTC [13:01:42] 503s started at 12:58 [13:02:02] going down already [13:02:12] 12:54:41 [13:02:40] FIRING: [18x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:04:41] 06Traffic, 13Patch-For-Review: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466#9867253 (10Vgutierrez) [13:04:57] they hardly can be a co-incidence [13:07:40] FIRING: [20x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:12:40] FIRING: [30x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:16:16] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867277 (10cmooney) The first phase of this is complete, ssw1-e1-eqiad has been upgraded. I am going to pause before completing ssw1-f1-eqiad as some of the output is stran... [13:18:27] topranks: I've also seen a spike on lvs2013 at the same time [13:18:56] hmm ok [13:19:02] perhaps just a co-incidence [13:19:32] I'm pausing my work now anyway to validate ssw1-e1-eqiad is healthy after upgrade, so we can see if things remain stable [13:27:40] FIRING: [21x] VarnishHighThreadCount: Varnish's thread count on cp1104:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:30:28] effie, claime: issues in k8s lvs in eqiad could steer traffic to codfw? [13:32:16] I don't think so? [13:32:40] RESOLVED: [12x] VarnishHighThreadCount: Varnish's thread count on cp1104:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:37:06] there was a bunch of 405s at the same time [13:37:56] And a spike in latency [13:37:58] https://grafana.wikimedia.org/goto/hLvbfQsIg?orgId=1 [14:15:52] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867516 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2e3e9f53-54b4-4b8d-b9d6-ab280392b41c) set by cmooney@cumin1002 for 2:00:00 on 3 host(s) and their... [14:59:57] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867739 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e84998aa-eea9-43ce-9047-23b408d134b5) set by cmooney@cumin1002 for 1:30:00 on 15 host(s) and thei... [15:04:43] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867757 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8ea52962-5718-4917-aeee-12b979b25d42) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their... [15:06:31] would it help to depool the eqiad front edge? [15:06:39] it would take the ro traffic off of that side anyways [16:42:33] vgutierrez: did you arrive at any theory as to what happened earlier? [16:42:58] the network maintenances are done so I was gonna work on moving the links back to the spines when John is back on site in a few mins [16:43:27] i.e. shut PyBal on lvs1019, allow traffic to move to lvs1020, recable lvs1019, then reset... [16:43:30] and repeat for the other two [16:43:48] sukhe: interested to get your thoughts too, that *should* be ok right? [16:46:48] topranks: we had a big spike on the internal lvs on both DCs so it should be related to external traffic [16:47:20] A small cache busting event [16:47:23] yeah I think so [16:47:36] timing just made it seem related [16:47:52] I'll failover lvs1019 to lvs1020 in that case and keep a close eye on things [16:49:39] topranks: I think while the timing was suspicious, the events were not related and the dashboards confirm that. but yeah, we can monitor the situation when you do it [16:49:42] no concerns from me [16:49:58] ok cool, thanks both for the input :) [18:11:32] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9868580 (10ssingh) Moving the links working out well (which I think this is the first time?) is a big take away from this task; glad to hear it went nicely! [18:59:57] 06Traffic, 06MW-Interfaces-Team, 06serviceops: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9868777 (10daniel) [21:13:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869306 (10RobH) p:05Medium→03High @Jclark-ctr or @VRiley-WMF: Would one of you be able to take care of this on your next on-site visit? We have light on the drm... [21:32:58] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869358 (10wiki_willy) a:03Jclark-ctr [21:34:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869359 (10wiki_willy) Valerie is on vacation, so assigning to John [21:44:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869378 (10RobH) [22:30:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869444 (10Jclark-ctr) Installed cross connect link came up on port. cableid #5229 [22:33:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869454 (10RobH) [22:33:56] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869455 (10RobH) 05Open→03Resolved Looks good to me on this end, thank you!