[02:19:29] 10Traffic, 10Performance-Team, 10SRE, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [02:19:50] 10Traffic, 10Performance-Team, 10SRE, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I edited the task description with a proposed rollout plan, and I renamed the task to encompass the actual work, not just deciding on the work. [06:22:38] (LVSHighRX) firing: Excessive RX traffic on lvs2009:9100 (ens2f0np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2009 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [06:37:15] this says # page, but didn't page [06:37:38] (LVSHighRX) resolved: Excessive RX traffic on lvs2009:9100 (ens2f0np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2009 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [08:20:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) a:03Papaul @papaul, nice! We should keep all the same switch's uplinks on the same breakout cable: So instead of doing: 0/0 - asw2-c-eqiad:xe-2/0/[44... [08:34:07] hello [08:34:44] the icinga-wm| PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL alert in -operations seems related to a lot of new AAAA records for lvs's VLANs names [08:34:53] were they added to Netbox without running the sre.dns.netbox cookbook? [08:34:59] could you please take care of it? [08:35:58] the changes are all for lvs[2007-2010] [08:36:13] that would be brett's work [08:36:27] I mean he was working on that last EU night [08:37:26] also the IPv6 for the vlan's names don't follow our mapping format, so not sure how they were set [08:37:56] (HAProxyEdgeTrafficDrop) firing: 54% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:37:58] are we sure those are needed in the DNS? [08:39:58] per https://phabricator.wikimedia.org/T179026 yes [08:40:42] those are mngtmpaddr AFAICT [08:42:56] (HAProxyEdgeTrafficDrop) resolved: 55% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:43:19] vgutierrez: the problem of leaving the changes uncommitted is that if anyone runs the dns cookbook, either manually or via the decommission or makevm cookbooks, they will get the same extra diff and will get stuck and probably come here to ask anyway :) [08:51:16] in addition, if I'm not mistaken, the DNS names don't match the iface names, unless I'm reading them wrong [08:51:48] I addressed that on -sre [08:52:04] I'm assuming that the iface names come from stretch times [08:52:16] and those got bumped on the upgrade to buster [08:52:25] predictable interface names :) [08:52:50] TBH I don't know if it makes any sense to add mngtmpaddr addresses to DNS [08:53:33] and all of this has been triggered by a task complaining about some lvs hosts not having AAAA records [08:53:45] maybe some clarification is needed after all [08:54:00] the original AAAA records task was only about the primary IP of the host [08:54:21] and I can confirm that as of yesterday lvs[1013-1016,2007-2010] are missing it, while all the other lvs hosts have it [08:55:03] I don't know the context behind T179026, so can't say what's that was supposed to be about [08:55:03] T179026: LVS IPv6 IPs should all be recorded in DNS - https://phabricator.wikimedia.org/T179026 [08:55:25] that's T179025 [08:55:25] T179025: LVS hosts should have static-mapped IPv6 on all virtual interfaces - https://phabricator.wikimedia.org/T179025 [08:55:44] and that would imply getting rid of those mngtmpaddr first [08:55:53] I see [08:56:40] could we remove those AAAA records from netbox and re-import them if needed later? [08:57:15] remove for sure, the easy re-import I need to check how that would be easy or not [08:57:49] I'd hate to remove it them now and force brett to re-add them this afternoon if they were needed after all [08:57:58] * volans same [08:59:14] what I can do is to remove the dns name only for now (that's what triggers the diff) and those should eaiser to re-add, let me see if I can do that in a way that simplify the re-addition [09:02:44] thx <3 [09:12:02] vgutierrez: in which of the above task you suggest I put my update so that it's not lost in IRC/ [09:12:05] ? [09:13:16] I'm failing to find the original AAA records task [09:14:10] *AAAA [09:14:59] T271144 ? [09:14:59] T271144: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 [09:15:17] yep, T271144 [09:15:21] ok [09:15:23] (18 secs late) [09:38:04] vgutierrez: dns cookbook is back a noop! I'm updating the task with the details [09:38:21] I've left a tmux with my code in netbox1002 so that we could easily re-add them if needed (and the data will also be in the task) [09:52:27] vgutierrez: I've updated T271144 with all the details. All good [09:52:28] T271144: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 [13:42:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [14:47:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10Cmjohnson) [15:28:37] Thank you volans and vgutierrez. I apologize for not committing that. I *thought* I had to but couldn't find any mention of it in [[Netbox]]. I see now there's a link on that page but I missed it [17:33:01] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) [17:41:24] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz) [17:43:46] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz) [17:44:40] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz) [19:00:42] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) Thank you for doing that, @Volans ; I apologize for forgetting to run the cookbook. I'm a little confused here regarding onl... [22:00:56] 10Traffic, 10DNS, 10SRE, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8035453, @Dzahn wrote: >>>! In T310738#8033789, @LSobanski wrote: >> @Varnent After chatting about this...