[09:35:35] 10netops, 06Infrastructure-Foundations: mr1-eqsin performance issue - https://phabricator.wikimedia.org/T362522#9766569 (10cmooney) p:05High→03Medium [09:49:16] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092 (10cmooney) 03NEW p:05Triage→03Medium [09:50:59] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9766638 (10cmooney) [09:55:38] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.2R3 - https://phabricator.wikimedia.org/T364092#9766653 (10ayounsi) Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru. [10:16:30] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 (10cmooney) 03NEW p:05Triage→03Medium [10:16:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9766721 (10cmooney) [10:16:53] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9766720 (10cmooney) [10:25:35] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097 (10cmooney) 03NEW p:05Triage→03Medium [10:25:48] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9766766 (10cmooney) [10:25:49] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766767 (10cmooney) [10:28:05] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766769 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b27eb80b-98ee-43fb-8026-b02b3e00b5d4) set by cmooney@cumin1002 for 14 days, 0:00:00 on 3 host(s) and their... [10:30:00] sukhe: the magru02 Ganeti cluster is also ready now for VM creation (confirmed by creating bast7001) [10:35:36] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766810 (10cmooney) Device has been removed from LiberNMS now. I also downtimed it for 2 weeks just in case I mess up the order of anything. [10:37:58] moritzm: thank you! [10:39:14] moritzm: the microcode mitigations check is flagging on all instances we did. I am sure you know but just thought I should point out in case it was forgotten [10:39:18] "sudo gnt-cluster modify -H kvm" etc [10:40:58] yeah, I forgot to do that yesterday, but applied it to the cluster settings this morning, so that all ne VMs get it right away [10:41:17] we can retroactively apply it to the running VMs [10:41:17] thanks! [10:41:31] they'll need a qemu reboot, but no issue given they are all being setup [10:42:38] yeah [10:42:58] possibly stupid question: is there a reason we don't apply them as part of the ganeti setup cookbooks? [10:43:29] it's a cluster setting, it's not needed on a per VM setting [10:43:46] the VMs from yesterday we justc reated before the cluster setting was made [10:44:52] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766846 (10cmooney) [10:45:21] yeah I meant in how you setup the cluster [10:45:29] but I think now I realize that there isn't a cookbook for that [10:45:33] it's just sudo gnt-cluster init ? [10:46:15] yeah, this step is just the gnt-cluster init [10:46:31] but we can't immediately set the CPU flags during the init, that's a second command [10:46:39] ok, yeah. well we do have the Icinga check so that's a the automatic reminder :) [10:50:23] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9766856 (10cmooney) [10:57:13] sukhe: ok to reboot ncredir7001, doh7001 and durum7001 now? [10:58:33] yep please do [11:13:10] 10netops, 06Infrastructure-Foundations, 06SRE: Adjust IBGP route-reflector spine/leaf automation to support separate client clusters - https://phabricator.wikimedia.org/T364103 (10cmooney) 03NEW p:05Triage→03Medium [11:18:06] sukhe: all done and alerts clearewd [11:18:17] thanks! <3 [11:35:31] XioNoX: very grateful for being able to just do "enable BGP" on netbox and running homer for the BGP configuration (Cathal just shared). thanks! [11:36:32] happy that it's useful :) [11:51:44] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767123 (10cmooney) [12:07:50] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767210 (10cmooney) [12:41:59] XioNoX, topranks: DNS sync shows me a diff for removing irb-2017.lsw1-a1 and irb-2021.lsw1-a1, safe to proceed? [12:42:16] moritzm: yep [12:42:18] thx ! [12:42:25] ack, merging [12:44:24] thanks, sorry you're just ahead of me :) [12:45:27] always happy to remove things :-) [12:50:03] the DNS sync failed, does that ring a bell to anyone? https://paste.debian.net/hidden/d30f749f/ [12:50:15] I could simply retry, maybe it's just some short-lived race [12:50:36] no [12:50:49] moritzm: give me a min that's me [12:50:55] ack [12:51:08] there is nothing in the range anymore - but the 'include' for the reverse is in the static zonefile in the dns repo [12:51:11] I'll make a patch [12:55:34] ok [13:25:56] morizm: sorry for the hassle there, dns stuff fixed now and I've pushed any pending changes [13:26:01] thx! [13:40:01] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767654 (10cmooney) [13:41:07] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767659 (10cmooney) a:03Papaul @papaul I think this one is ready to be moved to rack D1 now. [13:41:31] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767661 (10cmooney) [14:04:57] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9767770 (10cmooney) [14:18:49] topranks: XioNoX: sorry one more question [14:18:58] I enabled the BGP field on netbox for the durum and doh hosts [14:19:08] running homer though shows no diff and no changes applied [14:19:15] BGP session also not established [14:19:29] is there an intermediatery step? [14:19:43] sukhe: shoudn't be - we are on call trying to get transit up will look shortly [14:19:49] oh sorry [14:19:50] np [14:20:25] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:28] sukhe: on which host are you running homer? [14:22:48] cumin1002 [14:23:00] I did a run-puppet-agent to pull in the changs [14:23:07] let me try that again maybe [14:24:53] run-puppet-agent shouldn't be needed [14:25:05] INFO:homer:Homer run completed successfully on 2 devices: ['asw1-b3-magru.mgmt.magru.wmnet', 'asw1-b4-magru.mgmt.magru.wmnet'] [14:25:08] no diff though [14:25:51] anyway certainly not urgent [14:42:42] sukhe: from Telxius "The team is working on resolving the problem as soon as possible. We will let you know as soon as we have an update." [14:45:25] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:09] XioNoX: thanks! [14:57:14] still looking into that homer/bgp issue [15:01:11] no problem, not urgent fwiw. we can look into it on Monday as well! [15:01:24] I'm off most of next week :) [15:04:09] XioNoX: I had quick look too and was scratching my head [15:06:37] but I can look again if you don't spot it [15:10:37] alright, so it doesn't work for that usecase, as in VM bgp with L3 ToR [15:10:48] we can look for example at https://netbox.wikimedia.org/virtualization/virtual-machines/459/ [15:11:30] that's where it (doesn't) happen: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/refs/heads/master/plugins/wmf-netbox.py#134 [15:12:06] I'm actually surprised it doesn't get removed by automation in esams/drmrs [15:12:16] I was just about to ask! [15:12:29] I think we just add back in the YAML for now? [15:12:51] manual or yaml would works, yeah [15:12:56] as soon as you mentiond the obvious problem of working out what VM peer should be on what switch occurred to me [15:14:06] I mean I guess you can work back from the VM to cluster - and if the cluster is in one rack work it out? [15:15:22] yeah, I don't think it's too difficult, but one oversight when I wrote the initial automation [15:16:06] np [15:16:22] could also work out from the VMs IP address potentially [15:18:40] and of course Homer is trying to add the peering to the magru CRs [15:19:37] I think for now we set BGP=false for them in netbox to deal with that [15:21:40] yeah, like the other sites [15:33:15] how did the dns boxes work though? [15:33:54] sukhe: they're physical servers [15:33:57] the dns box is an easier thing to do as it's directly connected to the switch it's peering with [15:34:24] the VM is one degree of separation further and it's the determining of that relationship that isn't baked in [15:34:59] ah ok [15:35:04] XioNoX: I think I'll just add these manually for now [15:35:05] yeah fair [15:35:15] topranks: yeah that's ifne [15:35:27] I prepped a patch to add to the YAML, but it's trying to remove the peerings to the dns servers with that :( [15:35:45] I thought it merged the YAML + ones from netbox, at least from what I remember that was the intent of the code [15:35:51] I thought they were merging both dicts [15:35:54] yeah [15:36:22] oh, it merges the bgp_groups [15:36:29] but not the sub-keys [15:36:29] ah ok [15:36:34] right [15:39:04] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9768239 (10Gehel) [15:40:06] sukhe: BGP to all 4 VMs is now up :) [15:41:01] thank you both! [15:46:40] doh7002 working also :) [15:46:44] https://www.irccloud.com/pastebin/0PQwMMKd/ [15:58:23] :D [15:59:21] so just running down the list, we need to update homer to announce the /24 [15:59:25] but that's OK, we will do it later [15:59:30] thanks again for the help folks! [15:59:42] yeah we can do it any time just let us know when you want [16:00:20] it's separate from the ns2 anycast range, [16:01:03] so we can make the change to announce the wikidough range on its own [16:01:03] yeah, nice [16:01:11] I guess we can do it together [16:01:19] not that wikidough gets that much traffic but yeah [16:03:05] win 14 [17:04:56] 10netops, 06Infrastructure-Foundations, 10probenet, 06SRE, and 2 others: Add probenet configuration for magru - https://phabricator.wikimedia.org/T362902#9768659 (10CDanis) [17:12:04] sukhe: "I can confirm that this case is escalated within our organization. My colleagues from the Support Team reported a faulty card that is preventing them from continuing their investigation. Once the technology vendor solves that issue, we will be able to provide some news." [17:12:11] Telxius [17:28:29] haha thanks [17:28:47] let's hope this escalation will lead to something! [17:33:57] I bet that faulty card was breaking their ARP resolution [17:34:25] so now when they get that replaced they only need to fix the xconnect they broke during their first troubleshooting pass :P [17:36:11] topranks: didn't you already say it was a faulty card before they confirmed it? :P