[00:26:46] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:46] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:30] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9819960 (10Manuel) [08:05:22] More reasons to not do business with Cogent https://www.reddit.com/r/networking/comments/1cu13bv/cogent_depeering_tata/ [08:06:22] and https://hostux.social/@alarig/112481875248210287 [08:08:22] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9820025 (10MoritzMuehlenhoff) [09:07:04] volans, jayme, should we give the rename cookbook a try ? do you have a guinea pig host in mind? [09:10:35] XioNoX: not really, we had a couple but these needed to be finished quickly so h.ugh did them manually [09:10:58] but I you could try with some random k8s node that is named mwXXX [09:11:39] currently we're doing some maintenenace on the codfw k8s cluster though. So maybe better to wait until that is finished :) [09:13:23] I don't have any, sorry :) but happy to be around when testing [09:15:01] jayme: no pb, when is it scheduled to be finished ? I'd rather not lose the momentum that cookbook got [09:15:10] eheh [09:15:44] it will take a couple of hours - so maybe plan for tomorrow if thats okay [09:16:09] jayme: oh yeah that's great ! [09:17:54] XioNoX: do we have a phab task for the cookbook? We could coordinate there [09:20:29] jayme: not that I'm aware off, is there a phab task for renaming the mw hosts ? :) [09:20:51] no, I don't think we're really planing on doing that tbh [09:21:07] more like waiting for the nodes to rotate out [09:21:24] we could maybe during the next reimage, though [09:22:09] ah, I thought there was a need for this cookbook [09:22:24] what's the blocker to rename the mw if we have automation for it ? [09:22:45] none, really - it's just work still, even with automation [09:23:18] I think we can pick this up with the next k8s/os upgrade of the k8s worker fleet [09:23:41] oh ok, any idea when it's planned for [09:23:42] ? [09:23:51] can we at least rename any future migration of mw hosts to k8s? [09:24:04] def. month ahead XioNoX [09:24:31] volans: sure, for the remaining ones we can do with the cookbook ofc [09:24:58] jayme: if it's just months I think it's ok [09:25:02] how many of the remaining ones are in codfw A/B rows? [09:25:08] to understand how the move-vlan is needed for them [09:28:36] XioNoX: multiple months ;) [09:28:43] volans: there is a list here https://docs.google.com/spreadsheets/d/1VqgWZxmP6LqUgFChIvV5BYvHqr1ZhUh17iXgJ26_1UM/edit#gid=1295795675 [09:28:50] I'll hopefully be done with sretest2002 later today (using it for dhcp testing) [09:28:55] hello! I'm afraid I am back :( I've set bgp to true for one of the hosts from yesterday, and when running `homer 'lsw1-b6-codfw*' commit 'new k8s control node'` I get "ERROR:homer_plugins.wmf-netbox:No BGP group found for wikikube-ctrl2001" [09:29:36] hnowlan: I'll take a look [09:29:50] this hostname prefix "wikikibe-ctrl" is new right? [09:30:10] needs to be added to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/refs/heads/master/plugins/wmf-netbox.py#16 [09:30:14] yep [09:30:24] ahh thanks [09:30:40] and https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions ;) [09:30:57] ah now that last one I'd have missed thanks [09:31:30] hnowlan: I guess what we need to know is how the BGP needs to be set up [09:31:31] I'm sure there are lots of them missing [09:31:43] in terms of AS number, routes that will be announced [09:31:55] I suspect these are a carbon copy of an existing setup though are they? [09:32:08] in which case we can point them at an existing group [09:33:09] yea they're the same as the old kubemaster [09:33:34] ok give me a few mins [09:33:43] thank you! [09:34:43] hnowlan: I'll leave adding to the wikitech list to you [09:35:02] grand, thanks [09:38:35] XioNoX: we have ~212 unique prefixes extracted from cumin's A:all, I can paste them somewhere if needed ;) [09:41:00] volans: feel free to paste them over there https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions ;) [09:43:44] XioNoX: only if you tell me how I can keep the order by name while editing :D [09:45:28] more seriously, I don't think that page is a good source of truth, it should fold into the various ideas we had to move them to Netbox [09:45:45] but not an easy one :) [09:46:01] totally agree [09:46:48] hnowlan: will there be a `wikikube` too ? or only `wikikube-ctrl` ? [09:47:06] (or -worker) [09:50:53] XioNoX: good question [09:51:29] looking at the current list, naming seems to be all over the place :) [09:51:30] I think expecting wikikube-workerXXXX would be fair [09:51:45] yeah, especially for "wikikube" it's a mess [09:52:26] but if we rename some of them during the next OS reimage, we can also rename them all and let them become wikikube-workerXXXX [09:52:44] topranks: can you add ^ to https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1034853 ? [09:53:01] jayme: that would be awesome, yeah [09:53:17] XioNoX: good shout! [09:53:48] jayme: "wikikube-worker" should also be the same as "wikikube-ctrl" in terms of bgp? [09:54:01] i.e. same group/setup as kubemaster? [09:54:28] same as "kubernetes" to be precise, but as that's the same as kubemaster: yes [09:54:39] XioNoX: https://wikitech.wikimedia.org/w/index.php?title=SRE%2FInfrastructure_naming_conventions&diff=2179925&oldid=2179921 [09:54:51] cool [09:55:20] if they all share the same bgp setup could we change it to match patterns? [09:55:45] to avoid having to name them all [09:56:11] volans: I prefer to have them explicitly defined, less error-prone for only a few more lines [09:56:13] added now if someone wants to +1 [09:56:28] k [09:56:33] yeah I kind of agree best to list them individually [09:56:52] topranks: +1 on the change, thx ! [09:56:55] if it got really long we'd probably need to reconsider [09:57:00] cheers :) [09:58:29] XioNoX: i've added https://phabricator.wikimedia.org/T365571 [09:58:50] jayme: thx! I'm still down to test the cookbook tomorrow, so it's ready for when we will really need it [09:59:19] yeah, absolutely [10:06:51] hnowlan: I've merged that and updated the cumin hosts [10:07:02] ran homer against lsw1-b6-codfw to test and it went fine [10:07:14] BGP is established with wikikube-ctrl2001 [10:07:34] great! Thanks a lot [10:07:46] np! [10:07:59] confirmed established session from wikikube-ctrl2001 [10:23:30] 10netops, 06Infrastructure-Foundations, 06SRE: Move device AS numbers out of Homer YAML and source from Netbox - https://phabricator.wikimedia.org/T365572 (10cmooney) 03NEW p:05Triage→03Low [10:27:06] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9820426 (10Volans) I had a quick look to understand our options in terms of parallelization. Keeping in mind the usual 3 possible approaches: multi-process, multi-thread, async.... [10:42:15] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9820460 (10cmooney) One observation is that the config generation could be parallelized separate to the router transport. i.e. once the globbing on hostnames is done spawn separ... [10:57:10] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9820482 (10Volans) Sorry for not mentioning it, the parallelization of the configuration generation was implicit to me, and also easier, but ideally we should parallelize both an... [11:10:50] 10netops, 06Infrastructure-Foundations, 06SRE: Move device AS numbers out of Homer YAML and source from Netbox - https://phabricator.wikimedia.org/T365572#9820507 (10cmooney) The other thing that strikes me here is how to manage the individual-device ASNs, for instance 4265003001 on asw1-bw27-esams. And als... [13:13:58] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Juniper: use export-format state-data json compact - https://phabricator.wikimedia.org/T362523#9820973 (10ayounsi) Tested on a MX204 running Junos 21.2 and 22.4R3.25, the returned JSON is invalid... See for example a basic `show interfaces xe-0/1/2... [13:47:20] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9821206 (10aborrero) 05Open→03Stalled blocking until {T364984} is fixed, so we don't risk having another cloudvirt offline. [14:23:56] XioNoX: to be frank I thought you would probably do the testing of the cookbooks 😇 [14:24:49] jayme: oh yeah, I'm happy to do the testing! [14:25:04] just that it requires more than 1 host to test them fully [14:25:17] and of course I'll let you do the depooling :) [14:26:00] ah, cool! <3 You can definitely get a second one :) [14:26:53] XioNoX: requirement is it needs to be in row A or B, right? [14:27:11] jayme: for the move-vlan, indeed [14:30:38] ack. I'll clear out kubernetes2032 then as well (so you have 2 nodes in row B for testing) [14:31:46] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:29] XioNoX: I'll be around tomorrow untill like 10:00Z - then I'll leave commuting into vacation :) but hnowlan will absolutely be quipped to help/support the tests (he does not know that I'm volunteering him, thoug :)) [14:44:34] jayme: alright, ping me when you get online and we can get going [14:48:56] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:01] XioNoX: ack [14:53:20] XioNoX: there's probably some prep work to do, adding something to the preseed file. I can take care of that. Anything that would require special attention given the fact that the nodes are BGP peers? [14:54:30] double check that BGP is disabled and it's not receiving prod traffic [14:55:04] then once renamed and/or moved, run homer to update the peerings [15:02:36] so you think we should disable BGP (via netbox and homer run) before? [15:06:06] jayme: if we want to be double safe sure, but in theory the host shouldn't accept any BGP or advertise any prefix if it's properly depooled [15:09:50] I think we should be fine (although we would produce BGP alerts during the downtime). There is no workload running on the nodes, so there can't be any traffic really. [15:10:23] apart from stuff that comes via LVS directly to a hostport, which should not be the case because pooled=inactive [15:10:48] cool yeah [15:16:46] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:08] do we need a task for the arelion's email about IPv6 migration for Wikimedia Foundation? [15:23:21] yep [15:23:23] s/email/emails/ [16:08:55] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9822235 (10joanna_borun) [16:20:12] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9822300 (10Volans) [17:07:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822720 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.wikim... [17:56:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822855 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002.wikimedia... [18:05:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.wikim... [18:06:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822895 (10cmooney) Ok seems like we have a solution. I added the "forward-only" statement to the EVPN switches in codfw row A... [18:33:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822951 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1002 for hosts: `sretest2002.wikimedia... [18:43:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822970 (10cmooney) @Papaul @Jhancock.wm I'm done with sretest2002 now and ran the decom cookbook so feel free to put it back i... [19:16:46] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:56] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:55] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9823354 (10jhathaway)