[08:14:51] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:16:45] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:19:21] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [08:24:17] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [09:26:12] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10hnowlan) [10:30:09] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D switch... [10:46:47] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D switch... [11:04:49] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [12:17:47] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ssingh) [12:19:41] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) > +1 extending the lifetime is just delaying the issue and increasing the possibility its forgotten or missed Yes and no. It depends on how much we can automate it with... [12:43:42] (SystemdUnitFailed) firing: netbox_ganeti_drmrs01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:15] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3a841f97-aecd-4c7a-8eb4-8acd1caa15b3) set by ayounsi@cumin1001 for 2:00:00 on 189 ho... [12:53:21] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon) [12:54:13] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [12:58:42] (SystemdUnitFailed) resolved: netbox_ganeti_drmrs01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:12] (SystemdUnitFailed) firing: (2) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:38] 10netbox, 10DC-Ops, 10Infrastructure-Foundations: Netbox device's platform field inconsistency - https://phabricator.wikimedia.org/T336623 (10Volans) @wiki_willy what's DCOps's position on this? Those are the current platform used: https://netbox.wikimedia.org/dcim/platforms/ [13:24:51] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:25:49] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [13:25:57] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) 05Open→03Resolved a:03ayounsi All stacks have been upgraded. Hopefully for the last time! [13:28:07] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [13:28:50] 10netops, 10Infrastructure-Foundations, 10SRE: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Done with all the sub-tasks upgrades. [13:43:11] with today's upgrade can we start using ed ssh keys on the network devices too? [13:43:45] volans: :) [13:44:04] \o/ [13:44:08] * volans sending patch :D [13:44:20] I haven't tested, but we should be good to go, yeah [13:44:32] volans: feel free to be the guinee pig [13:45:17] volans: I updated the root password (and its hash), rolling it to all devices as we speak [13:45:25] nice! [13:46:11] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/920279 [13:46:32] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS... [13:47:03] I'll wait your run to finish to merge [13:47:07] and then I'll test on one devie [13:47:10] *device [13:47:23] volans: I'm going to stop my run [13:47:32] about to test your key on asw-a-codfw [13:47:38] ack [13:47:56] yeah there is a catch [13:48:13] there always is [13:48:18] tiny one [13:48:28] volans: https://github.com/wikimedia/operations-homer-public/blob/master/templates/includes/system/login.conf#L13 [13:48:47] so we need a quick "if start with.... [13:49:02] ack I can do it [13:49:08] cool, thx [13:49:16] pushing your key manually in the meantime [13:49:23] wait a sec [13:49:52] * sukhe waiting second in line if the key update goes well [13:49:52] :P [13:50:18] XioNoX: I can't ssh right now with my old key to asw-a-codfw.mgmt.codfw.wmnet, so I'm checking my ssh config [13:51:56] lol typo in my ssh :D -codfw.mgmt.eqiad.wmnet [13:52:18] eh [13:52:46] ok [13:52:49] I'm in with the old key [13:52:51] go ahead [13:54:33] volans: done, give it a try [13:54:40] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches... [13:54:50] maybe silly question, but what's the point of keeping " Riccardo Coccioli (volans) " ? [13:55:19] it's the comment I guess it gets save on the host so in case you have to manually modify them [13:55:25] you know which key is which [13:55:39] ssh works, checking with -vvv to ensure it's using the correct one [13:56:02] yep seems all good [13:56:07] I'll prepare the patch for the template [13:57:15] nice and thanks [13:57:28] I'll open a task then [13:59:01] XioNoX: try this one https://gerrit.wikimedia.org/r/c/operations/homer/public/+/920281 [13:59:35] looks good [13:59:59] should I merge and try a run on the cumin host? [14:00:10] I don't want to interfere with your runs :D [14:00:14] this can ofc wait [14:00:30] sure, you can test with asw-a-codfw, should be a noop for your key [14:00:38] if you merge both [14:00:54] ack [14:01:04] I stopped my run [14:04:02] ack, running on asw-a-codfw diff [14:04:12] empty diff, I'll try asw-b-codfw [14:04:20] if that's ok with you XioNoX [14:04:34] yep, go for it [14:10:12] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:21] volans: how is it going? [14:10:26] XioNoX: all good diff clean and ssh works [14:10:30] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches... [14:10:47] not sure if you want me to test some other devices [14:10:51] that might be a different model [14:11:12] or just roll-it out and I can do an ssh test to all of them automatically [14:11:24] just to be sure we're not kicking ourselves ou [14:11:25] *out [14:11:41] in any case I'd not change the homer key for a while, to be on eh safe side [14:12:00] 10netops, 10Infrastructure-Foundations: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) p:05Triage→03Medium [14:12:08] volans: yeah I'm not changing mine today [14:12:14] as long as the device accept it it's fine [14:12:30] I can take care of rolling it out don't worry [14:12:41] volans, sukhe, I opened https://phabricator.wikimedia.org/T336769 [14:13:13] thanks! [14:13:17] prepping the patch [14:13:36] ack, thx [14:13:44] * volans updating his ssh config [14:13:49] sukhe: thx, will wait for yours to roll it out [14:14:59] 10netops, 10Infrastructure-Foundations: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10Volans) [14:16:55] volans: do you use your regular prod key or a dedicated one? [14:17:06] I used my regular key as well [14:17:13] but I think probably a dedicated one might be better [14:17:18] not sure [14:17:44] well, not sure if it brings more security and if the hassle is worth it [14:17:57] ok, I have no thoughts either :) [14:18:54] I added "It's ok to use the same key as prod, but you can use a dedicated key if you prefer." to the task [14:18:59] XioNoX: I tried with my prod one right now, but ofc if you want a dedicated one I can just create it [14:19:01] 10netops, 10Infrastructure-Foundations, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [14:19:02] no hassle [14:19:09] what's your preference? [14:19:42] volans: up to the user [14:20:43] k [14:21:18] ok, I am ready to roll out but given that it takes time, maybe volans can go first since he was here first [14:21:27] or if not, happy to roll out first too [14:21:49] sukhe: I'll push them all at the same time :) [14:22:11] oh ok sure :) [14:22:12] thanks then! [14:24:53] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10herron) [14:26:42] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went very well. Thanks everybody! That was the last one! [14:26:52] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:26:59] sukhe: can you try to ssh to asw-a-codfw? [14:29:56] XioNoX: trying [14:30:18] I am in! [14:30:23] nice! [14:44:45] oh mm [14:45:00] XioNoX: I didn't realize that another homer run was in progress [14:45:08] I was adding a new DNS host, I will abort [14:45:43] sukhe: go for it, it will handle conflict fine if any [14:45:53] ok, just wanted to make sure :) [14:46:00] especially if you do it on a a few routers [14:46:17] oh yeah just cr*-codfw* [14:48:27] sukhe: cool, go for it [14:48:37] all done [14:48:41] cool [14:48:54] was just volan.s and my key and the DNS change [14:59:49] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bul... [16:30:27] 10netbox, 10DC-Ops, 10Infrastructure-Foundations: Netbox device's platform field inconsistency - https://phabricator.wikimedia.org/T336623 (10wiki_willy) Agreed, I don't think there's any need to continue using "platform" in Netbox, especially since more than half the devices don't have it currently filled o... [20:16:43] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10taavi) [21:07:41] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Result of the testing with Cathal. I first want to thank @cmooney for all the help with JunOS-magics, that was pre... [22:01:22] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) Awesome work getting it working @volans big thanks to you too :) >>! In T336485#8857232, @Volans wrote: > HTTP i... [22:08:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10cmooney) [23:57:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ssingh)