[02:14:17] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:17] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:37:49] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:14] 10puppet-compiler, 10Infrastructure-Foundations, 10SRE: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10Volans) >>! In T334680#8791310, @Dzahn wrote: > But since the compilers are running in cloud VPS and there it's neither of the... [05:23:27] jbond: FYI ^^^ for when you're online, also there are unrelated failed sysemd units on pcc-db1001 [05:31:16] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) >>! In T296832#8791457, @cmooney wrote: > In terms of next steps we obviously need to keep things consistent.... [07:25:53] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) Slightly relevant - https://wikitech.wikimedia.org/wiki/Juniper_TLS_certificate_install [08:07:48] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:17] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:43] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [09:59:32] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:59] 10netops, 10Infrastructure-Foundations, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) [10:17:12] 10puppet-compiler, 10Infrastructure-Foundations, 10SRE: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) 05Open→03In progress p:05Triage→03Medium a:03jbond [10:17:48] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:33:10] 10puppet-compiler, 10Infrastructure-Foundations, 10SRE: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) Thanks for the debugging, the issues was because the facts where not updating, which happened because there was/is an i... [10:34:16] 10puppet-compiler, 10Infrastructure-Foundations, 10SRE: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10jbond) 05In progress→03Resolved going to tentatively close this but please reopen if you still see the issue [10:44:17] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:42] 10puppet-compiler, 10Infrastructure-Foundations, 10SRE: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10ssingh) Thanks to everyone who worked on debugging/resolving this! I will try it again for the reimages in eqiad to see how it... [11:07:16] 10netops, 10Infrastructure-Foundations, 10SRE: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) {P47077} [11:08:11] 10netops, 10Infrastructure-Foundations: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [11:08:40] 10netops, 10Infrastructure-Foundations: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [11:08:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) [11:09:23] 10netops, 10Infrastructure-Foundations: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [11:09:29] 10netops, 10Infrastructure-Foundations, 10SRE: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [11:09:40] 10netops, 10Infrastructure-Foundations, 10SRE: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [11:09:48] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) [11:10:19] 10netops, 10Infrastructure-Foundations: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [14:42:07] topranks: hi! [14:42:12] just reimaged lvs1019 https://netbox.wikimedia.org/dcim/devices/3653/interfaces/ [14:42:38] so I guess in this case I should delete enp94s0f0np0 and then rename ens2f0np0 to enp94s0f0np0? (which is what the new name for the interface is) [14:44:17] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:49] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:02] 10netops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:58:16] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:58:50] 10netops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [15:00:01] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) [15:09:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) p:05Triage→03Medium [15:30:17] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 8 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff) [16:32:48] (SystemdUnitFailed) firing: (5) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:50] idrac redfish task not finishing [16:49:07] * jbond ignore that [16:54:06] sukhe: hey sorry for the late response somehow missed you [16:54:31] let me take a look at the host, in general shouldn't be too hard, but the eqiad LVS are a little more complex with the cross-cage link [16:54:42] topranks: thanks and np! [16:54:52] lvs1018,19 done [16:54:57] so you can look at both if that's fine [16:55:03] or just tell me and I am happy to update [16:59:19] ok great [17:07:15] Alright I did lvs1019 there, it's not tricky as I'm quite familiar with the setup & netbox, but might be easier if I do the other one too [17:07:18] Steps were: [17:07:24] https://www.irccloud.com/pastebin/Zt2ywf7m/ [17:07:40] sukhe: did you make a note of any cable labels before the re-image? [17:07:59] topranks: for 1019 yes [17:08:07] I did forget for 1018 it seems :D [17:08:09] btw it's probably not too important, we realise from this incident that many cable labels have been deleted through bullseye upgrades [17:08:20] ah cool - well at least we've the 1019 ones [17:08:35] sharing [17:08:46] https://share.riseup.net/#YVb9dmoPEVZVESwxLj8-PA [17:09:05] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:09:34] hey nice, riseup is cool :) [17:09:43] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10colewhite) [17:09:54] yeah! [17:10:02] let me know if you need an invite :P [17:10:36] (the share tool is free) [17:11:44] cool thanks! yeah I used it for something a few years back, it's escaping me what, but seemed like a cool initiative [17:15:27] so, back to lvs1019 [17:15:36] there is a further complication looking at your saved screenshot [17:15:36] yeah... [17:15:42] oh? [17:15:58] in the screenshot it shows lvs1019 connected to asw2-d7 7/0/23 [17:16:34] but the updated info (which is from lldp, and thus correct), shows a connected to asw2-d2 xe-2/0/4 [17:17:30] it's like the b-end of that link was moved from the switch in D7 to D2 at some stage ¯\_(ツ)_/¯ [17:18:05] and then netbox never updated ? [17:18:27] anyway it's no worry, chances are the cable label is the same I'll set it to what it was [17:18:49] sorry in a meeting [17:18:49] but reading [17:26:38] sukhe: no worries, I did lvs1019 there [17:27:42] made some assumptions the cables that go to each row are still labelled the same, even though where on that row they went has changed since the oriignal Netbox import that populated the data in your screenshot happened. [17:32:27] thank you! [17:32:45] you can do lvs1018 next if that's fine or combine lvs1018 and 17 later (17 reimaging in progress) [17:33:02] happy to do it myself as well [17:33:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10colewhite) Getting Prometheus to scrape a new metrics endpoint is pretty straightforward. When the exporter is up and running and firewall r... [17:36:58] sukhe: lvs1018 is done now too [17:37:10] <3 [17:37:25] much appreciated, I would have been completely lost [17:37:30] being honest there is nothing super-complex here, but it's probably easiest on both of us if I just do it [17:37:38] I had rather have that [17:37:44] but I also want to be respectful of your time :) [17:37:59] yeah no worries, and this is bespoke enough that it's not gonna eat up much time on a regular basis [17:40:13] just ping me when 1017 is done I'll have a look [17:40:20] thank you! [17:40:31] will you be around in an hour or so? [17:40:36] don't want to ping you since it's late for you [17:41:26] you can of course do it later, thankfully [17:41:34] as in nothing breaks on Netbox being outdated [17:44:49] sukhe: yeah I'll still be online so feel free to ping me [18:27:36] topranks: hello :) lvs1017 is done [18:27:40] pre-reimage: https://share.riseup.net/#wWO3cB0bVeDygROzpZUPGA [18:27:56] great stuff will have a look now in a few :) [18:28:07] np :) [18:51:09] I am stepping out for some fresh air but will be back soon. thanks again [19:37:44] sukhe: lvs1017 updated now [19:44:40] topranks: many thanks! [19:44:48] this completes this round [20:34:17] (SystemdUnitFailed) firing: (5) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:49] (SystemdUnitFailed) firing: (5) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:34:17] (SystemdUnitFailed) firing: (5) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed