[03:13:55] 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator: Phabricator profile "LDAP User" link goes to Wikitech, but users no longer need to have a Wikitech account - https://phabricator.wikimedia.org/T366766 (10matmarex) 03NEW [03:43:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:40:45] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:22:53] XioNoX: Regarding the Netbox 4 upgrade, all (most?) our reports will break and requires modification. The Report API/classes have been removed and we need to rewrite to use Script. At least that's my understand. I think it's fine to attempt an upgrade of Netbox-Next and then attempt to fix the reports there. Just in case you hadn't seen the changes to the reporting [07:23:46] https://netboxlabs.com/docs/netbox/en/stable/customization/reports/ <- Okay, deprecated, though I had little luck getting them running on a test installation [07:29:02] slyngs: yeah I saw, I didn't know they kept backward compatibility though that's useful [07:30:05] slyngs: the release notes also say "The legacy reports functionality has been dropped. Reports will be automatically converted to custom scripts on upgrade." [07:30:39] I put the next steps on https://phabricator.wikimedia.org/T336275#9863897 let me know what you think [07:31:05] but the idea is to have a dedicated netbox 4 test server, with its own deployment path and branch [07:31:20] and leave prod/next alone for now [07:32:23] Nice that it can convert automatically, I didn't see that. Does that mean that we need to install a 3.X first, load the reports and then upgrade? I suppose it does [07:33:18] no idea :) slyngs from the doc you linked, the change are quite easy to do too [07:33:42] most likely will convert on the fly, not the code on disk [07:33:58] like module-based cookbooks are actually class-based cookbooks inside spicerack :D [07:33:59] I think they over simplified it. The Script things need a "run" method [07:47:55] 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator: Phabricator profile "LDAP User" link goes to Wikitech, but users no longer need to have a Wikitech account - https://phabricator.wikimedia.org/T366766#9866375 (10taavi) Note this is not a new thing with idm - the Toolforge admin console has been creatin... [07:48:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:00] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9866440 (10ayounsi) [08:21:47] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9866456 (10ayounsi) [08:43:44] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:03] slyngs: hi, T366779 seems to be from something you're working on, can you make sure that gets fixed? [09:12:04] T366779: PuppetDisabled Puppet disabled on cloudidm2001-dev:9100 - https://phabricator.wikimedia.org/T366779 [09:13:03] taavi: Yes, I do however consider moving the installation to another hosts, as to not deal with the firewalling and network [09:15:12] taavi: Do you see any issue with maybe moving the cloud IDM installation to cloudweb2002-dev, so that it's in the correct network? [09:58:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:19] slyngs: I'm not sure what the issue with the current host is, but AFAIK that should be fine as long as it doesn't cause issues with mediawiki or the other services running there. but I would check with Andrew first [10:02:37] taavi: It's mostly that we'd need to poke at the firewall. The existing host is on a network that cannot reach the dev-LDAP server, without doing some extra firewall openings. That could be fine, but given that it's all for dev/testing its nicer to just keep it all within the cloud network. [10:03:50] But I'll check with Andrews :-) [10:08:10] the cloudwebs are currently in the public vlan so they're not in the cloud-hosts network either, so either they already have firewall rules that we could simply copy to cloudtestidms, or we would need to poke the firewall anyway [10:08:22] i guess we don't have an easy way to do ganeti vms in cloud-hosts? [10:10:33] It can be done, the question is if we should. [10:55:45] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:38] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867080 (10cmooney) Detailed steps are in P64182 [11:55:56] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867102 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=54328f3a-52e5-42cd-bdf1-26ee5617a4d5) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their... [12:15:09] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867139 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=512f5f90-4832-4c61-b0eb-75b61fcd6f8c) set by cmooney@cumin1002 for 1:30:00 on 18 host(s) and thei... [12:25:39] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867154 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=76763bfc-4091-4d8a-b3f8-e84d96a9bd49) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their... [12:35:29] volans: can we just merge these when we're okay with it? https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/967166, https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/967165 [12:42:15] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9867210 (10MatthewVernon) @Eevans are you OK to do this, please? Should just be a case of checking `swift-dispersion... [12:43:42] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9867212 (10MatthewVernon) @Eevans would you be OK to handle this as well, please? It's a bit more involved as you'll... [12:45:51] jayme: hey, it was in my todo list to do another round of pings for those locks CRs still open [12:45:59] sure if yuou're happy with the parameters [12:46:24] there were open comments IIRC [12:47:04] they seemed more like acknoledgements to me :) [12:47:11] :) [12:47:22] then go ahead, might need rebase [12:47:41] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9867218 (10MatthewVernon) @Eevans you OK to handle this, please? Should just be a quick cluster health check afterwa... [13:16:16] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867277 (10cmooney) The first phase of this is complete, ssw1-e1-eqiad has been upgraded. I am going to pause before completing ssw1-f1-eqiad as some of the output is stran... [13:29:22] 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797 (10Clement_Goubert) 03NEW [13:29:49] claime: I added some more info in the doc - https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&diff=2188910&oldid=2183069 not sure at which point it's worth automating it all, but at least it's there [13:31:26] XioNoX: Thanks, it's good that it's at least documented, I don't know if it's worth automating either [13:31:45] Especially since we don't really know the cutoff for "can't upgrade automatically" [13:32:35] I'm trying to figure out how to get the idrac version from the OS [13:49:19] it's exposed as puppet fact [13:49:44] XioNoX: ^ [13:50:28] ah? I'm looking at the facts, but couldn't find it [13:51:08] firmware and firmware_idrac [13:51:24] https://puppetboard.wikimedia.org/node/cumin2002.codfw.wmnet [13:54:10] ah right, it didn't show on the `facter` output when sshing to the host [13:55:35] facter -p [13:55:41] always [13:56:53] hmmmm [13:56:57] so yeah it should be easy to automate, but also we don't know for sure when old is too old [13:57:29] hmmm [13:57:31] for what? [13:57:34] we do [13:57:41] what are you trying to do? [13:58:37] volans: for the rename cookbook https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&diff=2188930&oldid=2188910 [13:59:10] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hardware/upgrade-firmware.py#896 [13:59:14] why reinventing the wheel? [14:00:11] see also T328593 [14:00:11] T328593: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 [14:01:45] volans: what do you mean? [14:01:58] just do the same check [14:02:13] ah yeah [14:15:52] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867516 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2e3e9f53-54b4-4b8d-b9d6-ab280392b41c) set by cmooney@cumin1002 for 2:00:00 on 3 host(s) and their... [14:30:54] 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9867585 (10elukey) [14:31:24] volans: <3 https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1039732 (cc claime) [14:31:47] volans: <3 https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1039732 (cc claime) [14:59:57] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867739 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e84998aa-eea9-43ce-9047-23b408d134b5) set by cmooney@cumin1002 for 1:30:00 on 15 host(s) and thei... [15:04:43] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867757 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8ea52962-5718-4917-aeee-12b979b25d42) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their... [16:02:00] claime, volans, success https://www.irccloud.com/pastebin/md7rygSx/ [16:02:23] perfect [16:03:22] Nice <3 [16:09:47] * elukey just discovered the 1000 lines of code of the upgrade-firmware.py cookbook [17:33:27] ahahahah [17:33:30] good luck [18:11:32] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9868580 (10ssingh) Moving the links working out well (which I think this is the first time?) is a big take away from this task; glad to hear it went nicely! [18:17:16] 10Packaging, 10Thumbor, 10Wikimedia-SVG-rendering, 07User-notice: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549#9868598 (10Pppery) Suggested wording for tech news: The software used to create previews of SVG files as been updated to a new version, fixing many longstanding bug... [19:50:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:51:53] 10Packaging, 10Thumbor, 10Wikimedia-SVG-rendering, 07User-notice: Update librsvg to version > 2.44.10 (2.50.3) - https://phabricator.wikimedia.org/T265549#9869228 (10Aklapper) [21:13:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869306 (10RobH) p:05Medium→03High @Jclark-ctr or @VRiley-WMF: Would one of you be able to take care of this on your next on-site visit? We have light on the drm... [21:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:32:58] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869358 (10wiki_willy) a:03Jclark-ctr [21:34:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869359 (10wiki_willy) Valerie is on vacation, so assigning to John [21:44:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869378 (10RobH) [21:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:30:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869444 (10Jclark-ctr) Installed cross connect link came up on port. cableid #5229 [22:33:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869454 (10RobH) [22:33:56] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869455 (10RobH) 05Open→03Resolved Looks good to me on this end, thank you! [23:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:49] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed