[00:20:46] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:25] jhathaway: FYI it seems that it fails randomly every now and then and journalctl has only one line with "ERROR:/usr/local/bin/vrts_aliases:Connection unexpectedly closed" [09:20:44] FYI, aux-k8s-etcd1002 will briefly go down for a reboot of an underlying ganeti node [09:50:46] FIRING: [22x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:48:45] RESOLVED: [22x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:56] ^ these were caused by the reboots of the PKI servers, all the clients access the service via the discovery record whicg gets failed over by the cookbook, so there was no real impact [12:01:03] k [13:17:54] volans: thanks, moritzm added a patch to give some debug info [13:28:22] I have a general question about debmonitor [13:28:28] I am checking https://debmonitor.wikimedia.org/packages/libc6 [13:29:12] if I copy a version in the top panel "Debian versions" and paste it in "Filter", most of the times I don't get what I need [13:29:31] from the code IIUC we match only the package name [13:29:31] host_packages = HostPackage.objects.filter(package__name=name) [13:29:35] image_packages = ImagePackage.objects.filter(package__name=name) [13:29:38] is it the case? [13:29:58] intuitively I'd love to also filter for version [13:30:27] and moreover, are the installed (blue)/ pending (yellow) values cached or accurate? [13:30:51] so, from the bottom to the top [13:30:53] For one version "2.36-9+deb12u3", I see 4 reported to be installed [13:31:03] but if I review I don't find anything [13:31:09] or better, it is not easy to filter [13:31:11] values are queried in the DB there is no cache for them [13:31:32] the filter is done in JS client side [13:31:50] ahh right ok it doesn't call anything in view.py [13:32:00] yeah it was too quick, should've thought about it [13:32:07] so you have to check the JS code, not the python one, and IIRC (but has been a long time since I wrote that code) there was a bit of an issue with filtering and the way we show the data [13:32:25] basically what you see as OS: Debian VErsion: 2.238-10 and then a list [13:32:33] is actually a table with [13:32:47] Debian 2.28-10 docker.... [13:32:50] Debian 2.28-10 docker.... [13:33:24] the values are repeated in each line, is then the JS library datatables that does the grouping [13:33:52] my brain was probably trying to keep me away from js code [13:34:10] what are you looking for, the ones that have a specific version or the ones upgradable to a specific version? [13:34:59] the former [13:35:04] more specifically, docker images [13:35:42] mmmh interesting, it seems that the + in the version is the problem [13:35:53] the original problem is that it filters also in the upgradable version [13:36:11] so filtering for "foo" will show you rows with version foo but also upgradable to foo [13:36:12] where is the js file with the logic? [13:36:15] and often is not what you want [13:36:24] FYI, aux-k8s-etcd1003 will briefly go down for a reboot of an underlying ganeti node [13:37:42] elukey: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/debmonitor/+/refs/heads/master/debmonitor/templates/base_table.html [13:38:07] *but*... [13:38:20] if you put your hands in there, I think we should update the various CSS/JS libraries :D [13:40:17] for a simple fitler [13:40:20] elukey: anyway, if you want the images with a specific version [13:40:23] *filter fix? :D [13:40:36] just filter by "docker" [13:40:48] and then scroll until you get the group with your version [13:41:52] page 23 for me [13:42:38] elukey: https://etherpad.wikimedia.org/p/volans-tmp3 [13:42:53] yep yep saw it [13:42:59] thanks, will open a task to improve this.. [13:43:03] not pretty nor practical :D [13:43:22] but filtering by 2.36-9 [13:43:28] works, so I guess is the + [13:44:52] * elukey nods [13:47:21] I also don't find consistency with the services reported with that version, maybe it is docker-report [13:50:46] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:24] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880091 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=adbdaf29-9da2-42ea-b64e-fc6d141eaf9e) se... [14:52:21] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880192 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=22e81c7a-3dde-4cd2-9376-bd003c744dc6) se... [14:56:45] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880204 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d67744a2-77a0-40dc-aff6-4af804b0b5ce) se... [15:45:06] topranks: by any chance do you know why this is a VIP? https://netbox.wikimedia.org/ipam/ip-addresses/16114/ [15:46:06] volans: I don't specifically, but I think Arzhel was using that to test routed ganeti and BGP stuff [15:46:29] so perhaps just experiments or some edge-case he came up against in testing [15:49:53] o/ I have a complaint! at https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs in the update netbox step, it says to delete all interfaces _except mgmt_ and then to run the provision script, but the provision script won't run if the mgmt interface is there [15:50:36] I would ping v.olans except that your messge has pinged him three different ways [15:57:02] lol [15:57:17] XD [15:58:52] that bits have been changed recently and not by me :D so let's see if there is some incongruency :D [16:02:28] kamila_: which device? [16:02:49] I was doing it on wikikube-ctrl1002 but I've since moved on, I'll have another sacrifice ready once done with this one [16:04:11] volans: should I ping you and wait once I get to that step? [16:04:51] do you have handy the error you got? [16:06:54] also, did the mgmt IP change? because if it did we have other problems to fix [16:07:18] this time it didn't, I got lucky (?) [16:07:24] last time doing the same thing it did [16:08:32] and I hope it was reverted manually then [16:08:32] I don't have the error at hand, but I _think_ it was something like `interfaces already defined` [16:08:49] wait, a log should exist, right? lemme see if I can find it [16:09:01] I can check them no worries, I've the hostname now :D [16:09:30] ok [16:12:00] kamila_: wikikube-ctrl1002 is the old or new name? [16:15:49] volans: it's the old and new name [16:15:56] the name didn't change, just a new NIC and new location [16:16:10] and from where did you run the cookbook? [16:16:32] I don't see a run of the provision [16:17:04] wait a sec, you mean the provision script in netbox, not the cookbook [16:17:07] sorry long day [16:17:53] kamila_: was this one? https://netbox.wikimedia.org/extras/scripts/results/5914067/ [16:18:39] volans: no, an earlier one [16:18:46] with port 12 and cable id 4883 [16:18:58] (because $reasons '^^) [16:19:00] the case of exisiting mgmt is handled [16:19:15] as long as it has a dns name [16:19:40] I am pretty sure an earlier run with mgmt being the only interface errorred otu [16:19:41] see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/customscripts/provision_server.py#241 [16:19:50] I'll find the log in a sec [16:19:57] it needs to have an IP and also a DNS name [16:22:34] 10netops, 06Infrastructure-Foundations, 06SRE: Sub-optimal cloud routing for WMCS in eqiad when link fails - https://phabricator.wikimedia.org/T367203 (10cmooney) 03NEW p:05Triage→03Low [16:23:24] volans: https://netbox.wikimedia.org/extras/scripts/results/5914012/ [16:24:11] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: optimize API calls to Netbox - https://phabricator.wikimedia.org/T271864#9880763 (10elukey) [16:24:48] kamila_: I understand but we'd need to know if that interface had an IP and a DNS name attached to it, I think it didn't had the dns name [16:25:02] it's possible it didn't [16:25:10] because the decom cookbook does remove them all now, and it was not doing that in the past [16:25:17] (this change is from months ago though) [16:25:24] but I did use the --keep-dns-name parameter [16:25:47] the decom cookbook was run earlier today from cumin1002 [16:26:05] *--keep-mgmt-dns I mean [16:26:42] 2024-06-11 14:30:24,568 kamila 1935975 [INFO] Skipping removal of DNS names on interface mgmt [16:31:46] weird... [16:38:55] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880841 (10cmooney) [16:39:12] 10homer, 10SRE-tools, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9880854 (10elukey) I had a chat with Riccardo about a possible first change that could help one of the use cases mentioned (a sort of version-0 of the final solution) could be si... [16:39:17] kamila_: if you have your next patient in the next 10m I can have a look but after that I'll have to step out [16:39:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880864 (10cmooney) [16:41:05] volans: that's not happening, I haven't even decommissioned it yet... but I can leave it in that state so you can look at it tomorrow [16:41:47] ack, thx [16:41:51] lmk the details [16:41:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880893 (10cmooney) [17:06:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9881020 (10VRiley-WMF) Swapped 40Base-LR4 in port et-0/0/53. [17:36:02] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9881242 (10Dwisehaupt) [17:36:18] 10Mail, 10fundraising-tech-ops, 06Infrastructure-Foundations: Update fundraising mail settings to use new production mx hosts - https://phabricator.wikimedia.org/T366740#9881238 (10Dwisehaupt) 05Open→03Resolved The changes have been pushed to eqiad hosts and log entries look good. Closing. [18:30:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9881669 (10cmooney) p:05Medium→03Low Thanks for the help with this @VRiley-WMF. The link has now b... [18:46:17] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Enhancement: view status of all running cookbooks on demand - https://phabricator.wikimedia.org/T367210#9881758 (10Volans) p:05Triage→03Medium Locally I have a 90% done draft of a list locks cookbook I started a while ago that will show all the exis... [19:20:21] 10SRE-tools, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230 (10taavi) 03NEW [19:20:24] 10SRE-tools, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367231 (10taavi) 03NEW [19:20:53] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367231#9881913 (10taavi) →14Duplicate dup:03T367230 [19:20:54] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#9881916 (10taavi) [20:51:22] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#9882207 (10Volans) @taavi technically it already can, taking advantage of filesystem autocompletion ;). As specified in the `cookbook -h` help message, the cookbook "name" can... [21:05:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9882267 (10VRiley-WMF) You're welcome @cmooney We do have spares if they are needed in the future. Clos... [21:06:10] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9882268 (10VRiley-WMF) 05Open→03Resolved