[00:03:33] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:33] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:33] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:18:33] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:33] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:33] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:42] moritzm: FYI the above errors seems to be caused because unable to connect to puppetdb1002 on 443 [07:23:33] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:25] hello, any objection that I do a re-image of one of the eqias sretest? [07:36:28] eqiad* [07:38:02] ok for me [07:42:55] thx, in progress [07:49:56] speaking of reimages, I've kicked off a reimage of titan1001 (new hw generation) and it either didn't get any PXE action or the "use pxe at the next reboot" ipmitool command didn't work, does any of this ring a bell? i.e. the host didn't come back in d-i and simply rebooted [07:52:08] I guess it is management console time [07:52:37] godog: did you get a specific error or the host just rebooted? [07:53:16] in the past with older HW it could happen that the force pxe command was successful (and is also checked with a another call) but then the host just rebooted normally [07:53:20] but I haven't seen that in a while [07:53:57] volans: no specific error I could see no [07:54:05] retrying while looking at mgmt console now [07:54:40] yeah mgmt console is the way to go [07:58:08] ok pxe attempted and no answers from dhcp [07:58:47] https://phabricator.wikimedia.org/P52492 [07:59:07] https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_issues [07:59:13] godog: which NIC firmware is it using? [07:59:44] I was testing a DHCP change in eqiad, but it works fine so most likely not related [08:00:16] unrelated, godog I'm curious about https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/957238 are we planning to drop one? :D [08:01:09] XioNoX: not sure tbh, though the host was provisioned by dcops just fine via the reimage cookbook AFAICS, I need to change from raid1 to raid0 hence the reimage [08:01:28] volans: I'd like to eventually! starting with removing references to the old one [08:01:36] ack [08:02:28] in my old age I've become insufferable to cli commands with underscores! [08:02:55] or underscores where we could use dashes for that matter [08:03:11] :D [08:04:04] wikitech search doesn't help as matches both even with quotes :( [08:04:19] but they seem all updated already [08:04:45] yeah I took care of wikitech a little while back [08:09:21] ok testing again with tcpdump running on the install server, I did not see messages from dhcpd on install1004 though [08:09:24] install1004:~$ grep 00:62:0b:c8:72:a0 -i /var/log/messages [08:11:44] godog: can you give it another try, I rolled back my change [08:11:57] mhh yeah now it works XioNoX [08:11:59] maybe there is a specificity as the host is in row F [08:12:05] haha [08:12:07] so annoying [08:12:13] topranks: ^ [08:12:44] heh, bad timing [08:13:12] or good timing I guess, depending on how you want to look at it, found the problem pretty quickly [08:13:56] rotfl [08:14:18] sorry about the trouble [08:14:26] so "forward-only" works if the relay is behind a configured interface, but breaks hosts when the relay is behind a non-configured-interfaces (E/F switches) [08:14:42] I would tell Arzhel to not break the network but I'll get as a reply that we need to switch to option 97... so I'll just shut up :-P [08:15:20] jokes apart, would opt 97 really solve all those issues? if so we should prioritize it I guess [08:15:23] volans: we were discussing that with topranks 5min ago, how much it would make things easier [08:15:25] XioNoX: UGH [08:15:44] volans: not solve all the issues, but make things much easier [08:16:01] without creating *new* issues? :D [08:16:07] yeah [08:16:10] ok then [08:16:15] it would remove complex router/switch config [08:16:30] this somewhat makes sense - I am fairly sure in T337345 I had tried "forward-only" [08:16:32] and thus remove some moving parts in the dhcp flow [08:16:53] I was trying to think if I had - or why not after our chat earlier [08:17:21] ok [08:17:39] I should setup ipsec between the relay and the install server so the router doesn't mess with it [08:17:39] From way back when I first looked at the "dhcp realy" vs bootp config when doing the evpn stuff I was trying to use "forward only" [08:17:47] as it seemed like the simplest - and all that we needed [08:18:14] Probably do it over an MVGRE tunnel with MACSEC just to be sure [08:18:19] :) [08:18:34] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:45] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:56] not sure if there is a way to have relays on both managed and unmanaged interfaces [08:19:44] https://www.irccloud.com/pastebin/fXOwJqhV/ [08:19:53] maybe all-interfaces is the good keyword [08:20:02] :) [08:21:13] it's worth a shot [08:21:22] godog: could you re-try the re-image? [08:21:28] I've been going through the tasks - seems I didn't document well the testing I did back then [08:21:46] But as I recall we hit on the solution through trial-and-error [08:21:52] and we "trialled" nearly everything [08:22:01] XioNoX: not the same reimage, I can try titan1002 in row D [08:22:07] so I'd be worried that "all interfaces" will fix row F, and break row A-D [08:22:21] topranks: yeah I was going to test sretest again [08:22:40] option 97 is starting to look really good :P [08:23:17] XioNoX: where are you ganeti test servers? [08:23:27] godog: sure, that helps, but if you have something in row E/F that would be better :) [08:23:54] XioNoX: I don't atm :( [08:23:59] topranks: row B [08:24:48] it occurs to me that the dhcp-relay the host is doing is only being blocked as the packet comes in on an interface that's configured for dhcp-relay [08:24:53] XioNoX: if you are ok to wait though I can reimage titan1001 once it has finished its puppet run and I've verified sth else [08:28:51] all-interfaces doesn't work for my ganeti relay anyway [08:35:40] godog: no need for the re-image anymore, thx [08:35:46] ok [08:35:49] XioNoX: ack [08:35:54] I was gonna say I can do that from codfw (similar to e/f) [08:39:06] dhcp, always a pain :) [08:40:23] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Vgutierrez) >>! In T345370#9147408, @BCornwall wrote: > @Vgutierrez Is this something that should be addressed in the cookbook? > > Your idea... [08:41:26] Option 97 + "forward-only" everywhere is probably what we need to target [08:41:34] be done with all the edge-cases and nerd knobs [08:41:41] ack [10:43:38] TIL bookworm's d-i initrd.gz is 350MB (or twice the size of bullseye's) no wonder it takes a while for d-i to boot from e.g. codfw [10:43:53] i.e. https://apt.wikimedia.org/tftpboot/bookworm-installer/debian-installer/amd64/ [10:52:56] can we have active active apt? :) [10:55:17] in principle I'd say so, whether it is worth it I'm not sure [11:46:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) I found a bit of time to play with some of the above mentioned solutions and those are my findings. ####... [12:10:48] volans: thanks, I'll investigate and fix it to use the newer puppetdbs [12:15:07] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) After speaking to @ayounsi I have a better idea of how we intend to use the "routed mode" ganeti. In many ways it's similar to what I propose above: * Both ha... [12:15:17] thx [12:23:33] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:49] fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/957283 [12:32:20] <3 [12:59:33] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10akosiaris) [13:11:01] Is there a way to have Puppet "register" a systemd service not installed by Puppet. I need to notify a service to restart, but the service is installed by a deb package [13:21:55] you can simply do: [13:22:18] service { 'foo': [13:22:35] and then enable/ensure/hasrestart etc) [13:22:53] https://www.puppet.com/docs/puppet/7/types/service.html [13:23:16] that's a Puppet builtin [13:23:51] a more powerful internal abstraction that we have is systemd::service [13:24:37] it has some additional settings to configure monitoring of the service and what happens in case of errors [13:25:05] and a very useful override mechanism, where you can e.g. override one line from the upstream systemd unit shipped in the deb [13:25:18] but for Bitu a simple service is most certainly fine [13:25:59] especailly since Bitu upstreams tends to be very receptive of feature requests voiced by Wikimedia :-) [14:23:33] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:58] moritzm: somehow is still trying to connect to puppetdb1002 [14:25:37] ah no my bad [14:25:53] is just the alerting, ofc it runs once a day [14:30:25] yeah, it should recover by tomorrow [14:48:33] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10cmooney) The problem getting them by ASN is that there may be "collateral damage" sometimes. i.e. If you pull th... [15:48:33] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:39:04] We're happy to announce that your RIPE Atlas anchor is functioning properly and is now connected to the RIPE Atlas network. https://atlas.ripe.net/probes/7261/ [17:43:33] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:51:31] XioNoX: \o/ [18:53:33] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:33] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:33] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed