[03:00:36] (SystemdUnitFailed) firing: (2) envoyproxy.service Failed on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:23] (SystemdUnitFailed) firing: (3) envoyproxy.service Failed on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:40] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10Marostegui) Once T355862 is done, es2021 needs to be switched back to be es4 slave (reverting all this T356064) [06:18:16] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10Marostegui) [07:34:23] (SystemdUnitFailed) firing: (3) envoyproxy.service Failed on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:49:23] (SystemdUnitFailed) firing: (4) envoyproxy.service Failed on debmonitor1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:27] Why does that say the service is failing when it's not [07:55:37] (SystemdUnitFailed) firing: (3) nftables.service Failed on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:23] (SystemdUnitFailed) firing: (3) nftables.service Failed on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:23] (SystemdUnitFailed) firing: (4) nftables.service Failed on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:23] (SystemdUnitFailed) firing: (4) nftables.service Failed on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:49] slyngs: /etc/nftables/input/10_ganeti_migration.nft:3:1-36: Error: Statement after terminal statement has no effect [08:10:52] ip saddr { 10.192.21.6, 10.192.6.6 } tcp dport { 8102 } accept [08:11:44] I was more thinking about the envoyproxy.service on debmonitor1003 [08:12:01] ah sorry :D [08:12:04] then moritzm ^^^ :D [08:12:45] The other one is interesting as well, but for different reasons :-) [08:15:37] (SystemdUnitFailed) firing: (4) nftables.service Failed on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:21] volans: yeah, there's syntax error in a rule specific to the routed access, Arzhel is looking into it [08:18:30] yep [08:18:36] ah ok thx [08:18:46] the actual error is in the vm_all_in.nft rule [08:18:55] the error message was a bit confusing as that file seems fine [08:19:00] and I was looking at the others [08:20:38] can someone check that nftables is now happy on ganeti2033 [08:20:50] I think it's good (and just needed a newline) [08:21:29] XioNoX: looks fine [08:21:32] XioNoX: thanks [08:21:35] :) [08:24:23] (SystemdUnitFailed) firing: (4) nftables.service Failed on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:59] volans, moritzm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/994663 (should be an easy +1) [08:25:23] systemctl status nftables.service is happy [08:26:09] and nft list ruleset seems sane [08:26:48] XioNoX: CI is failing [08:27:02] volans: yeah, indentation... [08:27:07] already sent a patch [08:27:19] XioNoX: looking [08:27:48] looks sane to me [08:48:22] hmm, looks like it doesn't understand the \n as new line [08:48:35] the puppet -> conf file translation [08:59:09] that's the fix hopefully https://gerrit.wikimedia.org/r/c/operations/puppet/+/994667 [09:00:13] mmmh [09:00:24] if the previous didn't put a newline for \n [09:00:39] why escaping the " should make it work? [09:03:37] https://serverfault.com/questions/688569/transform-n-to-new-line-although-input-variable-is-single-quoted-in-puppet [09:03:54] `If a variable is double quoted and it contains a \n then a new line is created` [09:05:55] sorry I missed that you changed the quote [09:06:14] my bad [09:20:51] alright, fixed ! [09:21:29] hx [09:21:32] thx [09:21:45] thank you ! [09:22:01] still lots of work to do but `ssh sretest2005.codfw.wmnet` works [09:22:24] puppet fails, not sure what the error means though [09:23:41] and the host is in a weird state network wise, its interface is a /23 instead of a /32, but getting there [09:25:48] (PuppetFailure) firing: Puppet has failed on ganeti2034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:27:23] it seems it's a puppet7 host, but the cert is created on puppetmaster1001 (as a puppet5 host) [09:27:41] so the reimage either wasn't able to autodetect puppet7 or you didn't specify 7 [09:28:54] XioNoX: ^^^ [09:29:20] noted [09:29:31] thx [09:34:23] (SystemdUnitFailed) firing: (2) nftables.service Failed on ganeti2034:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:48] (PuppetFailure) resolved: Puppet has failed on ganeti2034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:35:49] (PuppetZeroResources) firing: Puppet has failed generate resources on testvm2006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:19:57] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Current status, ignoring IPv6 for now. The cluster VIP is dynamically announced from the primary cluster node. Limitation from `isc-dhcp... [13:34:24] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:15] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10Volans) Indeed the `dhcrelay` not working as expected is a bit annoying also because if we run a dhcrelay for each VM, we'd need to hook also at VM... [14:15:32] moritzm: I think you have puppet disabled on netbox-dev2002 to debug the connection issue from cumin1002, do you still need it disabled? [14:25:05] no, I'll re-enable now [14:26:06] coincidently I've just added a summary to the related Phab task: https://phabricator.wikimedia.org/T356174#9502325 [14:27:29] 10netbox, 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) Let me know when you have a new backup i... [14:30:47] thx [14:31:03] for both, I've replied in the task [14:41:56] one thing I found is that we need proper absent handling for the ulogd class, I'll make patches for this [14:42:31] I forced a puppet run on netbox-dev2002, can be used again [14:58:15] thx, it ended up not being needed for me as we have backup disabled there, sorry :) [15:00:26] ok :-) [15:10:10] 10netbox, 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) @jcrespo we do have our first backup in `... [15:14:59] 10netbox, 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10jcrespo) Let me do a manual run first and then we... [15:41:22] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin2002 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (**... [16:07:53] If someone have time to have a look I can't figure out what's wrong with CI in there https://gerrit.wikimedia.org/r/c/operations/puppet/+/994223 [16:29:23] XioNoX: I can take a look, unless someone else already has [16:30:13] thx [16:36:35] 10netbox, 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) All tests with @jcrespo for netbox were s... [16:37:50] 10netbox, 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) 05Open→03Resolved [16:39:04] 10netbox, 10Data-Engineering, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 2 others: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 (10Volans) 05Resolved→03Open Sorry, moving in th... [17:29:25] (SystemdUnitFailed) firing: (2) isc-dhcp-server.service Failed on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:44] XioNoX: fixed, with_content takes a regex, which seems like a poor api for comparing big blocks of texts as we are doing, so I needed to escape the regex characters. A better refactor would be to compare the literal strings, since we don't seem to be using the regex functionality. [17:32:27] ohhhh [17:32:42] jhathaway: thanks so much ! [17:33:19] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) For the latter, some more debug: I added ` shared-network "test" { subnet 10.192.24.1 netmask 255.255.255.255 { opti... [17:42:30] happy to [21:30:37] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed