[00:03:54] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:36] 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review, 10Release-Engineering-Team (Quid Pro Crow 🦃): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129 (10Aklapper) [04:03:55] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:55] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:26] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:08:55] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:55] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:42] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) [09:13:34] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) 05Open→03Resolved [09:34:19] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) [10:04:56] 10Puppet, 10Wikidata, 10Wikidata Analytics, 10wmde-wikidata-tech, 10Technical-Debt: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072 (10Lucas_Werkmeister_WMDE) [10:10:37] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) Some additional information * puppet7 agents can talk to both centrallog1002 and ce... [10:21:43] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) from a very simple test this appears to only affect buster ` # in the following eve... [10:25:48] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [10:38:35] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) Feels like this could be related to https://bugs.debian.org/cgi-bin/bugreport.cgi?bu... [10:49:31] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [10:56:41] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) > > edit: or possibly this one https://github.com/rsyslog/rsyslog/issues/4035 ok i... [11:13:55] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:01] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) I can confirm that e.g. bookworm hosts are sending syslog fine, e.g. titan1002:... [11:17:53] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [11:19:51] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) Ditto bullseye: ` centrallog2002:~$ tail -5 /srv/syslog/thanos-fe1001/syslog.l... [11:23:55] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Jelto) [11:38:17] jbond, jhathaway: jayme hit an issue with puppet that seems related to the devenv [11:38:42] Cannot run program "/srv/puppet_code/environments/production/utils/get_config7.sh" (in directory "."): error=2, No such file or directory on node kubestage2002.codfw.wmnet [11:38:50] but AFAIK that's for the devenv... [11:39:04] that happend to me on running a reimage for kubestage2002 [11:39:29] I see that environment.conf does call that [11:41:37] that's run the puppetmaster/server right? [11:42:47] if I run [11:42:48] $ sudo cumin 'A:puppetserver' '/srv/git/operations/puppet/utils/get_config7.sh production' [11:42:56] it works on all 5 servers [11:47:16] jayme: AFAICT it's transient, does the cookbook allow you toretry the puppet run? [11:47:29] because I tried a noop run via install_console and does work [11:48:21] volans: yes, it does allow retry - but I did not retry as of now as this looked spooky and related to puppet7 [11:49:11] true, also the noop run was already run by the cookbook to populate icinga [11:49:17] so noop worked [11:49:26] and then the first run hit this, interesting [11:49:44] for reference (for john/jesse) the command that failed is: puppet agent --onetime --no-daemonize --verbose --no-splay --show_diff --no-usecacheonfailure [11:51:57] volans: if transient then my best guess would be T350809 [11:51:58] T350809: Sporadic puppet failures - https://phabricator.wikimedia.org/T350809 [11:52:17] I'm not 100% sure it's transient might be noop vs normal run yet [11:52:20] do you have a time stamp i can check if there was a puppet-merge [11:52:34] 2023-11-14 09:55:00,626 [11:53:18] (that's when failed, started at 2023-11-14 09:54:54,266 ) [11:54:18] jbond: I'll let you decide if jayme should retry or wait for additional debugging :D [11:55:03] I've no problem waiting. I did the reimage just because pupper7 - so I'm not in a rush [11:56:43] jayme: please retry i see a puppet merged happened at that time some im 99% sure thatwill be the cuase [11:56:55] s/will/was/ the cause [11:57:13] ack [11:57:21] so this is kind of a race contidion? [11:57:48] jayme: yes its documented in T350809 and jhath.away has been exploring work aruonds [11:57:49] T350809: Sporadic puppet failures - https://phabricator.wikimedia.org/T350809 [12:04:09] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:11:22] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) >>! In T351181#9329892, @jbond wrote: >> >> edit: or possibly this one https://gith... [12:13:55] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:57] jbond: retry looks okay [12:19:46] jayme: great thanks [12:22:30] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:47:40] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:55:05] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:55:09] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) I think we are gong to need to tweak this a bit more: ` -rw-rw---- 1 mysql mysql 61G Nov 14 12:44 syslog.ibd ` 61GB is quite large for what this is, t... [13:06:01] jbond: can I re-run decom for puppetdb2002 ? [13:11:05] volans: yes [13:11:27] ack thx [13:13:40] doh, not found in netbox as it was a vm :D [13:13:51] idempotency for vms apparently needs to be improved :D [13:14:12] I'll pick another one already decomm'ed [13:16:24] oh noo, no primary ip doesn't allow to run it [13:16:25] ok [13:16:29] we'll see with the next run [13:17:23] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:22:06] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10SRE, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10Volans) 05In progress→03Resolved This is now done. [13:32:56] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:57:14] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [14:13:55] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:05] XioNoX thought this article was interesting on running debian on a Mellanox switch, https://lobste.rs/s/n6vtps/debian_on_mellanox_sn2700_32x100g [14:52:19] jhathaway: thanks! [15:13:55] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:23] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:10:07] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:13:18] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) p:05Triage→03High [16:25:22] 10netbox, 10Infrastructure-Foundations, 10Maps, 10Puppet-Infrastructure, and 2 others: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10jbond) 05Open→03Resolved a:03jbond going to close this as i think its resolved but please reopen if not [16:42:00] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Ladsgroup) FWIW, the rows are almost all like this: ` +-----------+----------+----------+-------+------------+---------------------+---------+-----------------------... [17:16:54] 10Packaging, 10Infrastructure-Foundations, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech: Migrate WDQS to Java 11 - https://phabricator.wikimedia.org/T316103 (10thcipriani) >>! In T316103#8375080, @bking wrote: > Looking on [[ https://github.com/blazegraph/database/issues?q=is%3Aissue+is... [17:43:55] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:55:24] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [18:43:55] (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:34:50] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [19:56:02] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn) [19:57:05] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn) stewards: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973863 peopleweb: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973855 etherp... [20:33:06] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn) [22:43:55] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed