[00:03:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:03:56] <jinxer-wm>	 (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:55:03] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 3 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10ayounsi) 05Resolved→03Open Thanks @Ladsgroup yeah some devices got way too verbose at sending debug logs and we don't use debug level logs for alerting so the ab...
[07:59:23] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[08:03:56] <jinxer-wm>	 (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:18:57] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[09:27:49] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:48:55] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) Well i have updated apt1001 to 8.2102.0-2~deb10u1 and i still see the problem so tha...
[10:48:56] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:50:41] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:56:45] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10MoritzMuehlenhoff) >>! In T351181#9333629, @jbond wrote: > Well i have updated apt1001 to 8...
[11:00:47] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10MoritzMuehlenhoff) >>! In T351181#9333641, @MoritzMuehlenhoff wrote: >>>! In T351181#933362...
[11:05:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:20:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:48:56] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:51:46] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:53:01] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[12:08:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) i have tested using openssl and that works so ill prepare a patch to switch all buster to openssl
[12:14:23] <volans>	 topranks: lmk if you need a hand to test the reimage for the dhcp cleanup
[12:14:32] <volans>	 and thanks for fixing that
[12:15:15] <topranks>	 volans: np, I'll test in codfw for the "evpn" use-case anyway 
[12:15:49] <topranks>	 in terms of the scenario we don't want it to run I'll need to try and find a victim machine, is there an sretest host you know of that could be used?
[12:17:10] <volans>	 usually any sretest is fine, I'm not using any at the moment
[12:17:19] <volans>	 not sure if other are
[12:20:06] <topranks>	 cool
[12:33:49] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye
[13:14:18] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye...
[13:16:44] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) p:05Triage→03High
[13:25:55] <volans>	 topranks, XioNoX: there are pending dns changes for ae1-1117 in eqiad, are you planning to run the dns cookbook?
[13:29:53] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) Reset completed, the card came back up briefly but quickly failed again ` cmooney@re0.cr1-esams> show chassis fpc 1 detail  Slot 1 information:   State...
[13:38:04] <topranks>	 volans: my bad was in the middle of removing other bits
[13:38:06] <topranks>	 running now
[13:38:15] <volans>	 k thx
[13:38:31] <topranks>	 btw reimage and dhcp clear worked in codfw 
[13:38:39] <topranks>	 "Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)"
[13:39:02] <topranks>	 ^^ I'll need to submit another patch to clear the row e/f reference in this log but all good, confirmed on switch commands issued fine 
[13:42:29] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) Some logs following the issue of the "request chassis fpc online slot 1" command:  {F41507770}
[13:43:56] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:46:45] <volans>	 great
[13:48:41] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) JTAC case 2023-1115-011066 opened.
[13:48:48] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) a:03ayounsi
[13:54:15] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[13:54:34] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) 05Open→03Resolved a:03jbond i have rolled out a change so that buster machines use openss...
[14:14:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:23:36] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) Thank you for looking into this and fixing the issue, I can confirm the errors I'm seeing...
[14:26:57] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10MoritzMuehlenhoff) I'll also open a separate task to eventually also move Bullseye and Bookworm hosts...
[14:28:56] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:31:57] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Ladsgroup) Thanks for the patch! I hope it'll make a dent, I'll monitor it.  While I was monitoring it, I tried this: ` root@db1217.eqiad.wmnet[librenms]> select * f...
[14:43:31] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) @ayounsi okay to truncate that table?
[14:43:56] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:20] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye
[14:50:47] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi)
[14:53:18] <wikibugs>	 10SRE-tools, 10homer, 10Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415 (10ayounsi) FYI, this limitation is becoming more and more problematic for deploying a change to the whole infra.
[14:55:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ganeti1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:01:39] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10Southparkfan)
[15:05:10] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10Southparkfan) Production migration from the gnutls driver to the openssl driver can be tracked in T324...
[15:05:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on ganeti1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:07:25] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[15:09:51] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1001.eqiad.wmnet with OS bullseye
[15:26:35] <sukhe>	 volans: topranks: let's find some time to talk about how to resolve the authdns-update issue today
[15:26:44] <sukhe>	 I mean, not today, but the one that happened today!
[15:28:59] <volans>	 sukhe: sure but what do you mean by resolve? if you change netbox and then run the dns cookbook and it removes some include files you should remove the includes first from the ops repo. At most we could check that deleted files are not included in the dns repo maybe.
[15:29:13] <volans>	 if I understand correctly what happened (but correct me if I'm wrong)
[15:29:56] <sukhe>	 volans: this is not the first time this has happened and I don't think it will be the last. essentially if we can figure out a way to alert us before things start failing? 
[15:30:10] <sukhe>	 any thoughts on that since you know the Netbox side better?
[15:30:30] <volans>	 without running the dns cookbook nothing you change on netbox affects dns
[15:30:40] <volans>	 if you change netbox and don't run it, icinga alerts
[15:30:46] <volans>	 after a while
[15:32:04] <sukhe>	 right but what's a good solution for this? since this has happened in the past as well. we had an issue where the status was changed and the PTR include was not removed and so authdns-update was broken till it was
[15:32:30] <sukhe>	 one can argue that the right order might help but there are other corner cases too? like for example if there is an include for the PTR not the A record, this again fails? 
[15:32:41] <sukhe>	 which happens when a host is decommissioned and there are no IPs allocated in the subnet
[15:32:52] <topranks>	 volans: we discussed before about potentially just having the netbox cookbook generate a single file for each of the zones gdnsd has defined?
[15:33:18] <volans>	 I think we are mixing very different "failure" scenarios
[15:33:57] <sukhe>	 yes. the output unless I am missing something though is basically the same from what I have observed in the past cases
[15:34:22] <topranks>	 so like there would be a single "include" statement in 10.in-addr.arpa, and the file it pointed to (generated by cookbook) had numerous $ORIGIN statements followed by the specific records for the subnet in question?
[15:34:26] <volans>	 one is that an $INCLUDE statement is including a file that doesn't exists anymore
[15:34:47] <topranks>	 ^^ I think this is the scenario we could hopefully prevent 
[15:34:59] <volans>	 another one is that the dns repo CI fails because of some check that fails (duplicate/missing records)
[15:36:21] <sukhe>	 volans: correct
[15:36:34] <sukhe>	 in both cases unless I am mistaken, we only discover it till we run authdns-update
[15:36:44] <volans>	 but the cookbook does run it
[15:37:09] <sukhe>	 the uncommitted changes in Netbox message is helpful but I think the low SNR on that negates that
[15:37:25] <sukhe>	 volans: runs it how? 
[15:37:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:38:05] <volans>	 sukhe: utils/deploy-check.py -g {netbox} --deploy
[15:38:35] <volans>	 where netbox=/srv/git/netbox_dns_snippets
[15:39:12] <volans>	 so my question first is, what failed today? was the dns cookbook run successful?
[15:40:13] <volans>	 the alternative option could be to have gdnsd on the netbox hosts configured (without listening, just to validate things) and run the check there with the path to the netbox stuff to the temporary path where the cookbook is creating the tree
[15:40:14] <topranks>	 Today's issue was the "include statement for file that doesn't exist" problem 
[15:40:19] <volans>	 and allow to commit only if it passes
[15:41:03] <topranks>	 I was waiting on the +1 for the dns repo change to merge the remove of the "include" lines, but the records had already been deleted in Netbox 
[15:41:37] <volans>	 and that's the wrong order :D
[15:42:07] <volans>	 did the cookbook pass or fail though?
[15:42:07] <topranks>	 indeed, but sometimes it's not always clear you're removing the last IP in a block 
[15:42:13] <topranks>	 cookbook failed 
[15:42:16] <sukhe>	 I think I might be in the minority here but I don't think the order should fail stuff
[15:42:20] <volans>	 ok so it did fail
[15:42:24] <topranks>	 there was no confusion about what was happening 
[15:42:29] <sukhe>	 given the criticality of the service and our dependence on authdns-update
[15:42:49] <sukhe>	 as in, we should not count on the order being followed correctly
[15:43:08] <volans>	 sukhe: but the cookbook did fail, telling the operator there was an issue no?
[15:43:26] <sukhe>	 volans: correct but sometimes even resolving this takes a while
[15:43:33] <volans>	 the only way to "prevent" it is to have gdnsd on the netbox host where the script runs
[15:43:38] <volans>	 to be able to run that before committing
[15:43:39] <sukhe>	 and this is assuming the person knows anything about Netbox or gdnsd
[15:44:04] <volans>	 *but* that will block things in the other way around, just be aware of that
[15:44:09] <taavi>	 can we automatically generate the include statements too?
[15:44:19] <volans>	 taavi: we don't want them all
[15:44:29] <volans>	 it's a human decision
[15:45:31] <volans>	 sukhe: to clarify, if we do as I said above, then if there is a change in netbox that breaks things, that means that if we need to quickly deploy a netbox-driven change for whatever reason we're blocked, unless you're familiar and know how to use the --emergency-manual-edit flag
[15:46:24] <volans>	 if that's deemed ok we can go in that direction
[15:46:47] <XioNoX>	 we probably need a diagram because I'm a bit lost
[15:46:59] <sukhe>	 it's confusing and I don't think there are any other easy answers
[15:47:01] <topranks>	 I'm not getting why couldn't we have a single "include" file in each reverse zone, and that included file (generated by netbox) had all the records, with appropriate $ORIGIN statements before each?
[15:47:09] <sukhe>	 volans: since you know the most, is there any reason we can't soft-fail on a missing include?
[15:47:11] <XioNoX>	 not sure I understand "taavi: we don't want them all"
[15:47:18] <sukhe>	 would that break something? I can't think of anything obvious but maybe I need to think more
[15:47:37] <volans>	 IIRC that's gdnsd failing
[15:47:41] <XioNoX>	 sukhe: that's probably on the gdnsd?
[15:47:45] <XioNoX>	 yeah
[15:47:49] <topranks>	 XioNoX: I think that means we want to let Netbox generate records, but the file to go unused cos there is no matching "include" in the zonefile
[15:47:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:48:19] <sukhe>	 right, that's the check-deploy
[15:48:25] <volans>	 XioNoX: I meant that we don't want to include all files and alll the includes go in specific places
[15:48:38] <XioNoX>	 I mean in the other way around, all the config in generated by netbox, and there are a few includes for the static parts of the config
[15:48:57] <volans>	 uh? I never proposed that :D
[15:49:14] <volans>	 what do you mean?
[15:49:46] <sukhe>	 I mean I am in the camp of what happened today that if we have this failing, we should page immediately
[15:50:03] <sukhe>	 that might sound drastic but we would have not been able to depool esams had this not been resolved
[15:50:13] <sukhe>	 and sometimes the error itself is not obvious and it takes a while to figure out
[15:50:15] <XioNoX>	 volans: netbox having a bigger view on the dns config
[15:50:33] <volans>	 XioNoX: it doesn't have
[15:50:35] <sukhe>	 volans: let's set up a time to go over this if that's fine?
[15:50:38] <sukhe>	 we can collect our thoughts 
[15:50:48] <volans>	 sukhe: sorry but the page would not have helped you today
[15:50:53] <volans>	 it would have paged around the same time
[15:50:58] <sukhe>	 yeah not today at least
[15:51:15] <sukhe>	 but two weeks ago we had a similar issue, that was discoverd because a cookbook was failing
[15:51:18] <sukhe>	 and it was hidden otherwise :)
[15:51:33] <volans>	 I bet someone had a cookbook run failed
[15:51:40] <volans>	 if one runs the dns cookbook and ignores the failure...
[15:51:49] <volans>	 that's a bigger issue than order problems MHO
[15:51:52] <volans>	 *IMHO
[15:52:11] <volans>	 but I agree with the general problem
[15:52:12] <sukhe>	 I see your point. but I also have seen how people are lost around this issue
[15:52:14] <volans>	 we can make it more resilient
[15:52:18] <sukhe>	 yeah basically that
[15:52:33] <sukhe>	 the thing is that somethings failing might be OK-ish I guess? but this one is not
[15:53:24] <volans>	 sukhe: so you have a canary dns host by any chance
[15:53:24] <volans>	 ?
[15:53:40] <volans>	 one that is not answering any dns request
[15:53:52] <sukhe>	 volans: right now no
[15:53:58] <sukhe>	 but if you want, I can make one for you
[15:54:14] <volans>	 ok, then the easiest solution is to have gdnsd on the netbox hosts, where the files are generated and test them before commit
[15:54:46] <wikibugs>	 10CFSSL-PKI, 10Infrastructure-Foundations: PKI: configure a check for ocsp - https://phabricator.wikimedia.org/T350688 (10jbond) The following command should be be able to be used to check  ` $ openssl ocsp -issuer /etc/cfssl/signers/debmonitor/ca/debmonitor.pem -cert /etc/debmonitor/ssl/debmonitor__pki1001_eq...
[15:55:07] <topranks>	 my dream here I guess is different - which is to not have to make edits in the DNS repo when new subnets are added/removed
[15:55:44] <volans>	 topranks: that's another useful angle, but narrower in scope (fixes only one specific issue)
[15:55:52] <topranks>	 yes indeed 
[15:56:14] <volans>	 and as you know I've proposed that we reduce the number of INCLUDEs since day one
[15:56:23] <sukhe>	 volans: I think that would be helpful by itself for sure
[15:56:25] <volans>	 we needed them specific to be able to migrate to netbox slowly
[15:57:28] <topranks>	 ok yeah that makes sense 
[15:57:50] <volans>	 also PTR ORIGINs are not very flexible in size :D
[16:03:59] <wikibugs>	 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) 05Open→03Resolved Per my chat with Arzhel in irc, table truncated! `root@db1119.eqiad.wmnet[librenms]> truncate table syslog; Query OK, 0 rows affect...
[16:22:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ping1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:32:55] <volans>	 sukhe, topranks, XioNoX: if you're free there are the automation office hours upcoming in ~half an hour we could use for that :D
[16:32:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on ping1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:33:46] <sukhe>	 hahah
[16:33:50] <volans>	 sorry for not thinking about it earlier
[16:33:55] <sukhe>	 volans: I would love to join, but I have an interview in ~35
[16:34:05] <sukhe>	 how long are the office hours for? one hour I am assuming?
[16:34:14] <volans>	 yes, a bit less
[16:34:22] <volans>	 too bad, no prob, we can find any other time :D
[17:54:05] <wikibugs>	 10Packaging, 10Infrastructure-Foundations, 10Phabricator, 10collaboration-services: build python-phabricator package for bullseye (and bookworm?) - https://phabricator.wikimedia.org/T351333 (10Dzahn)
[18:03:39] <wikibugs>	 10Packaging, 10Infrastructure-Foundations, 10Phabricator, 10collaboration-services: build python-phabricator package for bullseye (and bookworm?) - https://phabricator.wikimedia.org/T351333 (10Dzahn) Before we do this the question is:   Do phab/phorge servers actually use it?  Because in older tasks linked...
[18:08:45] <wikibugs>	 10Packaging, 10Infrastructure-Foundations, 10Phabricator, 10collaboration-services: build python-phabricator package for bullseye (and bookworm?) - https://phabricator.wikimedia.org/T351333 (10Dzahn) https://debmonitor.wikimedia.org/packages/python-phabricator  ^ so alert* (icinga) and seaborgium (openldap...
[18:09:04] <wikibugs>	 10Packaging, 10Infrastructure-Foundations, 10Phabricator, 10collaboration-services: build python-phabricator package for bullseye (and bookworm?) - https://phabricator.wikimedia.org/T351333 (10taavi) >>! In T351333#9335050, @Dzahn wrote: > (bonus question: How does it get on the openldap / icinga servers i...
[18:12:33] <wikibugs>	 10CFSSL-PKI, 10Infrastructure-Foundations: PKI: configure a check for ocsp - https://phabricator.wikimedia.org/T350688 (10jbond) looking at the ocsp file using the following command suggests that something with ocprefresh is not rworking correctly as the response is from kafka ca  ` sudo openssl ocsp -respin /...
[18:15:45] <wikibugs>	 10Packaging, 10Infrastructure-Foundations, 10Phabricator, 10collaboration-services: build python-phabricator package for bullseye (and bookworm?) - https://phabricator.wikimedia.org/T351333 (10Dzahn) Same thing. Both only exist on buster:   ` [apt1001:~] $ sudo -i reprepro ls python-phabricator python-phab...
[18:17:18] <wikibugs>	 10Packaging, 10Infrastructure-Foundations, 10Phabricator, 10collaboration-services: build python-phabricator package for bullseye (and bookworm?) - https://phabricator.wikimedia.org/T351333 (10Volans) 05Open→03Invalid There is no python2 in our setup of `bullseye` or `bookworm`. `python3-phabricator` i...
[18:36:48] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[18:43:56] <jinxer-wm>	 (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:56:49] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1001.eqiad.wmnet with OS bullseye...
[19:05:05] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[19:11:53] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn)
[22:43:57] <jinxer-wm>	 (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:48:37] <ryankemper>	 \o We're working on bringing in some new cloudelastic hosts (`cloudelastic10[07-10]`). After the first puppet run, we see `Error: The CRL issued by 'CN=Wikimedia_Internal_Root_CA,OU=Cloud Services,O=Wikimedia Foundation\, Inc,L=San Francisco,ST=California,C=US' has expired, verify time is synchronized` upon trying to run puppet again
[22:48:44] <ryankemper>	 Notably, this happened both on `cloudelastic1007` which is not on the new puppet7 as well as on `cloudelastic1008` which has been migrated
[22:51:28] <bd808>	 andrewbogott: Is ryankemper's problem familiar? ^
[22:53:28] <inflatador>	 FWiW, for 1008, the reimage "failed" but the host was up via console and I was able to login and run puppet from the console, which made it accessible from SSH. Maybe that initial puppet run was pointing to the wrong master or something?