[07:06:34] 10Puppet, 10Infrastructure-Foundations: Puppetdb: audit existing configuration - https://phabricator.wikimedia.org/T291538 (10Volans) p:05Triage→03Medium [07:22:17] 10Puppet, 10Infrastructure-Foundations: Numa fact: puppetdb has the fact for only ~60% of the fleet - https://phabricator.wikimedia.org/T291539 (10Volans) p:05Triage→03Medium [07:22:26] 10Puppet, 10Infrastructure-Foundations: Puppetdb: not refreshed on config change? - https://phabricator.wikimedia.org/T291540 (10Volans) p:05Triage→03Medium [07:22:45] 10Puppet, 10Infrastructure-Foundations: Numa fact: puppetdb has the fact for only ~60% of the fleet - https://phabricator.wikimedia.org/T291539 (10Volans) I've opened T291540 for the more general case. [07:27:00] 10Puppet, 10Infrastructure-Foundations: Host distribution across puppetmasters - https://phabricator.wikimedia.org/T291541 (10Volans) p:05Triage→03Medium [07:27:33] sorry for the spam, while looking at the numa issue on puppetdb I found a bunch of things that I think need fixing and opened the above 4 tasks ^^^ :) [07:30:06] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Cookbooks repository: avoid stale code in master branch - https://phabricator.wikimedia.org/T287465 (10Volans) 05Open→03Resolved AFAIK it's all working fine, resolving, feel free to re-open if you encounter any issue. [07:32:03] 10Puppet, 10Infrastructure-Foundations: Hosts distribution across puppetmasters - https://phabricator.wikimedia.org/T291541 (10Volans) [08:03:56] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) Has been a while since we discussed this but the problem still stands and I think we need to get some progress here. What would be the best way... [08:41:08] 10netops, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10SRE, 10bacula: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) I see a huge improvement on the "stability" (if you... [09:15:23] 10Puppet, 10Infrastructure-Foundations: Numa fact: puppetdb has the fact for only ~60% of the fleet - https://phabricator.wikimedia.org/T291539 (10jbond) 05Open→03In progress [09:35:10] 10Puppet, 10Infrastructure-Foundations: Numa fact: puppetdb has the fact for only ~60% of the fleet - https://phabricator.wikimedia.org/T291539 (10jbond) I sent the puppetdb sent a `kill -HUP` to the puppetdb service, looks like i missed sending it on puppetdb2002. sending HUP is AFAIK mostly undocumented. T... [09:55:01] 10Puppet, 10Infrastructure-Foundations: Puppetdb: not refreshed on config change? - https://phabricator.wikimedia.org/T291540 (10jbond) Currently the this is a conscious decisions. when puppetdb is restarting all submissions to it are rejected which generally cause the "wide spread puppet issues" alert. Rece... [10:15:16] jobo, jbond if you're both ok with the current text on the doc I'll put in wikitech and send the patch to spicerack to link it [10:19:32] volans: its a go from me :) [10:24:24] from me as well 👍 [10:24:55] thanks, doing [10:37:03] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Puppetdb: audit existing configuration - https://phabricator.wikimedia.org/T291538 (10jbond) a:03jbond [10:37:15] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Puppetdb: audit existing configuration - https://phabricator.wikimedia.org/T291538 (10jbond) 05Open→03In progress [10:38:17] 10Puppet, 10Infrastructure-Foundations: Numa fact: puppetdb has the fact for only ~60% of the fleet - https://phabricator.wikimedia.org/T291539 (10jbond) This should be fixed now, and i have re imported the facts to the compiler hosts [10:38:58] 10Puppet, 10Infrastructure-Foundations: Numa fact: puppetdb has the fact for only ~60% of the fleet - https://phabricator.wikimedia.org/T291539 (10jbond) 05In progress→03Resolved a:03jbond [10:43:07] jobo, jbond: {done} https://wikitech.wikimedia.org/wiki/Spicerack#How_to_contribute [10:46:14] and https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/722841 [10:49:44] Thanks! [11:30:20] 10Puppet, 10Infrastructure-Foundations: Hosts distribution across puppetmasters - https://phabricator.wikimedia.org/T291541 (10jbond) > its resolution depends entirely on the order of the search Seems like hosts don't have the $site.wmnet as the first entry in the search path, this is probably an easy fix. bu... [11:39:36] jbond: what is your ETA for 722838 ? [11:41:32] volans, topranks, looks like the maintenance parser stuff is also a standalone python library: https://github.com/networktocode/circuit-maintenance-parser [11:43:51] effie: its complete [11:44:07] can I run puppet on deploy1002? [11:44:08] oh nice! thanks for that XioNoX. [11:44:09] everything should be enabled again now [11:44:13] oh great! [11:44:24] effie: let me know if its still disabled [11:44:45] all good, thank you [11:44:51] sorry to nag you about it [11:44:52] great thx [12:28:24] XioNoX: nice! [12:43:01] 10Puppet, 10Infrastructure-Foundations: Hosts distribution across puppetmasters - https://phabricator.wikimedia.org/T291541 (10Volans) >>! In T291541#7371589, @jbond wrote: >> its resolution depends entirely on the order of the search > Seems like hosts don't have the $site.wmnet as the first entry in the sear... [12:57:07] jbond: what would you think to move spicerack to a proper semantic versioning starting with 1.0.0 in the next release? [12:58:02] we're at 0.0.59. The only drawback I see is that as we do some breaking change now and then the major version muber might bump more often than not [12:58:15] and we might endup with spicerack 60.1.2 not too far in the future [12:58:38] at the same time... nobody is using that number AFAIK and so shouldn't be a problem I think [12:59:01] I merged https://gerrit.wikimedia.org/r/c/operations/homer/public/+/722551, but didn't run "homer commit", shouldn't a 'homer "cr*" diff' show me the change? [13:00:18] unless it's already configured like that in the CRs then yes [13:00:23] it should show you the diff [13:01:13] that's odd, as it currently only says "# No diff" [13:02:17] did you run puppet to pull the merged change? [13:05:04] doh, ofc [13:05:43] totally missed that, running puppet and re-running the diff [13:09:32] volans: i would say lets move to semantic versioning [13:10:28] works fine now, sorry for the noise :-) [13:10:34] moritzm: no prob! [13:12:26] 10Puppet, 10Infrastructure-Foundations: Hosts distribution across puppetmasters - https://phabricator.wikimedia.org/T291541 (10jbond) > Basically instead of puppet have puppet.eqiad.wmnet, that might point to a different host, even in a difference datacenter for temporary failover purposes. Ahh i see, what you... [14:10:31] wiki-mail-codfw is in dns.git, but also in Netbox, some leftover from the migration? [14:11:04] checking [14:12:20] moritzm: apparently so [14:12:52] how did you find it? [14:13:17] in netbox is marked as VIP, is that correcT? [14:14:08] ah the double result from a dig I guess [14:14:25] totally unrelated, Keith and I are currently tracking down last mail floating through mx1001 for the bullseye reimage [14:14:46] and first I was unsure whetehr it's in netbox or dns.git [14:48:20] moritzm, jbond: can I grab one of the sretest host for reimage testing again? [14:50:21] good with me [14:51:41] ack, please go ahead [14:57:12] thx [14:57:18] I might need to send a quick patch to the cookbook first [15:19:47] did anything change in installer setup today? [15:19:52] mx1001 fails to reimage [15:20:08] what's the failure? [15:20:12] with bullseye, but the reimage of mx2001 went fine [15:20:25] can't find the kernel udebs [15:20:40] but AFAICT there has been zero change on the Debian kernel side [15:21:28] those should be inside the image right? [15:22:02] the d-i image gets correctly started [15:22:19] but then it fails to grab the additional kernel modules used during installation [15:22:47] what I was asking is if it has to download them or they should be part of the image [15:23:03] -rw-rw-r-- 1 root root 144750509 Sep 22 15:07 initrd.gz [15:23:13] this has today timestamp [15:23:22] they are downloaded from the running image [15:23:23] in /var/lib/puppet/volatile/tftpboot/bullseye-installer/debian-installer/amd64 [15:30:15] we've reverted the mx1001 reimage for now, mx1001 is powered back on and bringing down the router filters [15:30:43] I'll test if I can reproduce this on a different VM [15:32:42] ack [15:57:05] * volans actually running the reimage on sretest now [16:23:58] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Auth & Access), 10Patch-For-Review, and 2 others: Open gitlab.wikimedia.org to all users with Wikimedia developer accounts - https://phabricator.wikimedia.org/T288162 (10brennen) [16:50:20] moritzm: fwiw reimage is failing on sretest1002 too [16:50:31] same reason [16:57:55] I see update-netboot-image bullseye was run on puppetmaster1001, but I dunno by who or the timestamp yet [17:17:26] yeah, I ran it before to refresh the bullseye image, but I didn't make a difference [17:17:40] I'll debug this more closely tomorrow, not sure what's happening there [17:17:55] but it's a good data point that this isn't limited to Ganeti VMs [17:35:11] ok, I'll leave sretest1002 stuck in d-i [17:35:36] you can try the experimental reimage cookbook on it if you want too [19:22:36] won't make a difference, the issue is within d-i