[00:45:49] (PuppetFailure) firing: Puppet has failed on es1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:00:49] (PuppetFailure) firing: (2) Puppet has failed on es1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:00:49] (PuppetFailure) firing: (2) Puppet has failed on es1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:06:52] btullis: can you double check why that an-redacteddb is alerting here and on sre-data-persistence@wikimedia.org if it doesn't belong to us, but to your team? [06:40:38] moritzm: you can go ahead with any of the hosts I mentioned in the email [07:24:41] marostegui: ack, thanks [07:27:18] the alert for an-redacteddb1001 showing up here it puzzling, though. looking at /etc/wikimedia//contacts.yaml on an-redacteddb1001 only Data Platform is listed there [07:28:00] moritzm: maybe there's some regex somewhere that if it is says db....? [07:29:23] I'd hope not :-) At least I've never seen alerts for netboxdb[12]002 showing up here either [07:29:36] yeah, I hope that too XD [07:29:46] Let's see if btullis has some ideas [07:57:57] there was a Puppet failure on an-redacteddb1001, I've just fixed it, maybe this was also due to the initial Puppet run not having been completed, we'll see [08:00:26] moritzm: I saw some errors on new hosts that are being installed, related to certs expiration [08:00:45] Was that the same thing? I saw dcops installing hosts yesterday night, but I have no idea in which stage they are at the moment [08:01:24] Error: The CRL issued by 'CN=Wikimedia_Internal_Root_CA,OU=Cloud Services,O=Wikimedia Foundation\, Inc,L=San Francisco,ST=California,C=US' has expired, verify time is synchronized [08:01:27] That is what I saw [08:01:38] Again, I don't know if they are still working on them or what [08:09:06] there might have been some issue with the reimage cookbook, the host ws configured for Puppet 7 in Hiera, but the Puppet 5 agent was still installed, I fixed that (the error you mentioned is a symptom of that) and now Puppet is working fine [08:09:24] probably some later step of the Puppet run properly declares the alert ownership [08:09:58] moritzm: In this case I was talking in case es1040 error is the same (this host is in the middle of DC-Ops installation, but I saw it alerting on puppet) [08:10:19] So I don't know what's the deal with it, as it is accessible but still failing on puppet with that error [08:11:22] ack. I'll have a look at es1040 in ~ 5m [08:12:05] yeah no rush, I've not been pinged or anything by DCOps, but as I saw the alerts I was curious [08:22:03] for es1036 I think DC ops might have used the wrong arguments as passed to the reimage cookbook, es1036 has the Puppet 7 agent packages installed, but insetup::data_persistence defaults to Puppet 5 [08:22:23] let me check the cookbook logs, they have passed -p 7 to the cookbook [08:24:21] I wonder why they went to puppet 7 by default [08:24:31] We didn't say anything about it, just bookworm [08:24:44] I guess db2196 is also the same issue [08:25:21] probably simply because all other roles default to Puppet 7 at this point [08:25:32] maybe we should update the racking template to make it an explicit field? [08:26:26] moritzm: What would be the case if we go for default puppet 7 on insetup::data_persistence and then once the host gets switched to its final role...that final role is puppet 5 for now? [08:26:29] Would that be a mess? [08:26:45] * Emperor would imagine that not working very well [08:26:48] BICBW! [08:26:54] yeah, I think it won't [08:27:08] Just thinking about making dc-ops life easier [08:27:21] If not, I will just comment on the racking tasks and let them know to use -p 5 when installing [08:27:27] yeah, that won't work. we'd need to explicit unregister the insetup host from Puppet 7 and move it to 5 [08:27:39] right, so I will let them know to use -p 5 instead then, sounds good? [08:27:45] basically add a reverse of the migration cookbooks we're currently using for 5-> [08:28:15] sounds good, I have been unable to find the used arguments in the spicerack logs on cumin1002, but that's the only reasonable theorty [08:28:39] I can actually try it myself [08:29:00] in other cookbook news, dbproxy2001 is on Puppet 7 now, will proceed with db2132 [08:29:35] cool [08:30:46] moritzm: How about this limbo? I cannot do --new -p 5 cause the host is already installed, but using --new is the only way to be able to use -p 5 [08:31:38] hmmh, good question. let's wait for Riccardo to be around, I'm sure he directly knows a clever hack [08:31:49] cool :) [08:40:05] * volans_ reading backlog [08:41:58] which host? the backlog is confusing :D [08:42:06] and what you're trying to o [08:42:08] *do [08:42:14] volans: es1036 for instance [08:43:13] volans: we believe es1036 was attempted to be reimaged by dc-ops using -p 7 for puppet 7, but we still don't want those hosts with puppet 7. So I tried to reimage it with -p5, but as the host is showing up in puppet (and it is accessible) I cannot use --new -p5 and using --new is the only way to be able to pass -p5 [08:43:15] if that makes sense [08:43:28] This is all guessing, as DCOps didn't ping me or anything about it, but I saw the errors [08:43:35] And same case with db2196 (I'd guess) [08:43:52] ok checking [08:44:21] so the cookbook was run 4 times all of them with: Executing cookbook sre.hosts.reimage with args: ['-t', 'T355269', '--os', 'bookworm', 'es1036', '--new'] [08:44:22] T355269: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 [08:44:44] in those cases the cookbook asks the operator for the puppet version, checking the logs for the answer [08:45:39] 2024-02-26 22:19:56,138 jclark 2293555 [INFO interactive.py:86 in ask_input] User input is: "7" [08:45:57] so yes, this one was installed with puppet 7 [08:47:03] right, so can I do just --new if the host is already up and in puppet? [08:47:03] as for how to revert to p5 the quickest way is to disable puppet on the host, downtime it (unless is already in a role that doesn't alert) and then remove it from puppetdb [08:47:12] at that point we can run with --new and choose 5 [08:47:21] if you want I can remove it from puppetdb for you :D [08:47:54] yeah, but we need to remove a bunch, let me give you the list [08:48:00] sure [08:48:05] :/ [08:48:16] es10[35-40] [08:48:32] volans: and db2196 [08:49:03] ok, do they need downtime too? [08:49:09] or are with notifiction disbled [08:49:17] they are insetup [08:49:53] ok, operating on: 'db2196.codfw.wmnet,es[1035-1036,1040].eqiad.wmnet' [08:49:57] thanks [08:49:57] for confirmation [08:50:02] +1 [08:50:09] the missing ones are ok? [08:50:15] the fact is only 35,36 and 40 [08:50:30] yeah, probably the others were not installed yet [08:50:36] ack [08:51:51] I've told DCOps to go for 5 in the next reimage [08:53:44] sounds good [08:53:51] thanks [08:54:02] it's running [08:55:35] marostegui: all done, you can proceed with reimage with --new and -p5 (or answer 5 when promted) at your will. The imporant part is that puppet doesn't get re-enabled before reimage. [08:55:53] trying [08:56:24] volans: it went thru, let's see if it all installs correctly now, thanks! [08:56:35] I've let john know too [08:56:43] no prob, sorry for the trouble [09:00:49] (PuppetFailure) firing: (2) Puppet has failed on es1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:15:49] (PuppetFailure) firing: (2) Puppet has failed on es1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:20:49] (PuppetFailure) resolved: (2) Puppet has failed on es1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:37:46] Sorry to bother you, I would like to record the output of a database depool command. Is there a host I can depool for a few seconds to do so, or you can maybe send me a screenshot of you doing it? [10:46:08] for example, I would like to depool db2117 for a few seconds, then repool it [10:54:47] arnaudb: around ? [10:59:01] around sorry jynus [10:59:39] See above, let me know if it is ok for me to do so or just let me know of another host, etc [10:59:46] yes yes I was checking for db2117 [10:59:53] lgtm for depooling! [10:59:54] I am not in a hurry [11:00:17] ok, will depool record stuf and repool it, should be safe I think [11:00:24] you can downtime it if you need to even :) [11:00:41] nope, it is just documentation what I am doing, no maintenance at all [11:00:45] 0:-D [11:08:30] Oh sorry folks, I've just seen the ping from this morning. Is an-redacteddb1001 still alerting in the wrong place? I updated the contacts from your team to ours last night, as soon as I spotted it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006600 [11:11:45] all looking good, arnaudb [11:12:47] I am now going to attempt to depool db1150 (this is a mistake and won't work, but I just want to record also that) [11:13:33] btullis: nothing from it since late yesterday, so I think we're good, thanks [11:13:35] giving a heads up in case I mess up something [12:25:42] thanks for the notification jynus! [14:40:56] All the new es hosts have IPv6 even though I asked dc ops not to do it :( [14:43:43] volans: is this still the correct procedure to clean them? https://phabricator.wikimedia.org/T270101#6688993 [14:45:08] marostegui: how many? [14:45:18] I can do it programmatically if too many [14:45:36] volans: all these es10[35-40] [14:46:10] but I am seeing that codfw new hosts too [14:46:12] so those are a lot more [14:46:17] sigh [14:46:20] let me send you the list [14:46:26] open a task [14:46:30] for dcops, I'll do them [14:46:35] thanks [14:46:38] I will do that now [14:46:42] thanks [14:51:10] volans: https://phabricator.wikimedia.org/T358594 I have not added your team's tag, not sure if I should [14:51:20] thanks, no worries [15:11:44] marostegui: AAAA records removed from netbox and dns, do you want me to clear also the dns recursors cache? [15:12:14] volans: if you want to, thanks a lot :) [15:40:10] topranks: today's hosts for T355870 are ready [15:40:10] T355870: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 [15:40:32] arnaudb: thanks! [16:18:36] arnaudb: all done with the move thanks :) [16:19:02] servers are repooling, well done topranks! next round tomorrow x) [16:19:14] almost there :)