[07:52:11] hi, I have a patch or the package_builder Puppet module. It is to ensure `eatmydata` package is installed on CI regardless of the Debian version being used. [07:52:48] I already deployed it on the CI Puppet master could use a puppet-merge for it please :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135966 [08:07:26] <_joe_> hashar: you mean you'd use a code review? [08:07:46] yeah that as well :-] [08:08:01] <_joe_> I'm sure someone in collab can help with that [08:08:09] I don't know who manages package_builder now though [08:09:11] <_joe_> tbh, the extra_packages thing seems to be quite the hack [08:09:19] <_joe_> and it's used only for eatmydata? [08:10:33] yeah originally I have hardcoded it but that package should not be installed in prod [08:10:41] ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/676008/1/modules/package_builder/manifests/pbuilder_base.pp#59 ) [08:11:24] it is feeding `cowbuilder --extrapackages` [09:28:49] Is there anything wrong with puppetserver2003? It is taking ages to go over that stage on a puppet-merge [09:29:20] marostegui: if you encounter my CR pls merge :) [09:29:49] fabfur: I am already merging mine, it's been trying to merge for 5 minutes [09:30:01] ah ack [09:30:02] Yours wasn't there yet when I +2ed mine [09:30:16] np, I'll merge when you're done [09:30:22] ssh: connect to host puppetserver2003.codfw.wmnet port 22: Connection timed out [09:30:23] ERROR: puppet-merge on puppetserver2003.codfw.wmnet (ops) failed [09:30:24] here we go [09:30:41] but the host is accesible though [09:31:06] fabfur: you can try now with yours [09:31:11] ack [09:32:46] same issue? [09:32:56] puppetserver2003 timouts [09:33:00] *timeouts [09:50:46] 20 minutes ago mine went through just fine. [09:50:54] OK: puppet-merge on puppetserver2003.codfw.wmnet (ops) succeeded [09:51:25] puppetserver2003 is a Host being setup by Infrastructure Foundations SREs with ferm (insetup::infrastructure_foundations_ferm) [09:51:28] I am sorry, what? [09:52:01] ah, that matches site.pp [09:52:18] moritzm: ^ [09:52:49] ok, this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1166161 explains it [09:52:58] and the related commits ofc [09:57:09] o/ [09:57:26] lemme check if we missed to remove it from the puppet merge list [09:57:34] if anybody didn't do already [10:02:04] my first patch had a typo, but this one fixed it up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1166758 [10:02:17] and puppetserver2003 has the correct insetup role now it seems [10:03:16] some of the logic in puppet merge acts on the role name, so currently merges should work fine now, but if it's stitll an issue, we need to check where we missed to drop it [10:03:46] I checked /srv/git/private/.git/hooks on puppetserver1001, post commit is correctly not showing up 2003, so I guess it was probably bad luck for marostegui when merging? [10:04:02] in the sense that it was happening before the puppet merge [10:04:11] sorry, puppet run on puppetserver1001 [10:04:51] there shouldn't be any slowdown from now on when puppetmerging, please ping us if not [10:05:19] thanks for looking into it! [10:34:37] of course before I totally wrote the wrong motivation, since it was puppet-merge related and not puppet-private related [10:34:49] in any case, /etc/puppet-merge/shell_config.conf looks clean, so the same applies [10:35:10] (in that file you have the list of target nodes etc.. so same conclusion as before, bad luck in timing :) [12:19:00] Wow we are close to T400000 [12:43:03] marostegui: I filed https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1000000 I don't feel the need for another round number report just yet ;) [12:43:37] hahahaha [12:45:20] https://phabricator.wikimedia.org/T100000 still not solved! [12:45:26] wow, that was 2021, and the forthcoming trixie release will be the first without old-pcre in [12:50:36] similarly I'm to blame for https://gerrit.wikimedia.org/r/1000000, so I'll leave this round number to someone else [13:44:35] XioNoX: puppet is disabled on some of the Ganeti's. I think that's expected given the message but pointing it here since it's been some time (4811 minutes) and last run was on July 4 [13:51:40] Dear on-callers: I will shortly begin the last of the sessionstore reimages (if no one objects ofc). No impacted expected. -- Me [14:06:00] sukhe: ah yeah! enabling it now [14:07:30] sukhe: done [14:07:56] XioNoX: thanks! I only discovered this when rolling out a C:bird change so thought to check [14:08:35] sukhe: yeah, working on testing the Bird "onlink" contract work, there are still some niggles to fix [14:08:50] but I shouldn't forget to return the host to a clean state [14:23:12] (copied from #-operations): headsup, I'm adding kafka-jumbo1016 host to the kafka-jumbo cluster. I've run puppet on all kafka and zk hosts, so kafka could start on 1016, meaning I'm not *expecting* alerts to fire. These kafka alerts directed to team:sre and are going to klaxon. If I was somehow wrong, I'm sorry in advance [14:56:40] Amir1: ok to merge yours? [14:56:46] Use table catalog for fullViews [14:56:56] taavi: yeah but also I just merged another one [14:57:15] I'll merge both of yours then in addition to mine? [14:58:12] sounds good [14:58:13] thanks [14:58:42] done [15:15:12] Ok, this is weird. I reimaged sessionstore1006, it came back up normally after the first reboot, after the second I'm seeing cloudgw1002 as the hostname on the serial console (for which there is no DNS), and sessionstore1006 is inaccessible now from the network [15:21:39] it is cloudgw1002, I think. Logging in via the serial console, it's insetup::wmcs, no networking, and the last puppet run was on Feb 4 [15:23:20] urandom: compare the serial from racadm with the one on netbox, just to make sure they weren't swapped [15:23:49] volans: the serial? [15:24:20] serial number [15:24:26] oh, right. [15:24:36] (but don't paste them here ;) _ [15:24:37] ) [15:25:18] they were bought at very different times and their serial don't look alike, so probably not the case here [15:25:33] but is one mismatch that can happen for example [15:26:37] well... it's the right serial number for sessionstore1006 [15:26:40] the service tab [15:26:42] the service tag [15:26:56] use `getsvctag` [15:27:19] it's matches sessionstore1006, no cloudgw1002 [15:27:34] ack [15:29:09] ok, so... the machine *is* cloudgw1002, but it's only been up for 26mins [15:29:27] which tracks the last reboot of sessionstore1006 [15:37:12] I'm not sure I understand how the hardware works here. Connecting to the DRAC, the machine has to be sessionstore1006, the service tag matches, and it has 8 447GB SSDs (which another machine could have, but it would be a weird coincidence). But the serial console connects to an apparently decommissioned machine, that until a half hour ago hadn't been active since February. The virtual console from the DRAC webui also does. [15:37:34] And that one is tty1? [15:38:16] so does the DRAC not directly interface the server? [15:39:01] cloudgw1002 is marked offline in netbox so it should not be in the rack [15:39:08] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States [15:39:42] yeah, clearly this thing has been powered off until half hour ago [15:39:55] no no, it should be racked, physically disconnects [15:39:58] *disconnected [15:45:08] how are the drac and server interfaced? [15:45:26] can you cross the out-of-band access? [15:45:44] * urandom is so confused [15:45:50] the drac is either an embedded board on the motherboard or an extra board (not so common anymore) [15:45:56] it has it's own dedicated ethernet port [15:46:13] if something has been indeed cross connected, it would be that ethernet [15:46:29] but in that case, the service tag should not match [15:46:34] right? [15:46:50] which makes what you witness pretty confusing [15:46:57] plus the IP would be wrong [15:48:37] this explains why it could boot up at all: [15:48:37] https://phabricator.wikimedia.org/T386810#10563397 [15:48:42] the bold line [15:49:05] but if offline in netbox it should be unplugged and uncabled [15:50:25] we probably need VRiley's help here to make sure our assumptions (e.g. that it is indeed unplugged and uncabled) are correct [15:51:01] I can take a look at that [15:51:59] VRiley: thanks! [15:54:25] akosiaris: I think I might have an idea [15:56:27] no, cloudgw1002 didn;t; had the same mgmt IP address of sessionstore1006, it was 10.65.0.223 [15:56:53] and it wouldn't explain other issues [16:24:10] akosiaris: it looks like cloudgw1002 has been decommed and it is no longer in the cabinet [16:25:27] volans, urandom: ^ [16:25:51] VRiley: thanks. This make all of this very .... interesting? [16:26:56] urandom: does the shell you are in offer any hint as to when that system was installed? Like timestamps on /var/lib/dpkg/info ? [16:26:59] akosiaris: yeah, sorry, to close the loop from #-dcops... there were reused drives added before it was reimaged, the wipe had failed (as volan.s pointed out earlier), and (I'm guessing) the devices reordered between the first reboot, where it came up sessionstore1006, and the second [16:27:18] aah [16:27:21] ok that explains it [16:27:23] thanks! [16:27:34] I feel much better :) [16:27:40] same here [16:30:11] Yes, I seem to remember this now. We were asked to source drives for these units to add space. [17:05:23] just so everyone knows, we're aware of the cirrussearch cluster issues in EQIAD [17:41:14] The cirrussearch cluster is healthy again, I've just repooled eqiad [19:59:11] FYI, bast6003 will go down for a bit to switch the disk type back to DRBD following the Ganeti update