[10:30:23] marostegui: wanna try this in cumin? [10:30:24] sudo cookbook -c /home/ladsgroup/cookbooks/config.yaml sre.mysql.clone --source db1124.eqiad.wmnet --target db1133.eqiad.wmnet --primary db1125.eqiad.wmnet [10:30:37] Amir1: please use the test-cookbook one [10:30:53] that's what I got from another SRE [10:31:08] also if you've time I'd like to discuss your reply on why you didn't use the existing mysql modules [10:34:18] sure [10:37:35] your approach to run mysql commands via the remote is the same of the mysql_legacy module, so I was wondering why you can't use that one [10:39:25] or, alternatively, given that the operation doesn't need parallelism, use the mysql module that speaks natively mysql [10:39:39] yeah but I want switch all of that at the same time once wmfdb is in place and stable [10:40:03] it doesn't make sense to switch to something to just switch it again in a few months [10:40:27] switch what, this is a new cookbook [10:40:29] is not switching to [10:40:37] it's writing it using the current tools [10:40:52] not re-duplicating part of the effort in the cookbook, making it simpler and shorter [10:40:58] all of access to mysql dbs should go through wmfdb, that way it could know which section it's operating on and so on [10:41:25] MysqlLegacy is section-aware [10:41:33] but it's legacy [10:41:38] I didn't rename it [10:41:42] your team did :) [10:41:42] why should I use a legacy system [10:41:50] because we are deprecating it [10:42:00] and replaced it with the mysql one [10:42:14] which also should be deprecated in long term [10:42:33] and I'm not sure if it was our team btw, we don't have cookbooks, I think it was WMCS [10:42:38] let me double check [10:43:12] does wmfdb have parallelism support? [10:43:26] will the DC switchover migrate to use wmfdb or keep using mysql_legacy? [10:44:04] hopefully [10:44:12] will see [10:45:51] dome of the mysql commands are run during the RO period of the switchover, that's way concurrency (or async) is important in that moment [10:46:54] the whole automation of mysql in both spicerack and etc. needs rework, people need to come to agreement on what should stay there and what should be removed. That's not really scope of the cookbook [10:47:09] (and document agreements) [10:47:50] I'm just saying the cookbook is re-implemnting things already available in spicerack, and could use those until this agreement and actual new tools are available [10:48:21] I agree that we need to find and agreement and longer term plan on how to handle db automation [10:48:25] but those are legacy and deprecated and should not be depended on [10:48:31] and that's totally out of scope for this cookbook :) [10:48:49] legacy != deprecated [10:49:10] legacy and should be deprecated [10:49:10] it it was officially deprecated it would raise deprecation warnings [10:49:27] but it's currently a critical part of the dc switchover [10:49:38] I know, it doesn't make it less legacy [10:50:25] but you're doing the same thing in the cookbook, I'm just saying it would be actually less code to use it, and less hardcoded things [10:51:27] I understand but I prefer not to introduce more dependencies on legacy systems [10:52:50] you'll have to migrate this anyway to the $new system at some pointt, I don't see the benefit tbh [11:32:32] (as for the test-cookbook you can see https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_merging ) [13:41:19] volans: we seem less sure that failing install yesterday was firmware, what would next-steps for troubleshooting that be? [13:42:28] urandom: hey, what's the current status, what was tried and where it's failing? [13:45:11] the drac was upgraded (so that we could use the cookbook for the nic firmware), and the nic was upgraded to the latest from /srv/firmware [13:45:37] apparently the bullseye version conflict issue was centered around the 10G nics, this is an embedded 1G [13:45:59] ok, remind me the hostname [13:46:00] which is why those firmware upgrades were failing, it was a firmware version that wasn't applicable [13:46:05] sessionstore2001 [13:46:29] is the host up right now? [13:47:17] as I can't seem to be able to ssh ino i [13:50:06] No, it's not, it's stuck in failed install limbo [13:50:27] Having tried and failed to pxe boot [13:50:29] was the old OS already wiped? [13:50:41] if pxe fails to boot it gets back to disk [13:51:32] yeah, it's back to disk but without connectivity [13:51:40] I guess because it's been removed from puppet? [13:51:53] nah [13:52:05] can I run the reimage cookbook on it? [13:52:14] yeah! [13:53:02] which OS? [13:53:05] I was focused on the fact the cookbook was failing, and had assumed the host would be unreachable at that stage [13:53:07] bullseye [13:56:43] I don''t see any DHCP request incoming to the install server [14:01:20] could it be using the wrong interface? [14:01:37] was any physical change made to the host? [14:01:50] the iface with PXE is the first one and we always use that one [14:02:01] no physical change, no [14:02:02] unless the host has been all this time plugged with the wrong iface [14:02:39] IPMI: Boot to PXE Boot Requested by iDRAC [14:03:26] Booting from BRCM MBA Slot 0400 v21.6.0 [14:04:06] let me check hte mac addresses [14:04:12] I got the one used for PXE [14:04:37] D0:94:66:8F:CA:FE? [14:04:57] correct [14:05:06] now I'll check which HW iface that corresponds to [14:05:11] port 1 [14:09:55] hey XioNoX thanks for coming [14:10:38] so to try to remove the network and the dhcp from the equation, I was wondering if you could check if we get any DHCP packets on the switch for the reimage of sessionstore2001 [14:10:51] it's connected to asw-b4-codfw ge-4/0/16 [14:11:04] it's not easy to check for that [14:11:22] urandom: in the end did you update the nic firmware or just the idrac? [14:11:28] both [14:11:44] there is no mac address learned on that port [14:11:44] urandom: and you're sure is the correct version for the NIC? [14:12:04] also the interface is down [14:12:24] been down for 22h [14:12:27] volans: I am certain of nothing [14:12:33] interesting [14:12:47] but the Dells site said so, and the wrong one refused to apply [14:13:08] 2: eno1: mtu 1500 qdisc mq state UP group default qlen 1000 [14:13:25] the plot thickens... [14:13:37] so it's up from the host PoV [14:13:49] volans: anyting interesting in lldp? [14:14:16] the fact is empty [14:14:17] lldp => { parent => [14:14:18] } [14:15:12] volans: anything if you tcpdump it? [14:15:40] checking [14:16:04] arp and lldp [14:16:12] out only I guess? [14:16:41] AFACIT yes [14:16:57] so 302 dcops to check the cabling [14:17:04] maybe replace the SFP-T [14:17:21] could the firmware upgrade "break" the cabling? :D [14:17:42] anything L1, so in theory, yeah [14:17:52] but yeah I agree with Arzhel urandom, I'd have dcops check the cabling [14:17:52] so...we had this issue before any firmware upgrades [14:17:58] the firmware upgrades were a reaction [14:18:02] ah [14:18:26] which I mean, we could have had a different problem before/after... but 🤷‍♂️ [14:18:52] sweet, I'll have them check! [14:21:31] to be continued... :) [14:24:22] what would be protocol for this, I've updated the ticket we originally opened with dcops (https://phabricator.wikimedia.org/T340055), should I ping them on #wikimedia-dcops too? [14:26:31] usually is the best way, yes [14:29:21] oh, a reply already... [14:30:54] XioNoX: https://phabricator.wikimedia.org/T340055#8956022 ? [14:31:32] I'd bet on a faulty SFP-T [14:36:05] volans, XioNoX: so should I retry the reimage? [14:36:35] I don't have the history but I guess? [14:38:14] 🤞 [14:38:42] ✌️ [14:41:31] * volans crossing fingers [14:43:16] I'm getting the feeling that it is not working [14:43:46] are you attached to the console? [14:43:49] what's doing? [14:44:05] login prompt :/ [14:44:19] the switch interface is down, but it flapped 5min ago [14:44:45] wow. [14:45:50] how does it look on the server side? up as well? [14:45:56] it shows up [14:46:12] tcpdump shows nothing but outbound (arp) [14:46:37] so weird [14:47:08] did you install some kind of NIC-killer firmware? :) [14:47:29] it came out of /srv/firmware 🤷‍♂️ [14:47:40] but also, it did this before the firmware was upgraded [14:48:31] pinged Jen on -dcops [14:50:48] could it be a faulty port on the switch? [14:51:20] unlikely [14:51:26] alright the port is up [14:51:38] and I'm learning the server's mac [14:51:42] time for a new try? [14:51:42] I can ssh now so link is definetely up [14:51:48] ya [14:52:01] * urandom giggle nervously [14:52:42] so I guess...try again? [14:53:20] urandom: yep [14:53:32] here goes... [14:56:53] link down for 2min now [14:57:04] it's trying to dhcp boot now [14:57:07] so... that's not good [14:57:30] grub [14:57:32] its like after the reboot the link is not detected [14:59:47] so weird [14:59:59] I tried to disable/enable the interface but it didn't help [15:00:28] 🤯 [15:01:44] so what is left, nic & switch port? [15:04:19] it's reaching soon its 5 years anniversary, should we spend more time on it or replace it? [15:04:57] would we even have something to replace it with? [15:05:13] I kind of need to get it back into production reasonably soon [15:05:14] just thinking out loud [15:05:17] yeah [15:05:29] I was looking if it have 10G nics but nop [15:07:07] yeah we could try a different switch port or the other NIC to narrow down the issue [15:08:45] which is easier? [15:10:28] I guess server's NIC, but I don't know if something is needed to tell it to pxe boot on the other nic? (volans ?) [15:10:43] I think it would be [15:10:48] yes we need to change the setting [15:11:01] but it's easy to do [15:11:09] ok, let's do that then [15:11:40] as it's an SFP based switch port and we replaced the SFP I doubt the issue is on this side [15:13:06] * urandom lights a black candle [15:13:31] volans: what needs to be changed? netbox? [15:13:55] urandom: no jus the host setting, I can do hat [15:16:55] urandom: it's not right-now-urgent (I'm not going to deploy until UK-tomorrow), but could you have a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/932197 at some point during your working day, please? [15:17:29] Emperor: sure [15:26:38] TY :) [20:05:30] hi there. in May you set up db1110.eqiad.wmnet for us to use for testing [20:05:40] now though I can't ssh to db1110.eqiad.wmnet [20:05:53] can you see what the status is for that machine? [20:06:07] I just get "closed by remote host" on ssh [20:06:50] db1110 sounds like one of the old ones we probably refreshed to db1210 [20:06:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/914727/ [20:06:57] see that change [20:07:10] I don't know the details, so I can be wrong, let me check phabricator [20:07:14] the point is that he picked an old one [20:07:17] afaict [20:07:31] see https://phabricator.wikimedia.org/T335092 [20:07:35] https://phabricator.wikimedia.org/T335011 [20:07:40] there is mentioning of db1118 as well [20:08:07] eh, hehe [20:08:17] that was decom'ed a few days after it was given to us as test host [20:08:21] so it got decommisioned, what is the replacement, I don't know :D [20:08:35] the point was that we could use it _because_ it wasnt used in prod [20:09:03] so the logical replacement is db1210, is phorge on m3? [20:09:31] may 3: https://phabricator.wikimedia.org/T335092#8823316 [20:09:41] may 9: https://phabricator.wikimedia.org/T335011#8836410 [20:10:04] those 2 things conflict [20:10:07] (basically by sheer luck, anything from db1106 to db1125 got replaced by extactly 100 up so db1115 -> db1215) [20:10:51] yeah, let's get back to this, what you're testing on is m3? [20:11:44] the point would have been that we have nothing to do with m3 [20:11:50] ah, okay [20:12:02] we needed a copy of a prod database [20:12:11] somewhere where it cant interfere [20:12:13] db1210 is s5, so definitely not the replacement for the testing [20:12:38] let me check db1118 or db1218 [20:13:03] nope, both are s1 [20:13:44] is it possible that he used db1118 instead [20:13:50] I think there was a mismatch somewhere, I have to ask Manuel tomorrow [20:13:51] and there is a typo [20:14:13] db1118 should also be decommisioned but it's currently serving s1 [20:14:14] on the relevant task he said "I am going to use db1110 instead" [20:14:19] but also mentions db1118 [20:14:29] maybe he wanted to say "I am going to use db1118 instead" [20:15:21] nah.. "Marostegui renamed this task from Move db1118 to m3 to Move db1110 to m3." [20:17:39] thanks Amir, yea, let's ask him. also left a ticket comment [20:18:12] what we wanted here was "make a copy of all the phab prod databases" [20:18:36] so that we can test the phab-phorge upgrade [20:18:40] without a risk to phab prod [20:19:08] we want to proof that there is no schema change or if there is, what it is exactly.. when doing that upgrade to new upstream [20:19:22] so the intention was "some old host that is not in prod" [20:22:10] mutante: probably i forgot it was a test host and decommissioned [20:22:35] I'll find another one next week [20:23:34] marostegui: ok, thank you !:) [20:23:40] sorry about that [20:23:52] np, sorry for taking so long to actually use it [20:23:56] we had .. other issues [20:24:17] we have a test VM now to talk to that test DB :)