[04:56:33] I am going to start with the switchover pre work [05:28:04] marostegui: morning :(( [05:31:34] :*** [05:31:50] kormat: can you review https://gerrit.wikimedia.org/r/c/operations/dns/+/736115 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/736114 ? [05:31:58] again? [05:32:05] oh! [05:32:08] you did already! [05:32:13] thanks [05:32:16] just, yeah. ;) [05:54:10] o/ [05:54:18] o/ [05:59:54] morning [06:00:19] hey! [06:00:23] we are in -operations [06:20:29] I am going to switch s1 backups to start generating them from buster [06:35:41] bd808: https://phabricator.wikimedia.org/T288093#7470015 would that date work for you? [06:43:13] marostegui: let's hope it works this time ;) [06:43:20] (not specifying which of those 2 statement i'm referring to) [10:35:10] sigh, something is wrong with my IPv6 today :-/ [10:35:48] ITYM ::-/ [10:41:01] 🐟 [10:41:21] <::>< [11:05:36] sigh, ADSL went pop [11:41:32] ...at least when the router rebooted the v6 problem went away /o\ [11:42:51] "conflicted yay" [16:50:44] marostegui: something odd with db2081. are you working on it? [16:57:56] nope, I haven't touched that one [16:59:08] alright. it seems like puppet failed on it an hour ago, for unclear reasons. and there's a problem with the ssh host key [16:59:12] what's up with it? I'm not next to my laptop now but I could be in 5 minutes if it is urgent, but if it can wait till tomorrow that's better, I'm very tired [16:59:15] investigating [16:59:16] uh [16:59:21] that's strange [16:59:21] nono, definitely not urgent. just weird. [16:59:32] the other thing that's strange: i agree with you. [17:06:59] I don't see the failure here: https://puppetboard.wikimedia.org/node/db2081.codfw.wmnet [17:08:13] volans: curious. there was an icinga fail for it [17:12:59] kormat: was part of the widespread failures with no resources? [17:13:02] https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1 [17:13:37] volans: no resources, yes, that was it. i was blissfully unaware of wider issues. thanks :) [17:13:39] that one might not get to puppetboard IIRC, but not 100% sure [17:13:50] volans: out of interest, can you ssh into db2081? [17:14:32] host key changed [17:14:36] yeah... [17:14:41] like.. how? [17:14:49] have you tried to conenct to console? [17:14:52] *connect [17:14:53] yes [17:14:56] what's there? [17:14:59] nothing obvious there [17:15:09] console is at the normal getty prompt [17:15:15] and logging in? [17:15:16] cumin has no issues sshing to the host [17:15:49] i didn't try logging in via the console [17:18:07] ssh-keyscan from cumin gives 3 different host keys, [17:18:18] one of which exists in https://config-master.wikimedia.org/known_hosts.ecdsa [17:18:49] so how come my ssh client isn't seeing it [17:18:53] the one in /etc/ssh/ssh_known_hosts on cumin1001 is the same of my local one [17:30:49] o/ in production what is slave_parallel_mode set to for mariadb replicaion? [17:32:03] `conservative`, it appears [17:32:23] thanks, im going to try that with my current bug then [17:32:44] ina setup with replication im running into a `Waiting for table metadata lock` when running update.php and replication gets stuck [17:33:14] generally confusing mw, but `slave_parallel_mode` `optimistic` makes me suspicious [17:36:39] oooo [17:36:40] Default Value: optimistic (>= MariaDB 10.5.1), conservative (<= MariaDB 10.5.0) [17:36:54] right, it might have changed when we updated [17:37:10] we not meaning wmf *goes off to continue rambling to himself* [17:38:58] Nope, I still get stuck on a `Waiting for table metadata lock` hmmm [17:46:40] marostegui: yes, that date & time will work for me [17:46:51] PROBLEM - pt-heartbeat-wikimedia service on db2081 is CRITICAL: NRPE: Command check_pt-heartbeat-wikimedia-state not defined https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat [17:52:42] kormat: it being weird still ^ [17:53:00] Numerous not defined just went off [18:00:21] * volans having a look [18:01:47] seems network related [18:04:58] well that's spooky. the ssh hostkey issue went away for me [18:05:09] yes I'm in in both ssh and console [18:05:13] but can't ping other hosts [18:05:38] and now it's pinging... mmmh [18:06:29] before I got [18:06:30] --- 10.192.32.49 ping statistics --- [18:06:30] 21 packets transmitted, 0 received, 100% packet loss, time 495ms [18:21:14] ok found the issue kormat [18:21:40] moritzm: you still around by any chance? [18:22:28] volans: just as I went to look, what was it? network issue? [18:22:46] I think a typo in ganeti-test2002 /etc/network/interfaces, has the same IP [18:22:52] but should have https://netbox.wikimedia.org/dcim/devices/3458/interfaces/ [18:22:58] 10.192.0.74 vs 10.192.0.7 [18:23:31] ah ok yeah. and all the host key stuff makes sense too. [18:23:39] good sleuthing :) [18:23:58] RECOVERY - pt-heartbeat-wikimedia service on db2081 is OK: OK - pt-heartbeat-wikimedia is inactive https://wikitech.wikimedia.org/wiki/MariaDB/pt-heartbeat [18:28:18] volans: 🤦‍♀️ [18:28:26] host rebooted [18:28:31] let's see if all goes well now [18:29:55] volans: hmmh, checking [18:30:22] moritzm: so, I might have used a larger hammer than needed in this case [18:30:48] I fixed /e/n/i and then rebooted, the host is up but I can ssh, checking ganeti-test2002 right now [18:31:17] but I applied the rule "DB >> ganeti-test" when picking which hammer to use ;) [18:31:40] yeah, totally [18:32:51] I added the bridge for the ganeti setup, but I'm wondering what might have gone wrong there, what did you change in e/n/i? [18:33:04] moritzm: 10.192.0.74 vs 10.192.0.7 [18:33:10] .7 is db2081 [18:33:19] I guess was just a missing '4' at the end [18:33:27] ah, what a mess [18:35:18] I really need to get started with systemd-networkd so that we can bring some sanity in there... [18:35:22] moritzm: but I have mess up something too, it doesn't like private dev to come up [18:35:38] yeah, I'm having a look over the mgmt currrently [18:35:45] thanks for fixing up my mess [18:35:48] I'm in too [18:37:54] we can simply reboot, it's the sanest way to make it reload, ifupdown isn't very good at being restarted [18:38:03] I have rebooted already [18:38:13] doesn't like the current config [18:38:15] ipv6 is up [18:38:20] v4 not [18:38:44] actually [18:38:48] private is not up [18:39:34] just leave it as-is, the node is not added to the Ganeti cluster yet, I'll fix it up tomorrow [18:39:52] ok, you can ssh to mgmt and login as root [18:39:52] and the insetup hosts are also exempt from monitoring anyway [18:39:58] ack [18:40:00] I was trying to repro ganeti-test2001's config [18:40:13] but at least there is no trace of 10.192.0.7 in there right now [18:40:16] and so that's safe [18:41:14] ack, I'll fix it up tomorrow [18:41:17] moritzm: I'm curious where it got the address, also prometheus-rsyslog-exporter's config had the wrong address [18:43:14] I've fixed both /etc/nagios/nrpe_local.cfg01 and /etc/rsyslog.d/10-exporter.conf [18:46:19] that's strange, those surely weren't edited [18:46:34] might have got the IP from the actual one [18:46:37] assigned to private [18:47:49] moritzm: ok got private up, I had to comment one line [18:47:58] it's marked as commented by volans in /e/n/i [18:48:15] at least now you can ssh normally ;) [18:50:21] /lib/systemd/system/nic-saturation-exporter.service needed fix too [18:51:12] ack, thanks! [18:51:29] systemctl is now clean [18:53:20] now I just need to convince puppetdb to get the correct IP so that the exported known hosts are correct [18:54:23] ok puppet run completed, now puppetdb has the correct data [18:55:05] within 30m all hosts will ahve it fixed [18:55:18] all yours moritzm for tomorrow :) [18:55:57] cool, thanks :-) [20:58:25] kormat: if you get a chance, could you look over https://gerrit.wikimedia.org/r/c/operations/puppet/+/736019/15/modules/profile/manifests/analytics/database/meta_new.pp? I'd love to get to work on that when I start my morning tomorrow. [20:58:26] thank you! [21:10:45] ottomata: sure thing