[08:34:30] Hello! at 10UTC I'm going to upgrade eqsin's switches, it's going to be a hard downtime, but I'm going to depool the site shortly and downtime as many things as I can. [08:55:30] thank you [08:58:26] thank you for the heads up XioNoX [10:00:46] alright, time for eqsin's upgrade [10:04:58] godspeed [10:06:13] o7 [10:30:28] status update, one of the two switches doesn't want to come back up healthy from the upgrade... The other is fine. Opened a critical ticket with JTAC [10:31:51] so all the devices in that rack are down until it's fixed: https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=78&status=active [10:31:57] obviously keeping the site depooled [10:32:24] :( [10:42:59] vgutierrez: out of curiosity, and despite the lack of redundancy at many levels now, could eqsin run without those CP/LVS servers https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=78&status=active&role_id=1 ? [10:43:53] we could lose one lvs, but those are too many cp servers to lose [10:44:14] that's 50% of the cp servers [10:44:58] that's what I thought, thanks [11:05:36] JTAC's only option is to have someone onsite to format/install the OS... [11:05:44] I'll follow up with DCops [11:10:55] random shot in the dark so please take it with a grain of salt, can we rollback the upgrade in the meantime ? [11:17:34] godog: unfortunately not, I tried with JTAC, but even booting on the recovery images doesn't work [11:18:49] so currently we have a nice costly brick :D [11:19:10] pretty much, yeah [11:20:05] I think I ACKed everything relevant here [11:22:17] sigh [11:22:31] XioNoX: thanks for the update anyways! [11:22:39] a very expensive brick indeed [16:05:44] hi, denisse, mutante. This was an evenfuly, but no outages EU morning. A switch from eqsin failed and only recently got online, all while being depooled. Netops are still looking at it. [16:26:33] jynus: thanks, alright! [16:27:07] eqsin not lucky these days [16:33:43] mutante: the yesterday acked p*ges on your shift are still in acked state- I think it is because the lost routers/links/redundancy are still down, but unsure of that [16:34:18] mentioning it not because I worry about eqsin, but because not sure how often it will try to page again if not resolved [16:36:52] sigh.ok.. acked is not good enough? [16:37:04] I did not get the nagging mails that I got previously [16:37:11] I honestly don't know [16:37:27] maybe I am thinking of icinga, and it doesn't happen on victorops [16:37:53] when you say "in acked state" which tool are you looking at? [16:37:53] but it is not resolved, so wanted to flag that the issue they are reporting is not resolved [16:38:06] (victorops) let me get you the link [16:38:07] ok, thanks [17:39:38] What does it mean when I have a phantom [17:39:43] https://www.irccloud.com/pastebin/FQhYCIAp/ [17:40:05] on every puppet run? I was thinking it was a typo in a config rule but now I'm not so sure [17:40:13] since ferm isn't complaining [17:42:50] andrewbogott: that would indicate ferm isn’t running [17:43:18] It's not supposed to run, though, ferm isn't a persistent service. [17:43:30] As far as I know it just configures iptables and exits. [17:44:06] Puppet is saying it’s been told somewhere to ensure it’s running [17:45:34] Yeah, I understand what puppet thinks, just not why it thinks that. [17:50:53] andrewbogott: which host? [17:51:55] taavi: all the cloudcontrols [17:53:54] * andrewbogott eating lunch and getting fresh eyes [17:54:51] https://phabricator.wikimedia.org/T323324 [17:55:30] something is in the role that makes this happen...because it doesn't happen globally but on all those using that role [17:55:50] weird though(tm) [18:02:14] hm, same issue [18:05:32] <_joe_> I would suppose some ferm rule fails [18:08:42] <_joe_> one of the multiple ones that are peculiar to these hosts, like I assume the galera ones and the api backends [18:10:15] yep but I assume those would be visible on `systemctl status ferm` [18:11:26] at least `ferm --lines --noexec /etc/ferm/ferm.conf` doesn't error out, so it shouldn't be a syntax error [18:13:56] <_joe_> I suspect we might have hit a ferm bug [18:14:08] <_joe_> because on cloudcontrol it always tries to reload the rules [18:14:20] <_joe_> like some can't be applied / not detected / something [18:14:47] when I refresh ferm on the cli it returns 0 which should satisfy whatever status check is happening [18:14:51] looks like ferm-status errors out [18:15:02] ok, that's something! [18:16:24] andrewbogott: https://gerrit.wikimedia.org/r/879115 [18:16:26] and puppet configures the ferm service like service { 'ferm': ensure => 'running', status => '/usr/local/sbin/ferm-status', so I think that'd be why puppet thinks a restart is needed [18:17:22] taavi: that looks like it. I'm surprised that ferm doesn't complain on refresh, though -- typically that's what I've seen when there's a mistake in a rule. [18:17:24] and I don't seem to have access to clouddumps boxes so can't debug those in the same way [18:18:00] indeed. this might be a bug in the ferm-status script? [18:18:23] meaning that ferm properly resolves the hostname but ferm-status doesn't? [18:19:10] yea, I suspect that iptables takes the hostnames and resolves them internally but ferm-status doesn't support that because our convention is to use @resolve instead [18:20:53] I'll see if I can find a similar issue on clouddumps after I finish my sandwich [18:31:11] taavi: your patch worked on the cloudcontrols [18:35:14] well dang, ferm-status returns one on clouddumps but produces no output [19:00:33] mutante: You okay with me merging the SPDX license headers? [19:00:47] ah, jinx [19:00:56] brett: yes:) [19:01:25] I was _not_ about to merge "remove vip from profile::dns::recursor, that was scary :) [19:01:47] :) [19:01:47] I don't know what you're talking about, nothing can go wrong ;) [19:02:00] hrhr [19:04:58] ottomata: quick shot https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/879122 [19:05:26] should have gone to -serviceops [19:07:12] oh hi [19:07:14] just saw [22:43:49] Emperor: heyas you about question on the ms-=fe racking tasks [22:43:57] the orders are for 3 hosts, but you put in 4 hosts of details [22:46:52] Emperor: sorry, you didnt fill that out, my bad was someone else you just commented on the config!