[07:02:48] hello folks [07:03:07] I just checked an-worker1106, it needs a reboot, going to work on it with Ben later on [07:03:34] going to ack it for the mometn [08:09:24] Hi all, for a while we have been getting an alert about prometheus being restarted that points to a dead dashboard (https://grafana.wikimedia.org/d/000000271/prometheus-stats), anyone knows if that was moved or something? [08:11:44] godog: ^ maybe you know? [08:14:56] dcaro: yeah seems likely, needs to point to https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server [08:15:57] godog: okok, I'll change in puppet then :), thanks! [08:17:01] dcaro: sure np and thank you! and even that dashboard isn't 100% correct but better than a broken one [08:50:41] godog: I have this alert https://alerts.wikimedia.org/?q=alertname%3D~beware%20possible%20monitoring%20artifacts that seems to be a false positive, the server in specific has been up for quite a long time: https://grafana-rw.wikimedia.org/d/GWvEXWDZk/prometheus-server?orgId=1&var-datasource=eqiad%20prometheus%2Flabs&refresh=1m&forceLogin=true [08:50:49] any idea on how to debug? [09:06:45] dcaro: interesting, the "while decoding json" error suggests the query didn't run at all, i.e. non parsable response [09:07:34] 2021-07-20T09:06:444202620:0:861:3:208:80:154:88proxy-server/503299GEThttp://cloudmetrics1002.eqiad.wmnet/labs/api/v1/query?qu [09:07:43] err, ok no spaces but that's a 503 [09:20:19] interesting, I'll pull that thread [09:23:54] It seems that there's a bunch of meta.json files with size 0, preventing the server from coming up :/ [09:32:26] hmm... weird thing that the uptime in the graph shows >month :/ [09:34:06] I think that's because we have two cloudmetrics, and the datasource is pointing to a floating dns record (prometheus-labmon.eqiad.wmnet is an alias for cloudmetrics1001.eqiad.wmnet.) [09:34:24] so the graph the alert points to is for cloudmetrics1001 :S [09:47:09] ah, yeah that'd explain [09:50:11] _joe_: our switch is back [09:50:46] but if we loose again another row, we will have similar issues with unreachable servers not being depooled [09:50:55] <_joe_> effie: you mean the A2 codfw switch? [09:50:58] yes [09:51:18] so, should we consider lowering the api threashold to 0.5 ? [09:51:25] it is back to 0.7 [09:51:42] <_joe_> so I'm perplexed by one thing [09:51:57] <_joe_> why losing a single switch makes half of the api servers unreachable? [09:52:41] <_joe_> in theory we should use 0.66 as a depool threshold, as servers should be perfectly balanced across at least 3 rows, if not 4 [09:52:52] so... for some reason lvs2009 and lvs2010 get row A connectivity from the very same switch [09:53:18] <_joe_> oh ok then that's the underlying issue [09:53:25] that switch (asw-a2) crashed, so lvs2009 and lvs2010 lost row A connectivity rendering every mw server hosted in row A unreachable [09:53:45] <_joe_> that should still be less than 30% of all api servers though [09:53:58] yeah, I've opened T286879 and T286881 to work on that issue [09:53:58] T286881: Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 [09:53:58] T286879: lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 [09:55:18] <_joe_> our resiliency model is we should tolerate losing one row with minimal disruption [09:55:55] <_joe_> so in theory depool_threshold = (1 - X) where X = max(servers_in_row/ tot_servers) [09:56:36] from a cursory look at netbox it should be ~26 mw API hosts in row A [09:56:56] over 64 matching A:mw-api-codfw [09:57:59] that's 0.40.. so 0.70 is too restrictive [09:58:00] but please double check becasue I did it quickly comparing https://netbox.wikimedia.org/dcim/devices/?q=mw2&status=active&sort=name with the cumin result [09:58:36] (expanding hte list with nodeset -S '\n' -e "$CUMIN_RESULT") [09:58:47] _joe_: assuming you mean where X = max(tot_servers/servers_in_row) [otherwise x is alwasy 1) [09:59:20] infact ignore me [09:59:32] #mathsarehard [09:59:40] :) [10:04:41] actually 27 [10:04:51] cumin 'A:mw-api-codfw' 'ip a | grep 10\.192\.0' [10:05:05] 42.2% [10:07:21] <_joe_> that's way too many :/ [10:07:32] <_joe_> so we have a problem of imbalance of servers [10:07:52] <_joe_> effie: go with 0.6 or 0.5 but state we need to rebalance servers [10:08:10] +1 [10:08:49] I agree for 0.5, but there was still 1 server again unreachable but pooled [10:08:57] judging by the restbase errors [10:09:32] we can create a task to either rerack some servers or do it after we switch back to eqiad [10:09:35] with 0.5? [10:09:52] as soon as we applied the 0.5 threshold pybal depooled all of them [10:13:34] 16:29 vgutierrez: restarting pybal on lvs2009 to decrease api depool threshold --> from the SAL [10:13:54] and the last pybal error saying it couldn't depool an api server it's from 15:38 [10:14:06] Jul 16 15:38:02 lvs2009 pybal[31136]: [api-https_443] ERROR: Could not depool server mw2297.codfw.wmnet because of too many down! [10:14:06] Jul 16 15:38:03 lvs2009 pybal[31136]: [api_80] ERROR: Could not depool server mw2297.codfw.wmnet because of too many down! [10:14:19] (all of those timestamps are UTC) [10:14:47] is that threshold needed? [10:15:38] ofc grep -q has a nicer output in the prev command, but I wanted to check it :D [10:16:00] XioNoX: yes, because we don't want to depool everything if everything is broken better to have some backends [10:16:13] that's why pybal has it and we've always had it for all backends [10:16:32] this covers also an issue with the pybal check for example due to a misconfiguration or something [10:17:40] what do you mean by "everything is broken" ? [10:18:08] if we have an overload and the pybal check thinks that all backends are unable to serve [10:18:10] check missconfig make sens though [10:18:19] without the threshold it would depool everything [10:18:28] did that ever happen? [10:18:34] also the last report of a API server being down and pooled (zgrep api pybal.log.4.gz |fgrep "enabled/down/pooled") is from 16:29:23, pybal was restarted to apply the 0.5 depool threshold for API at 16:29:30 [10:18:54] effie: so it looks to me that 0.5 was enough [10:19:17] then there was something else going on with restbase [10:19:36] I will ckeck again, good to know [11:49:04] Quick question regarding SSH setup: I have followed the wmf-laptop-sre guidelines, so I have my two SSH agents, a managed ~/.ssh/config, and I have run wmf-update-known-hosts-production. [11:50:45] However, I can't get access to the ganeti cluster e.g. `ssh ganeti01.svc.eqiad.wmnet` because it isn't in the known-hosts and I think it is a floating VIP. https://wikitech.wikimedia.org/wiki/Ganeti#Connect_to_an_existing_cluster [11:51:45] How do other people deal with this? Just disable StrictHostKeyChecking temporarily each time the VIP moves? [11:58:07] btullis: until you get an actual answer, a workaround is to ssh to `ganeti1009.eqiad.wmnet` [11:58:42] yeah, the MOTD will tell you the current Ganeti master (in case it changed) [11:59:09] I never used the SVC address to login to a Ganeti cluster ever, maybe we should simply drop that from the wikitech page, not sure [11:59:17] Ah, OK. Thanks both. [12:19:37] update the wikitech docs [12:21:17] "job=minio site={codfw,eqiad}" alert is me, I am debugging (cannot ack because it could hide unrelated errors) [12:21:45] btullis: if you also checkout the dns repo (gerrit.wikimedia.org:29418/operations/dns.git [12:22:23] then yuo can use `wmf-update-known-hosts-production /path/to/dns/repo` and it will add entries for an cnames [12:22:56] although not sure it helps in this case as ganeti01.svc.eqiad.wmnet is a dns discovery address and not a cname (volans may be able to add more) [12:24:11] Oh, thanks jbond. I didn't know about that one. [12:30:25] jbond: i checked, it does not [12:31:22] ack thx kormat [13:34:39] jbond, btullis: yes CNAMEs are added that way, but Ganeti master has a floating IP and can potentially change anytime, that's why it's not covered by the wmf-update-known-hosts-production, it can't know which one is the master without actually ssh-ing into a ganeti node [13:40:00] OK, thanks. Looks like moritzm has updated the wikitech page. I see that there's no reference to `gnt-instance migrate` on that page, so I guess it's not something we do much at the guest level. Just evacuate a whole cluster node. [13:41:15] yeah, practically all important ganeti commands are run from the master,in fact I've never logged into the SVC myself... [13:42:35] for completeness we also have the Ganeti RAPI endpoint in read-only mode open internally with some integration with Spicerack (see https://doc.wikimedia.org/spicerack/master/api/spicerack.ganeti.html ) [13:44:12] I am going to flip maps back to codfw, with additional capacity so will most likely be quiet [13:44:49] * volans loves hnowlan_ optimism when it's related to maps :) [13:45:38] * volans finger crossed [13:45:47] I said the flip will most likely be quiet, there are so many other things that could go wrong that I'll keep quiet about :P [13:45:56] lol [14:54:33] "Prometheus jobs reduced availability" alert should be gone soon [14:58:27] Ok folks we will be kicking off T286069 in a few minutes. [14:58:28] T286069: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 [14:58:33] Switch buffer re-partition - Eqiad Row D [14:58:45] nice, I was just about to ping you :) [14:58:57] Just doing some final checks and verifying all required actions have been taken first. [14:59:31] Will use this channel to keep everyone informed - hopefully be short and sweet. [15:06:35] There is one remaining task I'm not sure if it's done before we proceed. [15:06:42] puppetmaster1002 - Disable puppet fleet wide [15:07:01] topranks: i can do that one sec [15:07:02] jbond, moritzm: is that something you can take care of? [15:07:05] ok super :) [15:07:31] <_joe_> do we really need to disable puppet everywhere? [15:07:36] <_joe_> anyways, not for now [15:07:41] <_joe_> sorry [15:09:22] _joe_: not really we could depool it puppetmaster, however for a change that should only take a small amount of time its quicker to disable puppet every where [15:09:40] yeah, and it's only for a relatively brief period and disabling Puppet keeps monitoring free from unrelated Puppet failure scatter [15:09:44] <_joe_> yeha I thought it was quicker to let puppet fail on a bunch of hosts :P [15:09:54] could have also done that :) [15:10:15] <_joe_> again, we can talk about this at a later time, sorry [15:10:21] those that have an in-flight puppet run might fail randomly I guess [15:10:26] and cause some noise [15:11:16] topranks: puppet disabled [15:11:29] ok great. [15:11:55] I guess it's that "speak now or forever hold your peace" time.... [15:12:30] good luck :) [15:12:53] :rocket [15:12:55] dammit [15:12:58] 🚀 [15:13:15] Ok... commiting the config now. [15:13:28] 🤞 [15:14:27] gogogo [15:14:36] ok it's taken the config. No immediate sign of significant loss [15:14:58] is it done, not even the my ssh session dropped? [15:15:04] I had a few packet loss on my test ping [15:15:55] not complaining, if we learned something last week is that we should expect the worse :-D [15:16:00] nothing on mine I think you got lucky [15:16:02] that was smooth :) [15:16:11] yeah no packet loss on mine either [15:16:15] topranks: ack re-enabling puppet [15:16:18] are you sure it was applied? [15:16:19] :-P [15:16:29] nice job :) [15:16:59] good job indeed! [15:17:11] <_joe_> topranks: I'm disappointed [15:17:20] <_joe_> I want a refund of my ticket! [15:17:22] haha [15:17:31] <_joe_> I was promised explosions and network partitions [15:17:39] I have popcorns for nothing ready now [15:17:44] he's still got three more tries to earn a t-shirt [15:17:51] not a split brain in sight :) [15:17:55] 544 packets transmitted, 544 received, 0% packet loss [15:18:13] XioNoX: what -i did you had for your ping? [15:18:13] Murphy's law if we hadn't been so cautious it would/could have been a disaster. [15:18:19] that was suspiciously straightforward... [15:18:29] It's still a very sensitive change, so caution was certainly best. [15:18:31] I'm tempted to say that junos did not apply the change :D [15:18:40] <_joe_> I concur with volans [15:19:01] <_joe_> I mean the rule is a large change is applied, then junos crashes to acknowledge it applied the change [15:19:13] loool [15:19:27] volans: bast2002 to bast1003, it froze then they showed up later on, so end result was no real loss [15:19:33] <_joe_> vgutierrez: I'm pretty sure we've been told by jtac that's the expected behaviour [15:19:56] yeah, the reason we were super-careful is because the last time a "trivial change" was done, equipment didn't like it, and we have been super careful since then (even if most of the time it worked nicely) [15:20:24] <_joe_> thanks for the backstory jynus I wasn't aware [15:20:27] if there is no impact maybe we should do the other 3 rows now [15:20:34] (jk of course) [15:21:04] _joe_, I was telling topranks why we freaked out when he wanted to touch switches :-DDD [15:21:04] XioNoX: +1 [15:21:14] didn't even flap the interfaces [15:21:18] also, nice job everyone [15:21:45] <_joe_> XioNoX: why not? [15:21:53] <_joe_> we still have the maint window open right? [15:22:06] _joe_, manuel wanted to be around for some of the rows [15:22:19] jynus: yes, it's always a little tricky when asking the hardware to go do something, and for me the "virtual chassis" hive-mind setup we have was an added concern as I've not used that too much in the past. [15:22:25] the maint window is only for row D AFAIK [15:22:34] I'd suggest to stick to the original schedule or announce properly any changes [15:22:39] (manuel is not around today) [15:22:39] let's not tempt fate [15:22:39] yeah, just for D [15:22:45] <_joe_> XioNoX / topranks https://www.metaflix.com/wp-content/uploads/2020/09/Slim-Pickens-Riding-Bomb-in-Dr.-Strangelove-Movie.jpg [15:23:06] haha Dr. Strangelove brilliant :) [15:23:13] _joe_, you are super sarcastinc today, more than usual, I would say :-D [15:23:18] *sarcastic [15:23:19] I'd rather sit it a bit just in case [15:23:39] <_joe_> on the bomb? [15:23:47] haha [15:36:53] <_joe_> topranks / Xionox can I go on and perform a deployment now? [15:37:13] yeah everything looking very healthy - I see no reason to delay. [15:37:19] I'll update the task now to that effect [15:37:36] <_joe_> thanks [15:38:21] _joe_: hold on [15:38:32] <_joe_> topranks: sure [15:38:37] sry... seems I left out some of the config. Maybe you'll get your drama after all. [15:38:44] Give me a moment :) [15:40:56] <_joe_> sure [15:46:19] _joe_: ok you should be good to go now thanks [15:46:29] no drama it would appear. [15:46:42] <_joe_> ack [15:52:58] * kormat sadly puts away the 🍿 [16:05:03] lol [16:35:44] the spam on #operations is due to maintenance, nothing really exploding badly [16:35:52] Data Engineering working on it [16:36:42] making sure alerting works? :) [16:40:36] exactly yes [16:41:06] BigData needs BigAlerts [16:56:40] created https://phabricator.wikimedia.org/T287027 [16:56:43] (as follow up) [17:41:38] kormat: is our sinking ship still pointed in the right direction?