[07:30:36] greetings [09:05:47] morning [09:23:16] morning [09:31:27] looking for a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1258948 [10:01:47] LGTM [10:02:20] shortly I'll be stopping puppet on cloudrabbit* to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1254877 [10:03:47] ack [10:08:28] hmm, apparently haproxy by default sends http 1.0 health checks and istio wants to speak at least 1.1 :D [10:08:41] /o\ [10:12:45] also apparently istio serves traffic on a different port than where the healthz endpoint lives [10:15:28] fixing 1.0 -> 1.1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1258980 [10:39:13] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1180 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1259000 [10:43:40] +1d [10:45:18] ty [10:53:08] heads up that toolforge is now sending 5% of traffic via istio [10:53:40] \o/ \o/ \o/ neat [10:53:55] 🎉 [11:05:21] nice [11:14:02] awesome \o/ [11:18:18] my main question at the moment is that why is the network usage so unsymmetrical [11:18:21] https://grafana.wmcloud.org/d/TJuKfnt4z/tool-dashboard?orgId=1&from=now-3h&to=now&timezone=utc&var-cluster=P8433460076D33992&var-namespace=istio-gateway&viewPanel=panel-13 [11:18:42] given it's a reverse proxy, I would expect the input and output to roughly match [11:53:12] a bit odd indeed [11:54:08] I did rabbit codfw btw, which seems to have worked fine though I am still investigating why rabbitmqctl list_queues times out, will do eqiad this afternoon [11:54:21] to be clear the timeouts are not related to the change [12:13:07] hmm, the nginx ingress traffic has even increased a little bit [12:13:14] (instead of going down) [12:14:04] there's a similar inbalance on toolsbeta too [12:14:19] (though as the traffic is way less, might get biased easily) [12:14:40] the traffic levels sent to istio are still low enough that that can be just explained by the rest of europe and now americas waking up [12:15:42] hmm, it tools it seems that istio network output throughput has passed nginx no? (might be looking at the graphs wrong) [12:15:49] *in tools [12:16:31] (~25Mb/s istio, ~21Mb/s nginx) [12:49:39] * dcaro lunch [13:02:41] fyi folks, the CloudVPSDesignateLeaks alert that is firing right now is because of something interesting, a user who created a VM in two vlans at once (which I've just noticed is something that Horizon seems to support). I have a note out to the user who created VM to see if there's a reason; going to leave the alert acked but ignored until I figure out if this is something we need to support or not. [13:05:42] andrewbogott: semi-relatedly: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1259079/ [13:06:48] oh yeah, petr just emailed me so I'll follow up about that. [13:07:15] ah, I don't have those emails but just noticed the server was gone [13:09:14] cc'd you on followup email [13:10:06] petr is one of our very oldest and most sporadic volunteers, nice to see him still maintaining huggle one way or another. [13:11:02] for fixing the VLAN thing, I guess one option is removing the port with the legacy VLAN, and then manually fixing the v4 DNS record? [13:11:53] that will /probably/ work although the vlans seem to also be in the nova records somewhere too. So it might still show up in the UI with the second vlan even after neutron knows it's detached. [13:12:06] We will find out if petr opts for that approach. [13:17:28] godog: I still need to reboot cloudrabbit nodes for T419948, which provides an opportunity for Science. Is there anything you want to test/watch/learn during those reboots? Should I save them for you? [13:21:48] andrewbogott: yes re: learn, what's the reaction on the openstack side to a rolling reboot, I suspect the answer might be restarting heat/designate like I just did for T418444 since I noticed the api latencies going up [13:21:48] T418444: Increased openstack latency and rabbitmq rolling restarts on certificate update - https://phabricator.wikimedia.org/T418444 [13:22:01] i.e. https://grafana.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency? [13:22:20] yeah, in my experience most services handle puppet flips fine but designate does not [13:23:04] what's puppet flips in this context ? [13:23:15] um... a typo for 'rabbit flips' [13:23:30] *flaps [13:23:32] bah [13:24:02] anyway -- want me to start the reboots now and we'll just see how it goes? [13:24:22] sure why not [13:24:43] designate/heat latencies are spiking though heat seems to be recovering [13:24:58] in eqiad that is [13:24:59] ok, I'm doing the haproxy nodes and then will do rabbit, will ping in a few when I start [13:25:09] sgtm thank you [13:26:02] in the meantime I'm opening a new task and gathering data on why 'rabbitmqctl list_queues' times out [13:26:11] I mean it lists local queues, then times out [13:28:06] and then messes up your terminal session forever, right? [13:28:50] (whenever I ctrl-c out of a rabbitmqctl command then when I try to type other commands into the the terminal about 50% of my keystrokes get swallowed somewhere) [13:29:22] that I haven't tried yet, times out after 60s [13:46:19] godog: I think I was optimistic thinking that everything but designate could handle rabbit restarts, I see quite a few complaints in logs. Typically this is the time when I'd run the 'restart all openstack services' cookbook but I can hold off if you want to dig. [13:47:03] andrewbogott: thank you for checking in, please go ahead with the restarts [13:47:54] ok, doing. What that looks like: "sudo cookbook wmcs.openstack.restart_openstack --cluster-name eqiad1 --all" [13:48:00] it takes a while because of all the cloudvirts [13:48:13] ack [13:50:21] taavi: do you think this can be merged, or is there any blocker I'm not aware of? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1256417 [13:57:17] dhinus: not aware of anything but I haven't been following that super closely, if you think that is ready then go for it [13:58:12] thanks, I have never changed that map before, but it looks like it's the only change that's needed to change the haproxy mapping [13:58:19] and the new hosts are in sync [13:58:44] I'll give it a go and double check I can run queries on those sections [13:58:46] dhinus: I would maybe add new first and only then remove the old ones [13:58:54] taavi: ack, I [13:58:56] I [13:59:01] sorry, I'll split [13:59:24] since you can (and need) to use conftool to select which ones are pooled anyway [14:08:35] yep makes sense, I split the patch in two [14:10:41] godog: everything is working so now I'm going to start rebooting rabbit nodes and break things again :) [14:24:17] andrewbogott: lol ok [14:46:00] hot seems to still be having some issues [14:46:09] latency is still increasing [14:48:04] Looking at the noisy logs after rabbit reboots... some of them are services upset that the rabbit host is gone (could fix by moving rabbit behind haproxy probably), some are about dropped messages. WHY does rebooting rabbit nodes cause messages to be lost? Quorum queues ought to make that very unlikely if only one node is down at a time... [14:48:28] Still I suspect that the haproxy switch would be a good first step [14:49:05] shouldn't the clients reconnect to a different node? [14:50:00] /me is starting to get confused about which technology does which kind of HA [14:50:47] yes, they should [14:50:51] taavi: btw. something weird is going on with istio no? the traffic is now >40Mb/s? (it doubled) [14:50:59] https://grafana.wmcloud.org/d/TJuKfnt4z/tool-dashboard?orgId=1&from=now-6h&to=now&timezone=utc&var-cluster=P8433460076D33992&var-namespace=istio-gateway&viewPanel=panel-13 [14:51:24] dcaro: it's likely that some of those error messages precede a failover. But I need to read the code and figure out how the failover is actually meant to work. Maybe there are settings we can tune there [14:54:12] agree [14:55:00] but I'm also thinking, maybe oslo's failover code is permanent garbage and we should just stop relying on it. [14:55:04] Does anyone have a theory about these two 'failed to update Puppet repository' alerts? Was there a gerrit outage? [14:55:38] dcaro: indeed - I'll have a look [14:55:48] andrewbogott: I merged some labs/private placeholders recently, so probably that [14:56:01] oh so it's merge conflicts [14:56:18] that's fine, I was just confused about it being in two places at once [16:20:06] I'm about to reboot cloudcumin2001, I don't see anyone logged in or any screen/tmux or any cookbook running, lmk if I should hold though [16:34:48] * volans proceeding [16:43:07] * dhinus off [16:43:16] {done} will do cloudcumin1001 tomorrow morning (less people around) [16:46:39] I was logged into cloudcumin1001 but just disconnected [16:47:33] thx [18:14:07] * dcaro off