[01:25:24] I'm not sure but toolsadmin doesn't let me create a tool named scheherazade, it seems this is returning false https://toolsadmin.wikimedia.org/tools/api/toolname/scheherazade but I'm sure no other tool exists with this name, nor even an ldap username. It seems it might be returning false for everything? https://toolsadmin.wikimedia.org/tools/api/toolname/Foooooo [01:34:09] Amir1: that's going to be T380384 [01:34:10] T380384: [toolsadmin] Striker cannot create Developer accounts or tools with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384 [01:34:37] https://meta.wikimedia.org/wiki/Special:CentralAuth/scheherazade [01:53:09] ah thanks! [01:56:49] Is there a way to create the tool in the backend? I need it before start of women in red in March 8 [08:43:50] there's a bunch of alerts flapping, looking [08:55:24] for some reason the ingresses are hanging sometimes [09:00:31] I did a deploy of ingress-admission earlier today, that seems to have triggered a restart on the ingresses due to config changes, though not sure why [09:07:55] haciendo curl desde los haproxies, ~1/4 de las peticiones dan timeout, el resto responden bien [09:08:00] root@tools-k8s-haproxy-5:~# time curl -v http://tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud:30002 -H "Host: wm-lol.toolforge.org" [09:10:20] !log restarting ingress pods [09:10:20] dcaro: Not expecting to hear !log here [09:13:02] resource wise everything looked ok [09:13:57] getting consistent replies now :/ maybe one of the pods got messed up [09:15:05] mmm [09:15:54] no smoking gun in the stats, though there's some disturbance [09:15:57] https://usercontent.irccloud-cdn.com/file/wEIgF2SS/image.png [09:16:28] is its known the nginx in the ingress often struggles with the amount of tools, specially during re-config and re-deploy [09:16:34] pods take a lot to become ready and such [09:17:22] I wonder if we would need to revisit the setup to make it more resilient [09:17:44] maybe introduce some kind of sharding, so each pod has to handle less config [09:17:47] I did not find this restart to lag though [09:17:54] ok [09:18:05] it was quite smooth, it's still using ~60 [09:18:12] 60% of the memory it was using before [09:18:27] so it might need to warm up a little more to get to the same point [09:18:54] yeah, not all webservices get requests at the same time I guess [09:19:30] it's still not hitting the limits by far (3G limit, it's using 600M) [09:21:53] there's some increase in incoming traffic, though not much connections [09:21:56] https://usercontent.irccloud-cdn.com/file/xXHIJFNl/image.png [09:22:09] but the errors started before that [09:22:41] that spike is when things destabilized [09:22:43] https://usercontent.irccloud-cdn.com/file/DrUUAxP4/image.png [09:23:11] prometheus seems to be detecting haproxy <-> ingress problems [09:23:13] https://usercontent.irccloud-cdn.com/file/SvBJh6p2/image.png [09:23:33] yep, alerts are still there [09:23:46] -6 just recovered [09:24:45] haproxy stopped having retries though [09:24:47] https://usercontent.irccloud-cdn.com/file/5pNj6dqz/image.png [09:25:17] and errors [09:25:19] https://usercontent.irccloud-cdn.com/file/5QJjU7QI/image.png [09:25:23] so I think they might clear [09:25:58] ok [09:27:07] hmm, this is still flapping though https://tools-prometheus.wmflabs.org/tools/graph?g0.expr=haproxy_server_up&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=30m [09:27:10] did you create a phab ticket? if not, I can create one [09:28:08] I did not, can you please? [09:28:15] I'll add all this stuff there [09:28:19] sure [09:28:45] thanks! [09:29:24] it started flapping at the same time the backend errors started [09:29:27] https://usercontent.irccloud-cdn.com/file/LXzSekxk/image.png [09:33:01] T387959 [09:33:01] T387959: toolforge: ingress errors 2025-03-05 - https://phabricator.wikimedia.org/T387959 [09:33:43] this is the first alert I see in the -feed channel [09:33:50] https://usercontent.irccloud-cdn.com/file/6RcIsbvT/image.png [09:34:13] at 01:52 UTC+1 [09:34:25] do you think conntrack on cloudvirt1039 could affect it? [09:34:54] it could! but that self-resolved long time ago [09:38:15] this is the first log entry in haproxy-6 for a backend being down [09:38:15] Mar 05 00:50:57 tools-k8s-haproxy-6 haproxy[520]: Server k8s-ingress/tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 timeout [09:38:22] UTC [09:38:43] that could definitely be a conntrack problem [09:39:42] I'm not able to manually get curl to hang now for a bit [09:39:57] but it still flaps on prometheus/haproxy [09:46:53] journal is consistenty writing 0.5M/s on the haproxy vms xd [09:47:21] probably unrelated, but not awesome I guess [09:51:31] about what? what is in the journal? [09:53:31] haproxy mostly [09:54:49] it makes the journal rotate every 5-15 min [10:00:38] I'm unable to make nc fail :/ [10:01:02] not sure why haproxy is failing so often [10:02:19] 1k requests at 0.1s each passed without any errors so far [10:03:02] for ingress-8, checking ingress-9 [10:03:59] I'm trying to correlate ingress worker nodes with hypervisors [10:05:25] ack [10:07:33] cloudvirt1039 was the one alerting about conntrack [10:08:08] but the ingress workers are running on cloudvirt1059/1037/1057 [10:10:21] okok [10:13:09] interestingly enough there's no more backend retries or errors on the stats :/ [10:13:17] but backends keep flapping up/down [10:13:33] I'm now looking into reducing log level on haproxy [10:13:44] https://usercontent.irccloud-cdn.com/file/ALJiYuFn/image.png [10:17:01] no manual failed checks yet :/ [10:17:06] should I restart haproxy? [10:19:23] mmmm [10:19:43] let me make sure I understand what is going on [10:19:57] there is an alert about haproxy-5 [10:20:08] for the k8s-ingres-api-gateway backend [10:20:14] is that the only failure at the moment? [10:22:21] yep, the alerts are flapping, but I'm unable to check if there's any actual user impact right now (I was clearly able to make it fail before) [10:22:45] haproxy is taking the backends up and down (that's what the alerts reflect) [10:22:47] https://tools-prometheus.wmflabs.org/tools/graph?g0.expr=haproxy_server_up%20%3C%201&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h [10:22:57] Mar 05 10:21:11 tools-k8s-haproxy-5 haproxy[528]: Server k8s-ingress-api-gateway/tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 timeout, info: " at initial connection step of tcp-check", check duration: 10001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [10:23:16] for both k8s-ingress and k8s-ingress-api-gateway [10:24:03] I'll have to go in a bit [10:24:35] you can restart haproxy, but I have doubts it will change behavior :-( [10:24:39] let's do it anyway! [10:24:55] wait, lets merge the log config change in the same run? [10:25:01] so only one restart is required [10:25:43] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124744 [10:26:44] +2d [10:27:26] ok, waiting for CI to finish [10:27:33] ack [10:28:12] when the patch is merged, a simple puppet run should result in haproxy being restarted [10:28:41] ok, we will not know though if what helped is the logging or just the restart [10:28:47] (in case it gets fixed) [10:29:25] well, the logging I'm hoping for it to be more cosmetic, and help us with the journal being more useful to triage for example other system errors [10:30:06] patch merged, local puppetmaster rebased ✅ [10:30:15] dcaro: please run puppet agent [10:30:37] on it [10:31:12] (BTW, I tested this change in toolsbeta beforehand) [10:31:18] it seems to be running already [10:31:50] Mar 05 10:30:51 tools-k8s-haproxy-6 puppet-agent[1438227]: (/Stage[main]/Haproxy::Cloud::Base/Service[haproxy]) Triggered 'refresh' from 1 event [10:32:29] done in both [10:32:52] the alert is resolved, lets see how long it last [10:33:03] this is new [10:33:04] Mar 05 10:32:30 tools-k8s-haproxy-5 haproxy[1435570]: Server k8s-ingress-api-gateway/tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 timeout, info: " at initial connection step of tcp-check", check duration: 10000ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [10:33:13] :/ [10:33:25] and this on the other haproxy [10:33:26] Mar 05 10:32:36 tools-k8s-haproxy-6 haproxy[1438431]: Server k8s-ingress-api-gateway/tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 timeout, info: " at initial connection step of tcp-check", check duration: 10001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [10:33:41] gtg, things seems kinda stable user-wise [10:33:45] ok [10:33:47] I'll be back in ~30m [10:34:20] manual checks are still passing though... [10:34:27] I'm using: [10:34:27] dcaro@tools-k8s-haproxy-6:~$ count=0; while sleep 0.1; do nc -z tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud 30002 || break; count=$((count+1)); echo "$count passed"; done; echo "Passed $count times before failing"; [10:34:28] mmmm [10:34:34] I don't see a NodePort on tcp/30003 [10:34:38] it has not failed once [10:35:05] it's up now [10:35:05] Mar 05 10:34:26 tools-k8s-haproxy-6 haproxy[1438431]: Server k8s-ingress-api-gateway/tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud is UP, reason: Layer4 check passed, check duration: 1ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [10:35:11] nevermind, I see it now [10:35:20] https://usercontent.irccloud-cdn.com/file/Q5NEzZ5Q/image.png [10:35:24] anyhow, got to run, be back in a bit [10:37:41] o/ [11:17:05] I'm increasing the api-gateway replicas from 2 to 3 [11:17:18] as there was no pod running on ingress-8 [11:20:43] Is it bound to ingress nodes? [11:20:49] yes [11:20:56] the deployment has a tolerance [11:21:27] Okok [11:21:51] now -9 is down [11:21:52] ar 05 11:21:31 tools-k8s-haproxy-5 haproxy[1435570]: Server k8s-ingress-api-gateway/tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud is DOWN, reason: Layer4 timeout, info: " at initial connection step of tcp-check", check duration: 10000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [11:21:53] xD [11:22:50] https://www.irccloud.com/pastebin/DqPDwMvB/ [11:23:20] SYN_SENT but no reply?? I wonder if there are problems in the overlay network, with calico [11:24:43] do you see the other side of the traffic? [11:24:50] as in, packets arriving [11:25:25] mmm to make it easy to debug [11:26:04] I would need to a) downscale the api-gateway deployment to 1 pod b) downscale the haproxy backends for api-gateway to one worker (which doesn't contain the pod) [11:26:13] that way I could see the packets traversing calico [11:26:38] does that make sense? [11:28:53] I'll do this in tools-k8s-haproxy-6 which is the standby haproxy [11:28:54] why point to a worker without pod? [11:29:19] dhinus: so the k8s executes the nodeport NAT [11:29:30] dcaro: ^^^ [11:30:05] we can test both, same node and different node [11:30:12] and verify how calico behaves [11:30:32] haproxy goes directly to ports 3000x, the nodeport is on 443 no? [11:31:14] server tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud:30003 check [11:31:16] the nodeport is 30003 [11:31:46] https://usercontent.irccloud-cdn.com/file/z4sIsvDz/image.png [11:32:33] downscaled to 1 pod, running on ingress-7 [11:33:06] okok [11:33:07] haproxy-6 has a single backend configured, ingress-9 [11:34:39] so far, the healthchecks are working [11:34:54] do you want me to create more traffic? [11:35:56] no, wait a moment [11:36:02] okok [11:36:05] I'm trying to verify something [11:36:45] functional tests are passing [11:36:46] note how all flows are registered in conntrack now [11:36:48] (fyi) [11:36:48] https://www.irccloud.com/pastebin/BNzpSBRa/ [11:37:07] (compared to SYN_SENT + UNREPLIED from before) [11:38:28] there's been a few minutes without flaps [11:39:19] since downscaling maybe [11:42:12] interesting [11:42:54] why downscaling api-gateway would affect connectivity for ingress-nginx? [11:43:18] mmmm I believe the flapping was only for api-gateway specifically? [11:43:40] yep, for the last hour it seems, before it was both [11:43:43] :/ [11:44:44] maybe it's a different worker having issues? [11:44:57] can you scale up and see if it fails again? [11:45:01] we could reboot the ingress workers [11:45:12] yeah, let me scale up again [11:45:39] scaled to 2 [11:47:59] did you change haproxy too? [11:48:13] (no hiccups so far) [11:49:46] haproxy on the primary was unchanged [11:49:54] I had live-hacked the standby [11:50:13] haproxy-5 is primary, haproxy-6 standby [11:50:20] I'll remove the live-hack from haproxy-6 now [11:50:53] okok [11:52:13] from haproxy-6 to ingress-9 shows as down, might be the reload [11:52:35] the ingress-api-gateway backend [11:53:08] hmm, but that one is down no? no replica running there [11:53:27] correct [11:53:37] https://usercontent.irccloud-cdn.com/file/faUwE2HG/image.png [11:53:52] but then I'm confused as to how nodeports are supposed to work [11:54:10] it's up now... [11:54:12] wait, maybe we need to just change the healthcheck itself, from a TCP check to a HTTP [11:54:24] in a similar way that we did with the api-server itself [11:55:36] https://www.irccloud.com/pastebin/sX64MovC/ [11:55:45] it's replying [11:56:23] we could have a check like this: [11:56:24] curl --insecure -v https://tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud:30003/healthz [11:57:31] for the api-gateway yep, that should work [11:58:09] btw. no more flaps so far [11:58:17] should we scale up to 3? [12:01:34] ok [12:01:48] scaled [12:02:40] if it does not fail now means some state got flushed scaling down and up :/ [12:03:09] maybe that forced calico to repopulate some rules or something? [12:03:24] yeah, good theory [12:03:28] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/697 [12:04:13] LGTM [12:05:36] no errors so far :/, let's wait a bit, I suspect it's not going to fail again in a while [12:07:01] ok [12:10:27] dcaro: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124756 [12:10:45] this one we should be careful rolling out, in case we make a mess [12:17:02] Yep, +1d [12:17:22] You might get be able to test some by hand, not sure if it will be passing a host header [12:17:43] Catching up... it seems like there are two different things happening, alertwise, is that right? A toolforge ingress issue and cloudcontrol1005 being broken? [12:18:12] andrewbogott: cloudcontrol1005 is old news, from yesterday, awaiting DCops operations: see T387828 [12:18:13] T387828: openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828 [12:18:51] old news but I assume it's the reason for all the openstack complaints [12:19:00] Yep [12:19:04] andrewbogott: yeah, most likely [12:19:11] (most at least) [12:19:30] There's the designate ones that look related, but have not checked [12:19:49] I need to think a bit about why that would break dns, it probably has to do with the stage where pdns and designate want to 100% agree on state which might not be possible if 1005 is broken... [12:21:15] Also, regarding the toolforge-legacy thing (so many alerts!) should I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123797 and sweep the performance problem under the rug? [12:23:05] Hmm, in my experience the time it was taking might be longer that 10s sometimes, but ok [12:23:32] I want to spend some time investigating though (have not had yet the focus) [12:23:54] But that can happen even with that merged [12:24:16] I'm basically fine telling people that still use toolserver.org that they need to update their endpoints if they want fast responses. Less sure that that's fair for .wmflabs endpoints but it maybe is :) [12:25:07] Did someone already try just increasing the # of apache threads by a lot? I spent a bit of time reading puppet code and couldn't see how to do that easily. [12:28:21] andrewbogott: that would be maybe a good thing to do [12:28:34] it happened the same to me: looke and wasn't obvious [12:29:11] ok, maybe I'll have another go at that before changing the timing. I see you already scaled up the VM which /probably/ doubled the workers but I'm not actually sure. [12:30:03] By the memory counts it has already the right amount, it was nowhere near that though [12:31:00] dcaro: can you tell me more? Do threads automatically scale based on available memory? [12:31:08] As in, it's set to 150 procs, and each uses ~25M, that's a bit shy of 3.25G [12:31:22] They only scale when needed [12:31:31] oh, I see. [12:31:36] (there's a way to set a minimum though) [12:31:39] so maybe just scaling up the VM again is the next thing to try [12:31:52] that causes downtime for the endpoints, do we care about that? [12:32:22] There's some logs of it getting out of threads, but they were few and not so current, so it's not "only" that [12:32:41] I think some downtime is acceptable [12:32:55] ok, I'll try that in a bit. [12:32:57] If we scale up, we can increase the max procs [12:33:28] this seems weird, a dedicated webserver struggling to be a webserver [12:38:31] Yep xd [12:38:51] Last time the slow down was only on Https traffic [12:58:52] no more flapping at all... [12:59:17] I restarted all the ingress-nginx pods, but did not restart the api-gateway ones, maybe that's why the ingress-nginx errors went away earlie [13:00:47] added a note in the runbook [13:06:42] I just manually changed for now the tools-legacy apache config for http traffic to redirect to `RewriteRule ^/([^/]*)/?(.*)$ https://$1.toolforge.org/$2 [L,NE,R]` [13:08:01] that skips https if http was hit first [13:08:19] downside is that newer tools will get redirected too (as opposed to only old ones) [13:09:26] oh, it took 4s for a reply now [13:09:54] 14 secs (to http://tools.wmflabs.org/wm-lol) [13:14:14] yep, it's being slow on both http and https [13:14:39] oh, not anymore :/, /me getting confused [13:16:08] are http requests getting redirected twice? Like, http://foo.toolserver.org -> https://foo.toolserver.org -> https://.wmcloud.org? [13:16:54] handy grafana link for the tools-legacy probe: https://grafana.wmcloud.org/goto/-OuSu-tHz?orgId=1 [13:17:08] it was mostly stable overnight, then it started flapping again [13:18:23] interestingly sometimes the probes fail continuously for 20 minutes [13:19:06] fluctuations in traffic + load could explain that right? [13:19:06] andrewbogott: yep, they are [13:19:19] just changed also the https to use the regex instead [13:19:25] seems way more responsive now [13:19:43] andrewbogott: let me try to correlate the failures with the amount of traffic [13:19:46] So we can change the rule to skip that first redirect? [13:20:20] yep, did manually to test now, will add to puppet if it's not crashing anything [13:20:33] great [13:21:11] Since toolserver predates widespread use of https... I'd expect that to drop the load by 50% [13:24:51] I don't see any obvious correlation btw probes failures and metrics node_load or node_network_transmit/received [13:25:38] that's... concerning [13:26:36] now I'm redirecting with the right http code xd [13:27:44] the other suspicious thing I see is probes failed 100% for several hours last Friday and Saturday [13:27:55] then suddenly they started to be 100% successful again [13:28:37] I guess we're maybe surfing around the threshold of the probe timeout? [13:29:36] that's my experience kinda [13:29:53] the times vary from 0.something to >10s [13:30:17] for https most of the time, http being more stable [13:30:29] are there stats on requests time or something? [13:30:42] I was also looking for those but I'm not sure we have the metric [13:32:51] okok, just now fixed all the probes, using a regex redirect and http-redirect->toolforge.org hacks [13:33:04] let's see if that helps [13:33:32] I guess that the main reason to have all the redirects explicitly states is to avoid new tools using the old domain [13:37:34] we don't have any apache metrics btw, we can consider adding an apache prometheus exporter if we need to debug further [13:38:48] good point yep [13:39:08] got one failure, looking [13:39:32] https still taking 5s on a test now :/ [13:40:26] http seems still fast [13:54:19] taavi: what is 'fresh-node' in the striker makefile? [13:54:39] dcaro: <--- yes, that was the idea [14:00:19] I got a meeting now, I've left it with the old redirects for https, copied them over to http (to avoid the double redirect) and have increased the AsyncRequestFactor to 4 [14:00:20] https://httpd.apache.org/docs/current/mod/event.html#asyncrequestworkerfactor [14:01:04] it seems to be using a bit more cpu, that's good [14:08:57] topranks: do we have an agenda for today's network sync? otherwise we may skip it? [14:10:38] I've nothing in particular for it no [14:10:50] The v6 thing we will work on as discussed, nothing much other than that [14:11:05] so no probs to skip if that works for you [14:12:23] I have a question! But I can just ask it here... topranks, we're talking about buying bigger cloudvirts which would mean consolidating more VMs onto fewer servers. So for instance if we got 2x capacity cloudvirts we'd be cramming 2x the current network traffic onto a single host. [14:12:49] sounds like a sensible idea [14:12:52] Do you have any concerns about that? I spent a few minutes looking for dashboards that would show per-cloudvirt network activity but didn't immediately find anything. [14:13:28] Specifically, if we were already 60% of the way to saturating the nic of any of our current cloudvirts... [14:14:40] andrewbogott: https://gerrit.wikimedia.org/g/fresh [14:14:40] you can look here: [14:14:43] https://grafana.wikimedia.org/goto/sPcwiPtHg [14:14:57] (related, we are for sure filling up the conntrack table on some cloudvirts but I assume I can fix that by just changing a puppet setting) [14:15:10] The busiest seem to do about 3Gb/sec max.... so in theory we have some room to grow [14:16:04] andrewbogott: yeah it's just a sysctl - net.netfilter.nf_conntrack_max [14:16:11] there is a sysctl puppet class if nothing specific [14:16:38] topranks: ok, so room to grow 2x but not 4x [14:16:40] andrewbogott: it may be worth getting the new servers with 25Gb NICs and connecting them at that speed [14:16:54] Can that be done with the existing switches? [14:16:57] but not all the switches support that, only those in E4/F4 (newer models) [14:17:16] ok, got it [14:18:05] thank you, I think that's what I need to know for now! I will try to get this on a task and cc you. [14:18:21] yeah please do we can consider in more detail [14:18:22] thanks [14:25:37] topranks: if we get a server with a 25G nic and put it in a rack with the old switches, can it still be cabled up to a 10G switch port? [14:26:13] yeah the 25G NICs do 10G too, and are basically the same price [14:26:20] so it makes sense to get all servers with them [14:26:33] great, that's what I'll ask for then [14:26:37] it may even be our default now (I know it is for supermicro) [14:27:03] we could also consider juggling some hosts around, moving 10G ones into C8/D5, freeing space in E4/F4 for 25G ones. But I guess we can make that call closer the time [14:27:15] to your point the 25G NIC on the order leaves us with the option [14:27:58] yep. And if we only go to 2x capacity then we don't need to juggle. [14:35:34] ok, lets skip the network sync then [14:38:20] I'm enjoying that we've crossed a tipping point where we're using less and less rack space rather than more and more. Hopefully the increased cpu and drive densities doesn't start the datacenter on fire. [14:41:47] I wonder if we will hit the Jevons paradox (https://en.wikipedia.org/wiki/Jevons_paradox) [14:46:33] with ceph storage we absolutely will. With compute, I kind of doubt it. [14:47:30] I'm sure there are lots of 10 TB storage applications waiting to appear as soon as we're able to support them. [15:12:50] just noticed the alert about cloudelastic, it's the first time I see it [15:12:59] anyone looking into it? [15:13:14] (btw. no more legacy flaps since the manual changes) [15:13:16] dcaro: is it related to the email sent to cloud@? [15:13:29] let me check the email [15:13:52] .... I had not open the email yet today.... [15:13:54] hmm I don't find it in the archives, but I received it [15:14:15] Subject: [Cloud] Awareness: Cloudelastic is migrating from Elasticsearch to Opensearch within the next week or so [15:14:23] yep that one [15:14:47] might be yes, though they say there's no interruptions expected, I'll add a note in the task [15:19:42] FYI I'm merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124756 [15:21:20] ack, let me know if you need assistance [15:21:29] (to check whatever/etc.) [15:21:33] arturo: We clashed on a puppet-merge. Please feel free to merge mine. [15:21:46] btullis: thanks, I just merged this [15:21:47] Btullis: Remove sudo privileges for journalctl from airflow instance admins (d1c2775ff7) [15:22:02] yup, thanks. [15:22:10] cheers [15:28:17] I think something is not right with the healthcheck patch [15:29:54] hmm... I think that the mpm_event config in legacy-redirector is not managed by puppet :/ [15:30:32] arturo: anyhow, what's the issue? [15:31:13] dcaro: only toolsbeta for now, the regular ingress may be failing all healthcheks [15:31:27] puppet is disabled on tools, so the change did not make it to the live systems [15:31:38] okok [15:31:43] do you recall the command to see the backeds of haproxy? [15:31:45] from the socket [15:31:56] socat something [15:32:31] echo "show stat" | sudo socat stdio /var/run/haproxy.sock [15:32:33] something like that [15:33:07] yeah, got it [15:33:08] echo "show stat" | socat unix-connect:/run/haproxy/haproxy.sock stdio [15:33:14] but the output is basically unreadable [15:37:08] SSL handshake failure,,0,0,0,0,,,,Layer6 invalid response [15:40:18] I think it might have enabled the stats endpoint, you might need to get an ssh tunnel and go to /stats [15:42:15] I mean, that was the backend error [15:43:03] dcaro: I think the regular k8s-ingress does not listen on HTTPS, but k8s-ingress-api-gateway does [15:46:02] dcaro: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124829 <--- I think we need this [15:52:18] dcaro: I think I'm ready to run puppet on tools, potential for outage warning [15:52:27] did that work? [15:52:33] yes, last patch seemed to work [15:53:00] I'll run haproxy-6 which is standby first anyway [15:53:02] okok [15:54:03] haproxy-6 seems OK [15:54:09] now, to the real real, haproxy-5 [15:54:57] change applied, haproxy restarted [15:55:07] no backends DOWN apparently [15:55:25] I can navigate toolforge.org stuff [15:56:07] arturo: you already tried cycling power from mgmt on cloudcontrol1005? [15:56:14] andrewbogott: yes [15:56:30] ok. I'll eat lunch then and wait for Val [15:56:47] dcaro: I think we are fine, I'll close the ticket. Thanks for the assistance [15:56:52] andrewbogott: ack [15:57:05] \o/ [15:57:06] thanks [15:58:27] andrewbogott: just got a recovery notice for cloudcontrol1005 [15:58:58] interesting... [15:59:28] most likely DCops made their magic [16:01:14] she says she replaced the sfp [16:02:46] * arturo nods [16:03:09] I'll let you handle the server from here, not sure what would need cleanup after the couple of days of being off the network [16:06:32] did you depool galera via puppet? Or change the primary? [16:06:41] neither [16:06:46] ok [16:06:52] gtg for a bit, be back in ~30min [16:06:54] so probably it'll just bounce back but I'll keep an eye out [18:02:34] anyone around to look at ToolforgeKubernetesNodeNotReady tools (Ready page prometheus true wmcs) ? [18:03:35] arturo, dcaro, several toolforge things seem to be breaking, at least one looks related to earlier ingress changes [18:05:52] andrewbogott: I'm about to log off (back later), but a quick "kubectl get nodes" is showing all nodes as Ready [18:06:05] ok, so maybe just a flap [18:06:54] I see the alert in alertmanage, but I can't see it in prometheus [18:07:05] which is confusing [18:08:15] back in a bit sorry [19:10:14] I don't see anything wrong in the cluster, but the alert is still firing in alertmanager [19:10:43] the underlying metric in prometheus looks fine though [19:13:17] there's a mismatch between the 2 prometheus hosts [19:13:19] this happened before [19:13:52] I'm restarting the systemctl unit in tools-prometheus-7 [19:18:21] I think the alerts have cleared [19:18:47] * dhinus offline, ping me if you see something not working [20:09:04] * dcaro back [20:09:09] sorry, family stuff going on [20:11:11] so it was a prometheus thing? [20:11:15] :/ [20:11:27] last time that was caused by network issues [20:14:27] (and the ingress stuff this morning was likely also a network issue) [20:15:09] on the bright side, the tools-legacy-redirector is still working without errors [20:15:40] dcaro: I think we're good for now [20:15:40] anyhow, I don't see anything wrong now, ping me here or in telegram if you see any other issues [20:15:45] And, great, about the redirector! [20:15:50] thanks for checking in [20:16:20] np, sorry for the late reply [20:16:31] (had no good connectivity and missed the notification) [20:16:42] cya in a bit!