[14:52:02] marostegui, we've been investigating a pybal issue that apparently is related to dbproxy1018 [14:52:45] https://grafana.wikimedia.org/goto/vSRg8PQVk?orgId=1 [14:53:48] IdleConnection seemed to be flapping a lot (aka connecting/disconnecting from dbproxy too quickly) from ~08:00 to ~14:00 today [14:54:00] it roughly matches your ack on icinga [14:54:43] and it completely matches the icinga alert [14:54:57] vgutierrez: I had no idea dbproxy1018 (wmcs proxies) had any implication on pybal [14:55:10] but yes, it is part of the outage at https://phabricator.wikimedia.org/T337446 [14:55:19] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 1: https://wikitech.wikimedia.org/wiki/HAProxy at 08:08 UTC [14:55:30] yes, I am aware of that alert [14:55:37] But there's not much I can do if I want to get this fixed [14:55:52] marostegui: dbproxy1018 is exposed via high-traffic2 LVS through the wikireplicas service [14:56:13] vgutierrez: but it only affects wmcs users, right? [14:56:36] wikireplicas maybe, high-traffic2 handles upload.wikimedia.org traffic as well [14:56:58] but does that issue affects upload.wikimedia.org too? [14:57:27] potentially it could impact inbound traffic on upload.wm.o in eqiad yes [14:57:47] Then I have no idea what to do, because there will be more of those in the next few days [14:57:55] There is no other way for me to get this fixed [14:58:26] * vgutierrez reading the task [14:58:49] vgutierrez: Not much to read, I basically have to stop two clouddb* hosts for a few hours to reclone them [14:59:00] And they are behind dbproxy1018 and dbproxy1019, which are wmcs proxies [14:59:12] why that would impact haproxy ability of having a TCP connection open? [14:59:22] That I don't know [14:59:26] lack of backend servers? [14:59:31] I guess [14:59:41] I don't know, I don't own this service at all [15:01:02] could we set dbproxy1018 as inactive during that maintenance window? dcaro, arturo? [15:01:31] There is also dbproxy1019 involved on all this [15:01:58] In case it matters [15:04:05] yep.. actually both hosts were impacted [15:04:11] (from pybal's PoV) [15:05:46] so assuming that during the maintenance window the dbproxies are unable to process incoming requests we would like to flag them as inactive to prevent pybal from healthchecking them [15:07:19] but still pybal should not choke if a backend is not healthy/unable to respond to healthchecks [15:08:41] volans: yep [15:08:45] totally agree with that [15:53:56] <_joe_> vgutierrez: pybal will chocke worse if it has no backends defined in a pool [15:54:06] <_joe_> I would suggest to remove healthchecks for that service [23:19:19] Trying to look at these videoscaler alerts, but my power keeps dropping :/ [23:19:28] https://usercontent.irccloud-cdn.com/file/RA4EwqGG/image.png