[08:47:44] Why dbproxies don't have the process reloaded for 6 days?? [08:48:30] The alert has been sitting there for 6 days? [08:48:33] Is this known? [08:52:22] it's known, something got stuck, we can take a look together after the unfreeze [08:52:27] stuck? [08:52:46] I mean, this doesn't have to wait till the freeze, they really have to be up [08:54:43] Right now they show the master down, so if we were to failover to them because the other pair would crash, we'd have an outage [08:57:02] do you know if it needs some startup script to be fixed or can we just restart haproxy by hand? [08:57:14] you need to reload the service [08:57:50] it needs some coordination with the active nodes or we can just start haproxy itself? [08:58:08] they are standby hosts, so please go ahead and reload on the affected hosts [08:58:39] ok I'm starting them one at a time and monitoring the logs a [09:00:34] also, I've noticed there are dashboard for haproxies for other teams (doing HTTP load balancing) but I could not find one for ours [09:01:15] so far i created https://grafana.wikimedia.org/d/fc48lf4/dbproxy but I could not find metrics that are specific to haproxy [09:03:15] meeting [10:00:56] federico3: you can go ahead and do all of them, just check they are standby [10:01:08] I mean, if they weren't, we'd be down :) [10:01:23] did you log in and start them? [10:01:49] No, I didn't do anything, as I said, I saw the alerts and they really have to be up [10:06:39] I'm rechecking: the processes are up, I see them listening on 3306 and logging. Meanwhile the alerts are not clearing and I found one metric to add to the dashboard : check_haproxy_failover - which is showing errors [10:08:11] federico3: Just: echo "show stat" | socat /run/haproxy/haproxy.sock stdio [10:08:18] on the dbproxy host to see if both hosts are up [10:09:49] oh wow after *re*starting haproxy on dbproxy1022 the failover metric cleared [10:10:30] that's the whole point of the process [10:11:02] uh? [10:11:32] ? [10:16:18] I mean: https://phabricator.wikimedia.org/P89927 - looks like the configuration has a very high "99999999" threshold so it would never recover without a restart [10:17:06] yes, that's why you need a reload: [09:57:14] you need to reload the service [10:22:43] I found "check inter 3s fall 20 rise 99999999" in the configuration so they would failover from the active backend mariadb to the backup once and then never flip back but I'm not sure why they showed failed active in the first place [10:23:01] That's intended, to avoid flapping in masters [10:23:41] https://wikitech.wikimedia.org/wiki/HAProxy [10:36:43] I mean: haproxy was restarted as part of the upgrade and some detected the active backend as failed after the restart, some did not: https://grafana.wikimedia.org/d/fc48lf4/dbproxy?orgId=1&from=2026-03-18T15%3A36%3A36.290Z&to=2026-03-18T17%3A01%3A37.970Z&timezone=utc [10:41:20] federico3: all the ones that do not show a red there, were never restarted [10:41:31] all the ones you restarted started with the backend as DOWN as that's expected [10:49:54] dbproxy2005 (for example) had been rebooted and haproxy started and showed up the backends as up: https://phabricator.wikimedia.org/P89928 [11:13:35] federico3: That is completely expected for some hosts, haproxy "remembers" the state of the host stack when before it gets stopped so it could be several issues here (but again, expected), check: https://phabricator.wikimedia.org/P89928#363514 In any case it is expected and better that a master comes back as down, that's why the reload is needed. [13:01:53] PROBLEM - MariaDB sustained replica lag on s2 on db2225 is CRITICAL: 16 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2225&var-port=9104 [13:02:53] RECOVERY - MariaDB sustained replica lag on s2 on db2225 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2225&var-port=9104 [13:41:54] uhm why am I not seeing this in https://grafana.wikimedia.org/d/fcdr7bv/mariadb-aggregated-replication-lag?orgId=1&from=now-1h&to=now&timezone=utc ... [13:43:23] Probably because it is not even here: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-3h&to=now&timezone=utc&var-job=$__all&var-server=db2225&var-port=9104&refresh=1m&viewPanel=panel-6 [13:43:41] yes I'm looking at the same dash as well [13:44:48] but then where's the alert generated from...? [13:45:43] it's in puppet instead of the alerts repo https://gerrit.wikimedia.org/g/operations/puppet/+/91824db0db3a867655e54f61c78bbc551fcddb22/modules/profile/manifests/mariadb/replication_lag.pp [13:48:35] If I plot "scalar(avg_over_time(mysql_slave_status_seconds_behind_master{instance="db2225:9104"}[5m]))" in grafana "Explore" i get nothing [13:49:12] federico3: let's not spend much time on this, we have the DC switchover happening in 10 minutes [13:49:42] It's probably worth switching to pt-heartbeat based alerts in some time