[13:44:49] * dhinus paged checker.tools.wmflabs.org/toolschecker: Redis set/get [13:45:01] the alert went back to green by itself [13:46:26] * dhinus paged again (same alert) [13:47:00] and back to green again [13:58:34] all 3 hosts are up&running, Grafana shows a small dip in network usage from 13:38 to 13:48 UTC [14:01:23] * dhinus paged for a third time (same alert) [14:03:36] back to green [14:09:42] * dhinus paged for a fourth time (same alert) [14:12:56] journalctl doesn't show much at the time when the alert fired, apart from "sssd_sudo[672878]: Shutting down (status = 0)" [14:13:24] followed by "systemd[1]: sssd-pam.service: Succeeded." [14:14:56] the first 3 times, the alert resolved itself in 1 min, but now it's still firing after 5 mins [14:15:52] redis-cli info replication shows "slave" on 2 hosts, and "ERR max number of clients reached" on the third host [14:18:56] "systemctl status redis-server.service" is showing "failed" on tools-redis-7, and I cannot restart it [14:21:02] * dhinus paged for a fifth time (same alert) [14:21:15] reading https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Redis [14:24:35] "max number of clients reached" seems familiar, that was last fixed via a restart [14:24:55] dhinus: iirc we use a non-default service unit name for the redis server [14:25:26] thanks taavi [14:25:57] I guess the unit is redis-instance-tcp_6379.service [14:26:07] sounds like it [14:26:17] that one is running on all 3 hosts [14:28:03] restart it on the one that's saying too many connections [14:28:14] it's now working again without me doing anything... [14:28:35] i can explain more on a not sunday, but tl;dr is that only one of three of the hosts receives client traffic at a time [14:28:58] sure we can discuss tomorrow, let me know if there's anything else you can think of apart restarting the affected host [14:33:19] * dhinus paged again [14:33:51] I'll try restarting on the primary (tools-redis-7) where I'm seeing the ERR max number of clients reached [14:34:58] trying to failover first as described in https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Redis [14:35:14] redis-cli -p 26379 [14:35:17] sentinel failover toolforge [14:35:30] the failover won't help here [14:35:34] just restart the redis server [14:35:37] the failover worked, the new master is tools-redis-6 [14:35:47] I'll restart tools-redis-7 (or should I restart the new master)? [14:36:02] the one that was previously having issues [14:36:07] yep, 7 [14:36:21] systemctl restart redis-instance-tcp_6379.service [14:36:50] or should I reboot the entire host? [14:39:17] "systemctl restart" did restart the service with no errors, not sure if it's gonna help [14:47:08] the "max number of clients" error was only happening on tools-redis-7 (verified with journalctl -g "max number of clients") [14:47:31] it's not happening anymore after I restarted the unit at 14:36 UTC [15:11:25] I created a runbook as I don't think we had one for this alert: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Redis [15:38:01] I took the chance to test the new Incident Response Process and wrote an incident doc at https://docs.google.com/document/d/1r1TbKFo9yQl0gTJt2D0mJBLSYE09UKF4dlVF7xUS6mo/edit [15:38:31] that doc is shared with WMF staff only, tomorrow I will write a public Incident Report on wikitech [15:39:19] (I'm not sure this was serious enough to deserve an incident doc+report but it was a good chance to test the new process)