[13:44:49] * dhinus paged checker.tools.wmflabs.org/toolschecker: Redis set/get
[13:45:01] <dhinus>	 the alert went back to green by itself
[13:46:26] * dhinus paged again (same alert)
[13:47:00] <dhinus>	 and back to green again
[13:58:34] <dhinus>	 all 3 hosts are up&running, Grafana shows a small dip in network usage from 13:38 to 13:48 UTC
[14:01:23] * dhinus paged for a third time (same alert)
[14:03:36] <dhinus>	 back to green
[14:09:42] * dhinus paged for a fourth time (same alert)
[14:12:56] <dhinus>	 journalctl doesn't show much at the time when the alert fired, apart from "sssd_sudo[672878]: Shutting down (status = 0)"
[14:13:24] <dhinus>	 followed by "systemd[1]: sssd-pam.service: Succeeded."
[14:14:56] <dhinus>	 the first 3 times, the alert resolved itself in 1 min, but now it's still firing after 5 mins
[14:15:52] <dhinus>	 redis-cli info replication shows "slave" on 2 hosts, and "ERR max number of clients reached" on the third host
[14:18:56] <dhinus>	 "systemctl status redis-server.service" is showing "failed" on tools-redis-7, and I cannot restart it
[14:21:02] * dhinus paged for a fifth time (same alert)
[14:21:15] <dhinus>	 reading https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Redis
[14:24:35] <taavi>	 "max number of clients reached" seems familiar, that was last fixed via a restart
[14:24:55] <taavi>	 dhinus: iirc we use a non-default service unit name for the redis server
[14:25:26] <dhinus>	 thanks taavi 
[14:25:57] <dhinus>	 I guess the unit is redis-instance-tcp_6379.service
[14:26:07] <taavi>	 sounds like it
[14:26:17] <dhinus>	 that one is running on all 3 hosts
[14:28:03] <taavi>	 restart it on the one that's saying too many connections
[14:28:14] <dhinus>	 it's now working again without me doing anything...
[14:28:35] <taavi>	 i can explain more on a not sunday, but tl;dr is that only one of three of the hosts receives client traffic at a time
[14:28:58] <dhinus>	 sure we can discuss tomorrow, let me know if there's anything else you can think of apart restarting the affected host
[14:33:19] * dhinus paged again
[14:33:51] <dhinus>	 I'll try restarting on the primary (tools-redis-7) where I'm seeing the ERR max number of clients reached
[14:34:58] <dhinus>	 trying to failover first as described in https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Redis
[14:35:14] <dhinus>	 redis-cli -p 26379
[14:35:17] <dhinus>	 sentinel failover toolforge
[14:35:30] <taavi>	 the failover won't help here
[14:35:34] <taavi>	 just restart the redis server
[14:35:37] <dhinus>	 the failover worked, the new master is tools-redis-6
[14:35:47] <dhinus>	 I'll restart tools-redis-7 (or should I restart the new master)?
[14:36:02] <taavi>	 the one that was previously having issues
[14:36:07] <dhinus>	 yep, 7
[14:36:21] <dhinus>	 systemctl restart redis-instance-tcp_6379.service
[14:36:50] <dhinus>	 or should I reboot the entire host?
[14:39:17] <dhinus>	 "systemctl restart" did restart the service with no errors, not sure if it's gonna help
[14:47:08] <dhinus>	 the "max number of clients" error was only happening on tools-redis-7 (verified with journalctl -g "max number of clients")
[14:47:31] <dhinus>	 it's not happening anymore after I restarted the unit at 14:36 UTC
[15:11:25] <dhinus>	 I created a runbook as I don't think we had one for this alert: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Redis
[15:38:01] <dhinus>	 I took the chance to test the new Incident Response Process and wrote an incident doc at https://docs.google.com/document/d/1r1TbKFo9yQl0gTJt2D0mJBLSYE09UKF4dlVF7xUS6mo/edit
[15:38:31] <dhinus>	 that doc is shared with WMF staff only, tomorrow I will write a public Incident Report on wikitech
[15:39:19] <dhinus>	 (I'm not sure this was serious enough to deserve an incident doc+report but it was a good chance to test the new process)