[07:02:56] <taavi>	 morning. it seems like the fourohfour tool has started flapping?
[07:03:32] <taavi>	   Warning  Unhealthy  23m (x1314 over 43h)    kubelet  Liveness probe failed: Get "http://192.168.13.170:8000/_/fourohfour-healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[07:24:16] <godog>	 greetings
[07:24:23] <godog>	 x1314 seems like flapping alright
[07:27:10] <taavi>	 seems like there was a clear increase of 4xx status codes emitted early in the morning of march 1st?
[07:27:43] <taavi>	 https://grafana.wmcloud.org/goto/r_D5owdDR?orgId=1
[07:29:12] <godog>	 yes definitely looks like it
[08:32:27] <volans>	 morning
[09:06:30] <dhinus>	 morning
[09:10:11] <godog>	 looks like four-oh-four got more resources earlier today? or one more replica most likely?
[09:17:26] <dcaro>	 moring! so is it fixed four-oh-four?
[09:17:53] <godog>	 doesn't look like it, it just flapped again
[09:17:59] <godog>	 also good morning to all
[09:18:42] <dcaro>	 is someone looking at it already? (trying to avoid stepping in anyone's toes)
[09:19:16] <godog>	 I'm not no
[09:21:27] <taavi>	 I added more replicas to see if that would help, but it seems like the answer is no
[09:21:40] <taavi>	 I also haven't found any obvious reasons for the traffic increase
[09:22:42] <dcaro>	 ack, taavi let me know if you want another pair of eyes
[09:22:59] <taavi>	 if you have time please do
[09:23:31] <dcaro>	 ack, I'll have a look then before getting into anything else, I'll report back if I find something or not xd
[11:46:47] * dcaro lunch
[13:01:25] <godog>	 mmhh fourohfour still flapping, I'll poke too
[13:02:53] <dcaro>	 yes please, so far all I've done is delay the flapping (by increasing replicas and/or deleting crashloop pods)
[13:05:12] <godog>	 ok! yeah definitely there's a bunch of pods in crashloop now
[13:06:52] <dcaro>	 the behavior I've seen is that the pods eventually start failing the healthchecks due to timeout and get restarted
[13:09:22] <godog>	 ah yeah that explains the crashloop indeed
[13:12:55] <dcaro>	 there was some restart of cloudlb1001/1002 haproxyes, is htat known?
[13:13:26] <dcaro>	 it seems confd restarted them due to changes in wikireplicas-backend config
[13:21:51] <godog>	 I wasn't aware of that no
[13:23:36] <dcaro>	 it showed in the alerts for a few minutes, now it's gone
[13:23:44] <dcaro>	 (it was "haproxy recently restarted" or similar)
[13:26:07] <taavi>	 ihmo at this point the question about fourohfour is: what changed to make this start failing? are we just getting plain more traffic? did some crawler suddenly start running the javascript which calls the expensive endpoints? did redis or ldap suddenly get slower?
[13:26:51] <dcaro>	 I did not see any changes on traffic from the prometheus data so far
[13:28:14] <dcaro>	 I wanted to check the redis + ldap times from the code somehow, did not get to it yet, but that will not give "changes", just current times
[13:30:20] <godog>	 all very good questions
[13:30:52] <dcaro>	 hmm... we might be gathering redis stats already, maybe we can pull something prometheus (we currently don't have a dashboard afaik)
[13:32:02] <dcaro>	 this looks like something
[13:32:06] <dcaro>	 https://usercontent.irccloud-cdn.com/file/KtbCid2t/image.png
[13:36:01] <dcaro>	 weird because tools-redis-6 is not primary, it's 5
[13:37:15] <dcaro>	 oh, that's both, 5 and 6 that have the bump, what's 7 doing?
[13:38:08] <dcaro>	 hmm... by the graphs it seems 7 is the primary, let me double check
[13:38:38] <godog>	 ok I'll hold off looking for now
[13:39:06] <taavi>	 yep
[13:39:06] <dcaro>	 sentinel says the primary is 7, so the hiera value is not correct
[13:39:16] <taavi>	 which hiera value?
[13:40:14] <dcaro>	 profile::toolforge::redis_sentinel::primary
[13:40:35] <dcaro>	 https://gerrit.wikimedia.org/g/cloud/instance-puppet/+/1573ec61f5dc69a05b5907f39085478b61eb21fb/tools/tools-redis.yaml
[13:40:39] <taavi>	 that is a terribly named value
[13:40:44] <dcaro>	 xd
[13:40:45] <taavi>	 (yes, I checked, it's my fault)
[13:43:21] <taavi>	 i.e. that controls clustering in the case where a new cluster is turned up from scratch, or when everything has been offline, but once things are up it's up to sentinel to move the primary role around
[13:44:25] <dcaro>	 yep, I suspected sentinel would take over
[13:44:53] <dcaro>	 the commands that the secondaries started running are 'set' it seems, not sure why
[13:54:51] <dcaro>	 not finding much in the journal logs, got a meeting now so will stop for some time
[13:58:56] <dcaro>	 there's a spike in the number of sessions though until this morning, but the alerts on fourohfour started yesterday morning
[19:07:52] * dcaro off