[00:01:31] <denisse>	 I restarted apache2 on logstash1023 and logged it in SAL. I'll add this info on T337818 and I'll be monitoring the alert  to see if it triggers again.
[00:01:32] <stashbot>	 T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag - https://phabricator.wikimedia.org/T337818
[00:25:31] <denisse>	 I see an steady increase in consumer group lag for the apifeatureusage consumer group: https://grafana.wikimedia.org/goto/fDU7N9JSk
[00:29:59] <denisse>	 The consumer group lag graph shows that the number of events is decreasing, I think the issue may be resolved. :)
[00:30:12] <denisse>	 I'll be monitoring it...
[00:48:35] <denisse>	 The consumer group lag looks healthy now. https://grafana.wikimedia.org/goto/OxMJvrJSk
[00:57:16] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:47:16] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:47:40] <denisse>	 ^ Taking a look.
[14:50:20] <denisse>	 The consumer group lag graph shows that the lag is decreasing, it may resolve on its own.
[15:06:49] <elukey>	 herron: o/
[15:06:57] <elukey>	 fyi I am deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012404
[15:07:05] <elukey>	 to dropa lot of istio labels
[15:08:15] <herron>	 elukey 👍
[18:02:16] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:29:04] <denisse>	 Hello team, I see several email alerts of FAIL: debmonitor-client coming from logstash1011, it seems like its SSL certificates expired (SSLV3_ALERT_CERTIFICATE_EXPIRED).
[20:32:29] <denisse>	 ^ I'm taking a look.
[20:32:54] <volans>	 denisse: that host has puppet disabled since almost 18 days for  broken disk, but tht should never be the case
[20:35:51] <herron>	 yes was going to say the same re disabled puppet
[20:36:24] <herron>	 https://phabricator.wikimedia.org/T359612
[20:41:49] <denisse>	 Thank you both, do you know if there's something we need to do about that host?
[20:42:45] <herron>	 I think c.white has a plan, it's disabled via puppet for now although in a somewhat fragile state
[20:42:45] <denisse>	 I'm wondering if we're planning on decommissioning it, if so I think we could silence those alerts.
[20:43:05] <volans>	 it's not out of puppetdb, it's a ghost host and it's reported by  netbox report for that, it shouldn't be kept that way
[20:43:28] <denisse>	 Okay, thanks herron.
[20:43:33] <herron>	 yeah maybe moved to spare or somethign like this if the intention is to keep is online but not running services, but check with c.white
[20:43:37] <volans>	 the two options are running puppet or power it off IMHO
[20:44:07] <volans>	 puppet should never be disabled for more than few days
[20:45:45] <denisse>	 Thanks volans. I think Cole is OOO, I wonder if we should turn off the host. What do you herron and volans think?
[20:45:59] <volans>	 what is currently doing?
[20:46:19] <herron>	 well I think priority should be on not causing unexpected issues in the OS logging cluster
[20:46:25] <denisse>	 volans: sorry, I don't understand that question.
[20:46:38] <volans>	 is it part of any cluster, is it in production in any way?
[20:46:41] <herron>	 nothing essentially, it had a disk failure and is depooled
[20:47:50] <volans>	 ok and what's the plan for that host? I don't see a dcops related task for the disk replacement
[20:48:18] <volans>	 has passed the 5y mark so I guess is due for replacement
[20:49:22] <volans>	 so if there is no plan for fixing it I'd say decommission, if there is any plan to fix it power it off, mark it as failed in netbox and notify dcops
[20:49:23] <herron>	 yes, so basically this https://phabricator.wikimedia.org/T352517#9618909
[20:49:25] <volans>	 my 2 cents :)
[20:51:34] <herron>	 thanks yes I agree overall although I also don't want to interfere with the work that c.white already has in flight on it.  but will def check in when he's back about either bringing puppet back into sync or turning the host down
[20:52:17] <volans>	 sorry I'm not sure I follow. If the host is not doing anything, how is that affecting the setup and provisioning of the new hosts?
[20:53:35] <volans>	 opensearch is not running, so it's just an idle server, depooled from all services
[21:30:10] <cwhite>	 logstash1011 is ready for decom.  Filed T360950
[21:30:13] <stashbot>	 T360950: decommission logstash101[012] - https://phabricator.wikimedia.org/T360950
[22:58:37] <denisse>	 Hi team, I see some emails from SplunkOnCall Unassigned Overrides (Override For: Giuseppe Lavagetto). Do you know if there's something we should do with those alerts on our side?