[00:07:23] <jinxer-wm>	 FIRING: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:07:23] <jinxer-wm>	 FIRING: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:07:23] <jinxer-wm>	 FIRING: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:02:18] <godog>	 I bet that's a renamed user ^ taking a look
[09:03:52] <tappof>	 HCoplin -> HCoplin-WMF I think
[09:06:28] <godog>	 indeed, nice
[09:08:51] <godog>	 unfortunately a known issue, task is T374190
[09:08:52] <stashbot>	 T374190: grafana-ldap-users-sync breaks on renamed users - https://phabricator.wikimedia.org/T374190
[09:10:30] <godog>	 or maybe not, a different issue this time
[09:10:35] <tappof>	 I was taking a look as well, but it seems that the error is slightly different from the one in the task. In this case, it says: Could not fetch metadata for UID
[09:10:37] <tappof>	 eh :)
[09:11:52] <tappof>	 Or maybe it has the same root cause (a rename) but manifests in a slightly different way
[09:12:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:14:32] <tappof>	 Searching for the user with ldapsearch, I can only find a result for the one without the suffix
[09:16:16] <godog>	 indeed, unclear yet where hcoplin-wmf comes from, I was expecting to find it in grafana users and no luck so far
[09:17:23] <tappof>	 Looking at the code, it seems that the uid originates from the LDAP API
[09:20:31] <godog>	 mmhh ok next guess is the hcoplin-wmf user is referenced in groups and it doesn't actually exist in ldap?
[09:29:56] <moritzm>	 uid=HCoplin-WMF is listed as a user in cn=wmf, but the user doesn't exist?
[09:30:44] <godog>	 that's what it looks like so far
[09:31:28] <godog>	 at least I can't find uid=HCoplin-WMF
[09:31:44] <moritzm>	 there's an access request here: https://phabricator.wikimedia.org/T387459
[09:32:17] <moritzm>	 but then the person already has shell access for cn=wmf as "hcoplin"
[09:33:15] <godog>	 mmhh I'm tempted to remove HCoplin-WMF from wmf for now and comment on the access request
[09:33:40] <godog>	 moritzm tappof what do you think ^ ?
[09:34:02] <tappof>	 +1
[09:35:35] <moritzm>	 sounds good, but we should also loop in anyone who made the change to cn=wmf to prevent it from happening again?
[09:35:50] <moritzm>	 it wasn't done via Bitu, maybe Alex as part of clinic duty?
[09:36:02] <godog>	 yes indeed, I'll reach out
[09:41:13] <godog>	 ok we're back, there's another unrelated issue I'll open a task for
[09:41:20] <godog>	 nothing blocking
[09:42:23] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:43:42] <godog>	 https://phabricator.wikimedia.org/T387553 for the curious
[11:42:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:27:05] <godog>	 lag in eqiad is gone already for ^ and codfw is recovering
[13:57:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:58:10] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:02:55] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:05:12] <godog>	 that was a flood of messages from etcd.php btw re: ms3 being unknown
[16:43:30] <vgutierrez>	 ^^ Amir1 
[16:43:55] <Amir1>	 it should be fixed now?
[16:44:26] <vgutierrez>	 yup, just for awareness :D
[16:46:06] <Amir1>	 thanks. OTOH, I removed 600,000 logs every hour via a couple of patches I did last night :D
[16:46:41] <vgutierrez>	 nice
[17:07:39] <cwhite>	 Amir1: Amazing, thank you!  I'm guessing that these logs were generated CentralAuth?
[17:08:19] <Amir1>	 yup, that was 400,000 and then 200,000 per hour was CommunityConfiguration
[17:09:04] <Amir1>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1123361 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CommunityConfiguration/+/1123464
[17:11:12] <cwhite>	 🎉🎉🎉 \o/