[00:07:23] FIRING: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:07:23] FIRING: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:23] FIRING: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:18] I bet that's a renamed user ^ taking a look [09:03:52] HCoplin -> HCoplin-WMF I think [09:06:28] indeed, nice [09:08:51] unfortunately a known issue, task is T374190 [09:08:52] T374190: grafana-ldap-users-sync breaks on renamed users - https://phabricator.wikimedia.org/T374190 [09:10:30] or maybe not, a different issue this time [09:10:35] I was taking a look as well, but it seems that the error is slightly different from the one in the task. In this case, it says: Could not fetch metadata for UID [09:10:37] eh :) [09:11:52] Or maybe it has the same root cause (a rename) but manifests in a slightly different way [09:12:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:14:32] Searching for the user with ldapsearch, I can only find a result for the one without the suffix [09:16:16] indeed, unclear yet where hcoplin-wmf comes from, I was expecting to find it in grafana users and no luck so far [09:17:23] Looking at the code, it seems that the uid originates from the LDAP API [09:20:31] mmhh ok next guess is the hcoplin-wmf user is referenced in groups and it doesn't actually exist in ldap? [09:29:56] uid=HCoplin-WMF is listed as a user in cn=wmf, but the user doesn't exist? [09:30:44] that's what it looks like so far [09:31:28] at least I can't find uid=HCoplin-WMF [09:31:44] there's an access request here: https://phabricator.wikimedia.org/T387459 [09:32:17] but then the person already has shell access for cn=wmf as "hcoplin" [09:33:15] mmhh I'm tempted to remove HCoplin-WMF from wmf for now and comment on the access request [09:33:40] moritzm tappof what do you think ^ ? [09:34:02] +1 [09:35:35] sounds good, but we should also loop in anyone who made the change to cn=wmf to prevent it from happening again? [09:35:50] it wasn't done via Bitu, maybe Alex as part of clinic duty? [09:36:02] yes indeed, I'll reach out [09:41:13] ok we're back, there's another unrelated issue I'll open a task for [09:41:20] nothing blocking [09:42:23] RESOLVED: SystemdUnitFailed: grafana-ldap-users-sync.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:42] https://phabricator.wikimedia.org/T387553 for the curious [11:42:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:27:05] lag in eqiad is gone already for ^ and codfw is recovering [13:57:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:58:10] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:02:55] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:05:12] that was a flood of messages from etcd.php btw re: ms3 being unknown [16:43:30] ^^ Amir1 [16:43:55] it should be fixed now? [16:44:26] yup, just for awareness :D [16:46:06] thanks. OTOH, I removed 600,000 logs every hour via a couple of patches I did last night :D [16:46:41] nice [17:07:39] Amir1: Amazing, thank you! I'm guessing that these logs were generated CentralAuth? [17:08:19] yup, that was 400,000 and then 200,000 per hour was CommunityConfiguration [17:09:04] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1123361 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CommunityConfiguration/+/1123464 [17:11:12] 🎉🎉🎉 \o/