[00:09:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175185 [00:09:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175185 (owner: 10TrainBranchBot) [00:30:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175185 (owner: 10TrainBranchBot) [01:00:48] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:11:40] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 10m 52s) [01:22:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [01:32:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [01:34:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:06:56] (03CR) 10Eevans: [V:03+2 C:03+2] convenience script to cleanup Cassandra instance state [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1156924 (owner: 10Eevans) [02:41:36] PROBLEM - MariaDB Replica Lag: s3 #page on db2205 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:58:37] RECOVERY - MariaDB Replica Lag: s3 #page on db2205 is OK: OK slave_sql_lag Replication lag: 11.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:05:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:10:53] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:13:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 158112 MB (4% inode=99%): /var/lib/hadoop/data/e 156899 MB (4% inode=99%): /var/lib/hadoop/data/m 162003 MB (4% inode=99%): /var/lib/hadoop/data/k 163770 MB (4% inode=99%): /var/lib/hadoop/data/f 161447 MB (4% inode=99%): /var/lib/hadoop/data/g 157247 MB (4% inode=99%): /var/lib/hadoop/data/h 159302 MB (4% inode=99%): /var/lib/hadoop/data [03:13:20] 3 MB (4% inode=99%): /var/lib/hadoop/data/j 155157 MB (4% inode=99%): /var/lib/hadoop/data/c 147408 MB (3% inode=99%): /var/lib/hadoop/data/l 161283 MB (4% inode=99%): /var/lib/hadoop/data/b 160530 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [03:43:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::ee38:73ff:fe75:38c4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:48:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::ee38:73ff:fe75:38c4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:00:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:09:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:22:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-xjd7v - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [05:27:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-hw8sw - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [05:32:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [05:32:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [05:36:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 142011 MB (3% inode=99%): /var/lib/hadoop/data/e 144293 MB (3% inode=99%): /var/lib/hadoop/data/f 148107 MB (3% inode=99%): /var/lib/hadoop/data/b 145139 MB (3% inode=99%): /var/lib/hadoop/data/g 145837 MB (3% inode=99%): /var/lib/hadoop/data/d 131241 MB (3% inode=99%): /var/lib/hadoop/data/j 152154 MB (4% inode=99%): /var/lib/hadoop/data [05:36:14] 0 MB (3% inode=99%): /var/lib/hadoop/data/h 146210 MB (3% inode=99%): /var/lib/hadoop/data/l 147476 MB (3% inode=99%): /var/lib/hadoop/data/k 144648 MB (3% inode=99%): /var/lib/hadoop/data/m 144432 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [05:37:53] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:45:04] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:52:53] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:02:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:09:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:11:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:42:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-hw8sw - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [07:45:04] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:46:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:47:48] RESOLVED: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-hw8sw - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [09:32:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:32:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [09:36:01] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:39:16] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:39:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80481 and previous config saved to /var/cache/conftool/dbconfig/20250802-093924-ladsgroup.json [09:39:27] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [09:44:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80482 and previous config saved to /var/cache/conftool/dbconfig/20250802-094416-ladsgroup.json [09:45:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:50:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:59:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P80483 and previous config saved to /var/cache/conftool/dbconfig/20250802-095923-ladsgroup.json [10:02:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:07:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:14:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P80484 and previous config saved to /var/cache/conftool/dbconfig/20250802-101431-ladsgroup.json [10:29:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80485 and previous config saved to /var/cache/conftool/dbconfig/20250802-102938-ladsgroup.json [10:29:43] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [10:29:54] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:30:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T400854)', diff saved to https://phabricator.wikimedia.org/P80486 and previous config saved to /var/cache/conftool/dbconfig/20250802-103001-ladsgroup.json [10:34:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T400854)', diff saved to https://phabricator.wikimedia.org/P80487 and previous config saved to /var/cache/conftool/dbconfig/20250802-103452-ladsgroup.json [10:34:56] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [10:39:44] (03Abandoned) 10Jforrester: ZObjectContentHandler::fillParserOutput: Don't try to add bad links [extensions/WikiLambda] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174455 (https://phabricator.wikimedia.org/T400521) (owner: 10Jforrester) [10:46:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:06] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:50:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P80488 and previous config saved to /var/cache/conftool/dbconfig/20250802-104959-ladsgroup.json [10:51:53] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:56:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:05:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P80489 and previous config saved to /var/cache/conftool/dbconfig/20250802-110507-ladsgroup.json [11:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:20:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T400854)', diff saved to https://phabricator.wikimedia.org/P80490 and previous config saved to /var/cache/conftool/dbconfig/20250802-112015-ladsgroup.json [11:20:18] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [11:20:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:20:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T400854)', diff saved to https://phabricator.wikimedia.org/P80491 and previous config saved to /var/cache/conftool/dbconfig/20250802-112037-ladsgroup.json [11:25:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T400854)', diff saved to https://phabricator.wikimedia.org/P80492 and previous config saved to /var/cache/conftool/dbconfig/20250802-112527-ladsgroup.json [11:25:31] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [11:40:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P80493 and previous config saved to /var/cache/conftool/dbconfig/20250802-114035-ladsgroup.json [11:55:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P80494 and previous config saved to /var/cache/conftool/dbconfig/20250802-115542-ladsgroup.json [12:10:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-cjwwr - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [12:10:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T400854)', diff saved to https://phabricator.wikimedia.org/P80495 and previous config saved to /var/cache/conftool/dbconfig/20250802-121050-ladsgroup.json [12:10:54] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [12:11:05] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:11:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T400854)', diff saved to https://phabricator.wikimedia.org/P80496 and previous config saved to /var/cache/conftool/dbconfig/20250802-121112-ladsgroup.json [12:15:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T400854)', diff saved to https://phabricator.wikimedia.org/P80497 and previous config saved to /var/cache/conftool/dbconfig/20250802-121557-ladsgroup.json [12:16:01] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [12:31:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P80498 and previous config saved to /var/cache/conftool/dbconfig/20250802-123105-ladsgroup.json [12:36:55] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11055358 (10Novem_Linguae) SSH public key: https://en.wikipedia.org/w/index.php?title=User:Novem_Linguae/sandbox&oldid=1303853373 [12:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:46:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P80499 and previous config saved to /var/cache/conftool/dbconfig/20250802-124612-ladsgroup.json [12:47:02] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:56:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:01:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T400854)', diff saved to https://phabricator.wikimedia.org/P80500 and previous config saved to /var/cache/conftool/dbconfig/20250802-130120-ladsgroup.json [13:01:23] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [13:01:36] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:01:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T400854)', diff saved to https://phabricator.wikimedia.org/P80501 and previous config saved to /var/cache/conftool/dbconfig/20250802-130143-ladsgroup.json [13:06:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T400854)', diff saved to https://phabricator.wikimedia.org/P80502 and previous config saved to /var/cache/conftool/dbconfig/20250802-130629-ladsgroup.json [13:06:38] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [13:21:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P80503 and previous config saved to /var/cache/conftool/dbconfig/20250802-132137-ladsgroup.json [13:30:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-9sqf5 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [13:32:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:32:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [13:36:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P80504 and previous config saved to /var/cache/conftool/dbconfig/20250802-133645-ladsgroup.json [13:51:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T400854)', diff saved to https://phabricator.wikimedia.org/P80505 and previous config saved to /var/cache/conftool/dbconfig/20250802-135152-ladsgroup.json [13:51:56] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [13:52:09] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:52:27] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:52:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T400854)', diff saved to https://phabricator.wikimedia.org/P80506 and previous config saved to /var/cache/conftool/dbconfig/20250802-135234-ladsgroup.json [13:57:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T400854)', diff saved to https://phabricator.wikimedia.org/P80507 and previous config saved to /var/cache/conftool/dbconfig/20250802-135748-ladsgroup.json [13:57:52] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [14:12:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P80508 and previous config saved to /var/cache/conftool/dbconfig/20250802-141256-ladsgroup.json [14:28:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P80509 and previous config saved to /var/cache/conftool/dbconfig/20250802-142803-ladsgroup.json [14:30:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-9sqf5 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [14:43:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T400854)', diff saved to https://phabricator.wikimedia.org/P80510 and previous config saved to /var/cache/conftool/dbconfig/20250802-144311-ladsgroup.json [14:43:15] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [14:43:28] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [14:47:07] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:50:42] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [14:50:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T400854)', diff saved to https://phabricator.wikimedia.org/P80511 and previous config saved to /var/cache/conftool/dbconfig/20250802-145049-ladsgroup.json [14:50:53] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [14:57:40] PROBLEM - Disk space on an-worker1139 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 186598 MB (4% inode=99%): /var/lib/hadoop/data/f 175758 MB (4% inode=99%): /var/lib/hadoop/data/j 154886 MB (4% inode=99%): /var/lib/hadoop/data/m 142301 MB (3% inode=99%): /var/lib/hadoop/data/h 182880 MB (4% inode=99%): /var/lib/hadoop/data/k 167780 MB (4% inode=99%): /var/lib/hadoop/data/e 185857 MB (4% inode=99%): /var/lib/hadoop/data [14:57:40] 5 MB (4% inode=99%): /var/lib/hadoop/data/b 176261 MB (4% inode=99%): /var/lib/hadoop/data/d 180304 MB (4% inode=99%): /var/lib/hadoop/data/i 156832 MB (4% inode=99%): /var/lib/hadoop/data/l 181466 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops [15:04:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T400854)', diff saved to https://phabricator.wikimedia.org/P80512 and previous config saved to /var/cache/conftool/dbconfig/20250802-150446-ladsgroup.json [15:04:49] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [15:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P80513 and previous config saved to /var/cache/conftool/dbconfig/20250802-151953-ladsgroup.json [15:35:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P80514 and previous config saved to /var/cache/conftool/dbconfig/20250802-153501-ladsgroup.json [15:50:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T400854)', diff saved to https://phabricator.wikimedia.org/P80515 and previous config saved to /var/cache/conftool/dbconfig/20250802-155008-ladsgroup.json [15:50:12] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [15:50:25] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:50:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T400854)', diff saved to https://phabricator.wikimedia.org/P80516 and previous config saved to /var/cache/conftool/dbconfig/20250802-155032-ladsgroup.json [15:50:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-9sqf5 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [16:04:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T400854)', diff saved to https://phabricator.wikimedia.org/P80517 and previous config saved to /var/cache/conftool/dbconfig/20250802-160426-ladsgroup.json [16:04:30] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [16:19:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P80518 and previous config saved to /var/cache/conftool/dbconfig/20250802-161933-ladsgroup.json [16:26:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-r7zkv - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [16:34:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P80519 and previous config saved to /var/cache/conftool/dbconfig/20250802-163441-ladsgroup.json [16:49:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T400854)', diff saved to https://phabricator.wikimedia.org/P80520 and previous config saved to /var/cache/conftool/dbconfig/20250802-164949-ladsgroup.json [16:49:52] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [16:50:04] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [16:50:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T400854)', diff saved to https://phabricator.wikimedia.org/P80521 and previous config saved to /var/cache/conftool/dbconfig/20250802-165012-ladsgroup.json [16:52:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:57:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:04:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T400854)', diff saved to https://phabricator.wikimedia.org/P80522 and previous config saved to /var/cache/conftool/dbconfig/20250802-170407-ladsgroup.json [17:04:10] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [17:19:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P80523 and previous config saved to /var/cache/conftool/dbconfig/20250802-171914-ladsgroup.json [17:32:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:32:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [17:33:48] !log clean up some misbehaving thumbor pods [17:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P80524 and previous config saved to /var/cache/conftool/dbconfig/20250802-173422-ladsgroup.json [17:37:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:37:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [17:49:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T400854)', diff saved to https://phabricator.wikimedia.org/P80525 and previous config saved to /var/cache/conftool/dbconfig/20250802-174929-ladsgroup.json [17:49:33] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [17:49:45] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [17:49:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T400854)', diff saved to https://phabricator.wikimedia.org/P80526 and previous config saved to /var/cache/conftool/dbconfig/20250802-174952-ladsgroup.json [18:04:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T400854)', diff saved to https://phabricator.wikimedia.org/P80527 and previous config saved to /var/cache/conftool/dbconfig/20250802-180406-ladsgroup.json [18:04:13] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [18:19:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P80528 and previous config saved to /var/cache/conftool/dbconfig/20250802-181914-ladsgroup.json [18:34:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P80529 and previous config saved to /var/cache/conftool/dbconfig/20250802-183421-ladsgroup.json [18:46:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-r7zkv - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [18:49:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T400854)', diff saved to https://phabricator.wikimedia.org/P80530 and previous config saved to /var/cache/conftool/dbconfig/20250802-184929-ladsgroup.json [18:49:33] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [18:49:45] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2194.codfw.wmnet with reason: Maintenance [18:49:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T400854)', diff saved to https://phabricator.wikimedia.org/P80531 and previous config saved to /var/cache/conftool/dbconfig/20250802-184952-ladsgroup.json [19:04:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T400854)', diff saved to https://phabricator.wikimedia.org/P80532 and previous config saved to /var/cache/conftool/dbconfig/20250802-190408-ladsgroup.json [19:04:12] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [19:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:12:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-jjjrw - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [19:19:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P80533 and previous config saved to /var/cache/conftool/dbconfig/20250802-191915-ladsgroup.json [19:34:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P80534 and previous config saved to /var/cache/conftool/dbconfig/20250802-193423-ladsgroup.json [19:38:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [19:38:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [19:41:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T400854)', diff saved to https://phabricator.wikimedia.org/P80535 and previous config saved to /var/cache/conftool/dbconfig/20250802-194931-ladsgroup.json [19:49:35] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [19:49:46] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:49:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T400854)', diff saved to https://phabricator.wikimedia.org/P80536 and previous config saved to /var/cache/conftool/dbconfig/20250802-194953-ladsgroup.json [20:04:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T400854)', diff saved to https://phabricator.wikimedia.org/P80537 and previous config saved to /var/cache/conftool/dbconfig/20250802-200405-ladsgroup.json [20:04:09] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [20:19:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P80538 and previous config saved to /var/cache/conftool/dbconfig/20250802-201913-ladsgroup.json [20:34:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P80539 and previous config saved to /var/cache/conftool/dbconfig/20250802-203421-ladsgroup.json [20:44:58] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Sat 30 Aug 2025 08:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [20:49:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T400854)', diff saved to https://phabricator.wikimedia.org/P80540 and previous config saved to /var/cache/conftool/dbconfig/20250802-204928-ladsgroup.json [20:49:32] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [20:49:44] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2227.codfw.wmnet with reason: Maintenance [20:49:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T400854)', diff saved to https://phabricator.wikimedia.org/P80541 and previous config saved to /var/cache/conftool/dbconfig/20250802-204951-ladsgroup.json [21:00:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:04:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T400854)', diff saved to https://phabricator.wikimedia.org/P80542 and previous config saved to /var/cache/conftool/dbconfig/20250802-210406-ladsgroup.json [21:04:10] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [21:04:36] FIRING: RESTGatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mw-api-int_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DRESTGatewayBackendErrorsHigh [21:09:36] RESOLVED: RESTGatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mw-api-int_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DRESTGatewayBackendErrorsHigh [21:10:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:16:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 153469 MB (4% inode=99%): /var/lib/hadoop/data/e 149492 MB (3% inode=99%): /var/lib/hadoop/data/f 152705 MB (4% inode=99%): /var/lib/hadoop/data/b 151615 MB (4% inode=99%): /var/lib/hadoop/data/g 152946 MB (4% inode=99%): /var/lib/hadoop/data/d 145862 MB (3% inode=99%): /var/lib/hadoop/data/j 152100 MB (4% inode=99%): /var/lib/hadoop/data [21:16:14] 3 MB (4% inode=99%): /var/lib/hadoop/data/h 156217 MB (4% inode=99%): /var/lib/hadoop/data/l 149914 MB (3% inode=99%): /var/lib/hadoop/data/k 150590 MB (4% inode=99%): /var/lib/hadoop/data/m 145081 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [21:19:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P80543 and previous config saved to /var/cache/conftool/dbconfig/20250802-211914-ladsgroup.json [21:32:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-jjjrw - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [21:34:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P80544 and previous config saved to /var/cache/conftool/dbconfig/20250802-213421-ladsgroup.json [21:49:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T400854)', diff saved to https://phabricator.wikimedia.org/P80546 and previous config saved to /var/cache/conftool/dbconfig/20250802-214929-ladsgroup.json [21:49:33] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [21:49:45] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2239.codfw.wmnet with reason: Maintenance [22:04:56] (03PS1) 10NMW03: Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) [22:05:45] (03CR) 10CI reject: [V:04-1] Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) (owner: 10NMW03) [22:06:41] (03PS2) 10NMW03: Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) [22:56:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 148105 MB (3% inode=99%): /var/lib/hadoop/data/e 154172 MB (4% inode=99%): /var/lib/hadoop/data/f 164746 MB (4% inode=99%): /var/lib/hadoop/data/b 167843 MB (4% inode=99%): /var/lib/hadoop/data/g 156931 MB (4% inode=99%): /var/lib/hadoop/data/d 163245 MB (4% inode=99%): /var/lib/hadoop/data/j 169070 MB (4% inode=99%): /var/lib/hadoop/data [22:56:14] 4 MB (4% inode=99%): /var/lib/hadoop/data/h 153462 MB (4% inode=99%): /var/lib/hadoop/data/l 166420 MB (4% inode=99%): /var/lib/hadoop/data/k 167604 MB (4% inode=99%): /var/lib/hadoop/data/m 154190 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [23:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:38:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175224 [23:38:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175224 (owner: 10TrainBranchBot) [23:38:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:38:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [23:41:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:52:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175224 (owner: 10TrainBranchBot)