[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172727 (owner: 10TrainBranchBot) [00:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172833 [00:08:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172833 (owner: 10TrainBranchBot) [00:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:28:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172833 (owner: 10TrainBranchBot) [01:46:14] PROBLEM - Disk space on an-worker1129 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/b 155838 MB (4% inode=99%): /var/lib/hadoop/data/l 156446 MB (4% inode=99%): /var/lib/hadoop/data/k 147920 MB (3% inode=99%): /var/lib/hadoop/data/c 152385 MB (4% inode=99%): /var/lib/hadoop/data/d 153421 MB (4% inode=99%): /var/lib/hadoop/data/e 153047 MB (4% inode=99%): /var/lib/hadoop/data/g 155067 MB (4% inode=99%): /var/lib/hadoop/data [01:46:14] 0 MB (4% inode=99%): /var/lib/hadoop/data/i 150841 MB (4% inode=99%): /var/lib/hadoop/data/j 158611 MB (4% inode=99%): /var/lib/hadoop/data/m 152235 MB (4% inode=99%): /var/lib/hadoop/data/f 158683 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1129&var-datasource=eqiad+prometheus/ops [02:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:13:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 149109 MB (3% inode=99%): /var/lib/hadoop/data/e 153806 MB (4% inode=99%): /var/lib/hadoop/data/m 158735 MB (4% inode=99%): /var/lib/hadoop/data/k 158871 MB (4% inode=99%): /var/lib/hadoop/data/f 157254 MB (4% inode=99%): /var/lib/hadoop/data/g 156963 MB (4% inode=99%): /var/lib/hadoop/data/h 158380 MB (4% inode=99%): /var/lib/hadoop/data [03:13:20] 5 MB (4% inode=99%): /var/lib/hadoop/data/j 149046 MB (3% inode=99%): /var/lib/hadoop/data/c 155561 MB (4% inode=99%): /var/lib/hadoop/data/l 157365 MB (4% inode=99%): /var/lib/hadoop/data/b 154168 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [03:37:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:37:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [05:09:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:57:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-27rht - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [06:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250727T0700) [07:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:37:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:37:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [07:50:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:52:13] !incidents [07:52:14] 6513 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:52:29] !ack 6513 [07:52:29] 6513 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:53:40] looks like eqsin and availability is recovering already, I'll check the traffic in superset [07:55:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:59:40] some more details in _security [08:18:17] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-27rht - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [08:45:02] (03CR) 10C. Scott Ananian: [C:03+1] Deploy Parsoid Read Views to 39 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172711 (https://phabricator.wikimedia.org/T400510) (owner: 10Arlolra) [10:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:31:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:18] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:37:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:37:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [12:31:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:19:31] 10SRE-swift-storage, 06Commons, 10media-backups: File not found: /v1/AUTH_mw/wikipedia-commons-local-public ... for 3 files - https://phabricator.wikimedia.org/T400567 (10Raymond) 03NEW [14:20:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-hc58s - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [14:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-hc58s - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [15:37:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:37:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [16:31:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-hc58s - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [17:00:13] (03PS4) 10Anzx: aswikisource: add publisher (প্ৰকাশক) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172865 (https://phabricator.wikimedia.org/T399269) [17:00:13] (03CR) 10Anzx: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172865 (https://phabricator.wikimedia.org/T399269) (owner: 10Anzx) [17:22:17] FIRING: ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:17] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:08] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [17:44:49] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [17:47:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:47:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [17:50:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-k9mv6 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [18:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:54:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:59:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:26:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:31:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:32] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:22:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:26:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172927 [23:37:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172927 (owner: 10TrainBranchBot) [23:39:46] (03PS3) 10Krinkle: Set wgCentralBannerRecorder to /beacon/… instead of //example.org/beacon/… [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161757 (https://phabricator.wikimedia.org/T400586) [23:50:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1172927 (owner: 10TrainBranchBot)