[00:07:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176784 [00:07:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176784 (owner: 10TrainBranchBot) [00:43:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176784 (owner: 10TrainBranchBot) [00:44:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [01:00:33] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:02] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 28s) [02:36:37] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [02:53:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 159589 MB (4% inode=99%): /var/lib/hadoop/data/e 155202 MB (4% inode=99%): /var/lib/hadoop/data/m 158149 MB (4% inode=99%): /var/lib/hadoop/data/k 158437 MB (4% inode=99%): /var/lib/hadoop/data/f 159381 MB (4% inode=99%): /var/lib/hadoop/data/g 154439 MB (4% inode=99%): /var/lib/hadoop/data/h 159074 MB (4% inode=99%): /var/lib/hadoop/data [02:53:20] 9 MB (3% inode=99%): /var/lib/hadoop/data/j 159410 MB (4% inode=99%): /var/lib/hadoop/data/c 159072 MB (4% inode=99%): /var/lib/hadoop/data/l 152294 MB (4% inode=99%): /var/lib/hadoop/data/b 157168 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [03:00:32] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 157151 MB (4% inode=99%): /var/lib/hadoop/data/h 148603 MB (3% inode=99%): /var/lib/hadoop/data/b 157434 MB (4% inode=99%): /var/lib/hadoop/data/k 157351 MB (4% inode=99%): /var/lib/hadoop/data/m 150148 MB (3% inode=99%): /var/lib/hadoop/data/f 153267 MB (4% inode=99%): /var/lib/hadoop/data/j 154526 MB (4% inode=99%): /var/lib/hadoop/data [03:00:32] 4 MB (4% inode=99%): /var/lib/hadoop/data/l 158695 MB (4% inode=99%): /var/lib/hadoop/data/i 157051 MB (4% inode=99%): /var/lib/hadoop/data/g 157564 MB (4% inode=99%): /var/lib/hadoop/data/c 156482 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [03:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:13:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 155410 MB (4% inode=99%): /var/lib/hadoop/data/e 151165 MB (4% inode=99%): /var/lib/hadoop/data/m 155809 MB (4% inode=99%): /var/lib/hadoop/data/k 156888 MB (4% inode=99%): /var/lib/hadoop/data/f 151719 MB (4% inode=99%): /var/lib/hadoop/data/g 152604 MB (4% inode=99%): /var/lib/hadoop/data/h 153686 MB (4% inode=99%): /var/lib/hadoop/data [04:13:20] 8 MB (3% inode=99%): /var/lib/hadoop/data/j 153259 MB (4% inode=99%): /var/lib/hadoop/data/c 151491 MB (4% inode=99%): /var/lib/hadoop/data/l 150325 MB (4% inode=99%): /var/lib/hadoop/data/b 151208 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [04:16:48] (03CR) 10Brian Wolff: "I worked on this patch back when i used to work for the security team. I don't anymore." [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [04:44:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [05:08:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:32] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:13:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:15:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:20:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:36:37] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250810T0700) [07:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:44:41] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [08:47:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:47:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:29:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [11:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:44:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [14:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:08:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:19:32] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:52:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:40:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:40:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:43:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54368 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:43:26] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [18:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [19:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:19:32] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:44:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [22:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [22:52:33] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, and 2 others: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11072670 (10Tgr) a:03Tgr [23:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:19:32] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:21:20] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, and 2 others: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11072676 (10Tgr) LocalSettings code snippet that in... [23:22:41] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, and 2 others: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11072677 (10Tgr) (This doesn't add a referer, which... [23:30:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176781 (owner: 10TrainBranchBot) [23:37:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176858 [23:37:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176858 (owner: 10TrainBranchBot) [23:52:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176858 (owner: 10TrainBranchBot)