[00:03:06] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:03:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11070139 (10VRiley-WMF) [00:06:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:08:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176554 [00:08:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176554 (owner: 10TrainBranchBot) [00:08:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:48] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bullseye [00:19:56] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11070152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bullseye [00:29:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176554 (owner: 10TrainBranchBot) [01:00:47] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:07:40] PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 160815 MB (4% inode=99%): /var/lib/hadoop/data/m 157583 MB (4% inode=99%): /var/lib/hadoop/data/d 155051 MB (4% inode=99%): /var/lib/hadoop/data/b 156917 MB (4% inode=99%): /var/lib/hadoop/data/e 158613 MB (4% inode=99%): /var/lib/hadoop/data/g 156523 MB (4% inode=99%): /var/lib/hadoop/data/f 159898 MB (4% inode=99%): /var/lib/hadoop/data [01:07:40] 4 MB (4% inode=99%): /var/lib/hadoop/data/i 156572 MB (4% inode=99%): /var/lib/hadoop/data/j 159482 MB (4% inode=99%): /var/lib/hadoop/data/l 157500 MB (4% inode=99%): /var/lib/hadoop/data/c 149125 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops [01:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:12:58] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 12m 11s) [01:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:41:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [01:56:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:08:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250808T0600) [06:13:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:47:33] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11070323 (10geraki) Not sure why mediawiki.gr is still pointing to nswebhost.com—probably due to oversight or simply forgotten while waiting for a potential use. Feel f... [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250808T0700) [07:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:00:32] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 164660 MB (4% inode=99%): /var/lib/hadoop/data/g 159900 MB (4% inode=99%): /var/lib/hadoop/data/j 172994 MB (4% inode=99%): /var/lib/hadoop/data/c 164953 MB (4% inode=99%): /var/lib/hadoop/data/b 179912 MB (4% inode=99%): /var/lib/hadoop/data/l 162889 MB (4% inode=99%): /var/lib/hadoop/data/k 171280 MB (4% inode=99%): /var/lib/hadoop/data [08:00:32] 3 MB (4% inode=99%): /var/lib/hadoop/data/i 165100 MB (4% inode=99%): /var/lib/hadoop/data/m 171744 MB (4% inode=99%): /var/lib/hadoop/data/d 142160 MB (3% inode=99%): /var/lib/hadoop/data/h 174760 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [08:24:05] gerrit seems down? [08:24:49] Hmm, seems to work from inside the cluster. [08:52:04] PROBLEM - Disk space on an-worker1140 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 147154 MB (3% inode=99%): /var/lib/hadoop/data/b 183243 MB (4% inode=99%): /var/lib/hadoop/data/c 181238 MB (4% inode=99%): /var/lib/hadoop/data/d 157278 MB (4% inode=99%): /var/lib/hadoop/data/e 167350 MB (4% inode=99%): /var/lib/hadoop/data/f 174260 MB (4% inode=99%): /var/lib/hadoop/data/h 175648 MB (4% inode=99%): /var/lib/hadoop/data [08:52:04] 8 MB (4% inode=99%): /var/lib/hadoop/data/j 157852 MB (4% inode=99%): /var/lib/hadoop/data/k 174475 MB (4% inode=99%): /var/lib/hadoop/data/l 180509 MB (4% inode=99%): /var/lib/hadoop/data/m 188724 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1140&var-datasource=eqiad+prometheus/ops [09:24:13] (03Abandoned) 10Phuedx: DNM: MetricsPlatform: Disable logged-in experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176533 (owner: 10Phuedx) [09:36:11] (03CR) 10Cathal Mooney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1175887 (owner: 10Ayounsi) [09:38:45] 06SRE-OnFire, 10WMDE-TechWish-Maintenance, 10Sustainability (Incident Followup): Split out reusable Parsoid+Cite analysis module from scraper - https://phabricator.wikimedia.org/T401334#11070553 (10awight) [09:39:35] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [09:40:32] (03CR) 10Cathal Mooney: [C:03+1] gNMI: initial Nokia support [puppet] - 10https://gerrit.wikimedia.org/r/1175887 (owner: 10Ayounsi) [09:40:46] (03CR) 10Cathal Mooney: [C:03+1] Replace SONIC grpc port with Nokia's in MR ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1175872 (owner: 10Ayounsi) [09:53:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 156206 MB (4% inode=99%): /var/lib/hadoop/data/e 155470 MB (4% inode=99%): /var/lib/hadoop/data/m 155471 MB (4% inode=99%): /var/lib/hadoop/data/k 156272 MB (4% inode=99%): /var/lib/hadoop/data/f 153939 MB (4% inode=99%): /var/lib/hadoop/data/g 149367 MB (3% inode=99%): /var/lib/hadoop/data/h 156848 MB (4% inode=99%): /var/lib/hadoop/data [09:53:20] 7 MB (4% inode=99%): /var/lib/hadoop/data/j 151330 MB (4% inode=99%): /var/lib/hadoop/data/c 154766 MB (4% inode=99%): /var/lib/hadoop/data/l 155704 MB (4% inode=99%): /var/lib/hadoop/data/b 158365 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [10:54:02] (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176661 (https://phabricator.wikimedia.org/T399579) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250808T0700) [11:00:05] jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250808T1100). [11:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:12:04] PROBLEM - Disk space on an-worker1140 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 181811 MB (4% inode=99%): /var/lib/hadoop/data/b 173932 MB (4% inode=99%): /var/lib/hadoop/data/c 154521 MB (4% inode=99%): /var/lib/hadoop/data/d 182669 MB (4% inode=99%): /var/lib/hadoop/data/e 170863 MB (4% inode=99%): /var/lib/hadoop/data/f 172285 MB (4% inode=99%): /var/lib/hadoop/data/h 153530 MB (4% inode=99%): /var/lib/hadoop/data [11:12:04] 1 MB (5% inode=99%): /var/lib/hadoop/data/j 158160 MB (4% inode=99%): /var/lib/hadoop/data/k 160489 MB (4% inode=99%): /var/lib/hadoop/data/l 149223 MB (3% inode=99%): /var/lib/hadoop/data/m 189529 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1140&var-datasource=eqiad+prometheus/ops [12:13:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 160718 MB (4% inode=99%): /var/lib/hadoop/data/e 156650 MB (4% inode=99%): /var/lib/hadoop/data/m 156608 MB (4% inode=99%): /var/lib/hadoop/data/k 156993 MB (4% inode=99%): /var/lib/hadoop/data/f 153474 MB (4% inode=99%): /var/lib/hadoop/data/g 153807 MB (4% inode=99%): /var/lib/hadoop/data/h 160711 MB (4% inode=99%): /var/lib/hadoop/data [12:13:20] 6 MB (4% inode=99%): /var/lib/hadoop/data/j 150108 MB (3% inode=99%): /var/lib/hadoop/data/c 159120 MB (4% inode=99%): /var/lib/hadoop/data/l 160738 MB (4% inode=99%): /var/lib/hadoop/data/b 160602 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [13:07:40] PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 159073 MB (4% inode=99%): /var/lib/hadoop/data/m 158498 MB (4% inode=99%): /var/lib/hadoop/data/d 148371 MB (3% inode=99%): /var/lib/hadoop/data/b 157416 MB (4% inode=99%): /var/lib/hadoop/data/e 163913 MB (4% inode=99%): /var/lib/hadoop/data/g 159405 MB (4% inode=99%): /var/lib/hadoop/data/f 158170 MB (4% inode=99%): /var/lib/hadoop/data [13:07:40] 7 MB (4% inode=99%): /var/lib/hadoop/data/i 154788 MB (4% inode=99%): /var/lib/hadoop/data/j 158195 MB (4% inode=99%): /var/lib/hadoop/data/l 158206 MB (4% inode=99%): /var/lib/hadoop/data/c 154552 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops [13:39:35] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [14:22:41] (03PS2) 10Mhorsey: Add ce_event_topics to mariadb tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1175514 (https://phabricator.wikimedia.org/T399302) [14:22:44] (03CR) 10Ladsgroup: [C:03+2] Add ce_event_topics to mariadb tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1175514 (https://phabricator.wikimedia.org/T399302) (owner: 10Mhorsey) [14:22:46] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add ce_event_topics to mariadb tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1175514 (https://phabricator.wikimedia.org/T399302) (owner: 10Mhorsey) [14:46:36] (03PS1) 10Krinkle: multiversion: Fix manage-dblist "add" for Beta Cluster as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 [14:47:21] (03CR) 10Krinkle: [C:03+1] build: upgrade QUnit [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 (owner: 10Hashar) [14:47:30] (03CR) 10CI reject: [V:04-1] multiversion: Fix manage-dblist "add" for Beta Cluster as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 (owner: 10Krinkle) [15:03:46] (03CR) 10Zabe: multiversion: Fix manage-dblist "add" for Beta Cluster as well (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 (owner: 10Krinkle) [15:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:08:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:14:48] (03CR) 10Krinkle: multiversion: Fix manage-dblist "add" for Beta Cluster as well (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 (owner: 10Krinkle) [15:19:31] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:12] (03PS1) 10Krinkle: tests: Improve false-positive testOnlyExistingWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176714 [15:35:12] (03PS1) 10Krinkle: WmfConfig: Document why 'preinstall' is indexed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176715 [15:35:12] (03PS1) 10Krinkle: manage-dblist: Remove mention of non-existant "preinstall-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176716 [15:35:23] (03PS2) 10Krinkle: manage-dblist: Remove mention of non-existant "preinstall-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 [15:35:51] (03Abandoned) 10Krinkle: manage-dblist: Remove mention of non-existant "preinstall-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176716 (owner: 10Krinkle) [15:38:16] (03PS2) 10Krinkle: WmfConfig: Document why 'preinstall' is indexed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176715 [15:38:16] (03PS3) 10Krinkle: manage-dblist: Remove mention of non-existant "preinstall-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 [15:39:48] (03PS3) 10Krinkle: WmfConfig: Document why 'preinstall' is indexed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176715 [15:39:48] (03PS4) 10Krinkle: manage-dblist: Remove mention of non-existant "preinstall-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 [15:39:54] (03CR) 10Krinkle: manage-dblist: Remove mention of non-existant "preinstall-labs" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 (owner: 10Krinkle) [15:43:52] (03CR) 10Zabe: [C:03+1] manage-dblist: Remove mention of non-existant "preinstall-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 (owner: 10Krinkle) [15:56:55] (03PS2) 10Krinkle: tests: Improve false-positive testOnlyExistingWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176714 [15:58:21] (03PS3) 10Krinkle: tests: Improve false-positive testOnlyExistingWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176714 [15:58:26] (03PS4) 10Krinkle: WmfConfig: Document why 'preinstall' is indexed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176715 [15:58:33] (03PS5) 10Krinkle: manage-dblist: Remove mention of non-existant "preinstall-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 [16:12:04] PROBLEM - Disk space on an-worker1140 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 149249 MB (3% inode=99%): /var/lib/hadoop/data/b 167623 MB (4% inode=99%): /var/lib/hadoop/data/c 179263 MB (4% inode=99%): /var/lib/hadoop/data/d 171248 MB (4% inode=99%): /var/lib/hadoop/data/e 159643 MB (4% inode=99%): /var/lib/hadoop/data/f 149715 MB (3% inode=99%): /var/lib/hadoop/data/h 145477 MB (3% inode=99%): /var/lib/hadoop/data [16:12:04] 7 MB (4% inode=99%): /var/lib/hadoop/data/j 162535 MB (4% inode=99%): /var/lib/hadoop/data/k 158048 MB (4% inode=99%): /var/lib/hadoop/data/l 161013 MB (4% inode=99%): /var/lib/hadoop/data/m 166517 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1140&var-datasource=eqiad+prometheus/ops [16:19:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 578349112 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:20:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7279080 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:42:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 189253200 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:45:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:34:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 397872320 and 25 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:35:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 23816 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:39:35] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [17:59:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 725250272 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:00:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 27488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:06:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:06:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:06:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:07:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54369 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:07:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:26] 06SRE, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 06Traffic: Investigate ATS/Varnish serving stall/cached Zuul status json - https://phabricator.wikimedia.org/T341548#11071139 (10Krinkle) >>! In T341548#9004449, @hashar wrote: > The backend service providing the data for https://i... [18:23:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 475687480 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:27:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6654632 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:49:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 179793080 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:50:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 79792 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:06:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:09:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.684s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:09:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:12:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:14:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.684s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:22:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:09:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:23:50] (03PS2) 10Krinkle: MobileUrlCallback: Disable for thankyou.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174578 (https://phabricator.wikimedia.org/T400855) [20:50:26] (03PS3) 10Krinkle: Disable MobileFrontend on thankyou.wikipedia.org and nostalgia.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174578 (https://phabricator.wikimedia.org/T400855) [20:51:15] (03CR) 10CI reject: [V:04-1] Disable MobileFrontend on thankyou.wikipedia.org and nostalgia.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174578 (https://phabricator.wikimedia.org/T400855) (owner: 10Krinkle) [20:52:24] 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11071303 (10Krinkle) [20:58:20] 06SRE, 10DNS, 06Traffic-Icebox, 07Mobile, 13Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11071306 (10Krinkle) >>! From the task description: > * login.m.wikimedia.org I've ticked this as "not needed". This wiki intentionally does not have mobile view... [21:09:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:10:22] (03PS1) 10Krinkle: wikipedia.org: Fix grouping of wikis and non-wikis [dns] - 10https://gerrit.wikimedia.org/r/1176725 (https://phabricator.wikimedia.org/T152882) [21:10:55] (03PS2) 10Krinkle: wikipedia.org: Fix grouping of wikis and non-wikis [dns] - 10https://gerrit.wikimedia.org/r/1176725 (https://phabricator.wikimedia.org/T152882) [21:39:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [22:13:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:13:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:16:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54368 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:16:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176734 [23:38:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176734 (owner: 10TrainBranchBot) [23:51:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176734 (owner: 10TrainBranchBot)