[00:08:06] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176867
[00:08:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176867 (owner: 10TrainBranchBot)
[00:28:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176867 (owner: 10TrainBranchBot)
[00:44:36] <jinxer-wm>	 FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[01:00:36] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[01:12:20] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 43s)
[02:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[02:46:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova vendordata: puppetize VMs with 'puppet-agent' package present [puppet] - 10https://gerrit.wikimedia.org/r/1176911
[02:47:28] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova vendordata: puppetize VMs with 'puppet-agent' package present [puppet] - 10https://gerrit.wikimedia.org/r/1176911
[02:48:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Nova vendordata: puppetize VMs with 'puppet-agent' package present [puppet] - 10https://gerrit.wikimedia.org/r/1176911 (owner: 10Andrew Bogott)
[03:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[03:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:19:32] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:00:32] <icinga-wm>	 PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 158691 MB (4% inode=99%): /var/lib/hadoop/data/h 150065 MB (3% inode=99%): /var/lib/hadoop/data/b 155308 MB (4% inode=99%): /var/lib/hadoop/data/k 154367 MB (4% inode=99%): /var/lib/hadoop/data/m 153079 MB (4% inode=99%): /var/lib/hadoop/data/f 155134 MB (4% inode=99%): /var/lib/hadoop/data/j 151748 MB (4% inode=99%): /var/lib/hadoop/data
[04:00:32] <icinga-wm>	 0 MB (4% inode=99%): /var/lib/hadoop/data/l 172033 MB (4% inode=99%): /var/lib/hadoop/data/i 156946 MB (4% inode=99%): /var/lib/hadoop/data/g 156190 MB (4% inode=99%): /var/lib/hadoop/data/c 151811 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops
[04:44:36] <jinxer-wm>	 FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[05:08:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:13:10] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:21:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:31:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:08] <hashar>	 easy one :)
[07:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[07:08:22] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11072888 (10Novem_Linguae) Hello friends. Anything I can do to help to keep this moving?  * This is currently on the "NDA Pending" column of the #LDAP-Access-Requests board. That can prob...
[07:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:38:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1176095 (owner: 10Slyngshede)
[07:49:20] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:49:36] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:51:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 7.614 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:52:10] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54368 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:54:28] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: re-add email [puppet] - 10https://gerrit.wikimedia.org/r/1176095 (owner: 10Slyngshede)
[07:54:37] <moritzm>	 !log installing openjdk-11 security updates
[07:54:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:49] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2088.codfw.wmnet with OS bullseye
[07:56:59] <wikibugs>	 06SRE, 10SRE-swift-storage: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387#11072965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2088.codfw.wmnet with OS bullseye
[08:05:23] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#11072975 (10klausman) >>! In T396717#11060794, @Papaul wrote: > @klausman hello hope all is well. Is it possible to give us a day and time when you will be available to help us work on those s...
[08:06:28] <wikibugs>	 (03PS1) 10Vgutierrez: haproxykafka: Reduce socket deadline to 500ms [puppet] - 10https://gerrit.wikimedia.org/r/1177319 (https://phabricator.wikimedia.org/T400039)
[08:08:13] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177319 (https://phabricator.wikimedia.org/T400039) (owner: 10Vgutierrez)
[08:09:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet
[08:09:29] <wikibugs>	 (03PS1) 10MVernon: swift: re-add SM C-J hosts with JBOD card [puppet] - 10https://gerrit.wikimedia.org/r/1177320 (https://phabricator.wikimedia.org/T401387)
[08:10:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage
[08:13:38] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1177319 (https://phabricator.wikimedia.org/T400039) (owner: 10Vgutierrez)
[08:13:54] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2088.codfw.wmnet with reason: host reimage
[08:14:14] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxykafka: Reduce socket deadline to 500ms [puppet] - 10https://gerrit.wikimedia.org/r/1177319 (https://phabricator.wikimedia.org/T400039) (owner: 10Vgutierrez)
[08:14:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet
[08:16:07] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1175552 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[08:18:26] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "The hostnames match the 2 hosts flagged READY in the related tasks" [puppet] - 10https://gerrit.wikimedia.org/r/1177320 (https://phabricator.wikimedia.org/T401387) (owner: 10MVernon)
[08:20:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve1012.eqiad.wmnet
[08:27:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1012.eqiad.wmnet
[08:29:57] <joelyrookewmde>	 Hi, I'm planning to add wikidata support for tlwikisource as per https://phabricator.wikimedia.org/T388658. Let me know if running these maintenance scripts would block any deployments
[08:31:25] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2088.codfw.wmnet with OS bullseye
[08:31:36] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387#11073027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2088.codfw.wmnet with OS bullseye completed: - ms-be208...
[08:31:48] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] "Applies cleanly locally (apart from some "trailing whitespace" noise). Thank you (as usual)!" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1175157 (https://phabricator.wikimedia.org/T399604) (owner: 10Pppery)
[08:33:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve1013.eqiad.wmnet
[08:36:41] <vgutierrez>	 !log reducing haproxykafka socket batch deadline to 500ms - T400039
[08:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:45] <stashbot>	 T400039: Haproxykafka silently stops sending request data to kafka - https://phabricator.wikimedia.org/T400039
[08:39:14] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: re-add SM C-J hosts with JBOD card [puppet] - 10https://gerrit.wikimedia.org/r/1177320 (https://phabricator.wikimedia.org/T401387) (owner: 10MVernon)
[08:40:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1013.eqiad.wmnet
[08:44:36] <jinxer-wm>	 FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[08:50:00] <joelyrookewmde>	 !log joelyrookewmde@deploy1003:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https
[08:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2229.codfw.wmnet with reason: Maintenance
[09:00:46] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119)
[09:00:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: upload-frontend: apply UA policy to cache-upload as well [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119)
[09:03:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[09:04:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2204.codfw.wmnet with reason: Maintenance
[09:06:40] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119)
[09:06:40] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: upload-frontend: apply UA policy to cache-upload as well [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119)
[09:07:38] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "please fix the syntax errors" [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[09:08:03] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] upload-frontend: apply UA policy to cache-upload as well (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[09:09:57] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2213.codfw.wmnet with reason: Maintenance
[09:11:34] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119)
[09:13:11] <joelyrookewmde>	 !log Finished populateSitesTable for tlwikisource [as per T388658]
[09:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:15] <stashbot>	 T388658: Add Wikidata support for tlwikisource - https://phabricator.wikimedia.org/T388658
[09:14:40] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119)
[09:18:36] <joelyrookewmde>	 WIT team is done running maintenance scripts for adding wikidata support! Hasta la vista!
[09:20:18] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tlwikisource in section s5
[09:23:16] <wikibugs>	 (03PS10) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119)
[09:24:49] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] api-gateway: Conditional restbase compatibility headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172599 (https://phabricator.wikimedia.org/T400346) (owner: 10Clément Goubert)
[09:25:01] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis tlwikisource in section s5
[09:25:11] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis tlwikisource in section s5
[09:26:44] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Conditional restbase compatibility headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172599 (https://phabricator.wikimedia.org/T400346) (owner: 10Clément Goubert)
[09:27:36] <wikibugs>	 (03PS5) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672)
[09:27:36] <wikibugs>	 (03PS1) 10STran: Defer to * group for per-wiki temp account permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177327 (https://phabricator.wikimedia.org/T400672)
[09:27:44] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[09:27:59] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[09:28:56] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[09:29:08] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[09:31:26] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[09:31:39] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[09:32:14] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis tlwikisource in section s5
[09:32:58] <wikibugs>	 (03CR) 10STran: "Done in I1bf2269a3b78e301d0eac853bbfc61b14aa03015 to split deploy up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[09:34:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177327 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[09:34:44] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tlwikisource in section s5
[09:35:19] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[09:35:36] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2205.codfw.wmnet with reason: Maintenance
[09:36:04] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[09:37:40] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply
[09:37:54] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[09:38:00] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[09:38:25] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[09:38:39] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[09:38:51] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+1] Defer to * group for per-wiki temp account permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177327 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[09:38:58] <logmsgbot>	 fceratto@cumin1002 sanitize-wiki (PID 640009) is awaiting input
[09:40:45] <wikibugs>	 (03PS1) 10David Caro: harbor: for newer than bullseye use system package [puppet] - 10https://gerrit.wikimedia.org/r/1177330
[09:43:03] <wikibugs>	 (03PS2) 10David Caro: harbor: for newer than bullseye use system package [puppet] - 10https://gerrit.wikimedia.org/r/1177330
[09:46:26] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis tlwikisource in section s5
[09:46:40] <wikibugs>	 (03CR) 10David Caro: "I'm able to start harbor with docker compose and the containers are happy (did not test playing with it)." [puppet] - 10https://gerrit.wikimedia.org/r/1177330 (owner: 10David Caro)
[09:49:14] <wikibugs>	 (03CR) 10Majavah: [C:03+1] harbor: for newer than bullseye use system package [puppet] - 10https://gerrit.wikimedia.org/r/1177330 (owner: 10David Caro)
[09:49:52] <wikibugs>	 (03CR) 10David Caro: [C:03+2] harbor: for newer than bullseye use system package [puppet] - 10https://gerrit.wikimedia.org/r/1177330 (owner: 10David Caro)
[09:52:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] eventgate-analytics: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176479 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[09:53:20] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: aptly: Add trixie-tools/toolsbeta repos [puppet] - 10https://gerrit.wikimedia.org/r/1177333 (https://phabricator.wikimedia.org/T401574)
[09:54:23] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-analytics: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176479 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[09:55:25] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[09:55:33] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[09:56:03] <wikibugs>	 (03CR) 10Jelto: gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar)
[09:56:14] <wikibugs>	 (03PS1) 10Aqu: Analytics - Refine eventlogging_MediaWikiPingback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177334 (https://phabricator.wikimedia.org/T369845)
[09:57:06] <brouberol>	 !log redeploying eventgate-analytics to remove references to soon-to-be decommissioned brokers - T397447
[09:57:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:09] <stashbot>	 T397447: Take kafka-jumbo100[7-9] out of service, ready for decom - https://phabricator.wikimedia.org/T397447
[09:57:13] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[09:57:25] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: aptly: Add trixie-tools/toolsbeta repos [puppet] - 10https://gerrit.wikimedia.org/r/1177333 (https://phabricator.wikimedia.org/T401574)
[09:57:25] <wikibugs>	 (03PS1) 10Majavah: debian: Add trixie as a valid codename [puppet] - 10https://gerrit.wikimedia.org/r/1177335 (https://phabricator.wikimedia.org/T391083)
[09:57:38] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[09:58:03] <wikibugs>	 (03PS11) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119)
[09:58:05] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[09:58:09] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6531/co" [puppet] - 10https://gerrit.wikimedia.org/r/1177333 (https://phabricator.wikimedia.org/T401574) (owner: 10Majavah)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T1000)
[10:00:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Java updates - jmm@cumin2002
[10:01:08] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[10:02:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] eventstreams-internal: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176480 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[10:04:01] <brouberol>	 !log redeploying eventstreams-internal to remove references to soon-to-be decommissioned brokers - T397447
[10:04:03] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[10:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:06] <stashbot>	 T397447: Take kafka-jumbo100[7-9] out of service, ready for decom - https://phabricator.wikimedia.org/T397447
[10:04:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[10:05:09] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[10:06:17] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[10:07:34] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[10:08:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[10:09:28] <wikibugs>	 (03PS12) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470)
[10:09:40] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mw-page-content-change-enrich: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176481 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[10:09:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[10:10:04] <wikibugs>	 (03CR) 10Arnaudb: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[10:10:59] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:11:40] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrich: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176481 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[10:13:27] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:13:32] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:14:56] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:14:59] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:16:23] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] eventgate-analytics-external: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176487 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[10:20:32] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[10:20:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Java updates - jmm@cumin2002
[10:20:40] <logmsgbot>	 !log brouberol@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[10:21:46] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[10:22:14] <logmsgbot>	 !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[10:23:52] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[10:24:22] <logmsgbot>	 !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[10:28:17] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance
[10:30:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:39] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1045.eqiad.wmnet with OS bookworm
[10:30:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm
[10:32:10] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[10:32:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[10:35:04] <wikibugs>	 (03CR) 10Joal: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177334 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[10:35:17] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[10:37:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1140695 (https://phabricator.wikimedia.org/T393173) (owner: 10Majavah)
[10:40:31] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:docker::builder: Build a Trixie image [puppet] - 10https://gerrit.wikimedia.org/r/1140695 (https://phabricator.wikimedia.org/T393173) (owner: 10Majavah)
[10:44:30] <moritzm>	 !log installing batik security updates
[10:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:46] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2165.codfw.wmnet with reason: Maintenance
[10:45:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[10:47:52] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119)
[10:49:36] <taavi>	 !log manually built first trixie docker image T393173
[10:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:39] <stashbot>	 T393173: Publish Wikimedia trixie base Docker image - https://phabricator.wikimedia.org/T393173
[10:49:47] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2179.codfw.wmnet with reason: Maintenance
[10:52:29] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Enable Java security updates - klausman@cumin1003
[10:52:59] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[10:54:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[10:55:26] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[10:59:44] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:00:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1177335 (https://phabricator.wikimedia.org/T391083) (owner: 10Majavah)
[11:01:06] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[11:01:36] <wikibugs>	 (03CR) 10Majavah: [C:03+2] debian: Add trixie as a valid codename [puppet] - 10https://gerrit.wikimedia.org/r/1177335 (https://phabricator.wikimedia.org/T391083) (owner: 10Majavah)
[11:01:57] <moritzm>	 !log installing djvulibre security updates
[11:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:54] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:toolforge: aptly: Add trixie-tools/toolsbeta repos [puppet] - 10https://gerrit.wikimedia.org/r/1177333 (https://phabricator.wikimedia.org/T401574) (owner: 10Majavah)
[11:03:41] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: aptly: Add trixie-tools/toolsbeta repos [puppet] - 10https://gerrit.wikimedia.org/r/1177333 (https://phabricator.wikimedia.org/T401574) (owner: 10Majavah)
[11:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[11:05:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11073333 (10VRiley-WMF) @BTullis I was looking into the unit dumpstata1004-5 and it looks like basic support has ended on May 31st 2024. Then dumpsda...
[11:06:07] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:06:18] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] Defer to * group for per-wiki temp account permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177327 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[11:06:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[11:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:09:44] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:10:12] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Enable Java security updates - klausman@cumin1003
[11:10:26] <wikibugs>	 (03CR) 10Clément Goubert: "Couple questions for context" [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[11:10:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:11:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[11:13:23] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Enable Java security updates - klausman@cumin1003
[11:14:30] <wikibugs>	 (03CR) 10Máté Szabó: profile::hcaptcha::proxy: config improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[11:15:22] <wikibugs>	 (03PS5) 10Clément Goubert: profile::hcaptcha::proxy: config improvements [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[11:15:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:15:42] <wikibugs>	 (03CR) 10Clément Goubert: profile::hcaptcha::proxy: config improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[11:17:04] <wikibugs>	 (03Abandoned) 10Nikerabbit: WIP: services/machinetranslation: adjust startup probe delays [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162985 (https://phabricator.wikimedia.org/T386889) (owner: 10Klausman)
[11:19:04] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[11:22:31] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1026 - vriley@cumin1002"
[11:22:36] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1026 - vriley@cumin1002"
[11:22:36] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:23:28] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:24:22] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1046
[11:24:36] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1046
[11:25:25] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:26:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[11:29:13] <claime>	 mszabo: I'll +2 and merge the nginx CR, run it on urldownloader1003 for syntax check, and then ping you when I run puppet on urldownloader1004 so you can test, sounds good?
[11:29:44] <wikibugs>	 (03CR) 10Clément Goubert: profile::hcaptcha::proxy: config improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[11:30:26] <wikibugs>	 (03PS1) 10Majavah: Add Trixie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255)
[11:31:07] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Enable Java security updates - klausman@cumin1003
[11:32:30] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[11:32:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[11:33:19] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1045.eqiad.wmnet with OS bookworm
[11:33:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm
[11:39:09] <wikibugs>	 (03PS2) 10Majavah: Add Trixie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255)
[11:46:33] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] profile::hcaptcha::proxy: config improvements [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli)
[11:46:53] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[11:47:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] eventlogging: remove reference to kafka-jumbo1007 [alerts] - 10https://gerrit.wikimedia.org/r/1176485 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[11:51:20] <wikibugs>	 (03PS1) 10Btullis: Move the journalnode on an-worker1018 to an-worker1141 [puppet] - 10https://gerrit.wikimedia.org/r/1177349 (https://phabricator.wikimedia.org/T397166)
[11:51:22] <wikibugs>	 (03PS1) 10Btullis: Move the journalnode on an-worker1078 to an-worker1178 [puppet] - 10https://gerrit.wikimedia.org/r/1177350 (https://phabricator.wikimedia.org/T397166)
[11:51:23] <wikibugs>	 (03PS1) 10Btullis: Move the journalnode on analytics1072 to an-worker1126 [puppet] - 10https://gerrit.wikimedia.org/r/1177351 (https://phabricator.wikimedia.org/T397166)
[11:51:25] <wikibugs>	 (03PS1) 10Btullis: Move the journalnode on an-worker1090 to an-worker1142 [puppet] - 10https://gerrit.wikimedia.org/r/1177352 (https://phabricator.wikimedia.org/T397166)
[11:51:27] <wikibugs>	 (03PS1) 10Btullis: Remove the 52 decommissioning hosts from the analytics_hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177353 (https://phabricator.wikimedia.org/T397172)
[11:53:04] <logmsgbot>	 vriley@cumin1002 provision (PID 751870) is awaiting input
[11:53:17] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:53:34] <claime>	 mszabo: done, live on urldownloader1004
[12:02:13] <zabe>	 jouncebot: nowandnext
[12:02:13] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[12:02:13] <jouncebot>	 In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T1300)
[12:03:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.166s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:03:22] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Stop writing to cl_to and cl_collation on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176661 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe)
[12:05:07] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176661 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe)
[12:05:50] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1176661|Stop writing to cl_to and cl_collation on small wikis (T399579)]]
[12:05:54] <stashbot>	 T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579
[12:08:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.379s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:09:47] <wikibugs>	 (03PS3) 10Brouberol: Decommission kafka-jumbo1007 [puppet] - 10https://gerrit.wikimedia.org/r/1176489 (https://phabricator.wikimedia.org/T397447)
[12:09:47] <wikibugs>	 (03PS3) 10Brouberol: Decommission kafka-jumbo1008 [puppet] - 10https://gerrit.wikimedia.org/r/1176490 (https://phabricator.wikimedia.org/T397447)
[12:09:47] <wikibugs>	 (03PS3) 10Brouberol: Decommission kafka-jumbo1009 [puppet] - 10https://gerrit.wikimedia.org/r/1176491 (https://phabricator.wikimedia.org/T397447)
[12:12:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.913s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:12:53] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176489 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[12:14:29] <wikibugs>	 (03CR) 10Zabe: "It has been two weeks now, I think we can do this now. What do you think?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe)
[12:15:01] <logmsgbot>	 !log hashar@deploy1003 Started deploy [integration/docroot@1c2af1f]: build: Upgrade eslint-config-wikimedia to 0.31.0
[12:15:14] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance
[12:15:15] <logmsgbot>	 !log hashar@deploy1003 Finished deploy [integration/docroot@1c2af1f]: build: Upgrade eslint-config-wikimedia to 0.31.0 (duration: 00m 13s)
[12:17:30] <wikibugs>	 (03PS1) 10Jelto: gitlab: raise throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1177363 (https://phabricator.wikimedia.org/T400971)
[12:18:03] <wikibugs>	 (03PS4) 10Brouberol: Decommission kafka-jumbo1009 [puppet] - 10https://gerrit.wikimedia.org/r/1176491 (https://phabricator.wikimedia.org/T397447)
[12:18:36] <wikibugs>	 (03CR) 10Arnaudb: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1177363 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto)
[12:19:48] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6542/co" [puppet] - 10https://gerrit.wikimedia.org/r/1176491 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[12:19:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1177365 (owner: 10L10n-bot)
[12:20:20] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1175552 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[12:22:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.174s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:23:46] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177370
[12:23:57] <wikibugs>	 (03CR) 10Arnaudb: gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar)
[12:24:21] <wikibugs>	 10SRE-tools, 10Spicerack: DeprecationWarning: datetime.datetime.utcnow() is deprecated - https://phabricator.wikimedia.org/T401581 (10taavi) 03NEW
[12:24:55] <wikibugs>	 (03CR) 10Arnaudb: gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar)
[12:29:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.292s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:34:35] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1176661|Stop writing to cl_to and cl_collation on small wikis (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:34:38] <stashbot>	 T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579
[12:36:44] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[12:38:17] <wikibugs>	 (03CR) 10Hashar: [C:03+2] build: upgrade QUnit [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 (owner: 10Hashar)
[12:39:07] <zabe>	 it is very slow
[12:39:13] <wikibugs>	 (03Merged) 10jenkins-bot: build: upgrade QUnit [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 (owner: 10Hashar)
[12:39:30] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: raise throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1177363 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto)
[12:39:53] <logmsgbot>	 !log hashar@deploy1003 Started deploy [gerrit/gerrit@7d55b4f]: build: upgrade QUnit
[12:40:05] <logmsgbot>	 !log hashar@deploy1003 Finished deploy [gerrit/gerrit@7d55b4f]: build: upgrade QUnit (duration: 00m 12s)
[12:42:38] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:43:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177334 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[12:44:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.531s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:44:36] <jinxer-wm>	 FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[12:48:20] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:49:05] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "I have little context around what the script does as a whole, but the code looks correct based assuming the input yaml and txt are formatt" [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup)
[12:49:20] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176661|Stop writing to cl_to and cl_collation on small wikis (T399579)]] (duration: 43m 30s)
[12:49:24] <stashbot>	 T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579
[12:50:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Move the journalnode on an-worker1018 to an-worker1141 [puppet] - 10https://gerrit.wikimedia.org/r/1177349 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[12:50:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Move the journalnode on an-worker1078 to an-worker1178 [puppet] - 10https://gerrit.wikimedia.org/r/1177350 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[12:51:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Move the journalnode on analytics1072 to an-worker1126 [puppet] - 10https://gerrit.wikimedia.org/r/1177351 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[12:52:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Move the journalnode on an-worker1090 to an-worker1142 [puppet] - 10https://gerrit.wikimedia.org/r/1177352 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[12:53:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Remove the 52 decommissioning hosts from the analytics_hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177353 (https://phabricator.wikimedia.org/T397172) (owner: 10Btullis)
[12:56:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:56:31] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:56:51] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[12:58:10] <jinxer-wm>	 FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T1300).
[13:00:05] <jouncebot>	 Tran and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:20] <Tran>	 👋
[13:00:24] <wikibugs>	 (03PS2) 10Btullis: Move the journalnode on an-worker1090 to an-worker1108 [puppet] - 10https://gerrit.wikimedia.org/r/1177352 (https://phabricator.wikimedia.org/T397166)
[13:00:28] <phuedx>	 👋
[13:01:15] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1045.eqiad.wmnet with OS bookworm
[13:01:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.445s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:01:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073586 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104...
[13:01:28] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:01:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073587 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:01:38] <Lucas_WMDE>	 I can’t deploy today, sorry
[13:01:47] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:01:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104...
[13:02:05] <Tran>	 No problem, I'm (probably) capable of deploying.  Thanks for the heads up! I'll go ahead and start then.
[13:02:29] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:02:31] <phuedx>	 I can deploy mine
[13:02:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:02:49] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:02:58] <moritzm>	 !log installing libcommons-lang-java security updates
[13:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073593 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104...
[13:03:01] <Lucas_WMDE>	 sounds good, thanks!
[13:03:21] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:03:58] <wikibugs>	 (03PS2) 10Btullis: Move the journalnode on an-worker1080 to an-worker1141 [puppet] - 10https://gerrit.wikimedia.org/r/1177349 (https://phabricator.wikimedia.org/T397166)
[13:04:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177327 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[13:04:50] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:04:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:05:11] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:05:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104...
[13:06:07] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:06:11] <wikibugs>	 (03Merged) 10jenkins-bot: Defer to * group for per-wiki temp account permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177327 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[13:06:25] <logmsgbot>	 !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1177327|Defer to * group for per-wiki temp account permissions (T400672)]]
[13:06:29] <stashbot>	 T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672
[13:06:31] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:06:43] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:06:43] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:08:41] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 555 bytes in 8.840 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:08:41] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 565 bytes in 7.656 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:08:48] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Move the journalnode on an-worker1080 to an-worker1141 [puppet] - 10https://gerrit.wikimedia.org/r/1177349 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[13:08:59] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Move the journalnode on an-worker1078 to an-worker1178 [puppet] - 10https://gerrit.wikimedia.org/r/1177350 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[13:09:07] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Move the journalnode on analytics1072 to an-worker1126 [puppet] - 10https://gerrit.wikimedia.org/r/1177351 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[13:09:16] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Move the journalnode on an-worker1090 to an-worker1108 [puppet] - 10https://gerrit.wikimedia.org/r/1177352 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[13:10:30] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:10:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs-main.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:11:02] <vgutierrez>	 !incidents
[13:11:03] <sirenbot>	 6565 (UNACKED)  ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams)
[13:11:07] <vgutierrez>	 !ack 6565
[13:11:08] <sirenbot>	 6565 (ACKED)  ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams)
[13:11:21] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:11:24] <Tran>	 My spiderpig deploy seems to have failed for reasons unrelated to my config change. Can I kick off another spiderpig on the change even though it merged or am I going to be manually recovering from this?
[13:11:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073606 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[13:11:43] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:11:43] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:11:44] <Emperor>	 oncallers, are you OK? We all got paged
[13:11:56] <vgutierrez>	 Emperor: I acked the issue in a few seconds?
[13:12:09] <Emperor>	 that's weird
[13:12:10] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:12:19] <fabfur>	 yep
[13:12:22] <vgutierrez>	 unless I've missed a page of course
[13:12:25] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: upload-frontend: apply UA policy to cache-upload as well [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119)
[13:12:44] <moritzm>	 vgutierrez: no, I've also only gotten it a minute ago, there was no preceding one
[13:13:15] <Emperor>	 looks like it got sent straight to batphone
[13:13:34] <vgutierrez>	 so I guess that's for tappof 
[13:13:58] <Emperor>	 looks like a repeat of T371244
[13:13:59] <stashbot>	 T371244: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244
[13:14:05] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Move the journalnode on an-worker1080 to an-worker1141 [puppet] - 10https://gerrit.wikimedia.org/r/1177349 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[13:14:21] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:14:33] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:14:33] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:14:41] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.002 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:14:49] <logmsgbot>	 !log stran@deploy1003 Started scap sync-world: Backport for [[gerrit:1177327|Defer to * group for per-wiki temp account permissions (T400672)]]
[13:14:52] <stashbot>	 T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672
[13:15:07] <vgutierrez>	 Tran: your deploy change could explain the WDQS errors?
[13:15:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Patch passes varnishtests now." [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[13:15:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs-main.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:15:54] <Dreamy_Jazz>	 I don't think so, as it should be a no-op in terms of functionality
[13:15:59] <vgutierrez>	 ack, thx
[13:16:24] <Tran>	 ^ beat me to it. It's just removing a config for loginwiki
[13:16:30] <Dreamy_Jazz>	 :D
[13:18:10] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:20:20] <logmsgbot>	 !log stran@deploy1003 stran: Backport for [[gerrit:1177327|Defer to * group for per-wiki temp account permissions (T400672)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:20:23] <stashbot>	 T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672
[13:21:00] <wikibugs>	 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#11073627 (10MatthewVernon) We had a repeat today with [[ https://portal.victorops.com/ui/wikimedia/incident/6565/details | incident 6565 ]], which paged everyone imm...
[13:22:25] <tappof>	 vgutierrez: Emperor Yes, it seems that it got sent straight to the batphone... I don’t know why, since the escalation policies seem correct. I’m looking into it...
[13:22:32] <wikibugs>	 (03PS2) 10CDanis: [WIP] haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302
[13:22:37] <logmsgbot>	 !log stran@deploy1003 stran: Continuing with sync
[13:24:37] <Emperor>	 tappof: my update to T371244 might be of interest/help
[13:24:38] <stashbot>	 T371244: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244
[13:24:52] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:25:22] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1045.eqiad.wmnet with OS bookworm
[13:25:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073636 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm
[13:26:15] <wikibugs>	 (03CR) 10Vgutierrez: upload-frontend: apply UA policy to cache-upload as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[13:26:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Decommission kafka-jumbo1007 [puppet] - 10https://gerrit.wikimedia.org/r/1176489 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[13:30:01] <logmsgbot>	 !log stran@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177327|Defer to * group for per-wiki temp account permissions (T400672)]] (duration: 15m 12s)
[13:30:05] <stashbot>	 T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672
[13:30:37] <Tran>	 all you, phuedx. Thanks for your patience 🙇
[13:30:47] <phuedx>	 Tran: ACK
[13:31:02] <wikibugs>	 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#11073654 (10MatthewVernon) [whatever we're doing for 24/7 oncall really needs to get this right, or we'll be paging everyone in the small hours]
[13:31:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177334 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[13:31:54] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance
[13:32:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T399249)', diff saved to https://phabricator.wikimedia.org/P80986 and previous config saved to /var/cache/conftool/dbconfig/20250811-133201-fceratto.json
[13:32:05] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[13:33:15] <wikibugs>	 (03Merged) 10jenkins-bot: Analytics - Refine eventlogging_MediaWikiPingback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177334 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[13:33:28] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1177334|Analytics - Refine eventlogging_MediaWikiPingback (T369845)]]
[13:33:32] <stashbot>	 T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment - https://phabricator.wikimedia.org/T369845
[13:34:29] <wikibugs>	 (03PS4) 10Brouberol: Decommission kafka-jumbo1008 [puppet] - 10https://gerrit.wikimedia.org/r/1176490 (https://phabricator.wikimedia.org/T397447)
[13:34:56] <wikibugs>	 (03PS5) 10Brouberol: Decommission kafka-jumbo1009 [puppet] - 10https://gerrit.wikimedia.org/r/1176491 (https://phabricator.wikimedia.org/T397447)
[13:35:15] <logmsgbot>	 !log phuedx@deploy1003 aqu, phuedx: Backport for [[gerrit:1177334|Analytics - Refine eventlogging_MediaWikiPingback (T369845)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:35:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Decommission kafka-jumbo1008 [puppet] - 10https://gerrit.wikimedia.org/r/1176490 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[13:35:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Decommission kafka-jumbo1009 [puppet] - 10https://gerrit.wikimedia.org/r/1176491 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol)
[13:36:28] <phuedx>	 Checked the stream config change is coming through. LGTM
[13:36:32] <logmsgbot>	 !log phuedx@deploy1003 aqu, phuedx: Continuing with sync
[13:41:45] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177334|Analytics - Refine eventlogging_MediaWikiPingback (T369845)]] (duration: 08m 16s)
[13:41:48] <stashbot>	 T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment - https://phabricator.wikimedia.org/T369845
[13:41:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Check list of PXE miss-configs for codfw - https://phabricator.wikimedia.org/T401442#11073712 (10Jhancock.wm) @klausman to keep things organized, this is a condensed list for the codfw site. we have 8 servers that need your attention here. It would be best practice to depool the...
[13:42:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11073714 (10VRiley-WMF) I have been working on cloudcesphos1045 and cloudcesphos1047 and both of those units are giving me a lot of issues. Still troubleshooting them at the moment.
[13:42:20] <phuedx>	 aqu: The config change for eventlogging_MediaWikiPingback has been deployed
[13:43:37] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Increase the default batch size of puppet.run() - https://phabricator.wikimedia.org/T397687#11073717 (10JMeybohm) >>! In T397687#11000793, @Volans wrote: > @JMeybohm do you have a specific use case that cannot/is hard to solve simply changing the `batch...
[13:46:09] <wikibugs>	 (03PS1) 10Majavah: realm: Replace legacy facts when checking Cloud VPS hostname [puppet] - 10https://gerrit.wikimedia.org/r/1177399 (https://phabricator.wikimedia.org/T401586)
[13:47:07] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6545/console" [puppet] - 10https://gerrit.wikimedia.org/r/1177399 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[13:49:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] realm: Replace legacy facts when checking Cloud VPS hostname [puppet] - 10https://gerrit.wikimedia.org/r/1177399 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[13:49:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: upload-frontend: apply UA policy to cache-upload as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[13:49:19] <wikibugs>	 06SRE, 10SRE-swift-storage: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387#11073736 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon
[13:50:58] <wikibugs>	 (03PS1) 10Majavah: monitoring: Drop use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177401 (https://phabricator.wikimedia.org/T401586)
[13:53:02] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:53:31] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2001.codfw.wmnet
[13:53:32] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2001.codfw.wmnet
[13:53:36] <wikibugs>	 (03PS2) 10Majavah: monitoring: Drop use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177401 (https://phabricator.wikimedia.org/T401586)
[13:55:45] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[13:56:54] <wikibugs>	 (03PS1) 10Majavah: apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586)
[13:56:56] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::instance: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177404 (https://phabricator.wikimedia.org/T401586)
[13:58:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] upload-frontend: apply UA policy to cache-upload as well [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[13:58:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] upload-frontend: apply UA policy to cache-upload as well [puppet] - 10https://gerrit.wikimedia.org/r/1177324 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[14:00:07] <wikibugs>	 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11073850 (10tappof) @Miriam, just a gentle reminder: would you please approve @diego 's request? Thank you.
[14:00:30] <logmsgbot>	 jhancock@cumin1003 provision (PID 1565883) is awaiting input
[14:01:59] <logmsgbot>	 jhancock@cumin1003 provision (PID 1565622) is awaiting input
[14:02:41] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:02:47] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:03:41] <wikibugs>	 (03PS1) 10Jelto: gitlab: disable nftables throttling temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1177405 (https://phabricator.wikimedia.org/T400971)
[14:04:35] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Move the journalnode on an-worker1078 to an-worker1178 [puppet] - 10https://gerrit.wikimedia.org/r/1177350 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[14:04:48] <wikibugs>	 (03PS2) 10Btullis: Move the journalnode on an-worker1078 to an-worker1178 [puppet] - 10https://gerrit.wikimedia.org/r/1177350 (https://phabricator.wikimedia.org/T397166)
[14:05:57] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: disable nftables throttling temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1177405 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto)
[14:05:59] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Move the journalnode on an-worker1078 to an-worker1178 [puppet] - 10https://gerrit.wikimedia.org/r/1177350 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[14:07:14] <wikibugs>	 (03CR) 10Ssingh: "Commenting from Traffic's side: this is in some ways, a trivial patch for us because we are simply setting an additional header. The chall" [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins)
[14:08:49] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[14:08:49] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2001.codfw.wmnet
[14:08:50] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[14:08:51] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2001.codfw.wmnet
[14:09:00] <vgutierrez>	 !incidents
[14:09:01] <sirenbot>	 6566 (UNACKED)  kafka-jumbo1008/Kafka Broker Server (paged)
[14:09:01] <sirenbot>	 6565 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams)
[14:09:07] <vgutierrez>	 !ack 6566
[14:09:07] <sirenbot>	 6566 (ACKED)  kafka-jumbo1008/Kafka Broker Server (paged)
[14:09:08] <sukhe>	 this is klausman aabove I think
[14:09:13] <sukhe>	 > ml-serve2001.codfw.wmnet
[14:09:28] <vgutierrez>	 ml-serve != kafka-jumbo1008?
[14:09:33] <moritzm>	 brouberol: related to your maintenance?
[14:09:35] <klausman>	 yeah, I was about to say :)
[14:09:35] <moritzm>	 ^
[14:09:48] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[14:09:49] <sukhe>	 yeah, sorry, I misread
[14:09:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Check list of PXE miss-configs for codfw - https://phabricator.wikimedia.org/T401442#11073898 (10Jhancock.wm) a:05Papaul→03Jhancock.wm
[14:09:58] <klausman>	 no worries :)
[14:10:06] <brouberol>	 let me silence them, I'm decommissioning these hosts, sorry
[14:10:22] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[14:10:30] <moritzm>	 ok
[14:10:37] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2002.codfw.wmnet
[14:10:39] <vgutierrez>	      Active: failed (Result: timeout) since Mon 2025-08-11 14:06:33 UTC; 3min 41s ago
[14:10:39] <vgutierrez>	     Process: 101153 ExecStart=/usr/bin/kafka-server-start ${KAFKA_CONFIG}/server.properties (code=killed, signal=KILL)
[14:10:45] <moritzm>	 !incidents
[14:10:45] <sirenbot>	 6566 (ACKED)  kafka-jumbo1008/Kafka Broker Server (paged)
[14:10:46] <sirenbot>	 6567 (UNACKED)  kafka-jumbo1009/Kafka Broker Server (paged)
[14:10:46] <sirenbot>	 6565 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams)
[14:10:54] <vgutierrez>	 oh those hosts are decommissioned?
[14:10:57] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[14:10:58] <moritzm>	 !ack 6567
[14:10:59] <sirenbot>	 6567 (ACKED)  kafka-jumbo1009/Kafka Broker Server (paged)
[14:11:08] <moritzm>	 vgutierrez: they are in the process of being decommed
[14:11:19] <vgutierrez>	 lovely
[14:12:35] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[14:13:08] <brouberol>	 > oh those hosts are decommissioned?
[14:13:08] <brouberol>	 I was about to run the cookbook yes
[14:16:02] <brouberol>	 ok, I've silenced all I could see in alertmanager. <sigh> sorry about that. I wanted to stop kafka on these hosts instead of decommissioning them cold, which could have caused errors upstream/downstream
[14:17:45] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1007.eqiad.wmnet
[14:20:37] <logmsgbot>	 !log klausman@cumin1003 END (ERROR) - Cookbook sre.k8s.pool-depool-node (exit_code=97) depool for host ml-serve2002.codfw.wmnet
[14:20:41] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2002.codfw.wmnet
[14:20:42] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2002.codfw.wmnet
[14:22:06] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:23:31] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.dns.netbox
[14:24:55] <icinga-wm>	 PROBLEM - Host ml-serve2002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:26:55] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003"
[14:26:59] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10SRE Observability (FY2025/2026-Q1): More frequent Puppet runs on the alert hosts? - https://phabricator.wikimedia.org/T398444#11073996 (10jhathaway) p:05Triage→03Medium
[14:27:16] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003"
[14:27:16] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:27:17] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-jumbo1007.eqiad.wmnet
[14:27:42] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1008.eqiad.wmnet
[14:29:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:30:04] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T1430)
[14:30:57] <wikibugs>	 10SRE-tools, 10Observability-Alerting, 10SRE Observability (FY2025/2026-Q1): Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#11074018 (10CDanis)
[14:33:24] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks for the patch, confirming NOOP change for the domains." [dns] - 10https://gerrit.wikimedia.org/r/1176725 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle)
[14:33:27] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wikipedia.org: Fix grouping of wikis and non-wikis [dns] - 10https://gerrit.wikimedia.org/r/1176725 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle)
[14:33:39] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:34:34] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:34:43] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:35:09] <icinga-wm>	 RECOVERY - Host ml-serve2002 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms
[14:35:15] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.dns.netbox
[14:35:44] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2002.codfw.wmnet
[14:35:45] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2002.codfw.wmnet
[14:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[14:36:46] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2003.codfw.wmnet
[14:36:47] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2003.codfw.wmnet
[14:37:20] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Logstash Access for gergesshamon - https://phabricator.wikimedia.org/T399421#11074054 (10tappof) 05Stalled→03Declined Hello @Gerges, As per policy, we cannot proceed further with your request until you have found a WMF staff member willing to sponsor and support it. I’ll...
[14:39:12] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1008.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003"
[14:39:29] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1008.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003"
[14:39:29] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:39:30] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-jumbo1008.eqiad.wmnet
[14:39:50] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: ml-serve2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:40:25] <wikibugs>	 (03PS1) 10Bking: stat hosts: Alert on memory stalls [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589)
[14:41:46] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1009.eqiad.wmnet
[14:42:17] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[14:42:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11074092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104...
[14:42:51] <wikibugs>	 (03PS2) 10Bking: stat hosts: Alert on memory stalls [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589)
[14:45:08] <wikibugs>	 (03PS3) 10Bking: stat hosts: Alert on memory stalls [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589)
[14:46:45] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[14:47:49] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.dns.netbox
[14:48:53] <wikibugs>	 (03PS1) 10Andrew Bogott: vendordata.txt: add a standardized cloud-init final message [puppet] - 10https://gerrit.wikimedia.org/r/1177412 (https://phabricator.wikimedia.org/T401584)
[14:49:41] <icinga-wm>	 PROBLEM - Host ml-serve2003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:51:29] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003"
[14:52:24] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003"
[14:52:24] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:52:25] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-jumbo1009.eqiad.wmnet
[14:52:53] <brouberol>	 !log kafka-jumbo1007->9 are now decommissioned - T397447
[14:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:56] <stashbot>	 T397447: Take kafka-jumbo100[7-9] out of service, ready for decom - https://phabricator.wikimedia.org/T397447
[14:54:03] <wikibugs>	 (03CR) 10Majavah: Add Trixie images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah)
[14:54:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:55:36] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1045.eqiad.wmnet with OS bookworm
[14:55:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11074165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104...
[14:56:51] <wikibugs>	 (03CR) 10Btullis: stat hosts: Alert on memory stalls (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking)
[14:58:47] <icinga-wm>	 RECOVERY - Host ml-serve2003 is UP: PING OK - Packet loss = 0%, RTA = 30.45 ms
[14:59:50] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: ml-serve2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:00:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] vendordata.txt: add a standardized cloud-init final message [puppet] - 10https://gerrit.wikimedia.org/r/1177412 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott)
[15:01:06] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[15:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[15:06:12] <wikibugs>	 (03CR) 10Joal: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1176295 (https://phabricator.wikimedia.org/T398236) (owner: 10CDanis)
[15:06:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1177414 (https://phabricator.wikimedia.org/T400374)
[15:06:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#11074227 (10Jdforrester-WMF)
[15:06:39] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2003.codfw.wmnet
[15:06:41] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2003.codfw.wmnet
[15:06:42] <wikibugs>	 (03CR) 10Joal: [C:03+1] "LGTM ! Thanks for cleaning this up :)" [puppet] - 10https://gerrit.wikimedia.org/r/1176296 (https://phabricator.wikimedia.org/T398236) (owner: 10CDanis)
[15:08:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:20] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2004.codfw.wmnet
[15:09:21] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2004.codfw.wmnet
[15:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:09:46] <wikibugs>	 (03PS2) 10Btullis: Move the journalnode on analytics1072 to an-worker1126 [puppet] - 10https://gerrit.wikimedia.org/r/1177351 (https://phabricator.wikimedia.org/T397166)
[15:12:30] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:13:45] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Move the journalnode on analytics1072 to an-worker1126 [puppet] - 10https://gerrit.wikimedia.org/r/1177351 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[15:15:02] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-image-create: change expected message for cloud-init finish [puppet] - 10https://gerrit.wikimedia.org/r/1177419 (https://phabricator.wikimedia.org/T401584)
[15:15:35] <icinga-wm>	 PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:19:29] <moritzm>	 !log installing node-form-data security updates
[15:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:32] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:20:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:20:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:22:46] <wikibugs>	 (03PS6) 10BryanDavis: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[15:23:17] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2020.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:23:36] <wikibugs>	 (03CR) 10BryanDavis: "PS6 was a manual rebase to fix edit conflicts with Ia9ba82c994ba5761a569ac31dcc04fc062f53686" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[15:24:15] <icinga-wm>	 RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 30.53 ms
[15:24:20] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2020.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:25:50] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:28:51] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:29:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Check list of PXE miss-configs for codfw - https://phabricator.wikimedia.org/T401442#11074407 (10Jhancock.wm)
[15:29:49] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2004.codfw.wmnet
[15:29:50] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2004.codfw.wmnet
[15:30:05] <jouncebot>	 jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T1530).
[15:30:20] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11074412 (10tappof) Hello @Novem_Linguae, Yes, it’s safe to remove the shell access checklist from the original post. Moreover, I’ve just added you to the NDA group, so your request has b...
[15:30:55] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2009.codfw.wmnet
[15:30:56] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2009.codfw.wmnet
[15:34:30] <wikibugs>	 (03PS4) 10Bking: stat hosts: Alert on memory stalls [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589)
[15:35:59] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe2020.codfw.wmnet with OS bullseye
[15:36:08] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11074436 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-fe2020.codfw.wmnet with OS bullseye
[15:38:34] <wikibugs>	 (03CR) 10CDanis: [C:03+2] benthos: webrequest_sampled_live: remove client_port [puppet] - 10https://gerrit.wikimedia.org/r/1176295 (https://phabricator.wikimedia.org/T398236) (owner: 10CDanis)
[15:38:37] <wikibugs>	 (03CR) 10CDanis: [C:03+2] turnilo: webrequest_sampled_live: remove client_port [puppet] - 10https://gerrit.wikimedia.org/r/1176296 (https://phabricator.wikimedia.org/T398236) (owner: 10CDanis)
[15:38:38] <wikibugs>	 (03CR) 10Bking: stat hosts: Alert on memory stalls (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking)
[15:42:26] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.dns.netbox
[15:43:04] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:43:23] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:43:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177424
[15:43:47] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:44:00] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:45:15] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:45:27] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:45:46] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2049 to codfw - jhancock@cumin1002"
[15:45:54] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2049 to codfw - jhancock@cumin1002"
[15:45:54] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:46:21] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es2049
[15:46:30] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2049
[15:47:23] <wikibugs>	 (03CR) 10CDanis: [C:03+2] benthos webrequest: Add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176299 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis)
[15:47:25] <wikibugs>	 (03CR) 10CDanis: [C:03+2] turnilo: webrequest: add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176300 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis)
[15:49:20] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host es2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:50:09] <wikibugs>	 (03PS1) 10Papaul: Add scs-e3-codfw to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1177427
[15:50:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add scs-e3-codfw to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1177427 (owner: 10Papaul)
[15:51:04] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-image-create: change expected message for cloud-init finish [puppet] - 10https://gerrit.wikimedia.org/r/1177419 (https://phabricator.wikimedia.org/T401584)
[15:51:04] <wikibugs>	 (03PS1) 10Andrew Bogott: vendordata.txt: further attempts to clean up Trixie behavior [puppet] - 10https://gerrit.wikimedia.org/r/1177428 (https://phabricator.wikimedia.org/T401584)
[15:51:28] <wikibugs>	 (03PS2) 10Papaul: Add scs-e3-codfw to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1177427 (https://phabricator.wikimedia.org/T401310)
[15:52:00] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:52:16] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[15:52:22] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2020.codfw.wmnet with reason: host reimage
[15:52:48] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add scs-e3-codfw to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1177427 (https://phabricator.wikimedia.org/T401310) (owner: 10Papaul)
[15:55:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmcs-image-create: change expected message for cloud-init finish [puppet] - 10https://gerrit.wikimedia.org/r/1177419 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott)
[15:55:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] vendordata.txt: further attempts to clean up Trixie behavior [puppet] - 10https://gerrit.wikimedia.org/r/1177428 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott)
[15:58:39] <wikibugs>	 (03Restored) 10Federico Ceratto: sre.mysql.upgrade: Switch to Host, apt-get and mysql helpers [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[15:59:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Add scs-e3-codfw to monitoring - https://phabricator.wikimedia.org/T401310#11074563 (10Papaul) 05Open→03Resolved  complete
[16:00:27] <wikibugs>	 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#11074566 (10Dzahn) cc: @Kappakayala (since she was now looking into maintenance of that spreadsheet and the roster in general)
[16:01:46] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:02:09] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2020.codfw.wmnet with reason: host reimage
[16:03:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins)
[16:03:58] <icinga-wm>	 PROBLEM - Host ml-serve2009 is DOWN: PING CRITICAL - Packet loss = 100%
[16:04:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T401411#11074582 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm provisioned the server i forgot to do last week. alert has cleared.
[16:07:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:08:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2009.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:09:38] <wikibugs>	 (03PS1) 10CDanis: feat: haproxy allhdrs [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1177431
[16:09:48] <wikibugs>	 (03PS3) 10Btullis: Move the journalnode on an-worker1090 to an-worker1108 [puppet] - 10https://gerrit.wikimedia.org/r/1177352 (https://phabricator.wikimedia.org/T397166)
[16:10:20] <wikibugs>	 (03PS2) 10CDanis: feat: haproxy allhdrs [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1177431
[16:10:38] <wikibugs>	 (03CR) 10CDanis: [V:03+2 C:03+2] feat: haproxy allhdrs [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1177431 (owner: 10CDanis)
[16:10:48] <logmsgbot>	 !log cdanis@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "feat: haproxy allhdrs - cdanis@cumin1003"
[16:10:50] <logmsgbot>	 !log cdanis@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: haproxy allhdrs - cdanis@cumin1003
[16:11:42] <logmsgbot>	 !log cdanis@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: feat: haproxy allhdrs - cdanis@cumin1003
[16:11:44] <logmsgbot>	 !log cdanis@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "feat: haproxy allhdrs - cdanis@cumin1003"
[16:12:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:16:06] <icinga-wm>	 RECOVERY - Host ml-serve2009 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms
[16:18:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174578 (https://phabricator.wikimedia.org/T400855) (owner: 10Krinkle)
[16:18:50] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: ml-serve2009.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:19:32] <icinga-wm>	 PROBLEM - Host ml-serve2009 is DOWN: PING CRITICAL - Packet loss = 100%
[16:19:39] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1174578|Disable MobileFrontend on thankyou.wikipedia.org and nostalgia.wikipedia.org (T400855 T152882)]]
[16:19:44] <stashbot>	 T400855: Decide how to configure $wgMobileUrlCallback during mobile domain sunset - https://phabricator.wikimedia.org/T400855
[16:19:45] <stashbot>	 T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882
[16:21:28] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1174578|Disable MobileFrontend on thankyou.wikipedia.org and nostalgia.wikipedia.org (T400855 T152882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:23:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2009.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:24:08] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[16:25:02] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Move the journalnode on an-worker1090 to an-worker1108 [puppet] - 10https://gerrit.wikimedia.org/r/1177352 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis)
[16:26:35] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[16:26:55] <wikibugs>	 (03PS2) 10Krinkle: Remove unused wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083288 (https://phabricator.wikimedia.org/T336004)
[16:26:55] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[16:26:56] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2020.codfw.wmnet with OS bullseye
[16:27:07] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11074678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-fe2020.codfw.wmnet with OS bullseye completed: - ms-fe2020 (**WA...
[16:27:39] <wikibugs>	 (03Abandoned) 10Krinkle: Remove unused wgChronologyProtectorStash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083288 (https://phabricator.wikimedia.org/T336004) (owner: 10Krinkle)
[16:27:57] <wikibugs>	 (03PS1) 10Andrew Bogott: vendordata.txt: allow downgrading puppet version [puppet] - 10https://gerrit.wikimedia.org/r/1177438 (https://phabricator.wikimedia.org/T401584)
[16:28:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vendordata.txt: allow downgrading puppet version [puppet] - 10https://gerrit.wikimedia.org/r/1177438 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott)
[16:28:12] <wikibugs>	 (03PS2) 10Andrew Bogott: vendordata.txt: allow downgrading puppet version [puppet] - 10https://gerrit.wikimedia.org/r/1177438 (https://phabricator.wikimedia.org/T401584)
[16:28:27] <wikibugs>	 (03PS3) 10Andrew Bogott: vendordata.txt: allow downgrading puppet version [puppet] - 10https://gerrit.wikimedia.org/r/1177438 (https://phabricator.wikimedia.org/T401584)
[16:29:05] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174578 (https://phabricator.wikimedia.org/T400855) (owner: 10Krinkle)
[16:29:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] vendordata.txt: allow downgrading puppet version [puppet] - 10https://gerrit.wikimedia.org/r/1177438 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott)
[16:29:31] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174578|Disable MobileFrontend on thankyou.wikipedia.org and nostalgia.wikipedia.org (T400855 T152882)]] (duration: 09m 52s)
[16:29:37] <stashbot>	 T400855: Decide how to configure $wgMobileUrlCallback during mobile domain sunset - https://phabricator.wikimedia.org/T400855
[16:29:37] <stashbot>	 T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882
[16:30:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176714 (owner: 10Krinkle)
[16:30:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176715 (owner: 10Krinkle)
[16:30:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 (owner: 10Krinkle)
[16:31:32] <wikibugs>	 (03Merged) 10jenkins-bot: tests: Improve false-positive testOnlyExistingWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176714 (owner: 10Krinkle)
[16:31:33] <wikibugs>	 (03Merged) 10jenkins-bot: WmfConfig: Document why 'preinstall' is indexed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176715 (owner: 10Krinkle)
[16:31:35] <wikibugs>	 (03Merged) 10jenkins-bot: manage-dblist: Remove mention of non-existant "preinstall-labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176708 (owner: 10Krinkle)
[16:31:49] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1176714|tests: Improve false-positive testOnlyExistingWikis]], [[gerrit:1176715|WmfConfig: Document why 'preinstall' is indexed]], [[gerrit:1176708|manage-dblist: Remove mention of non-existant "preinstall-labs"]]
[16:32:25] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11074719 (10Jhancock.wm) 05Open→03Resolved
[16:32:34] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177424 (owner: 10Muehlenhoff)
[16:32:48] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11074723 (10Jhancock.wm) @MatthewVernon these are complete.
[16:33:40] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1176714|tests: Improve false-positive testOnlyExistingWikis]], [[gerrit:1176715|WmfConfig: Document why 'preinstall' is indexed]], [[gerrit:1176708|manage-dblist: Remove mention of non-existant "preinstall-labs"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:38:47] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11074762 (10Jhancock.wm)
[16:40:37] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[16:41:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.256s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:42:37] <logmsgbot>	 !log jgreen@cumin1002 START - Cookbook sre.dns.netbox
[16:44:36] <jinxer-wm>	 FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[16:44:50] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T401611#11074788 (10Jgreen) a:05Jgreen→03None
[16:46:02] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T401611#11074795 (10Jgreen)
[16:46:13] <logmsgbot>	 !log jgreen@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frdb1003.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1002"
[16:46:14] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176714|tests: Improve false-positive testOnlyExistingWikis]], [[gerrit:1176715|WmfConfig: Document why 'preinstall' is indexed]], [[gerrit:1176708|manage-dblist: Remove mention of non-existant "preinstall-labs"]] (duration: 14m 26s)
[16:46:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.393s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:46:18] <logmsgbot>	 !log jgreen@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove host frdb1003.frack.eqiad.wmnet from DNS for decommissioning - jgreen@cumin1002"
[16:46:18] <logmsgbot>	 !log jgreen@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:46:49] <wikibugs>	 (03PS29) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[16:47:29] <wikibugs>	 (03PS3) 10Anzx: tlwikisource: add author ( Manunulat ) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654)
[16:49:29] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores-extension: add threshold for revertrisk in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177446 (https://phabricator.wikimedia.org/T400590)
[16:50:48] <wikibugs>	 (03CR) 10Anzx: tlwikisource: add author ( Manunulat ) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[16:51:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[16:55:46] <icinga-wm>	 RECOVERY - Host ml-serve2009 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms
[16:58:50] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: ml-serve2009.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T1700)
[17:00:04] <jouncebot>	 ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T1700).
[17:02:29] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6548/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[17:04:03] <wikibugs>	 (03PS30) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[17:04:54] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6549/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[17:10:36] <wikibugs>	 (03PS31) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[17:13:11] <wikibugs>	 (03PS32) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[17:14:34] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6551/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[17:14:56] <wikibugs>	 (03CR) 10CDobbins: dnsrecursor: add recursor.yml.erb (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[17:20:17] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2009.codfw.wmnet
[17:20:18] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2009.codfw.wmnet
[17:22:51] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add NEL headers to apache [puppet] - 10https://gerrit.wikimedia.org/r/1175552 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[17:23:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.621s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:23:45] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host es2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:27:50] <logmsgbot>	 !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:28:14] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2049']
[17:28:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.11s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:28:26] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2049']
[17:31:41] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11075022 (10Dzahn)
[17:32:50] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11075027 (10Dzahn) @CDanis Deployed on Gerrit. It should be sending the headers now.      I added some check boxes for other services to the ticket descrip...
[17:33:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.495s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:34:10] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins)
[17:43:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.433s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:46:45] <wikibugs>	 (03PS1) 10Andrew Bogott: sssd: fix service notification/restart for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1177450 (https://phabricator.wikimedia.org/T401584)
[17:50:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.253s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:55:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.25s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:57:18] <wikibugs>	 (03PS1) 10Dzahn: icinga: add NEL headers to httpd config [puppet] - 10https://gerrit.wikimedia.org/r/1177452 (https://phabricator.wikimedia.org/T303725)
[17:57:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] icinga: add NEL headers to httpd config [puppet] - 10https://gerrit.wikimedia.org/r/1177452 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[17:59:57] <wikibugs>	 (03PS2) 10Dzahn: icinga: add NEL headers to httpd config [puppet] - 10https://gerrit.wikimedia.org/r/1177452 (https://phabricator.wikimedia.org/T303725)
[18:00:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177424 (owner: 10Muehlenhoff)
[18:01:29] <wikibugs>	 (03CR) 10Catrope: admin: stop using groups parsoid-roots and parsoid-admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn)
[18:02:03] <wikibugs>	 (03PS1) 10Dzahn: lists: add NEL headers to apache [puppet] - 10https://gerrit.wikimedia.org/r/1177455 (https://phabricator.wikimedia.org/T303725)
[18:02:14] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[18:02:56] <wikibugs>	 (03CR) 10A smart kitten: varnish: Allow customising "contact noc@" error (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[18:03:26] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[18:03:39] <wikibugs>	 (03PS15) 10CDobbins: dnsrecursor: remove hardcoded values and tidy up [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608)
[18:05:15] <wikibugs>	 (03PS2) 10Dzahn: lists: add NEL headers to apache [puppet] - 10https://gerrit.wikimedia.org/r/1177455 (https://phabricator.wikimedia.org/T303725)
[18:05:16] <wikibugs>	 (03PS1) 10Dzahn: contint/integration: add NEL headers to apache [puppet] - 10https://gerrit.wikimedia.org/r/1177456 (https://phabricator.wikimedia.org/T303725)
[18:07:36] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[18:07:52] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "This change is ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[18:11:33] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1177452 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[18:11:53] <wikibugs>	 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11075106 (10RobH) CS1147925   Support,    > We've been investigating an ongoing temperature issue in our servers, and would like to have the following done at...
[18:12:22] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[18:13:47] <wikibugs>	 (03PS4) 10BCornwall: varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[18:15:03] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:16:15] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1047
[18:16:24] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1047
[18:17:01] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:18:54] <wikibugs>	 (03CR) 10Muehlenhoff: admin: stop using groups parsoid-roots and parsoid-admin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn)
[18:21:05] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:21:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11075133 (10wiki_willy) a:05joanna_borun→03MoritzMuehlenhoff Hi @MoritzMuehlenhoff - are you able to confirm the racking details and update the site.pp info on this one?   Thanks,...
[18:21:43] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[18:21:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11075138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[18:22:15] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6552/co" [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[18:22:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11075140 (10wiki_willy) a:05joanna_borun→03MoritzMuehlenhoff Hi @MoritzMuehlenhoff - are you able to help confirm the racking details and update site.pp on this one?  Thanks, Willy
[18:29:53] <logmsgbot>	 jhancock@cumin1002 reimage (PID 1178073) is awaiting input
[18:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[18:37:50] <logmsgbot>	 !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host es2049.codfw.wmnet with OS bookworm
[18:37:58] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11075185 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host es2049.codfw.wmnet with OS bookworm
[18:48:48] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622 (10egardner) 03NEW
[18:50:26] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1047.eqiad.wmnet with reason: host reimage
[18:54:09] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1047.eqiad.wmnet with reason: host reimage
[18:55:54] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[18:56:08] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[19:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[19:03:02] <wikibugs>	 (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1177469
[19:03:09] <wikibugs>	 (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470
[19:03:16] <wikibugs>	 (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177471
[19:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[19:05:24] <wikibugs>	 (03PS8) 10BCornwall: varnish: Replace X-Page-ID with variable [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550)
[19:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:15:24] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[19:15:58] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[19:15:59] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1047.eqiad.wmnet with OS bookworm
[19:16:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11075320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm completed: - cloudcephosd1047 (**PASS**...
[19:16:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11075321 (10VRiley-WMF)
[19:18:09] <wikibugs>	 (03CR) 10Xcollazo: "I can see that this would eliminate the `kiwix/zim` mirrored ZIM files." [puppet] - 10https://gerrit.wikimedia.org/r/1039246 (owner: 10Milimetric)
[19:18:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11075328 (10VRiley-WMF) after troubleshooting this with @Papaul for a bit, we found that cloudcephosd1045 has a bad port on the NIC, and cloudcephosd1047 was plugged into a port that was rate...
[19:18:59] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey)
[19:20:18] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[19:20:46] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[19:21:40] <wikibugs>	 (03PS5) 10BCornwall: varnish: Replace X-Include-PV with include_pv var [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550)
[19:25:34] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[19:32:34] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for madwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177476 (https://phabricator.wikimedia.org/T391747)
[19:35:46] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11075445 (10Dzahn)
[19:38:38] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177477
[19:39:01] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1177452/6555/alert2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1177452 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[19:39:29] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] Fixed tabs to spaces. [dns] - 10https://gerrit.wikimedia.org/r/1131978 (owner: 10SCherukuwada)
[19:39:30] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11075475 (10Jhancock.wm) @Papaul install of the os went through without issue on es2049. But looks like it's going to fail here. I did find entries for the servers in site and p...
[19:39:33] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for madwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177476 (https://phabricator.wikimedia.org/T391747) (owner: 10Zabe)
[19:39:50] <wikibugs>	 (03CR) 10BCornwall: Fixed tabs to spaces. [dns] - 10https://gerrit.wikimedia.org/r/1131978 (owner: 10SCherukuwada)
[19:40:09] <wikibugs>	 (03Abandoned) 10BCornwall: Fixed tabs to spaces. [dns] - 10https://gerrit.wikimedia.org/r/1131978 (owner: 10SCherukuwada)
[19:40:27] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for madwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177476 (https://phabricator.wikimedia.org/T391747) (owner: 10Zabe)
[19:40:46] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177476|Initial configuration for madwikisource (T391747)]]
[19:40:50] <stashbot>	 T391747: Create Wikisource Madurese - https://phabricator.wikimedia.org/T391747
[19:42:53] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1177476|Initial configuration for madwikisource (T391747)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:43:17] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[19:43:35] <wikibugs>	 (03CR) 10A smart kitten: varnish: Update User-Agent Policy url in error messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[19:47:48] <wikibugs>	 (03PS1) 10BCornwall: fixup! varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1177479
[19:48:01] <wikibugs>	 (03PS1) 10Dzahn: Revert "icinga: add NEL headers to httpd config" [puppet] - 10https://gerrit.wikimedia.org/r/1177480
[19:48:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "icinga: add NEL headers to httpd config" [puppet] - 10https://gerrit.wikimedia.org/r/1177480 (owner: 10Dzahn)
[19:48:20] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] Revert "icinga: add NEL headers to httpd config" [puppet] - 10https://gerrit.wikimedia.org/r/1177480 (owner: 10Dzahn)
[19:48:34] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177476|Initial configuration for madwikisource (T391747)]] (duration: 07m 48s)
[19:48:38] <stashbot>	 T391747: Create Wikisource Madurese - https://phabricator.wikimedia.org/T391747
[19:49:10] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Update User-Agent Policy url in error messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[19:49:36] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] "Aug 11 19:42:51 alert2002 apachectl[1322774]: AH00526: Syntax error on line 18 of /etc/apache2/sites-enabled/50-requestctl-wikimedia-org.c" [puppet] - 10https://gerrit.wikimedia.org/r/1177480 (owner: 10Dzahn)
[19:50:08] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Update User-Agent Policy url in error messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[19:50:33] <wikibugs>	 (03PS2) 10BCornwall: fixup! varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1177479 (https://phabricator.wikimedia.org/T400421)
[19:51:12] <wikibugs>	 (03PS1) 10Zabe: Activate madwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177482 (https://phabricator.wikimedia.org/T391747)
[19:53:01] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activate madwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177482 (https://phabricator.wikimedia.org/T391747) (owner: 10Zabe)
[19:53:52] <wikibugs>	 (03Merged) 10jenkins-bot: Activate madwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177482 (https://phabricator.wikimedia.org/T391747) (owner: 10Zabe)
[19:54:13] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177482|Activate madwikisource (T391747)]]
[19:54:18] <stashbot>	 T391747: Create Wikisource Madurese - https://phabricator.wikimedia.org/T391747
[19:55:10] <wikibugs>	 (03PS7) 10BryanDavis: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[19:55:10] <wikibugs>	 (03PS1) 10BryanDavis: varnish: Update User-Agent Policy url in error messages (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1177483 (https://phabricator.wikimedia.org/T400421)
[19:56:23] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1177482|Activate madwikisource (T391747)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:57:07] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[19:57:07] <logmsgbot>	 !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2049.codfw.wmnet with OS bookworm
[19:57:08] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#11075534 (10BCornwall) 05In progress→03Resolved Basically the final say Dell has had:  > Hello Brett, > > I have been engaged on this case and have reviewed the TSRs and the information in this case...
[19:57:13] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11075537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host es2049.codfw.wmnet with OS bookworm executed with errors: - es2049 (*...
[19:58:44] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1177479 (https://phabricator.wikimedia.org/T400421) (owner: 10BCornwall)
[19:59:17] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Outdated link to User-Agent Policy in Varnish 403 and 429 responses - https://phabricator.wikimedia.org/T400421#11075548 (10BCornwall) 05Open→03In progress p:05Triage→03Medium
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T2000).
[20:00:05] <jouncebot>	 anzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:01:40] <wikibugs>	 (03CR) 10BryanDavis: "Follow up to Iae2900dbb4b71de21f09165a9166500a4f84a351 after Ia9ba82c994ba5761a569ac31dcc04fc062f53686 added more usage of the legacy URL." [puppet] - 10https://gerrit.wikimedia.org/r/1177483 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[20:01:47] <zabe>	 I can deploy
[20:01:49] <wikibugs>	 (03PS1) 10Vgutierrez: conftool: Drop rsa-2048 cert from hiddenparma httpd config [puppet] - 10https://gerrit.wikimedia.org/r/1177486
[20:01:52] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] fixup! varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1177479 (https://phabricator.wikimedia.org/T400421) (owner: 10BCornwall)
[20:02:28] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] conftool: Drop rsa-2048 cert from hiddenparma httpd config [puppet] - 10https://gerrit.wikimedia.org/r/1177486 (owner: 10Vgutierrez)
[20:02:28] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177482|Activate madwikisource (T391747)]] (duration: 08m 14s)
[20:02:32] <stashbot>	 T391747: Create Wikisource Madurese - https://phabricator.wikimedia.org/T391747
[20:02:45] <zabe>	 anzx: around?
[20:02:45] <wikibugs>	 (03CR) 10BryanDavis: varnish: Allow customising "contact noc@" error (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[20:03:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "I tried to deploy unrelated change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1177452  and restarted apache on alert2002 and it " [puppet] - 10https://gerrit.wikimedia.org/r/1177486 (owner: 10Vgutierrez)
[20:03:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] conftool: Drop rsa-2048 cert from hiddenparma httpd config [puppet] - 10https://gerrit.wikimedia.org/r/1177486 (owner: 10Vgutierrez)
[20:04:06] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "first need https://gerrit.wikimedia.org/r/c/operations/puppet/+/1177486" [puppet] - 10https://gerrit.wikimedia.org/r/1177452 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[20:04:23] <wikibugs>	 (03PS1) 10Dzahn: Revert^2 "icinga: add NEL headers to httpd config" [puppet] - 10https://gerrit.wikimedia.org/r/1177487
[20:06:27] <wikibugs>	 (03Abandoned) 10BryanDavis: varnish: Update User-Agent Policy url in error messages (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1177483 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[20:06:44] <wikibugs>	 (03CR) 10BryanDavis: [C:03+1] fixup! varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1177479 (https://phabricator.wikimedia.org/T400421) (owner: 10BCornwall)
[20:07:32] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for rkiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177488 (https://phabricator.wikimedia.org/T392490)
[20:07:34] <wikibugs>	 (03PS1) 10Zabe: Activate rkiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177489 (https://phabricator.wikimedia.org/T392490)
[20:08:47] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for rkiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177488 (https://phabricator.wikimedia.org/T392490) (owner: 10Zabe)
[20:09:01] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] fixup! varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1177479 (https://phabricator.wikimedia.org/T400421) (owner: 10BCornwall)
[20:09:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert^2 "icinga: add NEL headers to httpd config" [puppet] - 10https://gerrit.wikimedia.org/r/1177487 (owner: 10Dzahn)
[20:09:40] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for rkiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177488 (https://phabricator.wikimedia.org/T392490) (owner: 10Zabe)
[20:10:29] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Outdated link to User-Agent Policy in Varnish 403 and 429 responses - https://phabricator.wikimedia.org/T400421#11075598 (10BCornwall) Should be all good on the varnish side now.
[20:10:56] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for minwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177490 (https://phabricator.wikimedia.org/T395452)
[20:11:51] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for minwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177490 (https://phabricator.wikimedia.org/T395452) (owner: 10Zabe)
[20:12:40] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for minwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177490 (https://phabricator.wikimedia.org/T395452) (owner: 10Zabe)
[20:13:29] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11075623 (10Papaul) @Jhancock.wm indeed the server is already in site.pp and also i dno't see it's puppet cert on the wrong  puppet server. so i need to look more and see why it...
[20:13:30] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for zghwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177492 (https://phabricator.wikimedia.org/T399684)
[20:14:12] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Initial configuration for zghwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177492 (https://phabricator.wikimedia.org/T399684) (owner: 10Zabe)
[20:15:08] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11075627 (10XenoRyet)
[20:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for zghwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177492 (https://phabricator.wikimedia.org/T399684) (owner: 10Zabe)
[20:16:08] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177488|Initial configuration for rkiwiki (T392490)]], [[gerrit:1177490|Initial configuration for minwikibooks (T395452)]], [[gerrit:1177492|Initial configuration for zghwiktionary (T399684)]]
[20:16:15] <stashbot>	 T392490: Create Wikipedia Arakan - https://phabricator.wikimedia.org/T392490
[20:16:16] <stashbot>	 T395452: Create Wikibooks Minangkabau - https://phabricator.wikimedia.org/T395452
[20:16:16] <stashbot>	 T399684: Create Wiktionary Standard Moroccan Tamazight - https://phabricator.wikimedia.org/T399684
[20:18:12] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11075645 (10greg) summary of discussion during fr-tech backlog triage call: We're pretty certain if the only thing chang...
[20:18:22] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177488|Initial configuration for rkiwiki (T392490)]], [[gerrit:1177490|Initial configuration for minwikibooks (T395452)]], [[gerrit:1177492|Initial configuration for zghwiktionary (T399684)]]
[20:20:17] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1177488|Initial configuration for rkiwiki (T392490)]], [[gerrit:1177490|Initial configuration for minwikibooks (T395452)]], [[gerrit:1177492|Initial configuration for zghwiktionary (T399684)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:20:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Check list of PXE miss-configs for codfw - https://phabricator.wikimedia.org/T401442#11075658 (10Jhancock.wm)
[20:21:04] <wikibugs>	 (03CR) 10BryanDavis: Add Trixie images (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah)
[20:21:30] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[20:27:04] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177488|Initial configuration for rkiwiki (T392490)]], [[gerrit:1177490|Initial configuration for minwikibooks (T395452)]], [[gerrit:1177492|Initial configuration for zghwiktionary (T399684)]] (duration: 08m 42s)
[20:27:11] <stashbot>	 T392490: Create Wikipedia Arakan - https://phabricator.wikimedia.org/T392490
[20:27:11] <stashbot>	 T395452: Create Wikibooks Minangkabau - https://phabricator.wikimedia.org/T395452
[20:27:12] <stashbot>	 T399684: Create Wiktionary Standard Moroccan Tamazight - https://phabricator.wikimedia.org/T399684
[20:27:35] <mutante>	 many new wikis incoming? we thank you, zabe
[20:29:04] <zabe>	 yw :)
[20:33:18] <wikibugs>	 (03PS2) 10Zabe: Activate rkiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177489 (https://phabricator.wikimedia.org/T392490)
[20:33:28] <wikibugs>	 (03PS3) 10Zabe: Activate rkiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177489 (https://phabricator.wikimedia.org/T392490)
[20:33:35] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activate rkiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177489 (https://phabricator.wikimedia.org/T392490) (owner: 10Zabe)
[20:34:22] <wikibugs>	 (03Merged) 10jenkins-bot: Activate rkiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177489 (https://phabricator.wikimedia.org/T392490) (owner: 10Zabe)
[20:36:56] <wikibugs>	 10ops-codfw, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634 (10Jhancock.wm) 03NEW
[20:47:27] <wikibugs>	 10ops-codfw, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11075768 (10Jhancock.wm)
[20:47:33] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11075767 (10Dzahn)
[20:50:52] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177489|Activate rkiwiki (T392490)]]
[20:50:56] <stashbot>	 T392490: Create Wikipedia Arakan - https://phabricator.wikimedia.org/T392490
[20:51:52] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11075775 (10Novem_Linguae)
[20:54:13] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177489|Activate rkiwiki (T392490)]]
[20:56:20] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1177489|Activate rkiwiki (T392490)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:56:24] <stashbot>	 T392490: Create Wikipedia Arakan - https://phabricator.wikimedia.org/T392490
[20:57:34] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T2100).
[21:00:36] <sbassett>	 zabe: I assume you need a few more minutes to wrap up the backport window?
[21:00:39] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11075795 (10Novem_Linguae) Thanks! I just tried to log into a couple of NDA tools such as Superset and Icinga and got "Authentication Failure. Service access denied due to missing privile...
[21:01:06] <zabe>	 sbassett: yes sorry, the current backport will need 2-3 more minutes to finish
[21:01:26] <sbassett>	 np
[21:02:44] <wikibugs>	 (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[21:03:07] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177489|Activate rkiwiki (T392490)]] (duration: 08m 54s)
[21:03:11] <stashbot>	 T392490: Create Wikipedia Arakan - https://phabricator.wikimedia.org/T392490
[21:03:28] <zabe>	 sbassett: you can re-add your patches now
[21:03:37] <sbassett>	 zabe: tx
[21:09:31] <sbassett>	 Hey all - I have a few extension-related security patches to deploy right now.
[21:10:30] <wikibugs>	 (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[21:12:06] <wikibugs>	 (03PS4) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[21:13:20] <icinga-wm>	 PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 153435 MB (4% inode=99%): /var/lib/hadoop/data/e 158497 MB (4% inode=99%): /var/lib/hadoop/data/m 152261 MB (4% inode=99%): /var/lib/hadoop/data/k 159463 MB (4% inode=99%): /var/lib/hadoop/data/f 150139 MB (3% inode=99%): /var/lib/hadoop/data/g 159576 MB (4% inode=99%): /var/lib/hadoop/data/h 153306 MB (4% inode=99%): /var/lib/hadoop/data
[21:13:20] <icinga-wm>	 5 MB (4% inode=99%): /var/lib/hadoop/data/j 159270 MB (4% inode=99%): /var/lib/hadoop/data/c 153332 MB (4% inode=99%): /var/lib/hadoop/data/l 154576 MB (4% inode=99%): /var/lib/hadoop/data/b 156126 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops
[21:14:19] <wikibugs>	 (03PS1) 10Zabe: Activate minwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177503 (https://phabricator.wikimedia.org/T395452)
[21:19:48] <sbassett>	 !log Deployed security fix for T397580
[21:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:45] <zabe>	 sbassett: could you ping me once you are done?
[21:21:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T399249)', diff saved to https://phabricator.wikimedia.org/P80989 and previous config saved to /var/cache/conftool/dbconfig/20250811-212105-fceratto.json
[21:21:09] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[21:26:55] <sbassett>	 zabe: yes.  I’m done with 1, in the middle of 2 and have one more after that.
[21:27:47] <sbassett>	 !log Deployed security fix for T399627 (#2)
[21:27:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:09] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[21:31:12] <sbassett>	 Er, scap sync-file doesn’t seem to be working for me anymore.
[21:32:34] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1046 - vriley@cumin1002"
[21:32:39] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1046 - vriley@cumin1002"
[21:32:39] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:32:52] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1046
[21:35:56] <logmsgbot>	 vriley@cumin1002 configure-switch-interfaces (PID 1368640) is awaiting input
[21:36:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P80990 and previous config saved to /var/cache/conftool/dbconfig/20250811-213612-fceratto.json
[21:36:21] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1046
[21:36:45] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:37:32] <logmsgbot>	 !log sbassett@deploy1003 Started scap sync-world: Security deployments
[21:39:40] <logmsgbot>	 !log sbassett@deploy1003 Finished scap sync-world: Security deployments (duration: 02m 18s)
[21:45:41] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Outdated link to User-Agent Policy in Varnish 403 and 429 responses - https://phabricator.wikimedia.org/T400421#11075943 (10bd808) 05In progress→03Resolved a:03bd808
[21:51:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P80991 and previous config saved to /var/cache/conftool/dbconfig/20250811-215120-fceratto.json
[21:52:50] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:56:52] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Outdated link to User-Agent Policy in Varnish 403 and 429 responses - https://phabricator.wikimedia.org/T400421#11075978 (10BCornwall) @bd808 there's still https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172434, isn't there?
[21:59:19] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Outdated link to User-Agent Policy in Varnish 403 and 429 responses - https://phabricator.wikimedia.org/T400421#11075982 (10bd808) >>! In T400421#11075978, @BCornwall wrote: > @bd808 there's still https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172434, isn't th...
[21:59:37] <wikibugs>	 (03PS2) 10BryanDavis: wmcs: Update URL in comment in maintain_dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/1172434 (https://phabricator.wikimedia.org/T400421)
[22:00:12] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:00:38] <wikibugs>	 (03CR) 10BryanDavis: [C:03+1] "Do my +1 and fnegri's +1 add up to a +2? Nope. Gerrit math is weird. ;)" [puppet] - 10https://gerrit.wikimedia.org/r/1172434 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis)
[22:02:49] <Amir1>	 zabe: are you done with creations so I can pick up the clean up or should I wait? There are way too many :D
[22:02:52] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activate minwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177503 (https://phabricator.wikimedia.org/T395452) (owner: 10Zabe)
[22:03:45] <wikibugs>	 (03Merged) 10jenkins-bot: Activate minwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177503 (https://phabricator.wikimedia.org/T395452) (owner: 10Zabe)
[22:03:48] <zabe>	 Amir1: The are created database-wish, I only need to "activate" them.
[22:04:00] <Amir1>	 ah, cool
[22:04:06] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177503|Activate minwikibooks (T395452)]]
[22:04:10] <stashbot>	 T395452: Create Wikibooks Minangkabau - https://phabricator.wikimedia.org/T395452
[22:06:15] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1177503|Activate minwikibooks (T395452)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:06:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T399249)', diff saved to https://phabricator.wikimedia.org/P80992 and previous config saved to /var/cache/conftool/dbconfig/20250811-220627-fceratto.json
[22:06:32] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[22:06:43] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance
[22:06:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T399249)', diff saved to https://phabricator.wikimedia.org/P80993 and previous config saved to /var/cache/conftool/dbconfig/20250811-220650-fceratto.json
[22:07:15] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[22:08:02] <logmsgbot>	 !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[22:08:40] <logmsgbot>	 !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[22:09:07] <wikibugs>	 (03PS1) 10Zabe: Activate zghwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177509 (https://phabricator.wikimedia.org/T399684)
[22:11:11] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:11:17] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activate zghwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177509 (https://phabricator.wikimedia.org/T399684) (owner: 10Zabe)
[22:11:52] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:12:21] <wikibugs>	 (03Merged) 10jenkins-bot: Activate zghwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177509 (https://phabricator.wikimedia.org/T399684) (owner: 10Zabe)
[22:12:33] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177503|Activate minwikibooks (T395452)]] (duration: 08m 27s)
[22:12:37] <stashbot>	 T395452: Create Wikibooks Minangkabau - https://phabricator.wikimedia.org/T395452
[22:12:54] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177509|Activate zghwiktionary (T399684)]]
[22:12:57] <stashbot>	 T399684: Create Wiktionary Standard Moroccan Tamazight - https://phabricator.wikimedia.org/T399684
[22:14:59] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1177509|Activate zghwiktionary (T399684)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:15:32] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[22:17:41] <wikibugs>	 (03PS4) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946)
[22:18:55] <zabe>	 Amir1: so you can cleanup rkiwiki, minwikibooks, zghwiktionary, madwikisource and tlwikisource
[22:19:15] <Amir1>	 thanks
[22:20:48] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177509|Activate zghwiktionary (T399684)]] (duration: 07m 54s)
[22:20:52] <stashbot>	 T399684: Create Wiktionary Standard Moroccan Tamazight - https://phabricator.wikimedia.org/T399684
[22:21:14] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup)
[22:24:08] <wikibugs>	 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11076035 (10RobH) Please note I've saved the photos of the audit to the DC Ops google drive folder, under "2025 magru temp audit".  Cold intake temps range fro...
[22:24:37] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177511
[22:24:37] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177511 (owner: 10Zabe)
[22:24:40] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:25:28] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177511 (owner: 10Zabe)
[22:25:41] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1046.eqiad.wmnet with OS bookworm
[22:25:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1046.eqiad.wmnet with OS bookworm
[22:25:53] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1177511|Update interwiki cache]]
[22:27:50] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1177511|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:28:44] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[22:31:08] <icinga-wm>	 PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - free space: /srv 17598 MB (5% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[22:31:24] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis rkiwiki, minwikibooks, zghwiktionary, madwikisource, tlwikisource in section s5
[22:34:01] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177511|Update interwiki cache]] (duration: 08m 08s)
[22:36:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076066 (10VRiley-WMF)
[22:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[22:39:15] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11076079 (10Ladsgroup)
[22:40:51] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis rkiwiki, minwikibooks, zghwiktionary, madwikisource, tlwikisource in section s5
[22:41:06] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[22:42:03] <Amir1>	 zabe: I dont' know if related but deploy1003 is running out of disk on /srv
[22:44:32] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  cloudcephosd1044 - vriley@cumin1002"
[22:44:37] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  cloudcephosd1044 - vriley@cumin1002"
[22:44:37] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:45:01] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1044
[22:46:46] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1044
[22:47:20] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:48:28] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[22:48:34] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[22:48:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T400854)', diff saved to https://phabricator.wikimedia.org/P80994 and previous config saved to /var/cache/conftool/dbconfig/20250811-224841-ladsgroup.json
[22:48:46] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[22:49:07] <zabe>	 The disk space is concercing but not fully new, It has been at 92,7% 6 hours ago
[22:50:10] <thcipriani>	 I see we have 4 versions on disk at the moment. I think we should have three.
[22:50:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T400854)', diff saved to https://phabricator.wikimedia.org/P80995 and previous config saved to /var/cache/conftool/dbconfig/20250811-225023-ladsgroup.json
[22:51:14] <thcipriani>	 hrm, no, the last time we ran clean was fine.
[22:51:14] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11076119 (10KFrancis) >>! In T400176#11074412, @tappof wrote: > Hello @Novem_Linguae, > Yes, it’s safe to remove the shell access checklist from the original post. Moreover, I’ve just add...
[22:51:57] <thcipriani>	 it removed wmf.10, but left .11, .12, and .13. Which, now that I think about it, seems about right. We recently added php-next into the mix.
[22:53:08] <zabe>	 Another thing is that we currently have a 26GB "homedirs" folder in /srv
[22:53:31] <brennen>	 zipped up homedirs from old deploy box?
[22:53:49] <rzl>	 yeah I was about to say, that's where we stashed homedirs from mwmaint before throwing em out
[22:54:02] <rzl>	 I believe c.laime shuffled some volume sizes around to make room
[22:54:45] <thcipriani>	 ah
[22:55:19] <logmsgbot>	 vriley@cumin1002 provision (PID 1442382) is awaiting input
[22:56:18] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1046.eqiad.wmnet with reason: host reimage
[22:58:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076145 (10VRiley-WMF) submitted request SR214138427 for cloudcephosd1045
[22:59:46] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1046.eqiad.wmnet with reason: host reimage
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250811T2300)
[23:00:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076157 (10VRiley-WMF)
[23:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[23:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[23:05:22] <logmsgbot>	 vriley@cumin1002 provision (PID 1442382) is awaiting input
[23:05:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P80996 and previous config saved to /var/cache/conftool/dbconfig/20250811-230530-ladsgroup.json
[23:05:58] <wikibugs>	 (03PS1) 10Aaron Schulz: [WIP] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203)
[23:07:28] <wikibugs>	 (03PS1) 10Aaron Schulz: [DNM] Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203)
[23:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:10:05] <wikibugs>	 06SRE, 06serviceops-radar: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647 (10Zabe) 03NEW
[23:10:10] <wikibugs>	 06SRE, 06serviceops-radar: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11076202 (10Zabe) p:05Triage→03Unbreak!
[23:11:48] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:19:17] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[23:20:18] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[23:20:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P80997 and previous config saved to /var/cache/conftool/dbconfig/20250811-232038-ladsgroup.json
[23:22:22] <logmsgbot>	 vriley@cumin1002 reimage (PID 1420001) is awaiting input
[23:35:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T400854)', diff saved to https://phabricator.wikimedia.org/P80998 and previous config saved to /var/cache/conftool/dbconfig/20250811-233545-ladsgroup.json
[23:35:50] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[23:36:01] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[23:37:06] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[23:37:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T400854)', diff saved to https://phabricator.wikimedia.org/P80999 and previous config saved to /var/cache/conftool/dbconfig/20250811-233712-ladsgroup.json
[23:38:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1177521
[23:38:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1177521 (owner: 10TrainBranchBot)
[23:40:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T400854)', diff saved to https://phabricator.wikimedia.org/P81000 and previous config saved to /var/cache/conftool/dbconfig/20250811-234007-ladsgroup.json
[23:43:58] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[23:43:59] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1068.eqiad.wmnet with OS bookworm
[23:44:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#11076268 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm compl...
[23:44:37] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[23:44:38] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1046.eqiad.wmnet with OS bookworm
[23:44:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076269 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1046.eqiad.wmnet with OS bookworm completed: - cloudcephosd1046 (**PASS**...
[23:46:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076270 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104...
[23:46:47] <wikibugs>	 (03PS1) 10Dzahn: create th.wikimedia.org for Wikimedia Thailand [dns] - 10https://gerrit.wikimedia.org/r/1177522 (https://phabricator.wikimedia.org/T400001)
[23:52:09] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1177521 (owner: 10TrainBranchBot)
[23:55:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[23:55:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P81001 and previous config saved to /var/cache/conftool/dbconfig/20250811-235515-ladsgroup.json
[23:56:20] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[23:56:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm