Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-operations

Filter:
Start date
End date

Displaying 1105 items:

2025-10-28 00:00:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 00:04:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 00:06:25 <logmsgbot> !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy3002.esams.wmnet with OS trixie
2025-10-28 00:06:37 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11316600 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3002.es...'
2025-10-28 00:12:17 <wikibugs> ('PS1) ''Zabe: Initial configuration for minwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317)'
2025-10-28 00:12:58 <wikibugs> ('PS1) ''Zabe: Initial configuration for pcmwikiqoute [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408317)'
2025-10-28 00:13:23 <wikibugs> ('PS2) ''Zabe: Initial configuration for pcmwikiqoute [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318)'
2025-10-28 00:13:51 <zabe> jouncebot: nowandnext
2025-10-28 00:13:52 <jouncebot> No deployments scheduled for the next 1 hour(s) and 46 minute(s)
2025-10-28 00:13:52 <jouncebot> In 1 hour(s) and 46 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0200)
2025-10-28 00:13:55 <wikibugs> ('CR) ''Zabe: [C:''+2] Initial configuration for minwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317) (owner: ''Zabe)'
2025-10-28 00:14:21 <wikibugs> ('CR) ''Zabe: [C:''+2] Initial configuration for pcmwikiqoute [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318) (owner: ''Zabe)'
2025-10-28 00:14:48 <wikibugs> ('Merged) ''jenkins-bot: Initial configuration for minwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317) (owner: ''Zabe)'
2025-10-28 00:15:11 <wikibugs> ('Merged) ''jenkins-bot: Initial configuration for pcmwikiqoute [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318) (owner: ''Zabe)'
2025-10-28 00:16:44 <logmsgbot> !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]]
2025-10-28 00:16:53 <stashbot> T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318
2025-10-28 00:16:54 <stashbot> T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317
2025-10-28 00:20:04 <wikibugs> ('PS1) ''Zabe: Activate minwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317)'
2025-10-28 00:20:33 <wikibugs> ('PS1) ''Zabe: Activate pcmwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318)'
2025-10-28 00:24:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 00:25:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 00:30:56 <wikibugs> ('PS6) ''Scott French: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - ''https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220)'
2025-10-28 00:34:02 <wikibugs> ('CR) ''Scott French: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: ''Scott French)'
2025-10-28 00:37:01 <icinga-wm> PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 00:39:41 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1199093'
2025-10-28 00:39:41 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1199093 (owner: ''TrainBranchBot)'
2025-10-28 00:42:42 <logmsgbot> !log zabe@deploy2002 zabe: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-28 00:42:48 <stashbot> T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318
2025-10-28 00:42:48 <stashbot> T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317
2025-10-28 00:43:00 <logmsgbot> !log zabe@deploy2002 zabe: Continuing with sync
2025-10-28 00:44:01 <icinga-wm> RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 9.117 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 00:52:01 <icinga-wm> PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 00:53:55 <icinga-wm> RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.562 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 00:55:28 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-28 00:56:00 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1199093 (owner: ''TrainBranchBot)'
2025-10-28 00:57:20 <logmsgbot> !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]] (duration: 40m 37s)
2025-10-28 00:57:26 <stashbot> T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318
2025-10-28 00:57:27 <stashbot> T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317
2025-10-28 00:58:47 <wikibugs> ('CR) ''Zabe: [C:''+2] Activate minwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317) (owner: ''Zabe)'
2025-10-28 00:59:21 <wikibugs> ('CR) ''Zabe: [C:''+2] Activate pcmwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318) (owner: ''Zabe)'
2025-10-28 00:59:40 <wikibugs> ('Merged) ''jenkins-bot: Activate minwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317) (owner: ''Zabe)'
2025-10-28 01:00:09 <wikibugs> ('Merged) ''jenkins-bot: Activate pcmwikisource [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318) (owner: ''Zabe)'
2025-10-28 01:00:54 <logmsgbot> !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
2025-10-28 01:04:01 <wikibugs> ('PS1) ''Zabe: Update interwiki cache [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199096'
2025-10-28 01:04:01 <wikibugs> ('CR) ''Zabe: [C:''+2] Update interwiki cache [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199096 (owner: ''Zabe)'
2025-10-28 01:04:55 <wikibugs> ('Merged) ''jenkins-bot: Update interwiki cache [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199096 (owner: ''Zabe)'
2025-10-28 01:08:13 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1199097'
2025-10-28 01:08:13 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1199097 (owner: ''TrainBranchBot)'
2025-10-28 01:14:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 01:14:14 <logmsgbot> !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 19s)
2025-10-28 01:14:28 <logmsgbot> !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]]
2025-10-28 01:14:34 <stashbot> T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318
2025-10-28 01:14:34 <stashbot> T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317
2025-10-28 01:15:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 01:16:31 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408446#11316916 (''Jclark-ctr) →''Duplicate dup:''T408359'
2025-10-28 01:16:32 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11316918 (''Jclark-ctr)'
2025-10-28 01:18:42 <logmsgbot> !log zabe@deploy2002 zabe: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-28 01:22:32 <logmsgbot> !log zabe@deploy2002 zabe: Continuing with sync
2025-10-28 01:23:01 <icinga-wm> PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 01:23:57 <icinga-wm> RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 4.728 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 01:30:52 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1199097 (owner: ''TrainBranchBot)'
2025-10-28 01:32:35 <logmsgbot> !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]] (duration: 18m 07s)
2025-10-28 01:32:41 <stashbot> T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318
2025-10-28 01:32:41 <stashbot> T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317
2025-10-28 01:33:39 <Jhs> zabe, pcmwikisource??
2025-10-28 01:33:57 <zabe> no worries
2025-10-28 01:34:03 <zabe> I know its pcmwikiquote
2025-10-28 01:34:04 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 01:34:11 <zabe> its just the commit message that is wrong
2025-10-28 01:34:13 <Jhs> ah, ok, good :)
2025-10-28 01:39:40 <jinxer-wm> FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 01:49:04 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 01:50:37 <wikibugs> ('PS1) ''Andrew Bogott: rabbitmq: rename config file on Trixie [puppet] - ''https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516)'
2025-10-28 01:50:47 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516) (owner: ''Andrew Bogott)'
2025-10-28 01:53:22 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] rabbitmq: rename config file on Trixie [puppet] - ''https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516) (owner: ''Andrew Bogott)'
2025-10-28 01:54:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 01:55:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 02:00:04 <jouncebot> Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0200)
2025-10-28 02:07:56 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681)'
2025-10-28 02:07:58 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-28 02:17:59 <icinga-wm> PROBLEM - Host cloudrabbit2002-dev is DOWN: PING CRITICAL - Packet loss = 100%
2025-10-28 02:19:29 <icinga-wm> RECOVERY - Host cloudrabbit2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms
2025-10-28 02:20:28 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-28 02:23:43 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-28 02:24:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 02:25:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 03:00:05 <jouncebot> Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0300)
2025-10-28 03:02:39 <wikibugs> ('PS1) ''TrainBranchBot: testwikis to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681)'
2025-10-28 03:02:41 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-28 03:03:33 <wikibugs> ('Merged) ''jenkins-bot: testwikis to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-28 03:04:01 <logmsgbot> !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.45.0-wmf.25 refs T405681
2025-10-28 03:04:06 <stashbot> T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681
2025-10-28 03:14:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 03:15:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 03:20:57 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11317298 (''Dzahn)'
2025-10-28 03:24:01 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11317299 (''Dzahn)'
2025-10-28 03:29:06 <wikibugs> ('PS1) ''Arlolra: ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466)'
2025-10-28 03:30:28 <jinxer-wm> FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-28 03:37:53 <wikibugs> ('PS1) ''C. Scott Ananian: Forward-compatibility: allow output flags to be serialized in `OutputFlags` [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868)'
2025-10-28 03:38:26 <wikibugs> ('CR) ''C. Scott Ananian: [C:''+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868) (owner: ''C. Scott Ananian)'
2025-10-28 03:39:02 <wikibugs> ('PS1) ''C. Scott Ananian: ParserOutput: Add deprecation warnings for ParserOutput::getLanguageLinks() [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199115'
2025-10-28 03:39:12 <wikibugs> ('CR) ''C. Scott Ananian: [C:''+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199115 (owner: ''C. Scott Ananian)'
2025-10-28 03:39:45 <wikibugs> ('PS1) ''C. Scott Ananian: Implement a DOM version of the DeduplicateStyles pass [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929)'
2025-10-28 03:39:56 <wikibugs> ('CR) ''C. Scott Ananian: [C:''+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929) (owner: ''C. Scott Ananian)'
2025-10-28 03:44:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 03:45:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 03:51:51 <logmsgbot> !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.45.0-wmf.25 refs T405681 (duration: 47m 50s)
2025-10-28 03:51:55 <stashbot> T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681
2025-10-28 03:53:15 <wikibugs> ('Merged) ''jenkins-bot: Forward-compatibility: allow output flags to be serialized in `OutputFlags` [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868) (owner: ''C. Scott Ananian)'
2025-10-28 03:55:43 <wikibugs> ('Merged) ''jenkins-bot: ParserOutput: Add deprecation warnings for ParserOutput::getLanguageLinks() [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199115 (owner: ''C. Scott Ananian)'
2025-10-28 03:55:47 <wikibugs> ('Merged) ''jenkins-bot: Implement a DOM version of the DeduplicateStyles pass [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929) (owner: ''C. Scott Ananian)'
2025-10-28 04:00:04 <jouncebot> Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0400)
2025-10-28 04:02:40 <logmsgbot> !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.22 (duration: 02m 38s)
2025-10-28 04:29:08 <wikibugs> ('PS1) ''C. Scott Ananian: ParserOutput: 'ParseUsedOptions' need not be present in serialized form [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199117'
2025-10-28 04:29:49 <wikibugs> ('CR) ''C. Scott Ananian: [C:''+2] "Pull late patch into the branch cut." [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199117 (owner: ''C. Scott Ananian)'
2025-10-28 04:30:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 04:34:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 04:38:26 <wikibugs> ('PS1) ''C. Scott Ananian: Expose the list of behavior switch magic words to Parsoid [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290)'
2025-10-28 04:39:15 <wikibugs> ('CR) ''C. Scott Ananian: [C:''+2] "Late patch onto the train" [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290) (owner: ''C. Scott Ananian)'
2025-10-28 04:43:39 <wikibugs> ('Merged) ''jenkins-bot: ParserOutput: 'ParseUsedOptions' need not be present in serialized form [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199117 (owner: ''C. Scott Ananian)'
2025-10-28 04:45:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 04:49:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 04:54:38 <wikibugs> ('Merged) ''jenkins-bot: Expose the list of behavior switch magic words to Parsoid [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290) (owner: ''C. Scott Ananian)'
2025-10-28 04:55:28 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-28 04:57:25 <jinxer-wm> FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 05:00:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 05:04:01 <icinga-wm> PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 05:04:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 05:05:53 <icinga-wm> RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.587 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 05:09:04 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-28 05:15:01 <icinga-wm> PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 05:18:53 <icinga-wm> RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 1.421 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 05:34:04 <jinxer-wm> RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-28 05:39:40 <jinxer-wm> FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 05:50:28 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 06:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0600)
2025-10-28 06:00:05 <jouncebot> marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0600).
2025-10-28 06:03:42 <wikibugs> ('CR) ''Krinkle: ExtensionDistributor: Mark 1.45 as beta (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: ''Arlolra)'
2025-10-28 06:05:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 06:09:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 06:16:48 <wikibugs> 'ops-ulsfo, ''DC-Ops, ''Infrastructure-Foundations, ''netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510 (''Papaul) ''NEW'
2025-10-28 06:20:28 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-28 06:43:12 <wikibugs> 'ops-ulsfo, ''DC-Ops, ''Infrastructure-Foundations, ''netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511 (''Papaul) ''NEW'
2025-10-28 06:43:42 <wikibugs> 'ops-ulsfo, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11317386 (''Papaul) p:''Triage''Medium'
2025-10-28 06:43:54 <wikibugs> 'ops-ulsfo, ''DC-Ops, ''Infrastructure-Foundations, ''netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11317387 (''Papaul) p:''Triage''Medium'
2025-10-28 06:44:56 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis pcmwikiquote in section s5
2025-10-28 06:53:41 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis pcmwikiquote in section s5
2025-10-28 06:54:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 06:54:43 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis minwikisource in section s5
2025-10-28 06:55:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 07:00:05 <jouncebot> Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0700). nyaa~
2025-10-28 07:00:05 <jouncebot> sefehpisikler: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2025-10-28 07:01:32 <logmsgbot> marostegui@cumin1003 sanitize-wiki (PID 343895) is awaiting input
2025-10-28 07:10:45 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis minwikisource in section s5
2025-10-28 07:30:28 <jinxer-wm> FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-28 07:43:11 <marostegui> !log Deploy schema change on the master x1 T407587
2025-10-28 07:43:15 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 07:43:15 <stashbot> T407587: Apply ce_event_contributions schema changes in production (x1) - https://phabricator.wikimedia.org/T407587
2025-10-28 07:43:35 <wikibugs> ('PS1) ''Muehlenhoff: Failover idp.w.o [dns] - ''https://gerrit.wikimedia.org/r/1199225'
2025-10-28 07:44:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 07:47:29 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: ''Kosta Harlan)'
2025-10-28 07:47:54 <kostajh> marostegui: I'd like to create database tables in x1 for two wikis for the above config patch, can you check the command I am going to run?
2025-10-28 07:49:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 07:50:28 <kostajh> jouncebot: nowandnext
2025-10-28 07:50:28 <jouncebot> For the next 0 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0700)
2025-10-28 07:50:28 <jouncebot> In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000)
2025-10-28 07:50:45 <kostajh> also, marostegui are you done deploying?
2025-10-28 07:51:44 <kostajh> I'll take that as a "yes"
2025-10-28 07:51:49 <marostegui> kostajh: Yeah, go for anything
2025-10-28 07:51:53 <marostegui> You need :)
2025-10-28 07:52:07 <marostegui> kostajh: Show me the command
2025-10-28 07:52:52 <kostajh> marostegui: `php maintenance/mysql.php --cluster extension1 --wiki loginwiki ./extensions/CheckUser/schema/mysql/tables-virtual-checkuser-generated.sql`
2025-10-28 07:53:41 <marostegui> kostajh: I guess that is correct I guess you'd run another one for metawiki
2025-10-28 07:54:21 <kostajh> yeah
2025-10-28 07:54:26 <kostajh> ok, I will try it
2025-10-28 07:55:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 07:56:00 <wikibugs> 'ops-ulsfo, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11317482 (''cmooney) @papaul looks good! Nothing jumping out at me as problematic in terms of the connectivity plan. I don't think it makes sense to use 40G tho...'
2025-10-28 07:56:02 <kostajh> marostegui: hm, mwscript sql.php has a `--wiki` and a `--wikidb` flag
2025-10-28 07:56:12 <kostajh> should I specify both as `loginwiki` ?
2025-10-28 07:56:23 <marostegui> kostajh: I am not sure, I am not familiar with this procedure :(
2025-10-28 07:56:27 <kostajh> just reading over `mwscript sql.php --help`
2025-10-28 07:56:31 <marostegui> As we don't use it
2025-10-28 07:56:39 <marostegui> (DBAs do not create tables in prod)
2025-10-28 07:58:00 <kostajh> ok
2025-10-28 07:58:10 <kostajh> it seems to have worked
2025-10-28 07:58:41 <kostajh> I will deploy my config patch now
2025-10-28 07:58:45 <wikibugs> ('PS1) ''Brouberol: opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874)'
2025-10-28 07:59:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 07:59:12 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: ''Kosta Harlan)'
2025-10-28 07:59:20 <wikibugs> ('PS2) ''Brouberol: opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874)'
2025-10-28 08:00:01 <wikibugs> ('Merged) ''jenkins-bot: CheckUser: Enable SI on metawiki and loginwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: ''Kosta Harlan)'
2025-10-28 08:01:04 <wikibugs> ('CR) ''Slyngshede: [C:''+1] Failover idp.w.o [dns] - ''https://gerrit.wikimedia.org/r/1199225 (owner: ''Muehlenhoff)'
2025-10-28 08:02:10 <logmsgbot> !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]]
2025-10-28 08:02:15 <stashbot> T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428
2025-10-28 08:02:40 <wikibugs> ('CR) ''Kosta Harlan: "For next time: could you please schedule this as a backport? It was unexpected to see this when I went to deploy a config patch this morni" [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199117 (owner: ''C. Scott Ananian)'
2025-10-28 08:02:43 <jinxer-wm> FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
2025-10-28 08:04:16 <wikibugs> ('CR) ''Muehlenhoff: [C:''+2] Failover idp.w.o [dns] - ''https://gerrit.wikimedia.org/r/1199225 (owner: ''Muehlenhoff)'
2025-10-28 08:04:24 <logmsgbot> !log jmm@dns1004 START - running authdns-update
2025-10-28 08:05:11 <logmsgbot> !log jmm@dns1004 END - running authdns-update
2025-10-28 08:07:43 <jinxer-wm> RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
2025-10-28 08:11:12 <logmsgbot> !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001
2025-10-28 08:11:14 <logmsgbot> !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.powercycle (exit_code=99) for host ml-serve2001
2025-10-28 08:13:13 <gehel> !log restarting blazegraph on wdqs1019 - free allocator decreasing - `sudo depool; sleep 30; sudo systemctl restart wdqs-blazegraph.service; sleep 30; sudo pool`
2025-10-28 08:13:16 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 08:14:39 <kostajh> waiting on image building, which will probably take ~30 inutes
2025-10-28 08:17:13 <wikibugs> ('PS18) ''Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259)'
2025-10-28 08:18:20 <logmsgbot> !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001
2025-10-28 08:18:27 <logmsgbot> !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.powercycle (exit_code=99) for host ml-serve2001
2025-10-28 08:19:22 <wikibugs> ('CR) ''Jelto: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7480/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: ''Jelto)'
2025-10-28 08:21:56 <wikibugs> ('PS19) ''Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259)'
2025-10-28 08:23:33 <wikibugs> ('CR) ''Brouberol: [C:''+2] opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874) (owner: ''Brouberol)'
2025-10-28 08:23:56 <wikibugs> ('CR) ''Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: ''Jelto)'
2025-10-28 08:24:55 <icinga-wm> RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms
2025-10-28 08:25:54 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
2025-10-28 08:26:21 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
2025-10-28 08:27:48 <wikibugs> ('PS7) ''Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - ''https://gerrit.wikimedia.org/r/1198928'
2025-10-28 08:28:07 <logmsgbot> !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-28 08:28:12 <stashbot> T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428
2025-10-28 08:28:38 <moritzm> !log installing openjdk-11 security updates
2025-10-28 08:28:41 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 08:29:04 <jinxer-wm> RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-28 08:29:38 <kostajh> testing
2025-10-28 08:29:55 <logmsgbot> !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2001.codfw.wmnet
2025-10-28 08:29:58 <logmsgbot> !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2001.codfw.wmnet
2025-10-28 08:33:09 <logmsgbot> !log kharlan@deploy2002 kharlan: Continuing with sync
2025-10-28 08:34:06 <wikibugs> ('PS1) ''Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729)'
2025-10-28 08:34:53 <wikibugs> ('PS1) ''Brouberol: opensearch-operator: add a separator between tenant role and rolebinding resources [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199230 (https://phabricator.wikimedia.org/T404874)'
2025-10-28 08:35:30 <wikibugs> ('PS2) ''Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729)'
2025-10-28 08:36:31 <wikibugs> ('PS3) ''Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729)'
2025-10-28 08:46:15 <wikibugs> ('PS1) ''Kosta Harlan: hCaptcha: Enable on loginwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428)'
2025-10-28 08:49:07 <logmsgbot> !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]] (duration: 46m 57s)
2025-10-28 08:49:16 <stashbot> T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428
2025-10-28 08:49:30 <kostajh> I'm going to sync another patch, unless someone else needs to deploy
2025-10-28 08:49:36 <kostajh> jouncebot: nowandnext
2025-10-28 08:49:36 <jouncebot> No deployments scheduled for the next 1 hour(s) and 10 minute(s)
2025-10-28 08:49:36 <jouncebot> In 1 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000)
2025-10-28 08:50:13 <wikibugs> ('CR) ''Mszwarc: [C:''+1] hCaptcha: Enable on loginwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: ''Kosta Harlan)'
2025-10-28 08:50:41 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: ''Kosta Harlan)'
2025-10-28 08:51:21 <wikibugs> ('PS3) ''Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737)'
2025-10-28 08:51:33 <wikibugs> ('PS8) ''Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - ''https://gerrit.wikimedia.org/r/1198928'
2025-10-28 08:51:38 <wikibugs> ('Merged) ''jenkins-bot: hCaptcha: Enable on loginwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: ''Kosta Harlan)'
2025-10-28 08:52:06 <logmsgbot> !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]]
2025-10-28 08:53:11 <wikibugs> ('PS9) ''Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - ''https://gerrit.wikimedia.org/r/1198928'
2025-10-28 08:53:38 <logmsgbot> !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001
2025-10-28 08:53:52 <logmsgbot> !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve2001
2025-10-28 08:54:47 <wikibugs> ('CR) ''DCausse: [C:''+1] cirrus: Start near match A/B test (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: ''Ebernhardson)'
2025-10-28 08:55:27 <icinga-wm> PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100%
2025-10-28 08:55:28 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-28 08:56:31 <logmsgbot> !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-28 08:56:50 <stashbot> T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428
2025-10-28 08:56:55 <icinga-wm> RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms
2025-10-28 08:57:40 <jinxer-wm> FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 08:58:26 <wikibugs> ('CR) ''Brouberol: [C:''+2] opensearch-operator: add a separator between tenant role and rolebinding resources [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199230 (https://phabricator.wikimedia.org/T404874) (owner: ''Brouberol)'
2025-10-28 08:58:45 <logmsgbot> !log kharlan@deploy2002 kharlan: Continuing with sync
2025-10-28 08:59:55 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: OpenJDK security updates - jmm@cumin2002
2025-10-28 08:59:58 <wikibugs> ('PS1) ''Gehel: Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - ''https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582)'
2025-10-28 09:02:01 <wikibugs> ('CR) ''CI reject: [V:''-1] Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - ''https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:02:46 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
2025-10-28 09:05:17 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
2025-10-28 09:06:59 <wikibugs> ('CR) ''Clément Goubert: [C:''+1] Route /page/lint(.*) to the gateway on test2wiki [puppet] - ''https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: ''Aaron Schulz)'
2025-10-28 09:07:15 <wikibugs> ('CR) ''Filippo Giunchedi: "> > Nice find! Yes I think that ought to work and cater for module unload too. And yes I think there shouldn't be too many modules." [puppet] - ''https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: ''JHathaway)'
2025-10-28 09:08:40 <logmsgbot> !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]] (duration: 16m 35s)
2025-10-28 09:08:45 <stashbot> T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428
2025-10-28 09:14:40 <wikibugs> ('CR) ''Brouberol: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:14:44 <wikibugs> ('CR) ''Brouberol: Hadoop: Introduce tmpreaper to cleanup /tmp (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:15:50 <godog> gehel: FYI these days systemd-tmpfiles has replaced tmpreaper, check out e.g. modules/icinga/manifests/init.pp
2025-10-28 09:20:04 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: OpenJDK security updates - jmm@cumin2002
2025-10-28 09:20:28 <gehel> godog: Oh, nice! I'm too old school!
2025-10-28 09:21:56 <godog> nice indeed, one line config file and you're done
2025-10-28 09:22:41 <wikibugs> ('CR) ''Elukey: [C:''+2] Use Thanos rules for Pyrra error metrics for xLab [puppet] - ''https://gerrit.wikimedia.org/r/1199023 (https://phabricator.wikimedia.org/T398869) (owner: ''Dr0ptp4kt)'
2025-10-28 09:29:06 <wikibugs> ('Abandoned) ''Gehel: Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - ''https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:30:52 <wikibugs> ('PS1) ''Majavah: P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - ''https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457)'
2025-10-28 09:30:56 <wikibugs> ('CR) ''Elukey: LVS: Add druid-public-coordinator to service list (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 09:31:32 <wikibugs> ('CR) ''Elukey: LVS: etcd data for druid-public-coordinator (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 09:34:13 <logmsgbot> !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Roll-restart for Java security updates - klausman@cumin1003
2025-10-28 09:36:43 <logmsgbot> !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox
2025-10-28 09:36:45 <wikibugs> 'SRE, ''envoy, ''serviceops, ''Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11317841 (''LSobanski) Untagging #collaboration-services based on https://phabricator.wikimedia.org/T403663#11196043'
2025-10-28 09:37:12 <wikibugs> ('PS1) ''Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582)'
2025-10-28 09:38:07 <wikibugs> ('CR) ''Stevemunene: LVS: Add druid-public-coordinator to service list (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 09:38:27 <wikibugs> ('CR) ''Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: ''Arthur taylor)'
2025-10-28 09:39:32 <logmsgbot> !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-28 09:39:40 <jinxer-wm> FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 09:39:47 <logmsgbot> !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox
2025-10-28 09:39:54 <wikibugs> ('PS2) ''Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582)'
2025-10-28 09:40:07 <wikibugs> ('CR) ''Gehel: "check-experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:40:13 <wikibugs> ('CR) ''Brouberol: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:41:00 <wikibugs> 'SRE, ''collaboration-services, ''Traffic, ''Release-Engineering-Team (Radar), ''WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532 (''LSobanski) ''NEW'
2025-10-28 09:41:29 <wikibugs> ('CR) ''FNegri: [C:''+1] P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - ''https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457) (owner: ''Majavah)'
2025-10-28 09:41:49 <wikibugs> ('PS1) ''Majavah: aptrepo: Retire kubeadm/1.29 components [puppet] - ''https://gerrit.wikimedia.org/r/1199240'
2025-10-28 09:41:50 <wikibugs> ('PS1) ''Majavah: aptrepo: Import Kubeadm/1.31 packages [puppet] - ''https://gerrit.wikimedia.org/r/1199241 (https://phabricator.wikimedia.org/T372697)'
2025-10-28 09:41:58 <wikibugs> ('CR) ''CI reject: [V:''-1] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:42:05 <wikibugs> ('CR) ''Majavah: [C:''+2] P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - ''https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457) (owner: ''Majavah)'
2025-10-28 09:42:32 <logmsgbot> !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-28 09:42:54 <wikibugs> ('PS3) ''Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582)'
2025-10-28 09:42:58 <logmsgbot> !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1014.eqiad.wmnet
2025-10-28 09:43:07 <wikibugs> ('CR) ''Gehel: "check-experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:43:20 <wikibugs> ('CR) ''Brouberol: Hadoop: cleanup /tmp with systemd::tmpfile (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:43:35 <wikibugs> ('CR) ''Brouberol: Hadoop: cleanup /tmp with systemd::tmpfile (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:43:42 <wikibugs> 'SRE, ''collaboration-services, ''Traffic, ''Release-Engineering-Team (Radar), ''WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11317892 (''LSobanski) p:''Triage''High'
2025-10-28 09:43:59 <wikibugs> ('CR) ''Brouberol: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:44:21 <wikibugs> ('PS1) ''Jelto: aptrepo::staging: add job to clear incoming folder [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527)'
2025-10-28 09:44:21 <wikibugs> 'SRE, ''collaboration-services, ''Traffic, ''Release-Engineering-Team (Radar), ''WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11317895 (''LSobanski)'
2025-10-28 09:44:22 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11317894 (''LSobanski)'
2025-10-28 09:44:27 <wikibugs> ('CR) ''Gehel: Hadoop: cleanup /tmp with systemd::tmpfile (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:45:01 <wikibugs> ('Abandoned) ''Brouberol: growthbook: remove all traces of mongoDB from the chart [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197589 (https://phabricator.wikimedia.org/T406579) (owner: ''Brouberol)'
2025-10-28 09:45:30 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Looks good, two nits inline" [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:45:48 <wikibugs> ('CR) ''Stevemunene: [C:''+1] Definition of a ferretdb chart [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: ''Brouberol)'
2025-10-28 09:46:25 <wikibugs> ('CR) ''Stevemunene: [C:''+1] ferretdb-growthbook: define helmfile and values [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: ''Brouberol)'
2025-10-28 09:48:52 <logmsgbot> !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1014.eqiad.wmnet
2025-10-28 09:49:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 09:49:13 <wikibugs> ('CR) ''Brouberol: [C:''+2] cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 09:49:16 <wikibugs> ('CR) ''Brouberol: [C:''+2] cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 09:49:17 <wikibugs> ('CR) ''Brouberol: [C:''+2] postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 09:49:24 <logmsgbot> !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet
2025-10-28 09:49:25 <wikibugs> ('CR) ''Brouberol: [C:''+2] Definition of a ferretdb chart [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: ''Brouberol)'
2025-10-28 09:49:27 <wikibugs> ('CR) ''Brouberol: [C:''+2] ferretdb-growthbook: define helmfile and values [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: ''Brouberol)'
2025-10-28 09:50:11 <wikibugs> ('PS4) ''Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582)'
2025-10-28 09:50:18 <wikibugs> ('CR) ''Gehel: Hadoop: cleanup /tmp with systemd::tmpfile (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:50:28 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 09:51:14 <wikibugs> ('Merged) ''jenkins-bot: cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 09:51:28 <wikibugs> ('Merged) ''jenkins-bot: cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 09:51:42 <wikibugs> ('Merged) ''jenkins-bot: postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 09:51:52 <wikibugs> ('Merged) ''jenkins-bot: Definition of a ferretdb chart [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: ''Brouberol)'
2025-10-28 09:51:54 <wikibugs> ('Merged) ''jenkins-bot: ferretdb-growthbook: define helmfile and values [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: ''Brouberol)'
2025-10-28 09:51:57 <logmsgbot> !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Roll-restart for Java security updates - klausman@cumin1003
2025-10-28 09:52:15 <logmsgbot> !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Roll-restart for Java security updates - klausman@cumin1003
2025-10-28 09:53:20 <wikibugs> ('CR) ''Mark Bergsma: [C:''+1] admin: add dpogorzelski to ops-limited [puppet] - ''https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: ''Kamila Součková)'
2025-10-28 09:54:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 09:54:05 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11317933 (''mark) Approved in Gerrit!'
2025-10-28 09:54:07 <wikibugs> ('PS2) ''Tiziano Fogli: nrpe2nodexp: use service description as alertname [puppet] - ''https://gerrit.wikimedia.org/r/1199242 (https://phabricator.wikimedia.org/T395446)'
2025-10-28 09:54:18 <klausman> lookinfg at that alert
2025-10-28 09:55:27 <logmsgbot> !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet
2025-10-28 09:55:59 <wikibugs> ('CR) ''Brouberol: [C:''+1] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 09:59:57 <wikibugs> ('CR) ''Elukey: LVS: Add druid-public-coordinator to service list (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 10:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000)
2025-10-28 10:01:34 <wikibugs> ('CR) ''Stevemunene: LVS: etcd data for druid-public-coordinator (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 10:02:53 <wikibugs> ('CR) ''Clément Goubert: wikikube: Add wikikube-worker2[248-330] (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: ''Jasmine)'
2025-10-28 10:03:44 <wikibugs> ('PS2) ''Jelto: aptrepo::staging: add job to clear incoming folder [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527)'
2025-10-28 10:03:53 <wikibugs> ('CR) ''Clément Goubert: [C:''+2] taskgen: Update calico IPPool check [puppet] - ''https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: ''Clément Goubert)'
2025-10-28 10:05:20 <wikibugs> ('CR) ''Jelto: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7482/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: ''Jelto)'
2025-10-28 10:05:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 10:05:32 <wikibugs> ('PS2) ''Daniel Kinzler: rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183)'
2025-10-28 10:09:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 10:09:22 <wikibugs> ('PS1) ''JavierMonton: Disable default user-agent collection. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964)'
2025-10-28 10:09:37 <jinxer-wm> FIRING: Failing Rate (Dashboard - Desktop & Mobile): <no value> - https://alerts.wikimedia.org/?q=alertname%3DFailing+Rate+%28Dashboard+-+Desktop+%26+Mobile%29
2025-10-28 10:10:00 <logmsgbot> !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Roll-restart for Java security updates - klausman@cumin1003
2025-10-28 10:10:32 <wikibugs> ('PS1) ''Fabfur: P:cache:haproxy: introduce ua classes [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060)'
2025-10-28 10:13:06 <wikibugs> ('PS1) ''Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - ''https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535)'
2025-10-28 10:14:14 <wikibugs> ('PS2) ''Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - ''https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535)'
2025-10-28 10:14:21 <wikibugs> ('PS3) ''Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - ''https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535)'
2025-10-28 10:14:25 <wikibugs> 'sre-alert-triage, ''Infrastructure-Foundations, ''netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407833#11318022 (''cmooney) ''Open''Resolved I removed these additional sessions last week but got distracted and didn't come back to edi...'
2025-10-28 10:20:28 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-28 10:22:05 <wikibugs> ('CR) ''Klausman: [C:''+1] admin: add dpogorzelski to ops-limited [puppet] - ''https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: ''Kamila Součková)'
2025-10-28 10:26:59 <wikibugs> ('CR) ''Elukey: LVS: etcd data for druid-public-coordinator (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 10:28:47 <wikibugs> ('CR) ''Hnowlan: [C:''+1] Route /page/lint(.*) to the gateway on test2wiki [puppet] - ''https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: ''Aaron Schulz)'
2025-10-28 10:29:37 <jinxer-wm> RESOLVED: Failing Rate (Dashboard - Desktop & Mobile): <no value> - https://alerts.wikimedia.org/?q=alertname%3DFailing+Rate+%28Dashboard+-+Desktop+%26+Mobile%29
2025-10-28 10:29:41 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group0 10% [puppet] - ''https://gerrit.wikimedia.org/r/1198929 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:30:23 <wikibugs> ('CR) ''Stevemunene: LVS: etcd data for druid-public-coordinator (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 10:30:51 <wikibugs> ('CR) ''Clément Goubert: [C:''+2] Route /page/lint(.*) to the gateway on test2wiki [puppet] - ''https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: ''Aaron Schulz)'
2025-10-28 10:32:14 <wikibugs> ('CR) ''Fabfur: "as @Elukey correctly pointed out, the procedure needs to be followed here, happy to review it again later" [puppet] - ''https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 10:34:27 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 10:37:02 <wikibugs> 'SRE-SLO, ''Experimentation Lab (Experiment Platform Sprint 14), ''OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11318126 (''elukey)'
2025-10-28 10:37:46 <wikibugs> ('CR) ''Dpogorzelski: [C:''+1] admin: add dpogorzelski to ops-limited [puppet] - ''https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: ''Kamila Součková)'
2025-10-28 10:38:01 <wikibugs> 'SRE-SLO, ''Experimentation Lab (Experiment Platform Sprint 14), ''OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11318132 (''elukey) We finally have all three SLO published in Pyrra: https://slo.wikimedia.org/?search=xlab Let's wait a couple of weeks to observe the new SL...'
2025-10-28 10:41:58 <wikibugs> ('CR) ''Clément Goubert: [C:''+2] trafficserver: action api to rest-gateway group0 10% [puppet] - ''https://gerrit.wikimedia.org/r/1198929 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:43:27 <wikibugs> ('CR) ''Muehlenhoff: "That would work, alternative proposal inline (which doesn't interfere with people working late in the American timezones)." [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: ''Jelto)'
2025-10-28 10:44:32 <wikibugs> ('PS1) ''Fabfur: P:cache:haproxy: don't repeat contact validation regex [puppet] - ''https://gerrit.wikimedia.org/r/1199251 (https://phabricator.wikimedia.org/T408060)'
2025-10-28 10:44:52 <wikibugs> ('CR) ''Fabfur: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: ''Fabfur)'
2025-10-28 10:45:33 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group0 100% [puppet] - ''https://gerrit.wikimedia.org/r/1198931 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:45:57 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group1 10% [puppet] - ''https://gerrit.wikimedia.org/r/1198932 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:46:11 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group1 50% [puppet] - ''https://gerrit.wikimedia.org/r/1198933 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:46:22 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group1 100% [puppet] - ''https://gerrit.wikimedia.org/r/1198934 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:46:47 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group2 10% [puppet] - ''https://gerrit.wikimedia.org/r/1198935 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:47:02 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group2 50% [puppet] - ''https://gerrit.wikimedia.org/r/1198936 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:47:11 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group2 100% [puppet] - ''https://gerrit.wikimedia.org/r/1198937 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:47:24 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway enwiki 10% [puppet] - ''https://gerrit.wikimedia.org/r/1198938 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:50:03 <wikibugs> ('PS2) ''Clément Goubert: trafficserver: action api to rest-gateway group0 50% [puppet] - ''https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223)'
2025-10-28 10:50:37 <moritzm> !log installing openjdk-17 security updates
2025-10-28 10:50:40 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 10:51:07 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway enwiki 50% [puppet] - ''https://gerrit.wikimedia.org/r/1198939 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:51:17 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway enwiki 100% [puppet] - ''https://gerrit.wikimedia.org/r/1198940 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:51:35 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway cleanup [puppet] - ''https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 10:57:25 <jinxer-wm> RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 10:58:50 <logmsgbot> !log zabe@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
2025-10-28 11:00:03 <logmsgbot> !log zabe@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
2025-10-28 11:11:50 <wikibugs> ('PS1) ''Stevemunene: druid: add druid-coordinator to druid public worker role [puppet] - ''https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222)'
2025-10-28 11:14:51 <wikibugs> ('CR) ''Mahmoud-abdelsattar: [C:''+1] Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: ''Arthur taylor)'
2025-10-28 11:14:54 <wikibugs> ('PS2) ''Stevemunene: druid: add druid-coordinator to druid public worker role [puppet] - ''https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222)'
2025-10-28 11:20:08 <wikibugs> ('PS3) ''Stevemunene: LVS: etcd data for druid-public-coordinator [puppet] - ''https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222)'
2025-10-28 11:20:12 <wikibugs> ('PS4) ''Stevemunene: LVS: Add druid-public-coordinator to service list [puppet] - ''https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222)'
2025-10-28 11:21:24 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: ''Arthur taylor)'
2025-10-28 11:24:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 11:25:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 11:27:48 <wikibugs> ('PS1) ''Muehlenhoff: osm: Remove obsolete spec files [puppet] - ''https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565)'
2025-10-28 11:29:06 <wikibugs> ('CR) ''Muehlenhoff: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-28 11:29:26 <wikibugs> ('PS10) ''Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - ''https://gerrit.wikimedia.org/r/1198928'
2025-10-28 11:30:32 <logmsgbot> !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001
2025-10-28 11:31:39 <icinga-wm> PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100%
2025-10-28 11:31:48 <Msz2001> I'm going to do a deployment to private code, related to Suggested Investigations
2025-10-28 11:32:03 <wikibugs> ('CR) ''Elukey: [C:''+1] osm: Remove obsolete spec files [puppet] - ''https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-28 11:33:55 <icinga-wm> RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms
2025-10-28 11:35:59 <wikibugs> ('CR) ''Muehlenhoff: [C:''+2] osm: Remove obsolete spec files [puppet] - ''https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-28 11:37:33 <wikibugs> ('PS1) ''Brouberol: cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578)'
2025-10-28 11:37:56 <wikibugs> ('PS1) ''Brouberol: postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578)'
2025-10-28 11:40:35 <logmsgbot> !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve2001
2025-10-28 11:41:07 <logmsgbot> !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host sretest2010
2025-10-28 11:42:12 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408478#11318289 (''Jclark-ctr)'
2025-10-28 11:42:13 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11318292 (''Jclark-ctr) →''Duplicate dup:''T408478'
2025-10-28 11:42:50 <wikibugs> ('PS1) ''Mvolz: Update Zotero to node22 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199263 (https://phabricator.wikimedia.org/T393434)'
2025-10-28 11:42:53 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2026.codfw.wmnet
2025-10-28 11:42:53 <logmsgbot> !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010
2025-10-28 11:43:31 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11318295 (''Jclark-ctr) ''Duplicate''Open Closed by mistake'
2025-10-28 11:44:07 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408478#11318299 (''Jclark-ctr) ''Open''Resolved a:''Jclark-ctr Down due to work with card install T400877'
2025-10-28 11:44:34 <logmsgbot> !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe-codfw
2025-10-28 11:45:40 <wikibugs> ('CR) ''Slyngshede: [C:''+1] admin: add dpogorzelski to ops-limited [puppet] - ''https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: ''Kamila Součková)'
2025-10-28 11:47:44 <wikibugs> ('PS1) ''Muehlenhoff: osm_sync_lag.sh: Fix default to current directory [puppet] - ''https://gerrit.wikimedia.org/r/1199265 (https://phabricator.wikimedia.org/T381565)'
2025-10-28 11:47:57 <wikibugs> ('CR) ''Stevemunene: [C:''+1] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 11:48:04 <wikibugs> ('CR) ''Stevemunene: [C:''+1] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 11:48:52 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
2025-10-28 11:49:06 <wikibugs> ('CR) ''Brouberol: [C:''+2] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 11:49:08 <wikibugs> ('CR) ''Brouberol: [C:''+2] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 11:49:19 <wikibugs> ('PS2) ''Brouberol: postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578)'
2025-10-28 11:50:43 <wikibugs> ('CR) ''Brouberol: [V:''+2 C:''+2] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 11:50:47 <wikibugs> ('CR) ''Brouberol: [V:''+2 C:''+2] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-28 11:54:33 <wikibugs> ('PS2) ''Fabfur: P:cache:haproxy: introduce ua classes [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060)'
2025-10-28 11:54:35 <wikibugs> ('CR) ''Fabfur: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: ''Fabfur)'
2025-10-28 11:54:36 <logmsgbot> fceratto@cumin1003 decommission (PID 372416) is awaiting input
2025-10-28 11:59:27 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11318342 (''Neslihan_Turan_WMDE) Hi, sorry for the delay. I had a problem accessing Slack but now I managed to sent my public key to Amir. My public key is already...'
2025-10-28 12:00:04 <jouncebot> Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1200)
2025-10-28 12:00:36 <Msz2001> Noting that I'll finish my deployment to private code in 2-3 minutes
2025-10-28 12:01:16 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11318344 (''Jclark-ctr) @VRiley-WMF Hey, just a heads up — the fiber was installed with RX-to-RX and TX-to-TX, so the polarity wasn’t verified. Make sure to check polarity next time to avoid c...'
2025-10-28 12:04:38 <Msz2001> !log Deployed changes to Suggested Investigations
2025-10-28 12:04:41 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 12:04:44 <Msz2001> I'm finished with deploying
2025-10-28 12:08:08 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11318379 (''cmooney) >>! In T396065#11318344, @Jclark-ctr wrote: > @cmooney link is up Ok great yep BGP looking good I've added it now. ` cmooney@ssw1-e1-eqiad> show bgp summary group core |...'
2025-10-28 12:08:51 <wikibugs> ('PS1) ''Muehlenhoff: maps: Stop installing osm2pgsql and osmborder [puppet] - ''https://gerrit.wikimedia.org/r/1199271 (https://phabricator.wikimedia.org/T381565)'
2025-10-28 12:09:14 <wikibugs> ('PS1) ''Cathal Mooney: ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - ''https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065)'
2025-10-28 12:12:05 <wikibugs> ('CR) ''Vgutierrez: [C:''-1] P:cache:haproxy: introduce ua classes (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: ''Fabfur)'
2025-10-28 12:16:35 <wikibugs> ('CR) ''Dpogorzelski: [C:''+1] "Done" [puppet] - ''https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: ''Kamila Součková)'
2025-10-28 12:19:43 <wikibugs> ('CR) ''Hnowlan: [C:''+1] trafficserver: action api to rest-gateway group0 50% [puppet] - ''https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223) (owner: ''Clément Goubert)'
2025-10-28 12:19:57 <wikibugs> ('CR) ''Cathal Mooney: [C:''+2] ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - ''https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065) (owner: ''Cathal Mooney)'
2025-10-28 12:21:15 <wikibugs> ('Merged) ''jenkins-bot: ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - ''https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065) (owner: ''Cathal Mooney)'
2025-10-28 12:24:09 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2026.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
2025-10-28 12:26:28 <kostajh> Msz2001: is deploying a follow up
2025-10-28 12:27:14 <logmsgbot> fceratto@cumin1003 decommission (PID 372416) is awaiting input
2025-10-28 12:27:27 <kostajh> these issues appeared after the previous deploy https://logstash.wikimedia.org/goto/d13b6c9cd8e42929d855b4c081e43484
2025-10-28 12:35:20 <Msz2001> Deployed
2025-10-28 12:44:45 <wikibugs> ('PS1) ''Stevemunene: druid: Increase the size of the Druid broker cache size to 4GB [puppet] - ''https://gerrit.wikimedia.org/r/1199280 (https://phabricator.wikimedia.org/T408189)'
2025-10-28 12:45:22 <logmsgbot> !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: reboot
2025-10-28 12:46:03 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet
2025-10-28 12:49:18 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11318475 (''Jclark-ctr) a:''Jclark-ctr''None'
2025-10-28 12:49:48 <logmsgbot> !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet
2025-10-28 12:53:07 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2026.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
2025-10-28 12:53:07 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-28 12:53:08 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2026.codfw.wmnet
2025-10-28 12:55:28 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-28 13:00:05 <jouncebot> Urbanecm and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1300).
2025-10-28 13:00:06 <jouncebot> Bunnypranav and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2025-10-28 13:00:53 <logmsgbot> !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe-codfw
2025-10-28 13:01:15 <MatmaRex> hi
2025-10-28 13:03:07 <MatmaRex> anyone deploying?
2025-10-28 13:04:25 <jinxer-wm> RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 13:06:09 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host lvs2011.codfw.wmnet
2025-10-28 13:06:13 <wikibugs> ('PS5) ''Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582)'
2025-10-28 13:07:15 <wikibugs> ('PS2) ''Muehlenhoff: Shift tile eqiad invalidation to the bookworm master [puppet] - ''https://gerrit.wikimedia.org/r/1195717 (https://phabricator.wikimedia.org/T381565)'
2025-10-28 13:08:08 <wikibugs> ('CR) ''CDanis: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (''5 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: ''Jelto)'
2025-10-28 13:08:23 <wikibugs> ('CR) ''Gehel: [C:''+2] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - ''https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 13:10:29 <wikibugs> ('Abandoned) ''Muehlenhoff: Shift tile eqiad invalidation to the bookworm master [puppet] - ''https://gerrit.wikimedia.org/r/1195717 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-28 13:11:13 <wikibugs> ('CR) ''Muehlenhoff: "The mwdebug servers are gone" [puppet] - ''https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) (owner: ''Muehlenhoff)'
2025-10-28 13:11:20 <wikibugs> ('PS2) ''Muehlenhoff: Remove obsolete appserver cergen certs [puppet] - ''https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636)'
2025-10-28 13:14:04 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
2025-10-28 13:14:54 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
2025-10-28 13:17:38 <xSavitar> MatmaRex, I can help if you'll assist with testing :)
2025-10-28 13:17:46 <logmsgbot> !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host lvs2011.codfw.wmnet
2025-10-28 13:17:50 <xSavitar> Are you still around?
2025-10-28 13:17:58 <MatmaRex> hi :) thanks
2025-10-28 13:18:28 <wikibugs> 'ops-codfw, ''DC-Ops, ''Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549 (''ssingh) ''NEW'
2025-10-28 13:18:29 <xSavitar> Seems like Bunnypranav is not around
2025-10-28 13:18:36 <wikibugs> 'ops-codfw, ''DC-Ops, ''Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549#11318574 (''ssingh) p:''Triage''High'
2025-10-28 13:18:37 <xSavitar> So I'll just quickly do MatmaRex's
2025-10-28 13:18:50 <bunnypranav> Hi!
2025-10-28 13:19:07 <bunnypranav> Bit late, apologies. I'm fine with waiting
2025-10-28 13:19:46 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) (owner: ''Bartosz Dziewoński)'
2025-10-28 13:20:03 <xSavitar> bunnypranav, okay! Will signal you once I'm done, thanks!
2025-10-28 13:20:13 <bunnypranav> Sure :)
2025-10-28 13:20:39 <wikibugs> ('Merged) ''jenkins-bot: Make wgVectorMaxWidthOptions specify Special:Userlogin correctly [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) (owner: ''Bartosz Dziewoński)'
2025-10-28 13:21:13 <logmsgbot> !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]]
2025-10-28 13:21:19 <stashbot> T408447: Under Vector 2022 on Wikimedia wikis, page width is different between Special:UserLogin and Special:CreateAccount - https://phabricator.wikimedia.org/T408447
2025-10-28 13:23:23 <wikibugs> ('PS1) ''Mszwarc: Remove hCaptcha site key from private/readme.php [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199291'
2025-10-28 13:23:50 <wikibugs> ('CR) ''Kosta Harlan: [C:''+1] Remove hCaptcha site key from private/readme.php [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199291 (owner: ''Mszwarc)'
2025-10-28 13:24:14 <kostajh> xSavitar MatmaRex we need to sync the above patch ^
2025-10-28 13:24:15 <wikibugs> ('PS14) ''Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574)'
2025-10-28 13:25:04 <kostajh> are either of you able to sync that? it should be a no-op. if not, either me or Msz2001 can do it
2025-10-28 13:25:08 <logmsgbot> !log derick@deploy2002 derick, matmarex: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-28 13:25:12 <xSavitar> kostajh, sure! After bunnypranav or now?
2025-10-28 13:25:25 <xSavitar> MatmaRex, you can test
2025-10-28 13:25:26 <kostajh> as soon as possible, I'd say
2025-10-28 13:25:49 <MatmaRex> my change looks good
2025-10-28 13:25:53 <xSavitar> Okay, once MatmaRex is done testing, maybe you can take over before bunnypranav (just an idea). That is if bunnypranav is up for it.
2025-10-28 13:26:05 <xSavitar> MatmaRex, okay will sync now.
2025-10-28 13:26:06 <bunnypranav> I'm fine, can wait if needed.
2025-10-28 13:26:12 <logmsgbot> !log derick@deploy2002 derick, matmarex: Continuing with sync
2025-10-28 13:26:38 <xSavitar> kostajh, okay bunnypranav agrees. I'll poke you once MatmaRex's patch is done syncing.
2025-10-28 13:27:39 <xSavitar> kostajh, I can also help in doing it.
2025-10-28 13:28:22 <wikibugs> ('CR) ''Ottomata: Disable default user-agent collection. (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: ''JavierMonton)'
2025-10-28 13:29:02 <kostajh> thank you!
2025-10-28 13:29:17 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-28 13:29:30 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-28 13:29:39 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-28 13:29:46 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-28 13:29:49 <wikibugs> ('PS15) ''Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574)'
2025-10-28 13:29:49 <wikibugs> ('CR) ''Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (''3 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: ''Pmiazga)'
2025-10-28 13:32:10 <logmsgbot> !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]] (duration: 10m 56s)
2025-10-28 13:32:14 <stashbot> T408447: Under Vector 2022 on Wikimedia wikis, page width is different between Special:UserLogin and Special:CreateAccount - https://phabricator.wikimedia.org/T408447
2025-10-28 13:33:05 <wikibugs> ('CR) ''Muehlenhoff: "Looks good to me!" [software/transferpy] - ''https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: ''Muehlenhoff)'
2025-10-28 13:33:19 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199291 (owner: ''Mszwarc)'
2025-10-28 13:33:36 <xSavitar> kostajh, so nothing to test I suppose?
2025-10-28 13:33:45 <kostajh> xSavitar: nothing to test
2025-10-28 13:33:57 <xSavitar> Ack! Will just sync it when it's time then, thanks~
2025-10-28 13:34:01 <xSavitar> *!
2025-10-28 13:34:16 <wikibugs> ('Merged) ''jenkins-bot: Remove hCaptcha site key from private/readme.php [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199291 (owner: ''Mszwarc)'
2025-10-28 13:34:48 <logmsgbot> !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]]
2025-10-28 13:35:35 <MatmaRex> thanks for deploying xSavitar
2025-10-28 13:35:59 <wikibugs> 'SRE, ''collaboration-services, ''Traffic, ''Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11318699 (''LSobanski)'
2025-10-28 13:36:22 <xSavitar> MatmaRex, thank you :)
2025-10-28 13:38:53 <logmsgbot> !log derick@deploy2002 mszwarc, derick: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-28 13:39:16 <logmsgbot> !log derick@deploy2002 mszwarc, derick: Continuing with sync
2025-10-28 13:39:42 <wikibugs> 'ops-ulsfo, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11318700 (''Papaul) @cmooney thanks for the feedback, I will upgrade the diagram to match the 100G links between the core routers and the switches and the type of...'
2025-10-28 13:42:43 <xSavitar> bunnypranav, 64% done, will hand over to you in a few mins.
2025-10-28 13:42:56 <bunnypranav> sure!
2025-10-28 13:43:46 <logmsgbot> !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]] (duration: 08m 58s)
2025-10-28 13:43:55 <xSavitar> bunnypranav over to you.
2025-10-28 13:44:18 <xSavitar> and thank you for your patience. 🙏🏽
2025-10-28 13:44:27 <bunnypranav> No worries
2025-10-28 13:45:21 <bunnypranav> I need some help of yours as well, the patch is a creation of an namespace; do we need to run any maintenance scripts
2025-10-28 13:46:17 <bunnypranav> btw, the namespace is "R:", and they already use that prefix, technically in the mainspace, so i assume the former.
2025-10-28 13:46:25 <bunnypranav> xSavitar: ^^^
2025-10-28 13:46:38 <anzx> bunnypranav: run namespacedupes
2025-10-28 13:46:49 <xSavitar> anzx beat me to it.
2025-10-28 13:47:23 <bunnypranav> I assume the pages wont be lost right?
2025-10-28 13:49:30 <wikibugs> ('PS2) ''JavierMonton: Disable default user-agent collection. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964)'
2025-10-28 13:49:32 <xSavitar> bunnypranav, I think everything should be fine.
2025-10-28 13:49:36 <anzx> bunnypranav: https://www.mediawiki.org/wiki/Manual:NamespaceDupes.php add prefix to check of any pages lost/unmoved/need manually moved can be retrieved
2025-10-28 13:49:53 <xSavitar> Are there any pages that are already in that namespace? In the past?
2025-10-28 13:50:12 <xSavitar> I guess I shouldn't say namespace but prefixed by R:
2025-10-28 13:50:28 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 13:50:37 <xSavitar> After running that script, everything should work correctly and they should be part of the R: and R_talk: namespace I suppose.
2025-10-28 13:51:14 <bunnypranav> Okay!
2025-10-28 13:51:19 <xSavitar> runs for a meeting...
2025-10-28 13:51:28 <bunnypranav> xSavitar: BTW I need you to deploy it for me, I am just a volunteer.
2025-10-28 13:51:57 <wikibugs> ('CR) ''Giuseppe Lavagetto: "I think the patch goes in the right direction, but is overcomplicated and misses a couple things:" [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: ''Fabfur)'
2025-10-28 13:52:12 <xSavitar> bunnypranav, Oh I could do that but having a meeting now. Will you be fine doing the next backport window? That is if another deployer isn't around to help.
2025-10-28 13:52:15 <wikibugs> ('CR) ''JavierMonton: Disable default user-agent collection. (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: ''JavierMonton)'
2025-10-28 13:52:27 <xSavitar> I thought you would be the one deploying, apologies, I would have asked.
2025-10-28 13:52:31 <bunnypranav> The next window is 1:30 am for me
2025-10-28 13:52:49 <bunnypranav> Its fine
2025-10-28 13:53:22 <xSavitar> Ops :(, I'll ping you here in a few hours (later this evening). If there is an open window, we can deploy your patch.
2025-10-28 13:53:39 <xSavitar> Otherwise, we can do it tomorrow afternoon (that's when I'll be available).
2025-10-28 13:53:54 <xSavitar> Is that okay by you?
2025-10-28 13:54:21 <wikibugs> ('CR) ''Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: ''Pmiazga)'
2025-10-28 13:54:28 <bunnypranav> Fine, I'll see if I am available tomorrow.
2025-10-28 13:54:46 <bunnypranav> These deploy windows are pretty tough for asian timezones
2025-10-28 13:55:10 <xSavitar> bunnypranav, FYI - this is the docs for adding a new namespace: https://wikitech.wikimedia.org/wiki/Adding_namespaces
2025-10-28 13:55:15 <xSavitar> I hope it's still up to date.
2025-10-28 13:55:19 <bunnypranav> Can I ping you in a few hours once I am available as well?
2025-10-28 13:55:34 <xSavitar> bunnypranav, yes ping me please. I want to help.
2025-10-28 13:55:48 <bunnypranav> Thank you so much!
2025-10-28 13:56:01 <xSavitar> bunnypranav, no thank you for all the work. 🙏🏽
2025-10-28 13:56:12 <bunnypranav> :D
2025-10-28 13:56:31 <xSavitar> Re tz friendlyness, maybe you can ask on #wikimedia-releng about it.
2025-10-28 13:56:52 <xSavitar> But we have multiple of these windows per day so I'm pretty sure one is friendly I suppose to your TZ
2025-10-28 13:57:11 <xSavitar> goes AFK to attend a meeting.
2025-10-28 13:57:28 <bunnypranav> Checked the wikitech page earlier, commit is fine; just needed confirmation on the maintenence scripts
2025-10-28 13:58:07 <bunnypranav> yeah, the afternoon one was fine, today I was busy for the morning one, so couldn't schedule for it.
2025-10-28 14:00:05 <jouncebot> Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1400)
2025-10-28 14:01:51 <wikibugs> ('CR) ''Elukey: [C:''+1] osm_sync_lag.sh: Fix default to current directory [puppet] - ''https://gerrit.wikimedia.org/r/1199265 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-28 14:02:12 <wikibugs> ('CR) ''Elukey: [C:''+1] maps: Stop installing osm2pgsql and osmborder [puppet] - ''https://gerrit.wikimedia.org/r/1199271 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-28 14:02:41 <wikibugs> ('CR) ''Elukey: [C:''+1] LVS: etcd data for druid-public-coordinator [puppet] - ''https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 14:02:58 <wikibugs> ('CR) ''Elukey: [C:''+1] LVS: Add druid-public-coordinator to service list (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 14:03:12 <wikibugs> ('CR) ''Elukey: [C:''+1] druid: add druid-coordinator to druid public worker role [puppet] - ''https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222) (owner: ''Stevemunene)'
2025-10-28 14:05:32 <wikibugs> ('PS16) ''Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574)'
2025-10-28 14:05:51 <wikibugs> ('PS1) ''Brouberol: global_config: add an urldownloader external service [puppet] - ''https://gerrit.wikimedia.org/r/1199297 (https://phabricator.wikimedia.org/T408012)'
2025-10-28 14:09:58 <wikibugs> ('CR) ''Brouberol: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199297 (https://phabricator.wikimedia.org/T408012) (owner: ''Brouberol)'
2025-10-28 14:10:46 <wikibugs> ('PS5) ''Daniel Kinzler: api-gateway: make cookie name configurable for testing [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128)'
2025-10-28 14:10:58 <wikibugs> ('CR) ''CI reject: [V:''-1] api-gateway: make cookie name configurable for testing [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: ''Daniel Kinzler)'
2025-10-28 14:13:00 <wikibugs> ('PS1) ''Federico Ceratto: sanitize-wiki: log into phabricator [cookbooks] - ''https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512)'
2025-10-28 14:14:10 <wikibugs> ('PS1) ''Muehlenhoff: Update account meta data for khantstop [puppet] - ''https://gerrit.wikimedia.org/r/1199302'
2025-10-28 14:14:48 <wikibugs> ('CR) ''Ottomata: [C:''+1] "I didn't look very deep to check each config, but LGTM!" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: ''JavierMonton)'
2025-10-28 14:17:08 <wikibugs> ('PS17) ''Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: ''Pmiazga)'
2025-10-28 14:19:50 <wikibugs> ('PS18) ''Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: ''Pmiazga)'
2025-10-28 14:19:50 <wikibugs> ('PS7) ''Clément Goubert: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - ''https://gerrit.wikimedia.org/r/1192879 (owner: ''Daniel Kinzler)'
2025-10-28 14:20:06 <wikibugs> ('PS8) ''Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - ''https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859)'
2025-10-28 14:20:28 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-28 14:21:24 <wikibugs> ('CR) ''Kamila Součková: [C:''+2] admin: add dpogorzelski to ops-limited [puppet] - ''https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: ''Kamila Součková)'
2025-10-28 14:23:12 <wikibugs> ('PS7) ''Clément Goubert: api-gateway: make cookie name configurable for testing [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: ''Daniel Kinzler)'
2025-10-28 14:23:15 <wikibugs> ('CR) ''Clare Ming: [C:''+2] xLab: Deploying v1.1.0 release to staging [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) (owner: ''Santiago Faci)'
2025-10-28 14:24:26 <wikibugs> ('CR) ''Jasmine: wikikube: Add wikikube-worker2[248-330] (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: ''Jasmine)'
2025-10-28 14:24:53 <wikibugs> ('Merged) ''jenkins-bot: xLab: Deploying v1.1.0 release to staging [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) (owner: ''Santiago Faci)'
2025-10-28 14:26:09 <wikibugs> ('PS1) ''Majavah: toolforge::toolviews: Output proper Prometheus metrics [puppet] - ''https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457)'
2025-10-28 14:26:39 <wikibugs> ('CR) ''Andrew Bogott: [C:''+1] clean-stale-puppet-certs: Remove nodes from PuppetDB where enabled [puppet] - ''https://gerrit.wikimedia.org/r/1198299 (owner: ''Majavah)'
2025-10-28 14:27:21 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549#11318894 (''Jhancock.wm) logged into idrac and found following error. ` A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in...'
2025-10-28 14:27:56 <wikibugs> ('CR) ''Kamila Součková: [C:''+1] "LGTM :-)" [puppet] - ''https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: ''Jasmine)'
2025-10-28 14:28:19 <wikibugs> ('CR) ''CI reject: [V:''-1] toolforge::toolviews: Output proper Prometheus metrics [puppet] - ''https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: ''Majavah)'
2025-10-28 14:28:33 <wikibugs> ('PS20) ''Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259)'
2025-10-28 14:29:23 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11318896 (''Raine)'
2025-10-28 14:29:35 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: ''JavierMonton)'
2025-10-28 14:30:05 <jouncebot> Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1430)
2025-10-28 14:30:18 <wikibugs> ('PS19) ''Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: ''Pmiazga)'
2025-10-28 14:30:58 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netops, ''Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11318918 (''tappof) I saw the alerts on the ALERTS metric: https://w.wiki/FqSi . I think there was a silence rule in place, so you didn't get any notifications....'
2025-10-28 14:31:46 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549#11318932 (''ssingh) ''Open''Resolved a:''ssingh Thanks for the help @Jhancock.wm. Marking this as resolved for now.'
2025-10-28 14:32:33 <wikibugs> ('PS9) ''Clément Goubert: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - ''https://gerrit.wikimedia.org/r/1192879 (owner: ''Daniel Kinzler)'
2025-10-28 14:33:26 <wikibugs> 'SRE-SLO, ''Experimentation Lab (Experiment Platform Sprint 14), ''OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11318939 (''dr0ptp4kt) >>! In T398869#11318126, @elukey wrote: > We finally have all three SLO published in Pyrra: https://slo.wikimedia.org/?search=xlab Thank...'
2025-10-28 14:33:50 <wikibugs> ('PS9) ''Clément Goubert: api-gateway: make cookie name configurable for testing [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: ''Daniel Kinzler)'
2025-10-28 14:35:17 <wikibugs> ('CR) ''CI reject: [V:''-1] api-gateway: make cookie name configurable for testing [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: ''Daniel Kinzler)'
2025-10-28 14:36:02 <icinga-wm> PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 14:37:38 <wikibugs> 'SRE, ''SRE-Unowned, ''Maps, ''Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11318965 (''elukey) Ran the diff testing tool between eqiad and codfw: ` | | ssim | |-----:|---------:| | 0.05 | 0.974994 | | 0.1 | 0.990161 | | 0.2 | 0.998943 | | 0.25 |...'
2025-10-28 14:37:46 <wikibugs> ('PS1) ''Brouberol: growthbook: deploy a more modern version against ferretdb [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199310 (https://phabricator.wikimedia.org/T408397)'
2025-10-28 14:39:48 <wikibugs> ('PS1) ''Federico Ceratto: site.pp, es2026.yaml: Decommission es2026 [puppet] - ''https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385)'
2025-10-28 14:40:48 <hashar> jouncebot: nowandnext
2025-10-28 14:40:48 <jouncebot> For the next 0 hour(s) and 19 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1430)
2025-10-28 14:40:48 <jouncebot> In 0 hour(s) and 19 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1500)
2025-10-28 14:41:36 <hashar> I am restarting both CI Jenkins and Gerrit
2025-10-28 14:42:07 <hashar> !log Restarting Gerrit
2025-10-28 14:42:10 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 14:44:46 <wikibugs> ('CR) ''Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: ''Jelto)'
2025-10-28 14:45:08 <hashar> !log Restarted CI Jenkins
2025-10-28 14:45:11 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 14:45:44 <wikibugs> ('CR) ''Majavah: [C:''+2] clean-stale-puppet-certs: Remove nodes from PuppetDB where enabled [puppet] - ''https://gerrit.wikimedia.org/r/1198299 (owner: ''Majavah)'
2025-10-28 14:45:45 <hashar> Gerrit/Jenkins/Zuul are all up and running
2025-10-28 14:46:02 <icinga-wm> RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 9.007 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 14:46:34 <wikibugs> ('CR) ''Andrea Denisse: [C:''+1] "lgtm, thank you!" [puppet] - ''https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) (owner: ''Huei Tan)'
2025-10-28 14:46:59 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netops, ''Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11319051 (''cmooney) >>! In T408378#11318918, @tappof wrote: > I saw the alerts on the ALERTS metric: https://w.wiki/FqSi . Ok thanks for that! That is a good...'
2025-10-28 14:47:21 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Data-Platform-SRE (2025.10.17 - 2025.11.07), ''Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11319065 (''RobH)'
2025-10-28 14:47:42 <wikibugs> ('PS1) ''Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128)'
2025-10-28 14:48:29 <wikibugs> ('PS2) ''Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128)'
2025-10-28 14:49:22 <wikibugs> ('CR) ''Clément Goubert: "Due to rebasing issues, I've squashed all the patch stack for the next phase of testing in one, plus renaming group to policy." [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: ''Clément Goubert)'
2025-10-28 14:49:56 <wikibugs> ('CR) ''CI reject: [V:''-1] api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: ''Clément Goubert)'
2025-10-28 14:50:02 <icinga-wm> PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 14:50:54 <icinga-wm> RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30036 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 14:51:04 <wikibugs> ('PS3) ''Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128)'
2025-10-28 14:51:36 <wikibugs> ('PS1) ''Cathal Mooney: team-netops: ospf alert: add pint disable promql/series [alerts] - ''https://gerrit.wikimedia.org/r/1199332 (https://phabricator.wikimedia.org/T408378)'
2025-10-28 14:52:06 <wikibugs> ('CR) ''Pmiazga: api-gateway: Release patch for ratelimit test (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: ''Clément Goubert)'
2025-10-28 14:52:32 <wikibugs> ('CR) ''CI reject: [V:''-1] api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: ''Clément Goubert)'
2025-10-28 14:52:33 <logmsgbot> !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad
2025-10-28 14:52:37 <wikibugs> ('PS2) ''Majavah: toolforge::toolviews: Output proper Prometheus metrics [puppet] - ''https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457)'
2025-10-28 14:52:37 <wikibugs> ('PS1) ''Majavah: toolforge::toolviews: Fix footgun with default values [puppet] - ''https://gerrit.wikimedia.org/r/1199333'
2025-10-28 14:54:03 <wikibugs> ('PS1) ''Gehel: hadoop: cleanup /tmp from directories as well as files [puppet] - ''https://gerrit.wikimedia.org/r/1199334 (https://phabricator.wikimedia.org/T396582)'
2025-10-28 14:55:01 <wikibugs> ('PS3) ''Cwhite: site: initial setup for new logging-sd hosts [puppet] - ''https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796)'
2025-10-28 14:55:07 <wikibugs> ('CR) ''CI reject: [V:''-1] toolforge::toolviews: Output proper Prometheus metrics [puppet] - ''https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: ''Majavah)'
2025-10-28 14:56:33 <wikibugs> ('PS4) ''Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128)'
2025-10-28 14:57:38 <logmsgbot> !log dancy@deploy2002 Installing scap version "4.218.0" for 2 host(s)
2025-10-28 14:57:57 <wikibugs> ('CR) ''Clément Goubert: api-gateway: Release patch for ratelimit test (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: ''Clément Goubert)'
2025-10-28 14:57:58 <wikibugs> ('CR) ''CI reject: [V:''-1] api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: ''Clément Goubert)'
2025-10-28 14:58:11 <wikibugs> ('CR) ''FNegri: [C:''+1] toolforge::toolviews: Fix footgun with default values [puppet] - ''https://gerrit.wikimedia.org/r/1199333 (owner: ''Majavah)'
2025-10-28 14:59:11 <wikibugs> ('PS2) ''Majavah: toolforge::toolviews: Fix footgun with default values [puppet] - ''https://gerrit.wikimedia.org/r/1199333'
2025-10-28 14:59:24 <logmsgbot> !log dancy@deploy2002 Installation of scap version "4.218.0" completed for 2 hosts
2025-10-28 14:59:59 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11319159 (''Raine) ''Open''Resolved Done, ping me in case of trouble :-)'
2025-10-28 15:00:05 <jouncebot> jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1500).
2025-10-28 15:00:31 <jelto> no my calendar says it's in one hour
2025-10-28 15:00:41 <taavi> daylight confusion time
2025-10-28 15:01:05 <wikibugs> ('PS21) ''Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259)'
2025-10-28 15:01:46 <wikibugs> ('CR) ''Majavah: [C:''+2] toolforge::toolviews: Fix footgun with default values [puppet] - ''https://gerrit.wikimedia.org/r/1199333 (owner: ''Majavah)'
2025-10-28 15:02:31 <wikibugs> ('PS3) ''Majavah: toolforge::toolviews: Output proper Prometheus metrics [puppet] - ''https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457)'
2025-10-28 15:04:14 <wikibugs> 'SRE, ''Traffic, ''FY2025-26 WE3.3 Engaging core audiences, ''Reader Experience Team (REx Sprint 8 [Q2 Oct 21-Nov 3]): [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11319191 (''Jdrewniak) When I talked to #traffic about this topic...'
2025-10-28 15:04:35 <wikibugs> ('CR) ''CI reject: [V:''-1] toolforge::toolviews: Output proper Prometheus metrics [puppet] - ''https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: ''Majavah)'
2025-10-28 15:05:42 <logmsgbot> !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1004.eqiad.wmnet with reason: reboot for kernel
2025-10-28 15:06:00 <wikibugs> ('PS4) ''Majavah: toolforge::toolviews: Output proper Prometheus metrics [puppet] - ''https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457)'
2025-10-28 15:06:19 <logmsgbot> !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab2002.codfw.wmnet with reason: reboot for kernel
2025-10-28 15:06:34 <wikibugs> ('PS22) ''Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259)'
2025-10-28 15:07:20 <wikibugs> 'SRE, ''SRE-Unowned, ''Maps, ''Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11319213 (''elukey) @TheDJ Hi! As FYI we now have eqiad and codfw on the new stack, both eqiad and codfw are pooled :)'
2025-10-28 15:07:23 <wikibugs> ('CR) ''Jelto: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7486/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: ''Jelto)'
2025-10-28 15:09:04 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-28 15:09:07 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
2025-10-28 15:09:11 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
2025-10-28 15:09:19 <logmsgbot> !log brennen@deploy2002 Started deploy [phabricator/deployment@5fbb350]: deploy phab1004 for T408575
2025-10-28 15:09:21 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply
2025-10-28 15:09:24 <stashbot> T408575: Deploy Phabricator/Phorge 2025-10-28 - https://phabricator.wikimedia.org/T408575
2025-10-28 15:09:25 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
2025-10-28 15:09:53 <logmsgbot> !log brennen@deploy2002 Finished deploy [phabricator/deployment@5fbb350]: deploy phab1004 for T408575 (duration: 00m 34s)
2025-10-28 15:10:12 <logmsgbot> !log brennen@deploy2002 Started deploy [phabricator/deployment@5fbb350]: deploy phab1004 for T408575
2025-10-28 15:11:37 <wikibugs> 'ops-codfw, ''DC-Ops, ''Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11319244 (''elukey)'
2025-10-28 15:11:41 <swfrench-wmf> !log applied mediawiki-common network policy updates in mw-script / mw-cron - T309738
2025-10-28 15:11:48 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 15:11:52 <stashbot> T309738: Move MediaWiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738
2025-10-28 15:12:13 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11319246 (''Ladsgroup) >>! In T406590#11318342, @Neslihan_Turan_WMDE wrote: > Hi, sorry for the delay. I had a problem accessing Slack but now I managed to sent my...'
2025-10-28 15:12:22 <wikibugs> 'ops-codfw, ''DC-Ops, ''Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11319258 (''elukey) The host is up after a powercycle, but it is still not serving any traffic. Adding dcops if they want to investigate it further, giving the numerous occurrences of t...'
2025-10-28 15:13:24 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11319262 (''Ladsgroup) I confirmed the key out of band.'
2025-10-28 15:13:38 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11319266 (''Ladsgroup)'
2025-10-28 15:14:01 <wikibugs> ('PS1) ''Ottomata: AQS edit-analytics - deploy new edits/per_editor endpoint [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199337 (https://phabricator.wikimedia.org/T405041)'
2025-10-28 15:16:21 <logmsgbot> !log brennen@deploy2002 Finished deploy [phabricator/deployment@5fbb350]: deploy phab1004 for T408575 (duration: 06m 09s)
2025-10-28 15:16:33 <stashbot> T408575: Deploy Phabricator/Phorge 2025-10-28 - https://phabricator.wikimedia.org/T408575
2025-10-28 15:16:56 <wikibugs> ('PS5) ''Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128)'
2025-10-28 15:19:26 <wikibugs> ('CR) ''Elukey: [C:''+1] Nokia: always set system cpm packet filter on devices [homer/public] - ''https://gerrit.wikimedia.org/r/1199056 (https://phabricator.wikimedia.org/T402577) (owner: ''Cathal Mooney)'
2025-10-28 15:20:05 <wikibugs> ('CR) ''Brouberol: [C:''+1] druid: Increase the size of the Druid broker cache size to 4GB [puppet] - ''https://gerrit.wikimedia.org/r/1199280 (https://phabricator.wikimedia.org/T408189) (owner: ''Stevemunene)'
2025-10-28 15:21:39 <wikibugs> 'ops-codfw, ''DC-Ops, ''Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11319327 (''Jhancock.wm) @elukey is it depooled? i wanna check some things out that might require some reboots.'
2025-10-28 15:23:36 <swfrench-wmf> !log disable-puppet on A:cp hosts for haproxy config change
2025-10-28 15:23:39 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 15:24:02 <icinga-wm> PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 15:24:02 <wikibugs> ('CR) ''Stevemunene: [C:''+1] "LGTM!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199310 (https://phabricator.wikimedia.org/T408397) (owner: ''Brouberol)'
2025-10-28 15:24:16 <wikibugs> 'ops-codfw, ''DC-Ops, ''Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11319338 (''elukey) @Jhancock.wm yep you can go ahead! Thanks :)'
2025-10-28 15:24:33 <wikibugs> ('CR) ''Scott French: "Thanks for the review!" [puppet] - ''https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) (owner: ''Scott French)'
2025-10-28 15:24:36 <wikibugs> ('CR) ''Scott French: [C:''+2] P:cache::haproxy: move x_requestctl setup into listen section [puppet] - ''https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) (owner: ''Scott French)'
2025-10-28 15:24:56 <icinga-wm> RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30037 bytes in 2.732 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
2025-10-28 15:25:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 15:27:27 <wikibugs> 'SRE, ''Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11319349 (''MoritzMuehlenhoff)'
2025-10-28 15:27:29 <wikibugs> ('PS6) ''Clément Goubert: api-gateway: Release patch for ratelimit test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128)'
2025-10-28 15:27:55 <wikibugs> ('Abandoned) ''Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - ''https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: ''Pmiazga)'
2025-10-28 15:28:05 <wikibugs> ('Abandoned) ''Clément Goubert: api-gateway: support per-route rate limit groups for rest gateway [deployment-charts] - ''https://gerrit.wikimedia.org/r/1192879 (owner: ''Daniel Kinzler)'
2025-10-28 15:28:11 <wikibugs> ('Abandoned) ''Clément Goubert: api-gateway: make cookie name configurable for testing [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: ''Daniel Kinzler)'
2025-10-28 15:29:37 <wikibugs> ('CR) ''CDanis: [C:''+1] "+1 from me! Although I don't think it's strictly necessary to make the same change on the public druid IMO" [puppet] - ''https://gerrit.wikimedia.org/r/1199280 (https://phabricator.wikimedia.org/T408189) (owner: ''Stevemunene)'
2025-10-28 15:34:04 <jinxer-wm> RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-28 15:34:52 <logmsgbot> !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ml-serve2001']
2025-10-28 15:34:53 <wikibugs> ('PS2) ''Arlolra: ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466)'
2025-10-28 15:35:13 <wikibugs> ('CR) ''Herron: [C:''+1] alertmanager: Add support for team mentions on the Slack template [puppet] - ''https://gerrit.wikimedia.org/r/1194321 (https://phabricator.wikimedia.org/T408145) (owner: ''Andrea Denisse)'
2025-10-28 15:36:36 <wikibugs> ('CR) ''Ottomata: [C:''+2] "Main patch has been reviewed, merging for deployment." [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199337 (https://phabricator.wikimedia.org/T405041) (owner: ''Ottomata)'
2025-10-28 15:36:49 <wikibugs> ('CR) ''Herron: [C:''+1] nrpe2nodexp: use service description as alertname [puppet] - ''https://gerrit.wikimedia.org/r/1199242 (https://phabricator.wikimedia.org/T395446) (owner: ''Tiziano Fogli)'
2025-10-28 15:37:56 <wikibugs> ('CR) ''Arlolra: ExtensionDistributor: Mark 1.45 as beta (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: ''Arlolra)'
2025-10-28 15:38:23 <wikibugs> ('Merged) ''jenkins-bot: AQS edit-analytics - deploy new edits/per_editor endpoint [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199337 (https://phabricator.wikimedia.org/T405041) (owner: ''Ottomata)'
2025-10-28 15:41:52 <logmsgbot> !log otto@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
2025-10-28 15:43:49 <swfrench-wmf> !log rolling run-puppet-agent on A:cp hosts for haproxy config change
2025-10-28 15:43:52 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 15:44:52 <wikibugs> ('PS1) ''Kamila Součková: benthos-cache-invalidator: clean up releases [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199340'
2025-10-28 15:44:55 <logmsgbot> !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ml-serve2001']
2025-10-28 15:46:22 <wikibugs> 'ops-eqiad, ''DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408585 (''phaultfinder) ''NEW'
2025-10-28 15:46:39 <wikibugs> ('CR) ''CI reject: [V:''-1] benthos-cache-invalidator: clean up releases [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199340 (owner: ''Kamila Součková)'
2025-10-28 15:49:29 <wikibugs> ('PS1) ''PipelineBot: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199403'
2025-10-28 15:51:12 <wikibugs> ('PS3) ''Ebernhardson: cirrus: Start near match A/B test [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154)'
2025-10-28 15:51:12 <wikibugs> ('CR) ''Ebernhardson: cirrus: Start near match A/B test (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: ''Ebernhardson)'
2025-10-28 15:51:58 <wikibugs> ('CR) ''CI reject: [V:''-1] cirrus: Start near match A/B test [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: ''Ebernhardson)'
2025-10-28 15:54:09 <jinxer-wm> FIRING: HelmReleaseBadStatus: Helm release edit-analytics/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=edit-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
2025-10-28 15:54:31 <wikibugs> ('PS23) ''Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - ''https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259)'
2025-10-28 15:58:29 <wikibugs> ('CR) ''Cathal Mooney: [C:''+2] Nokia: always set system cpm packet filter on devices [homer/public] - ''https://gerrit.wikimedia.org/r/1199056 (https://phabricator.wikimedia.org/T402577) (owner: ''Cathal Mooney)'
2025-10-28 15:59:04 <jinxer-wm> FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
2025-10-28 15:59:17 <wikibugs> ('PS4) ''Ebernhardson: cirrus: Start near match A/B test [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154)'
2025-10-28 16:00:00 <wikibugs> ('Merged) ''jenkins-bot: Nokia: always set system cpm packet filter on devices [homer/public] - ''https://gerrit.wikimedia.org/r/1199056 (https://phabricator.wikimedia.org/T402577) (owner: ''Cathal Mooney)'
2025-10-28 16:00:04 <jouncebot> jhathaway and moritzm: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1600).
2025-10-28 16:00:05 <jouncebot> No Gerrit patches in the queue for this window AFAICS.
2025-10-28 16:04:04 <jinxer-wm> RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
2025-10-28 16:05:20 <wikibugs> 'ops-magru, ''SRE, ''DC-Ops: MAGRU power maint - CHG0262056 - October 29-30, 2025 - https://phabricator.wikimedia.org/T408589 (''RobH) ''NEW p:''Triage''Low'
2025-10-28 16:05:50 <wikibugs> 'ops-magru, ''SRE, ''DC-Ops: MAGRU power maint - CHG0262056 - October 29-30, 2025 - https://phabricator.wikimedia.org/T408589#11319581 (''RobH) Please note the email required we give consent for the work so I did so via the email.'
2025-10-28 16:06:52 <wikibugs> 'ops-magru, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, and 2 others: MAGRU power maint - CHG0262056 - October 29-30, 2025 - https://phabricator.wikimedia.org/T408589#11319592 (''RobH) @netops & #traffic: I don't expect any impact from this according to the notification but just FYI!'
2025-10-28 16:13:53 <wikibugs> 'SRE, ''PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592 (''Jdrewniak) ''NEW'
2025-10-28 16:14:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 16:14:23 <wikibugs> 'SRE, ''PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11319644 (''Jdrewniak)'
2025-10-28 16:15:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 16:24:08 <wikibugs> ('CR) ''Marostegui: sanitize-wiki: log into phabricator (''1 comment) [cookbooks] - ''https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512) (owner: ''Federico Ceratto)'
2025-10-28 16:30:56 <wikibugs> ('PS1) ''Marostegui: instances.yaml: Remove es1031 [puppet] - ''https://gerrit.wikimedia.org/r/1199462 (https://phabricator.wikimedia.org/T408600)'
2025-10-28 16:31:37 <wikibugs> ('CR) ''Marostegui: [C:''+2] instances.yaml: Remove es1031 [puppet] - ''https://gerrit.wikimedia.org/r/1199462 (https://phabricator.wikimedia.org/T408600) (owner: ''Marostegui)'
2025-10-28 16:32:53 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1031 from dbctl T408600', diff saved to https://phabricator.wikimedia.org/P84315 and previous config saved to /var/cache/conftool/dbconfig/20251028-163252-marostegui.json
2025-10-28 16:32:59 <stashbot> T408600: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600
2025-10-28 16:34:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 16:34:18 <wikibugs> ('PS1) ''Marostegui: mariadb: Decommission es1031 [puppet] - ''https://gerrit.wikimedia.org/r/1199463 (https://phabricator.wikimedia.org/T408600)'
2025-10-28 16:34:49 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1031.eqiad.wmnet
2025-10-28 16:35:08 <wikibugs> ('PS1) ''Ottomata: edit-analytics - bump to build on bookworm [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199464 (https://phabricator.wikimedia.org/T405041)'
2025-10-28 16:35:14 <wikibugs> ('PS1) ''Elukey: prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - ''https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697)'
2025-10-28 16:35:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 16:35:34 <wikibugs> ('CR) ''Marostegui: [C:''+2] mariadb: Decommission es1031 [puppet] - ''https://gerrit.wikimedia.org/r/1199463 (https://phabricator.wikimedia.org/T408600) (owner: ''Marostegui)'
2025-10-28 16:36:00 <wikibugs> ('CR) ''Marostegui: "is it already removed from dbctl?" [puppet] - ''https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385) (owner: ''Federico Ceratto)'
2025-10-28 16:36:06 <wikibugs> ('CR) ''Ottomata: [C:''+2] edit-analytics - bump to build on bookworm [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199464 (https://phabricator.wikimedia.org/T405041) (owner: ''Ottomata)'
2025-10-28 16:36:08 <wikibugs> ('PS1) ''Mszwarc: hCaptcha: Store risk score in cache, so that jobs can use it [extensions/ConfirmEdit] (wmf/1.45.0-wmf.24) - ''https://gerrit.wikimedia.org/r/1199466 (https://phabricator.wikimedia.org/T408542)'
2025-10-28 16:36:27 <wikibugs> ('PS1) ''Mszwarc: hCaptcha: Store risk score in cache, so that jobs can use it [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199467 (https://phabricator.wikimedia.org/T408542)'
2025-10-28 16:36:29 <wikibugs> ('PS2) ''Elukey: prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - ''https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697)'
2025-10-28 16:37:45 <wikibugs> ('Merged) ''jenkins-bot: edit-analytics - bump to build on bookworm [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199464 (https://phabricator.wikimedia.org/T405041) (owner: ''Ottomata)'
2025-10-28 16:38:29 <logmsgbot> !log otto@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
2025-10-28 16:38:34 <logmsgbot> !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
2025-10-28 16:40:43 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.dns.netbox
2025-10-28 16:40:58 <jinxer-wm> FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
2025-10-28 16:41:03 <logmsgbot> !log otto@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
2025-10-28 16:41:18 <sukhe> !incidents
2025-10-28 16:41:18 <sirenbot> 6905 (UNACKED) NELHigh sre (thanos-rule@main tcp.timed_out)
2025-10-28 16:41:24 <logmsgbot> !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
2025-10-28 16:41:24 <_joe_> sukhe: hi
2025-10-28 16:41:29 <sukhe> !ack 6905
2025-10-28 16:41:30 <_joe_> !ack 6905
2025-10-28 16:41:30 <Raine> !ack 6905
2025-10-28 16:41:32 <sirenbot> 6905 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out)
2025-10-28 16:41:33 <sirenbot> 6905 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out)
2025-10-28 16:41:33 <sirenbot> 6905 (ACKED) NELHigh sre (thanos-rule@main tcp.timed_out)
2025-10-28 16:44:09 <jinxer-wm> RESOLVED: HelmReleaseBadStatus: Helm release edit-analytics/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=edit-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
2025-10-28 16:44:17 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1031.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
2025-10-28 16:44:36 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1031.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
2025-10-28 16:44:36 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-28 16:44:39 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1031.eqiad.wmnet
2025-10-28 16:44:44 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i"; [extensions/ConfirmEdit] (wmf/1.45.0-wmf.24) - ''https://gerrit.wikimedia.org/r/1199466 (https://phabricator.wikimedia.org/T408542) (owner: ''Mszwarc)'
2025-10-28 16:45:20 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i"; [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199467 (https://phabricator.wikimedia.org/T408542) (owner: ''Mszwarc)'
2025-10-28 16:45:58 <jinxer-wm> RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
2025-10-28 16:49:36 <wikibugs> ('CR) ''Brouberol: [C:''+2] growthbook: deploy a more modern version against ferretdb [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199310 (https://phabricator.wikimedia.org/T408397) (owner: ''Brouberol)'
2025-10-28 16:50:39 <wikibugs> 'ops-eqiad, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11320015 (''Marostegui)'
2025-10-28 16:50:50 <wikibugs> 'ops-eqiad, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11320042 (''Marostegui) This is ready for #dc-ops'
2025-10-28 16:51:10 <logmsgbot> !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
2025-10-28 16:51:22 <logmsgbot> !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
2025-10-28 16:51:33 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply
2025-10-28 16:51:39 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply
2025-10-28 16:51:57 <logmsgbot> !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
2025-10-28 16:52:17 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
2025-10-28 16:52:17 <logmsgbot> !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
2025-10-28 16:52:37 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
2025-10-28 16:52:39 <wikibugs> ('PS1) ''Pppery: Update translation [phabricator/translations] (wmf/stable) - ''https://gerrit.wikimedia.org/r/1199469'
2025-10-28 16:53:11 <wikibugs> ('PS2) ''Pppery: Update translations [phabricator/translations] (wmf/stable) - ''https://gerrit.wikimedia.org/r/1199469'
2025-10-28 16:53:42 <wikibugs> ('PS3) ''Pppery: Update translations [phabricator/translations] (wmf/stable) - ''https://gerrit.wikimedia.org/r/1199469'
2025-10-28 16:55:28 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-28 17:00:05 <jouncebot> swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1700).
2025-10-28 17:00:13 <swfrench-wmf> o/
2025-10-28 17:00:26 <_joe_> jouncebot: cringe
2025-10-28 17:00:28 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 17:00:40 <swfrench-wmf> lol
2025-10-28 17:01:33 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199048 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-28 17:01:35 <wikibugs> ('PS4) ''JHathaway: sysctls: add optional module param to sysctl::parameters [puppet] - ''https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726)'
2025-10-28 17:02:23 <wikibugs> ('Merged) ''jenkins-bot: Enroll 10% of client sessions in PHP 8.3 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199048 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-28 17:02:56 <logmsgbot> !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1199048|Enroll 10% of client sessions in PHP 8.3 (T405955)]]
2025-10-28 17:03:06 <stashbot> T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
2025-10-28 17:05:09 <wikibugs> ('CR) ''JHathaway: "I wasn't aware of ConditionKernelModuleLoaded. I tried it on a qemu sid box, but I couldn't get it to work properly. I think this is becau" [puppet] - ''https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: ''JHathaway)'
2025-10-28 17:05:13 <wikibugs> ('CR) ''JHathaway: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: ''JHathaway)'
2025-10-28 17:05:23 <logmsgbot> !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1199048|Enroll 10% of client sessions in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-28 17:05:47 <wikibugs> ('CR) ''Klausman: [C:''+1] prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - ''https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697) (owner: ''Elukey)'
2025-10-28 17:07:09 <logmsgbot> !log swfrench@deploy2002 swfrench: Continuing with sync
2025-10-28 17:08:16 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2025-10-28 17:08:40 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool sretest2003 T407352', diff saved to https://phabricator.wikimedia.org/P84316 and previous config saved to /var/cache/conftool/dbconfig/20251028-170840-marostegui.json
2025-10-28 17:08:46 <stashbot> T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352
2025-10-28 17:09:04 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 17:09:04 <jinxer-wm> FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 17:09:59 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es2040 to clone sretest2003 T407352', diff saved to https://phabricator.wikimedia.org/P84317 and previous config saved to /var/cache/conftool/dbconfig/20251028-170958-marostegui.json
2025-10-28 17:11:20 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2040.codfw.wmnet,sretest2003.codfw.wmnet with reason: Cloning sretest2003 from es2040
2025-10-28 17:11:27 <logmsgbot> !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199048|Enroll 10% of client sessions in PHP 8.3 (T405955)]] (duration: 08m 30s)
2025-10-28 17:11:31 <wikibugs> ('PS1) ''Marostegui: sretest2003: Move it to es7 [puppet] - ''https://gerrit.wikimedia.org/r/1199472 (https://phabricator.wikimedia.org/T407352)'
2025-10-28 17:11:32 <stashbot> T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
2025-10-28 17:12:47 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto sretest2003.codfw.wmnet
2025-10-28 17:13:01 <wikibugs> 'SRE, ''Traffic, ''FY2025-26 WE3.3 Engaging core audiences, ''Reader Experience Team (REx Sprint 8 [Q2 Oct 21-Nov 3]): [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11320228 (''CDanis) Sounds good to me @Jdrewniak ! Thanks :)'
2025-10-28 17:13:11 <wikibugs> ('CR) ''Marostegui: [C:''+2] sretest2003: Move it to es7 [puppet] - ''https://gerrit.wikimedia.org/r/1199472 (https://phabricator.wikimedia.org/T407352) (owner: ''Marostegui)'
2025-10-28 17:13:29 <swfrench-wmf> part #1 of the infra window done. part #2 coming soon.
2025-10-28 17:13:44 <wikibugs> ('PS3) ''Fabfur: P:cache:haproxy: introduce ua classes [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060)'
2025-10-28 17:13:54 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320235 (''Dzahn) a:''Neslihan_Turan_WMDE''None Thank you for taking care of that, Ladsgroup!'
2025-10-28 17:14:02 <wikibugs> ('CR) ''Fabfur: [C:''-1] "still addressing the comments" [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: ''Fabfur)'
2025-10-28 17:14:06 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320237 (''Dzahn) ''Stalled''In progress'
2025-10-28 17:14:51 <wikibugs> ('CR) ''Andrea Denisse: [C:''+2] alertmanager: Add support for team mentions on the Slack template [puppet] - ''https://gerrit.wikimedia.org/r/1194321 (https://phabricator.wikimedia.org/T408145) (owner: ''Andrea Denisse)'
2025-10-28 17:18:54 <wikibugs> ('CR) ''Scott French: [C:''+2] mw-(api-int|jobrunner): Serve 5% of traffic on PHP 8.3 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199047 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-28 17:20:45 <wikibugs> ('Merged) ''jenkins-bot: mw-(api-int|jobrunner): Serve 5% of traffic on PHP 8.3 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199047 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-28 17:22:53 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
2025-10-28 17:23:08 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
2025-10-28 17:23:29 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
2025-10-28 17:23:38 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
2025-10-28 17:25:27 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
2025-10-28 17:25:40 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
2025-10-28 17:25:47 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11320304 (''RobH) [[ https://docs.google.com/spreadsheets/d/13ow4JxrsQdz8KSsdBBNwvlrAuGKo8OHWcnR4RhXTYc0/edit?usp=sharing | Google Sheet listing of all affect...'
2025-10-28 17:26:03 <wikibugs> ('PS3) ''Elukey: prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - ''https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697)'
2025-10-28 17:26:23 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
2025-10-28 17:26:31 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
2025-10-28 17:27:25 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
2025-10-28 17:27:37 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
2025-10-28 17:27:58 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
2025-10-28 17:28:04 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
2025-10-28 17:28:37 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
2025-10-28 17:28:46 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
2025-10-28 17:28:56 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
2025-10-28 17:29:01 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
2025-10-28 17:31:42 <wikibugs> ('PS5) ''BCornwall: varnish: Promote new m-dot redirect from 302/307 to 301/308 [puppet] - ''https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-28 17:32:30 <wikibugs> ('CR) ''BCornwall: "I took the liberty to update two more tests to use 301s instead of 302s. varnishtests now pass. Mind giving that a lookover?" [puppet] - ''https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-28 17:33:49 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool es2027 T408406', diff saved to https://phabricator.wikimedia.org/P84318 and previous config saved to /var/cache/conftool/dbconfig/20251028-173348-fceratto.json
2025-10-28 17:33:53 <stashbot> T408406: decommission es2027 - https://phabricator.wikimedia.org/T408406
2025-10-28 17:38:27 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11320374 (''RobH)'
2025-10-28 17:38:46 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11320386 (''RobH)'
2025-10-28 17:40:28 <wikibugs> ('CR) ''BCornwall: "Marking unresolved" [puppet] - ''https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-28 17:46:12 <wikibugs> ('PS2) ''Federico Ceratto: site.pp, es2026.yaml: Decommission es2026 [puppet] - ''https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385)'
2025-10-28 17:46:12 <wikibugs> ('PS1) ''Federico Ceratto: instances.yaml: remove es2027 from dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1199476 (https://phabricator.wikimedia.org/T408406)'
2025-10-28 17:52:41 <wikibugs> 'SRE, ''PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11320455 (''Aklapper) > - The ability to access this page via a custom domain/subdomain (TBD) Wasn't that {T407156} instead of TBD?'
2025-10-28 17:57:24 <wikibugs> ('PS1) ''Ottomata: edit-analytics - image bump to fix path route [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199479 (https://phabricator.wikimedia.org/T405041)'
2025-10-28 17:57:36 <wikibugs> ('CR) ''Scott French: [C:''+1] {api,rest}-gateway: Update to Envoy 1.32.12 in staging [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199085 (https://phabricator.wikimedia.org/T405808) (owner: ''RLazarus)'
2025-10-28 17:57:38 <wikibugs> ('CR) ''Ottomata: [C:''+2] edit-analytics - image bump to fix path route [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199479 (https://phabricator.wikimedia.org/T405041) (owner: ''Ottomata)'
2025-10-28 17:59:19 <wikibugs> ('Merged) ''jenkins-bot: edit-analytics - image bump to fix path route [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199479 (https://phabricator.wikimedia.org/T405041) (owner: ''Ottomata)'
2025-10-28 18:00:05 <jouncebot> dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1800).
2025-10-28 18:01:28 <logmsgbot> !log otto@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
2025-10-28 18:01:46 <logmsgbot> !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
2025-10-28 18:02:06 <logmsgbot> !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
2025-10-28 18:02:21 <logmsgbot> !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
2025-10-28 18:02:31 <logmsgbot> !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
2025-10-28 18:03:02 <logmsgbot> !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
2025-10-28 18:04:24 <jinxer-wm> FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 18:06:45 <wikibugs> ('PS1) ''TrainBranchBot: group0 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199481 (https://phabricator.wikimedia.org/T405681)'
2025-10-28 18:06:52 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199481 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-28 18:07:42 <wikibugs> ('Merged) ''jenkins-bot: group0 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199481 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-28 18:08:15 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2025-10-28 18:10:54 <wikibugs> ('PS1) ''Jdlrobson: Update QuickSurvey platforms [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199482'
2025-10-28 18:13:44 <wikibugs> ('PS2) ''Federico Ceratto: sanitize-wiki: log into phabricator [cookbooks] - ''https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512)'
2025-10-28 18:14:43 <logmsgbot> !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.25 refs T405681
2025-10-28 18:14:48 <stashbot> T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681
2025-10-28 18:17:44 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320544 (''Dzahn)'
2025-10-28 18:21:36 <wikibugs> ('PS1) ''Dzahn: admin: add SSH key and restricted group membership for neslihanturan [puppet] - ''https://gerrit.wikimedia.org/r/1199484 (https://phabricator.wikimedia.org/T406590)'
2025-10-28 18:23:19 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-28 18:24:24 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 18:27:33 <wikibugs> 'SRE, ''collaboration-services, ''PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11320568 (''Dzahn)'
2025-10-28 18:27:56 <wikibugs> 'SRE, ''collaboration-services, ''PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11320570 (''Dzahn) added tag for the SRE subteam that owns microsites hosted on "miscweb" / kubernetes'
2025-10-28 18:28:19 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 18:29:48 <wikibugs> 'SRE, ''collaboration-services, ''PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11320577 (''Dzahn) This is certainly possible (hosting on kubernetes 'miscweb' alongside other microsites) and deployment via deployment servers, but does require...'
2025-10-28 18:32:33 <wikibugs> ('CR) ''Dzahn: aptrepo::staging: add job to clear incoming folder (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: ''Jelto)'
2025-10-28 18:33:04 <wikibugs> ('CR) ''Krinkle: [C:''+1] "Thanks. LGTM." [puppet] - ''https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-28 18:33:19 <wikibugs> ('PS8) ''Krinkle: varnish: Remove temporary enable_m_redir flag [puppet] - ''https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931)'
2025-10-28 18:35:09 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS trixie
2025-10-28 18:37:39 <wikibugs> ('PS1) ''Dzahn: add discovery records for gerrit as CNAMEs to public names [dns] - ''https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259)'
2025-10-28 18:39:09 <wikibugs> ('CR) ''Dzahn: "Is this what you meant?" [dns] - ''https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) (owner: ''Dzahn)'
2025-10-28 18:43:56 <wikibugs> ('PS2) ''Dzahn: add discovery records for gerrit as CNAMEs to public names [dns] - ''https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259)'
2025-10-28 18:49:51 <wikibugs> ('CR) ''Kamila Součková: [C:''+1] admin: add SSH key and restricted group membership for neslihanturan [puppet] - ''https://gerrit.wikimedia.org/r/1199484 (https://phabricator.wikimedia.org/T406590) (owner: ''Dzahn)'
2025-10-28 18:50:28 <wikibugs> ('CR) ''Pmiazga: [C:''+1] "LGTM" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: ''Clément Goubert)'
2025-10-28 18:51:36 <wikibugs> ('CR) ''Dzahn: [C:''+2] admin: add SSH key and restricted group membership for neslihanturan [puppet] - ''https://gerrit.wikimedia.org/r/1199484 (https://phabricator.wikimedia.org/T406590) (owner: ''Dzahn)'
2025-10-28 18:51:55 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage
2025-10-28 18:58:03 <wikibugs> ('CR) ''Muehlenhoff: site: initial setup for new logging-sd hosts (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) (owner: ''Cwhite)'
2025-10-28 19:00:02 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage
2025-10-28 19:15:13 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320789 (''Dzahn) @Neslihan_Turan_WMDE Your user has just been created on the deployment server now. You have the access. Do you need any other info how to config...'
2025-10-28 19:15:32 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320792 (''Dzahn) ''In progress''Resolved a:''Dzahn'
2025-10-28 19:16:32 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11320807 (''Dzahn) ` deploy1003:~] $ id neslihanturan uid=17901(neslihanturan) gid=500(wikidev) groups=500(wikidev),706(restricted),714(airflow-deployers) `'
2025-10-28 19:23:31 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS trixie
2025-10-28 19:24:16 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2006-dev.codfw.wmnet with OS trixie
2025-10-28 19:26:41 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy7001.magru.wmnet
2025-10-28 19:26:43 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
2025-10-28 19:28:48 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.codfw.wmnet with OS trixie
2025-10-28 19:29:22 <wikibugs> ('CR) ''JHathaway: Add the sre.hosts.powercycle cookbook (''1 comment) [cookbooks] - ''https://gerrit.wikimedia.org/r/1198928 (owner: ''Elukey)'
2025-10-28 19:30:28 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy7001.magru.wmnet - dzahn@cumin2002"
2025-10-28 19:30:32 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy7001.magru.wmnet - dzahn@cumin2002"
2025-10-28 19:30:33 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-28 19:30:33 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy7001.magru.wmnet on all recursors
2025-10-28 19:30:36 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy7001.magru.wmnet on all recursors
2025-10-28 19:31:10 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy7001.magru.wmnet - dzahn@cumin2002"
2025-10-28 19:31:16 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy7001.magru.wmnet - dzahn@cumin2002"
2025-10-28 19:31:28 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7001.magru.wmnet with OS trixie
2025-10-28 19:31:41 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11320900 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host...'
2025-10-28 19:31:59 <icinga-wm> PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-28 19:32:39 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-28 19:37:00 <wikibugs> ('CR) ''JHathaway: [C:''+2] dmarc: add dmarc monitoring records to more domains [dns] - ''https://gerrit.wikimedia.org/r/1198598 (https://phabricator.wikimedia.org/T404884) (owner: ''JHathaway)'
2025-10-28 19:37:57 <logmsgbot> !log jhathaway@dns1004 START - running authdns-update
2025-10-28 19:38:19 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-28 19:39:23 <logmsgbot> !log jhathaway@dns1004 END - running authdns-update
2025-10-28 19:40:28 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage
2025-10-28 19:44:32 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage
2025-10-28 19:45:09 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage
2025-10-28 19:48:59 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage
2025-10-28 19:51:41 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy7002.magru.wmnet
2025-10-28 19:51:43 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
2025-10-28 19:54:56 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11320930 (''VRiley-WMF) Attempted to swap the unit and it wouldn't power back on. Swapped it back out with the old one, and it still won't power on. Check...'
2025-10-28 19:57:08 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy7002.magru.wmnet - dzahn@cumin2002"
2025-10-28 19:57:34 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy7002.magru.wmnet - dzahn@cumin2002"
2025-10-28 19:57:35 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-28 19:57:35 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy7002.magru.wmnet on all recursors
2025-10-28 19:57:39 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy7002.magru.wmnet on all recursors
2025-10-28 19:58:11 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy7002.magru.wmnet - dzahn@cumin2002"
2025-10-28 19:58:19 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy7002.magru.wmnet - dzahn@cumin2002"
2025-10-28 19:58:50 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7002.magru.wmnet with OS trixie
2025-10-28 19:59:04 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11320950 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host...'
2025-10-28 20:00:05 <jouncebot> RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T2000).
2025-10-28 20:00:05 <jouncebot> Msz2001: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2025-10-28 20:00:41 <wikibugs> ('CR) ''BCornwall: [C:''+2] DNSRepository: Automated MarkMonitor domain sync [dns] - ''https://gerrit.wikimedia.org/r/1196775 (owner: ''Ncmonitor)'
2025-10-28 20:00:50 <Msz2001> I'm going to deploy
2025-10-28 20:00:55 <logmsgbot> !log brett@dns1004 START - running authdns-update
2025-10-28 20:01:44 <logmsgbot> !log brett@dns1004 END - running authdns-update
2025-10-28 20:02:14 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.24) - ''https://gerrit.wikimedia.org/r/1199466 (https://phabricator.wikimedia.org/T408542) (owner: ''Mszwarc)'
2025-10-28 20:02:14 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mszwarc@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199467 (https://phabricator.wikimedia.org/T408542) (owner: ''Mszwarc)'
2025-10-28 20:03:29 <wikibugs> ('Merged) ''jenkins-bot: hCaptcha: Store risk score in cache, so that jobs can use it [extensions/ConfirmEdit] (wmf/1.45.0-wmf.24) - ''https://gerrit.wikimedia.org/r/1199466 (https://phabricator.wikimedia.org/T408542) (owner: ''Mszwarc)'
2025-10-28 20:04:04 <wikibugs> ('Merged) ''jenkins-bot: hCaptcha: Store risk score in cache, so that jobs can use it [extensions/ConfirmEdit] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199467 (https://phabricator.wikimedia.org/T408542) (owner: ''Mszwarc)'
2025-10-28 20:04:41 <logmsgbot> !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1199466|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]], [[gerrit:1199467|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]]
2025-10-28 20:04:53 <stashbot> T408542: hCaptcha: Store risk score in global memcache key - https://phabricator.wikimedia.org/T408542
2025-10-28 20:06:58 <logmsgbot> !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1199466|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]], [[gerrit:1199467|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-28 20:07:36 <logmsgbot> !log mszwarc@deploy2002 mszwarc: Continuing with sync
2025-10-28 20:08:40 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2006-dev.codfw.wmnet with OS trixie
2025-10-28 20:12:08 <logmsgbot> !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199466|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]], [[gerrit:1199467|hCaptcha: Store risk score in cache, so that jobs can use it (T408542)]] (duration: 07m 27s)
2025-10-28 20:12:16 <stashbot> T408542: hCaptcha: Store risk score in global memcache key - https://phabricator.wikimedia.org/T408542
2025-10-28 20:13:41 <wikibugs> ('CR) ''BCornwall: [V:''+2 C:''+2] varnish: Promote new m-dot redirect from 302/307 to 301/308 [puppet] - ''https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-28 20:14:49 <wikibugs> ('CR) ''BCornwall: [C:''+2] varnishtest: Remove logfile support [puppet] - ''https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) (owner: ''BCornwall)'
2025-10-28 20:14:55 <wikibugs> ('CR) ''BCornwall: varnishtest: Remove logfile support [puppet] - ''https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) (owner: ''BCornwall)'
2025-10-28 20:17:26 <wikibugs> 'SRE, ''vrts, ''Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321005 (''Peachey88)'
2025-10-28 20:20:38 <wikibugs> 'SRE, ''vrts, ''Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321044 (''jhathaway) @Krd thanks, I'm investigating, not sure of the cause either.'
2025-10-28 20:20:57 <wikibugs> 'SRE, ''vrts, ''Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321045 (''jhathaway) p:''Triage''High'
2025-10-28 20:23:19 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 20:23:48 <wikibugs> 'SRE, ''vrts, ''Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321064 (''Krd) Non-representative example: From MAILER-DAEMON Tue Oct 28 20:21:46 2025 Received: from mx-in1001.wikimedia.org ([2620:0:861:4:208:80:155:102]:55514) by vrts1003.eq...'
2025-10-28 20:24:24 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 20:25:02 <logmsgbot> !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy7001.magru.wmnet with OS trixie
2025-10-28 20:25:03 <logmsgbot> !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy7001.magru.wmnet
2025-10-28 20:25:12 <wikibugs> 'SRE, ''vrts, ''Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321067 (''Krd) Ir appears to me that we are accepting bounces from phishing e-mails sent with fake sender info@wikipedia.org.'
2025-10-28 20:25:20 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321069 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-...'
2025-10-28 20:26:24 <wikibugs> 'SRE, ''vrts, ''Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321073 (''Krd) The 219.240.37.89 looks like a common factor. Can we block this source IP for SMTP as a first measure?'
2025-10-28 20:29:20 <Msz2001> !log Deployed change to private Suggested Investigations code
2025-10-28 20:29:23 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-28 20:29:34 <Msz2001> Freeing the window, I deployed all that I planned
2025-10-28 20:33:03 <wikibugs> ('CR) ''Herron: [C:''+1] "LGTM once the ferm/nftables bit is sorted out!" [puppet] - ''https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) (owner: ''Cwhite)'
2025-10-28 20:33:19 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 20:34:24 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 20:38:41 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy7002.magru.wmnet with reason: host reimage
2025-10-28 20:44:42 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy7002.magru.wmnet with reason: host reimage
2025-10-28 20:48:40 <wikibugs> 'SRE, ''envoy, ''serviceops, ''Patch-For-Review: Envoy config updates from v1.29 - https://phabricator.wikimedia.org/T404036#11321177 (''RLazarus) ''Open''Resolved'
2025-10-28 20:49:22 <wikibugs> 'SRE, ''vrts, ''Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321182 (''jhathaway) >>! In T408632#11321073, @Krd wrote: > The 219.240.37.89 looks like a common factor. Can we block this source IP for SMTP as a first measure? done, though a pr...'
2025-10-28 20:50:25 <apine> Hello, all! The Abstract Wikipedia team needs to do a semi-urgent deployment of backend services. I notice that the Web Team deployment window is coming up in ten minutes, but is rarely used.
2025-10-28 20:50:36 <apine> Will the Web Team be using that window today, or can I grab it?
2025-10-28 20:52:10 <logmsgbot> marostegui@cumin1003 clone (PID 543428) is awaiting input
2025-10-28 20:57:20 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639 (''SherryYang-WMF) ''NEW'
2025-10-28 20:58:19 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-28 21:00:05 <jouncebot> Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T2100)
2025-10-28 21:01:37 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy7002.magru.wmnet with OS trixie
2025-10-28 21:01:37 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy7002.magru.wmnet
2025-10-28 21:01:48 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321209 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-...'
2025-10-28 21:16:08 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408585#11321251 (''wiki_willy) a:''VRiley-WMF'
2025-10-28 21:17:55 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11321257 (''wiki_willy) a:''VRiley-WMF'
2025-10-28 21:18:16 <wikibugs> ('PS4) ''Cwhite: site: initial setup for new logging-sd hosts [puppet] - ''https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796)'
2025-10-28 21:19:15 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321260 (''Dzahn)'
2025-10-28 21:21:58 <wikibugs> ('PS1) ''Cory Massaro: Wikifunctions: Upgrade orchestrator from 2025-10-22-011302 to 2025-10-28-205854. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199504 (https://phabricator.wikimedia.org/T406540)'
2025-10-28 21:24:08 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) (owner: ''Cwhite)'
2025-10-28 21:26:20 <wikibugs> ('PS1) ''Cory Massaro: Update function-evaluators from 2025-10-21-143846 to 2025-10-28-150053. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199505 (https://phabricator.wikimedia.org/T407718)'
2025-10-28 21:26:44 <wikibugs> ('PS2) ''Cory Massaro: Wikifunctions: Update function-evaluators from 2025-10-21-143846 to 2025-10-28-150053. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199505 (https://phabricator.wikimedia.org/T407718)'
2025-10-28 21:27:17 <wikibugs> ('CR) ''Bking: [C:''+1] hadoop: cleanup /tmp from directories as well as files [puppet] - ''https://gerrit.wikimedia.org/r/1199334 (https://phabricator.wikimedia.org/T396582) (owner: ''Gehel)'
2025-10-28 21:27:58 <wikibugs> 'SRE, ''Data-Engineering: stat1011: cannot create directory ‘/srv/published/datasets/one-off’: Permission denied - https://phabricator.wikimedia.org/T408641 (''Addshore) ''NEW'
2025-10-28 21:28:14 <logmsgbot> !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
2025-10-28 21:28:42 <logmsgbot> !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
2025-10-28 21:37:27 <wikibugs> ('CR) ''Cwhite: [C:''+2] site: initial setup for new logging-sd hosts (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) (owner: ''Cwhite)'
2025-10-28 21:44:07 <wikibugs> ('PS1) ''JHathaway: postfix: add rspamd network discard map [puppet] - ''https://gerrit.wikimedia.org/r/1199507 (https://phabricator.wikimedia.org/T408632)'
2025-10-28 21:44:27 <wikibugs> ('CR) ''JHathaway: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199507 (https://phabricator.wikimedia.org/T408632) (owner: ''JHathaway)'
2025-10-28 21:50:04 <wikibugs> ('CR) ''JHathaway: [C:''+2] postfix: add rspamd network discard map [puppet] - ''https://gerrit.wikimedia.org/r/1199507 (https://phabricator.wikimedia.org/T408632) (owner: ''JHathaway)'
2025-10-28 22:04:24 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-28 22:21:10 <wikibugs> ('PS1) ''Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - ''https://gerrit.wikimedia.org/r/1199512'
2025-10-28 22:22:08 <wikibugs> 'SRE, ''vrts, ''Znuny, ''Patch-For-Review: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11321474 (''jhathaway) @Krd how else can I help?'
2025-10-28 22:22:55 <wikibugs> ('PS2) ''Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - ''https://gerrit.wikimedia.org/r/1199512'
2025-10-28 22:23:02 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199512 (owner: ''Andrew Bogott)'
2025-10-28 22:23:27 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321486 (''Dzahn) @thcipriani Turns out this ticket might change from "restricted" to a full deployment access request. How about your approval if that was the case?'
2025-10-28 22:24:24 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-28 22:25:02 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321488 (''Dzahn)'
2025-10-28 22:26:38 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321493 (''Dzahn) edited ticket to change request from "restricted" to "deployment" after talking to Sean. We will redo the approvals for that but reuse the ticket.'
2025-10-28 22:28:08 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321498 (''Dzahn) a:''Dzahn''thcipriani @seanleong-WMDE Could you add some context re: the request for deployment? @thcipriani for your consideration one more time'
2025-10-28 22:28:48 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321503 (''Dzahn)'
2025-10-28 22:30:57 <wikibugs> ('PS3) ''Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - ''https://gerrit.wikimedia.org/r/1199512'
2025-10-28 22:32:16 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199512 (owner: ''Andrew Bogott)'
2025-10-28 22:33:06 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7001.magru.wmnet with OS trixie
2025-10-28 22:33:16 <wikibugs> ('CR) ''RLazarus: [C:''+2] {api,rest}-gateway: Update to Envoy 1.32.12 in staging [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199085 (https://phabricator.wikimedia.org/T405808) (owner: ''RLazarus)'
2025-10-28 22:33:20 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321519 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host...'
2025-10-28 22:35:01 <wikibugs> ('Merged) ''jenkins-bot: {api,rest}-gateway: Update to Envoy 1.32.12 in staging [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199085 (https://phabricator.wikimedia.org/T405808) (owner: ''RLazarus)'
2025-10-28 22:37:50 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2002.codfw.wmnet
2025-10-28 22:37:52 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
2025-10-28 22:38:26 <wikibugs> ('PS4) ''Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - ''https://gerrit.wikimedia.org/r/1199512'
2025-10-28 22:38:39 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199512 (owner: ''Andrew Bogott)'
2025-10-28 22:38:48 <logmsgbot> !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply
2025-10-28 22:39:00 <logmsgbot> !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
2025-10-28 22:41:21 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2002.codfw.wmnet - dzahn@cumin2002"
2025-10-28 22:41:56 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2002.codfw.wmnet - dzahn@cumin2002"
2025-10-28 22:41:56 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-28 22:41:57 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2002.codfw.wmnet on all recursors
2025-10-28 22:42:00 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2002.codfw.wmnet on all recursors
2025-10-28 22:42:13 <logmsgbot> !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2025-10-28 22:42:21 <logmsgbot> !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2025-10-28 22:42:32 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy2002.codfw.wmnet - dzahn@cumin2002"
2025-10-28 22:42:38 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy2002.codfw.wmnet - dzahn@cumin2002"
2025-10-28 22:42:58 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy2002.codfw.wmnet with OS trixie
2025-10-28 22:43:11 <wikibugs> ('PS5) ''Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - ''https://gerrit.wikimedia.org/r/1199512'
2025-10-28 22:43:14 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321544 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host...'
2025-10-28 22:43:16 <logmsgbot> dzahn@cumin2002 reimage (PID 1675734) is awaiting input
2025-10-28 22:43:22 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199512 (owner: ''Andrew Bogott)'
2025-10-28 22:43:33 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy3002.esams.wmnet with OS trixie
2025-10-28 22:43:52 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321546 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host...'
2025-10-28 22:45:30 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321550 (''Dzahn)'
2025-10-28 22:46:27 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321551 (''Dzahn) All VMs exist now. --> https://netbox.wikimedia.org/search/?q=tcp-proxy some still need t...'
2025-10-28 22:57:32 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321576 (''seanleong-WMDE) Hi, thanks @Dzahn. The ticket has been changed from "restricted" to "deployment", as this is part of the requirements to be a deployer, and "restricted" is...'
2025-10-28 22:58:51 <wikibugs> ('PS1) ''Scott French: mw-(api-ext|web): scale next releases to 20% of main [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199513 (https://phabricator.wikimedia.org/T405955)'
2025-10-28 22:58:52 <wikibugs> ('PS1) ''Scott French: mw-(api-int|jobrunner): serve 10% of traffic on PHP 8.3 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199514 (https://phabricator.wikimedia.org/T405955)'
2025-10-28 22:58:55 <wikibugs> ('PS1) ''Scott French: Enroll 25% of client sessions in PHP 8.3 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199515 (https://phabricator.wikimedia.org/T405955)'
2025-10-28 22:59:25 <wikibugs> ('CR) ''RLazarus: [C:''+1] mw-(api-ext|web): scale next releases to 20% of main [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199513 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-28 22:59:29 <wikibugs> ('CR) ''RLazarus: [C:''+1] Enroll 25% of client sessions in PHP 8.3 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199515 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-28 22:59:32 <wikibugs> ('CR) ''RLazarus: [C:''+1] mw-(api-int|jobrunner): serve 10% of traffic on PHP 8.3 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199514 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-28 23:00:52 <wikibugs> 'SRE: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts about tcp-proxy3002 - https://phabricator.wikimedia.org/T408646 (''Dzahn) ''NEW'
2025-10-28 23:01:09 <wikibugs> 'SRE: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts for new VMs - https://phabricator.wikimedia.org/T408646#11321593 (''Dzahn)'
2025-10-28 23:03:06 <wikibugs> 'SRE: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts for new VMs - https://phabricator.wikimedia.org/T408646#11321597 (''Dzahn)'
2025-10-28 23:03:07 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321598 (''seanleong-WMDE)'
2025-10-28 23:03:13 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy2002.codfw.wmnet with reason: host reimage
2025-10-28 23:06:46 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321600 (''seanleong-WMDE)'
2025-10-28 23:09:37 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy2002.codfw.wmnet with reason: host reimage
2025-10-28 23:12:26 <jinxer-wm> FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-28 23:14:00 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to wmf group for jpchev - https://phabricator.wikimedia.org/T408636#11321623 (''Dzahn) @Jpchev Hi there, are you a Wikimedia Foundation employee or contractor? Or are you asking for access as a volunteer? Any specific systems you have in mind?'
2025-10-28 23:14:33 <wikibugs> ('PS1) ''RLazarus: mw-*: Upgrade to Envoy 1.32.12 in the MW canary releases and mw-debug [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199519 (https://phabricator.wikimedia.org/T405808)'
2025-10-28 23:16:43 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11321629 (''seanleong-WMDE)'
2025-10-28 23:17:26 <jinxer-wm> RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-28 23:21:21 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts for new VMs - https://phabricator.wikimedia.org/T408646#11321640 (''Dzahn)'
2025-10-28 23:25:53 <wikibugs> ('CR) ''Scott French: [C:''+1] mw-*: Upgrade to Envoy 1.32.12 in the MW canary releases and mw-debug [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199519 (https://phabricator.wikimedia.org/T405808) (owner: ''RLazarus)'
2025-10-28 23:26:33 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy2002.codfw.wmnet with OS trixie
2025-10-28 23:26:35 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy2002.codfw.wmnet
2025-10-28 23:26:41 <logmsgbot> !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy7001.magru.wmnet with OS trixie
2025-10-28 23:26:53 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321656 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-...'
2025-10-28 23:26:57 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321657 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-...'
2025-10-28 23:28:17 <wikibugs> ('CR) ''RLazarus: [C:''+2] mw-*: Upgrade to Envoy 1.32.12 in the MW canary releases and mw-debug [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199519 (https://phabricator.wikimedia.org/T405808) (owner: ''RLazarus)'
2025-10-28 23:28:24 <wikibugs> 'SRE, ''SRE-Access-Requests, ''LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11321663 (''Dzahn) Hello @SherryYang-WMF, re: the "wmf" LDAP group Please take a look here: https://wikitech.wikimedia....'
2025-10-28 23:28:35 <rzl> jouncebot: nowandnext
2025-10-28 23:28:35 <jouncebot> No deployments scheduled for the next 0 hour(s) and 31 minute(s)
2025-10-28 23:28:35 <jouncebot> In 0 hour(s) and 31 minute(s): Abstract Wikipedia emergency deploy window (one-off) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T0000)
2025-10-28 23:28:58 <rzl> I'll deploy an envoy upgrade to mw-debug and the canaries
2025-10-28 23:30:17 <wikibugs> ('Merged) ''jenkins-bot: mw-*: Upgrade to Envoy 1.32.12 in the MW canary releases and mw-debug [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199519 (https://phabricator.wikimedia.org/T405808) (owner: ''RLazarus)'
2025-10-28 23:32:44 <logmsgbot> !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
2025-10-28 23:32:54 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-28 23:33:12 <logmsgbot> !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
2025-10-28 23:33:40 <wikibugs> ('CR) ''Atieno: [C:''+1] ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: ''Arlolra)'
2025-10-28 23:35:23 <logmsgbot> !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
2025-10-28 23:35:42 <logmsgbot> !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
2025-10-28 23:37:05 <logmsgbot> !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy3002.esams.wmnet with OS trixie
2025-10-28 23:37:26 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321678 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-...'
2025-10-28 23:38:01 <logmsgbot> !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1199519 T405808
2025-10-28 23:38:07 <stashbot> T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808
2025-10-28 23:39:24 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-28 23:39:38 <wikibugs> 'SRE, ''Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11321680 (''Dzahn) The link above is common example. The user asks for `analytics-privatedata-users` (or is told to ask for it as part of some onboarding docs). But that is...'
2025-10-28 23:40:41 <logmsgbot> !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1199519 T405808 (duration: 03m 34s)
2025-10-28 23:43:20 <wikibugs> 'SRE, ''collaboration-services, ''Infrastructure-Foundations, ''vm-requests, ''Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321712 (''Dzahn)'
2025-10-28 23:44:02 <wikibugs> ('PS1) ''Zabe: Using Hadoop for MostTranscludedPages on enwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199522 (https://phabricator.wikimedia.org/T309738)'
2025-10-28 23:44:06 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
2025-10-28 23:46:42 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-28 23:48:57 <wikibugs> 'ops-ulsfo, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11321730 (''Papaul)'
2025-10-28 23:59:47 <wikibugs> ('PS1) ''Santiago Faci: Metrics Platform PHP client library: set performer_registration_dt as null when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199524 (https://phabricator.wikimedia.org/T408547)'

This page is generated from SQL logs, you can also download static txt files from here