Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-operations

Filter:
Start date
End date

Displaying 1281 items:

2026-04-02 00:01:06 <wikibugs> ('CR) ''Scott French: "Thanks, Raine!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) (owner: ''Kamila Součková)'
2026-04-02 00:09:10 <wikibugs> ('CR) ''Scott French: "Thanks, Raine!" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049) (owner: ''Kamila Součková)'
2026-04-02 00:56:14 <logmsgbot> !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2026-04-02 01:02:33 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 284378408 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 01:06:33 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7050408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 01:06:35 <logmsgbot> !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2026-04-02 01:08:23 <logmsgbot> !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2026-04-02 01:09:22 <jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 01:11:46 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1266500'
2026-04-02 01:11:46 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1266500 (owner: ''TrainBranchBot)'
2026-04-02 01:18:44 <logmsgbot> !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2026-04-02 01:19:48 <logmsgbot> !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2026-04-02 01:24:09 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1266500 (owner: ''TrainBranchBot)'
2026-04-02 01:30:13 <logmsgbot> !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2026-04-02 01:30:53 <logmsgbot> !log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2026-04-02 01:41:15 <logmsgbot> !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2026-04-02 01:51:17 <jinxer-wm> FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 01:54:29 <icinga-wm> PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
2026-04-02 02:00:56 <logmsgbot> !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
2026-04-02 02:01:29 <icinga-wm> RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
2026-04-02 02:06:11 <jinxer-wm> FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
2026-04-02 02:07:20 <logmsgbot> !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 23s)
2026-04-02 02:09:13 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 02:34:13 <jinxer-wm> RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 02:46:33 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 786199704 and 51 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 02:47:33 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 03:09:23 <jinxer-wm> FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
2026-04-02 04:41:25 <jinxer-wm> FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 04:54:23 <jinxer-wm> RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
2026-04-02 04:55:25 <jinxer-wm> FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 05:00:42 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C:''+1] wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - ''https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 05:09:37 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 05:16:23 <jinxer-wm> FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
2026-04-02 05:33:30 <jinxer-wm> FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
2026-04-02 05:51:32 <jinxer-wm> FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 05:56:17 <jinxer-wm> FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 06:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600)
2026-04-02 06:00:05 <jouncebot> marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600).
2026-04-02 06:06:11 <jinxer-wm> FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
2026-04-02 06:10:25 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 06:15:12 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "The patch looks good, but I left a comment on the comment :-)" [puppet] - ''https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: ''Bking)'
2026-04-02 06:19:56 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: ''1F616EMO)'
2026-04-02 06:29:22 <wikibugs> ('PS2) ''1F616EMO: arbcom_zhwiki: Enable SecurePoll without PII rights [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309)'
2026-04-02 06:30:25 <jinxer-wm> FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 06:52:10 <jinxer-wm> FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
2026-04-02 06:56:17 <jinxer-wm> FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 06:57:10 <jinxer-wm> RESOLVED: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
2026-04-02 07:00:05 <jouncebot> Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0700).
2026-04-02 07:00:05 <jouncebot> georgekyz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2026-04-02 07:00:24 <georgekyz> Good morning folks!
2026-04-02 07:00:59 <georgekyz> I am planning to deploy my patch now, is anybody around ?
2026-04-02 07:03:22 <georgekyz> I running it.
2026-04-02 07:03:34 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: ''Gkyziridis)'
2026-04-02 07:04:26 <wikibugs> ('Merged) ''jenkins-bot: EventStreamConfig: Add rr-multilingual prediction_change stream [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: ''Gkyziridis)'
2026-04-02 07:05:19 <logmsgbot> !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]]
2026-04-02 07:05:22 <stashbot> T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892
2026-04-02 07:07:35 <logmsgbot> !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 07:08:03 <logmsgbot> !log gkyziridis@deploy1003 gkyziridis: Continuing with sync
2026-04-02 07:08:16 <georgekyz> syncing
2026-04-02 07:08:42 <wikibugs> 'SRE, ''Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11780898 (''MoritzMuehlenhoff) p:''Triage''Medium'
2026-04-02 07:08:49 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 07:12:19 <logmsgbot> !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] (duration: 07m 00s)
2026-04-02 07:12:23 <stashbot> T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892
2026-04-02 07:12:53 <georgekyz> the deployment finished successfully!
2026-04-02 07:13:09 <wikibugs> 'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11780904 (''MoritzMuehlenhoff) Was this linked in some onboarding doc that you followed? If so, it can be removed for now. We're currently reworking 2FA support in CAS and the originally...'
2026-04-02 07:13:58 <wikibugs> ('CR) ''Gkyziridis: [C:''+2] ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: ''Gkyziridis)'
2026-04-02 07:16:01 <wikibugs> ('Merged) ''jenkins-bot: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: ''Gkyziridis)'
2026-04-02 07:20:49 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11780907 (''MoritzMuehlenhoff) Since Andrea is working as a contractor the tracking entry in data.yaml should use the The t...'
2026-04-02 07:22:25 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11780912 (''MoritzMuehlenhoff) ''In progress''Resolved a:''hnowlan @MPostoronca-WMF Your access is enabled, so I'm rmarking this as resolved. If you run into any issues,...'
2026-04-02 07:24:57 <logmsgbot> !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
2026-04-02 07:25:06 <logmsgbot> !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
2026-04-02 07:27:56 <wikibugs> ('PS1) ''Jaime Nuche: ApiAuthManagerHelper: Accept fields with undefined label [core] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027)'
2026-04-02 07:29:00 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by jnuche@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) (owner: ''Jaime Nuche)'
2026-04-02 07:30:33 <logmsgbot> !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 64049
2026-04-02 07:32:13 <logmsgbot> !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 64049
2026-04-02 07:38:00 <logmsgbot> !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fea7794]: deploy to freshly reimaged wdqs host
2026-04-02 07:38:05 <logmsgbot> !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fea7794]: deploy to freshly reimaged wdqs host (duration: 00m 05s)
2026-04-02 07:38:07 <moritzm> !log purge prometheus-nginx-exporter from url downloaders, remnants of early hcapcha rollout
2026-04-02 07:38:08 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 07:40:36 <wikibugs> ('PS1) ''Mszwarc: Disable external link analysis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837)'
2026-04-02 07:40:42 <wikibugs> ('Merged) ''jenkins-bot: ApiAuthManagerHelper: Accept fields with undefined label [core] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) (owner: ''Jaime Nuche)'
2026-04-02 07:41:06 <logmsgbot> !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]]
2026-04-02 07:41:09 <stashbot> T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given - https://phabricator.wikimedia.org/T422027
2026-04-02 07:41:17 <jinxer-wm> FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 07:42:21 <Msz2001> I'll deploy a config change if there's nothing going on
2026-04-02 07:42:42 <Msz2001> (I see it is, I'll wit)
2026-04-02 07:42:45 <Msz2001> wait*
2026-04-02 07:43:08 <logmsgbot> !log jnuche@deploy1003 jnuche: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 07:43:33 <logmsgbot> !log jnuche@deploy1003 jnuche: Continuing with sync
2026-04-02 07:46:17 <jinxer-wm> FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 07:46:54 <wikibugs> ('CR) ''Kosta Harlan: [C:''+1] Disable external link analysis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: ''Mszwarc)'
2026-04-02 07:47:40 <logmsgbot> !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (, T421714) xfer wdqs-all from wdqs2016.codfw.wmnet -> wdqs1027.eqiad.wmnet, repooling both afterwards
2026-04-02 07:47:44 <stashbot> T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714
2026-04-02 07:47:55 <logmsgbot> !log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] (duration: 06m 39s)
2026-04-02 07:47:58 <stashbot> T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given - https://phabricator.wikimedia.org/T422027
2026-04-02 07:48:54 <wikibugs> 'SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops, ''Sustainability (Incident Followup): ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11780961 (''ayounsi)'
2026-04-02 07:49:22 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: ''Mszwarc)'
2026-04-02 07:50:16 <wikibugs> ('Merged) ''jenkins-bot: Disable external link analysis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: ''Mszwarc)'
2026-04-02 07:50:17 <wikibugs> ('PS1) ''Kevin Bazira: ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350)'
2026-04-02 07:50:40 <logmsgbot> !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]]
2026-04-02 07:50:43 <stashbot> T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
2026-04-02 07:50:56 <jinxer-wm> RESOLVED: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
2026-04-02 07:51:23 <icinga-wm> PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
2026-04-02 07:52:23 <icinga-wm> PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
2026-04-02 07:52:40 <logmsgbot> !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 07:53:58 <wikibugs> ('CR) ''Muehlenhoff: [C:''+2] Failover URL downloaders [dns] - ''https://gerrit.wikimedia.org/r/1266242 (owner: ''Muehlenhoff)'
2026-04-02 07:54:14 <logmsgbot> !log jmm@dns1004 START - running authdns-update
2026-04-02 07:55:55 <logmsgbot> !log jmm@dns1004 END - running authdns-update
2026-04-02 07:56:39 <logmsgbot> !log mszwarc@deploy1003 mszwarc: Continuing with sync
2026-04-02 07:58:49 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 08:00:05 <jouncebot> jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0800)
2026-04-02 08:00:53 <logmsgbot> !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] (duration: 10m 13s)
2026-04-02 08:00:57 <stashbot> T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
2026-04-02 08:01:15 <jnuche> morning, I will begin the train shortly
2026-04-02 08:01:58 <wikibugs> ('PS1) ''Arnaudb: apt-staging: error handling for restricted projects [puppet] - ''https://gerrit.wikimedia.org/r/1266920 (https://phabricator.wikimedia.org/T422070)'
2026-04-02 08:02:03 <wikibugs> ('CR) ''Arnaudb: [C:''+2] apt-staging: error handling for restricted projects [puppet] - ''https://gerrit.wikimedia.org/r/1266920 (https://phabricator.wikimedia.org/T422070) (owner: ''Arnaudb)'
2026-04-02 08:03:25 <wikibugs> ('PS1) ''TrainBranchBot: group2 to 1.46.0-wmf.22 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480)'
2026-04-02 08:03:28 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) (owner: ''TrainBranchBot)'
2026-04-02 08:04:19 <wikibugs> ('Merged) ''jenkins-bot: group2 to 1.46.0-wmf.22 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) (owner: ''TrainBranchBot)'
2026-04-02 08:07:49 <wikibugs> ('CR) ''Ozge: [C:''+1] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: ''Kevin Bazira)'
2026-04-02 08:08:49 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 08:10:28 <logmsgbot> !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.22 refs T420480
2026-04-02 08:10:31 <stashbot> T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480
2026-04-02 08:11:03 <wikibugs> ('CR) ''Kevin Bazira: [C:''+2] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: ''Kevin Bazira)'
2026-04-02 08:11:59 <wikibugs> ('PS1) ''Muehlenhoff: Update email record for andreawest [puppet] - ''https://gerrit.wikimedia.org/r/1266931 (https://phabricator.wikimedia.org/T420053)'
2026-04-02 08:12:45 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17), ''Patch-For-Review: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11781036 (''MoritzMuehlenhoff) >>! In T420053#11778139, @AWesterinen wrote: > I still have the error,...'
2026-04-02 08:13:10 <wikibugs> ('Merged) ''jenkins-bot: ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: ''Kevin Bazira)'
2026-04-02 08:14:38 <wikibugs> ('PS4) ''Volans: webproxies: allow cloudcumin to openstack [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360)'
2026-04-02 08:14:38 <wikibugs> ('CR) ''Volans: "PCC available at:" [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 08:16:16 <wikibugs> 'ops-eqiad, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111 (''FCeratto-WMF) ''NEW'
2026-04-02 08:16:17 <jinxer-wm> FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 08:16:24 <wikibugs> ('PS1) ''Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 08:16:49 <wikibugs> ('PS2) ''Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 08:17:10 <logmsgbot> !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
2026-04-02 08:17:56 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] "LGTM, nice!" [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 08:18:38 <wikibugs> ('PS1) ''Arnaudb: aptrepo: add an alert for failed prepare [alerts] - ''https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070)'
2026-04-02 08:18:41 <wikibugs> ('CR) ''Arnaudb: [C:''+2] aptrepo: add an alert for failed prepare [alerts] - ''https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) (owner: ''Arnaudb)'
2026-04-02 08:19:02 <wikibugs> ('CR) ''CI reject: [V:''-1] deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
2026-04-02 08:19:21 <wikibugs> ('PS3) ''Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 08:19:38 <wikibugs> ('PS4) ''Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 08:20:00 <wikibugs> ('Merged) ''jenkins-bot: aptrepo: add an alert for failed prepare [alerts] - ''https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) (owner: ''Arnaudb)'
2026-04-02 08:20:57 <wikibugs> ('CR) ''Ayounsi: [C:''+1] "lgtm, pcc looks good too, to be carefully rolled out/tested." [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 08:21:07 <wikibugs> ('PS5) ''Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 08:23:15 <wikibugs> ('CR) ''CI reject: [V:''-1] deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
2026-04-02 08:24:10 <wikibugs> ('CR) ''Brouberol: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8368/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
2026-04-02 08:24:15 <wikibugs> ('PS6) ''Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 08:30:22 <volans> !log briefly disabling puppet on P:installserver::proxy to deploy g/1266885
2026-04-02 08:30:23 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 08:31:21 <wikibugs> ('CR) ''Volans: [C:''+2] webproxies: allow cloudcumin to openstack [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 08:33:26 <wikibugs> ('CR) ''Btullis: [C:''+1] "Nice, thanks." [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
2026-04-02 08:40:18 <wikibugs> ('CR) ''Brouberol: [C:''+2] deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
2026-04-02 08:40:45 <XioNoX> slyngs, effie, I'm going to reboot mr1-esams for a software upgrade, it will go down for up to 20min, device itself is downtimed, but there might be some alerting noise from esams mgmt being unreachable
2026-04-02 08:41:15 <jinxer-wm> FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 08:41:17 <jinxer-wm> FIRING: [3x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 08:41:23 <jinxer-wm> RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
2026-04-02 08:41:40 <jinxer-wm> FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 08:42:00 <XioNoX> !log reboot mr1-esams - T416450
2026-04-02 08:42:03 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 08:42:04 <stashbot> T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
2026-04-02 08:42:36 <effie> XioNoX: thank you, break a leg
2026-04-02 08:43:59 <icinga-wm> PROBLEM - ps1-by27-esams-infeed-load-tower-B-single-phase on ps1-by27-esams is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
2026-04-02 08:44:20 <wikibugs> 'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781126 (''atsuko) Thanks, I'll update the onboarding.'
2026-04-02 08:44:32 <logmsgbot> !log dpogorzelski@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: sync
2026-04-02 08:44:42 <wikibugs> 'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781127 (''atsuko) a:''atsuko'
2026-04-02 08:44:45 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
2026-04-02 08:44:53 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90206 and previous config saved to /var/cache/conftool/dbconfig/20260402-084452-fceratto.json
2026-04-02 08:44:56 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 08:45:07 <logmsgbot> !log dpogorzelski@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
2026-04-02 08:45:23 <icinga-wm> PROBLEM - Host ps1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100%
2026-04-02 08:45:23 <icinga-wm> PROBLEM - Host ps1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100%
2026-04-02 08:45:32 <logmsgbot> !log dpogorzelski@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync
2026-04-02 08:45:39 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and mr1-esams (10.80.127.5) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDow
2026-04-02 08:46:09 <logmsgbot> !log dpogorzelski@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
2026-04-02 08:46:15 <jinxer-wm> FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 08:46:17 <wikibugs> 'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781130 (''atsuko)'
2026-04-02 08:47:08 <wikibugs> ('PS1) ''Gkyziridis: ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941'
2026-04-02 08:47:29 <wikibugs> 'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781133 (''atsuko)'
2026-04-02 08:47:50 <wikibugs> 'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781135 (''atsuko) ''Open''Declined'
2026-04-02 08:49:13 <jinxer-wm> FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 08:49:49 <wikibugs> ('CR) ''Ilias Sarantopoulos: [C:''+1] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941 (owner: ''Gkyziridis)'
2026-04-02 08:49:54 <moritzm> !log added Atsuko to the cn=ops LDAP group T421860
2026-04-02 08:49:57 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 08:49:58 <stashbot> T421860: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860
2026-04-02 08:50:23 <jinxer-wm> FIRING: GnmiTargetDown: asw1-bw27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
2026-04-02 08:50:39 <jinxer-wm> RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and mr1-esams (10.80.127.5) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPD
2026-04-02 08:50:47 <icinga-wm> RECOVERY - Host ps1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.26 ms
2026-04-02 08:50:47 <icinga-wm> RECOVERY - Host ps1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.25 ms
2026-04-02 08:51:15 <jinxer-wm> FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 08:51:27 <wikibugs> ('CR) ''Dpogorzelski: [C:''+1] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941 (owner: ''Gkyziridis)'
2026-04-02 08:51:32 <XioNoX> router is back up - 10min downtime
2026-04-02 08:52:15 <wikibugs> ('CR) ''Gkyziridis: [C:''+2] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941 (owner: ''Gkyziridis)'
2026-04-02 08:53:34 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11781141 (''MoritzMuehlenhoff) ''Open''Resolved a:''MoritzMuehlenhoff @atsuko Your SSH access should now be working. You can e.g. try to connect to cumin1003.e...'
2026-04-02 08:54:13 <wikibugs> ('Merged) ''jenkins-bot: ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941 (owner: ''Gkyziridis)'
2026-04-02 08:54:13 <jinxer-wm> RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 08:55:23 <jinxer-wm> RESOLVED: GnmiTargetDown: asw1-bw27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
2026-04-02 08:55:27 <logmsgbot> !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
2026-04-02 08:55:41 <logmsgbot> !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
2026-04-02 08:56:15 <jinxer-wm> RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 08:57:47 <wikibugs> ('CR) ''Muehlenhoff: [C:''+2] Update Cumin alias for contint to also cover the spun-off Trixie role [puppet] - ''https://gerrit.wikimedia.org/r/1266215 (owner: ''Muehlenhoff)'
2026-04-02 08:58:49 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 09:08:30 <jinxer-wm> FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
2026-04-02 09:12:31 <wikibugs> ('PS1) ''Klausman: admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266947'
2026-04-02 09:17:43 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90207 and previous config saved to /var/cache/conftool/dbconfig/20260402-091743-fceratto.json
2026-04-02 09:17:47 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 09:19:48 <moritzm> !log upgrading Envoy on the config-master servers to 1.35.9 T419637 T410975
2026-04-02 09:19:57 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 09:19:58 <stashbot> T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637
2026-04-02 09:19:59 <stashbot> T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
2026-04-02 09:21:37 <wikibugs> ('PS1) ''Gkyziridis: ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266948'
2026-04-02 09:23:16 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "LGTM" [software/bitu] - ''https://gerrit.wikimedia.org/r/1265258 (owner: ''Slyngshede)'
2026-04-02 09:23:51 <wikibugs> ('CR) ''Gkyziridis: [C:''+2] ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266948 (owner: ''Gkyziridis)'
2026-04-02 09:25:57 <wikibugs> ('PS1) ''Volans: Add missing includes from Netbox exported data [dns] - ''https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115)'
2026-04-02 09:26:07 <wikibugs> ('Merged) ''jenkins-bot: ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266948 (owner: ''Gkyziridis)'
2026-04-02 09:27:36 <logmsgbot> !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
2026-04-02 09:27:42 <logmsgbot> !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
2026-04-02 09:27:52 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P90208 and previous config saved to /var/cache/conftool/dbconfig/20260402-092751-fceratto.json
2026-04-02 09:28:30 <jinxer-wm> RESOLVED: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
2026-04-02 09:29:31 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master-codfw
2026-04-02 09:29:35 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors
2026-04-02 09:29:39 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors
2026-04-02 09:30:35 <wikibugs> ('PS4) ''Arnaudb: gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - ''https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909)'
2026-04-02 09:33:23 <wikibugs> ('PS1) ''Arnaudb: gerrit: update sshd timeouts [puppet] - ''https://gerrit.wikimedia.org/r/1266149 (https://phabricator.wikimedia.org/T417996)'
2026-04-02 09:33:45 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master-codfw
2026-04-02 09:33:47 <wikibugs> ('Abandoned) ''Arnaudb: gerrit: update timeouts for gitiles [puppet] - ''https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904) (owner: ''Arnaudb)'
2026-04-02 09:37:53 <wikibugs> ('CR) ''Muehlenhoff: [C:''+2] Obsolete airflow-search-admins POSIX group [puppet] - ''https://gerrit.wikimedia.org/r/1242407 (owner: ''Muehlenhoff)'
2026-04-02 09:38:00 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P90209 and previous config saved to /var/cache/conftool/dbconfig/20260402-093759-fceratto.json
2026-04-02 09:39:25 <wikibugs> ('PS5) ''Arnaudb: gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - ''https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909)'
2026-04-02 09:39:29 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+1] image-suggestion: remove service configuration [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 09:39:45 <wikibugs> ('CR) ''Arnaudb: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) (owner: ''Arnaudb)'
2026-04-02 09:40:13 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+1] profile::service_proxy::envoy: remove mw-parsoid [puppet] - ''https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) (owner: ''Elukey)'
2026-04-02 09:40:30 <wikibugs> ('PS2) ''Elukey: profile::service_proxy::envoy: remove mw-parsoid [puppet] - ''https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468)'
2026-04-02 09:41:18 <wikibugs> ('CR) ''Arnaudb: [C:''+2] gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - ''https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) (owner: ''Arnaudb)'
2026-04-02 09:41:50 <wikibugs> ('Abandoned) ''Effie Mouzeli: profile::service_proxy::envoy: remove mw-parsoid [puppet] - ''https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) (owner: ''Elukey)'
2026-04-02 09:43:42 <wikibugs> ('CR) ''Ayounsi: [C:''+1] "thanks!" [dns] - ''https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) (owner: ''Volans)'
2026-04-02 09:45:33 <logmsgbot> !log javiermonton@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: sync
2026-04-02 09:45:42 <logmsgbot> !log javiermonton@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync
2026-04-02 09:46:56 <jinxer-wm> FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare
2026-04-02 09:47:41 <logmsgbot> !log javiermonton@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
2026-04-02 09:48:02 <logmsgbot> !log javiermonton@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
2026-04-02 09:48:03 <wikibugs> ('Abandoned) ''Majavah: hieradata: Add dumps.wikimedia.org CDN mapping [puppet] - ''https://gerrit.wikimedia.org/r/1237194 (https://phabricator.wikimedia.org/T306550) (owner: ''Majavah)'
2026-04-02 09:48:09 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90210 and previous config saved to /var/cache/conftool/dbconfig/20260402-094808-fceratto.json
2026-04-02 09:48:11 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 09:48:26 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
2026-04-02 09:48:34 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90211 and previous config saved to /var/cache/conftool/dbconfig/20260402-094834-fceratto.json
2026-04-02 09:48:37 <logmsgbot> !log javiermonton@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
2026-04-02 09:48:58 <logmsgbot> !log javiermonton@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
2026-04-02 09:53:07 <wikibugs> ('PS1) ''Muehlenhoff: Obsolete airflow-wmde-admins POSIX group [puppet] - ''https://gerrit.wikimedia.org/r/1266959'
2026-04-02 09:58:30 <wikibugs> ('CR) ''Muehlenhoff: [C:''+2] Update email record for andreawest [puppet] - ''https://gerrit.wikimedia.org/r/1266931 (https://phabricator.wikimedia.org/T420053) (owner: ''Muehlenhoff)'
2026-04-02 10:00:04 <jouncebot> Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1000)
2026-04-02 10:00:04 <jouncebot> dues: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2026-04-02 10:00:25 <wikibugs> ('CR) ''Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
2026-04-02 10:02:05 <wikibugs> ('PS1) ''Volans: cumin: use webproxy to connect to openstack APIs [puppet] - ''https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360)'
2026-04-02 10:02:05 <wikibugs> ('CR) ''Volans: "PCC available for cloudcumin1001 here:" [puppet] - ''https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 10:03:22 <wikibugs> ('CR) ''Muehlenhoff: [C:''+2] thumbor: Update service image to latest rebuild [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266229 (owner: ''Muehlenhoff)'
2026-04-02 10:03:35 <wikibugs> ('PS1) ''Arnaudb: gerrit: update upstream_idle_timeout [puppet] - ''https://gerrit.wikimedia.org/r/1266962 (https://phabricator.wikimedia.org/T421827)'
2026-04-02 10:03:38 <wikibugs> ('CR) ''Arnaudb: [C:''+2] gerrit: update upstream_idle_timeout [puppet] - ''https://gerrit.wikimedia.org/r/1266962 (https://phabricator.wikimedia.org/T421827) (owner: ''Arnaudb)'
2026-04-02 10:04:15 <jinxer-wm> FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 10:04:17 <wikibugs> ('PS1) ''Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963'
2026-04-02 10:04:26 <wikibugs> ('CR) ''CI reject: [V:''-1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (owner: ''Volans)'
2026-04-02 10:05:23 <logmsgbot> !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
2026-04-02 10:05:32 <logmsgbot> !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
2026-04-02 10:05:34 <wikibugs> ('PS2) ''Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360)'
2026-04-02 10:05:49 <wikibugs> ('CR) ''CI reject: [V:''-1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 10:08:55 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 10:09:15 <jinxer-wm> FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 10:09:42 <wikibugs> ('PS1) ''Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749)'
2026-04-02 10:10:18 <logmsgbot> !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
2026-04-02 10:10:38 <wikibugs> ('CR) ''Daniel Kinzler: [C:''+2] rest gateway: define authed-user class [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: ''Daniel Kinzler)'
2026-04-02 10:10:57 <wikibugs> ('CR) ''CI reject: [V:''-1] Enable the CampaignEvents extension on incubator [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: ''Mhorsey)'
2026-04-02 10:11:19 <wikibugs> ('PS3) ''Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360)'
2026-04-02 10:11:41 <wikibugs> ('CR) ''Volans: [C:''+2] cumin: use webproxy to connect to openstack APIs [puppet] - ''https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 10:12:36 <logmsgbot> !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
2026-04-02 10:12:49 <wikibugs> ('Merged) ''jenkins-bot: rest gateway: define authed-user class [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: ''Daniel Kinzler)'
2026-04-02 10:13:17 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] "LGTM" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 10:14:30 <logmsgbot> !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
2026-04-02 10:15:11 <logmsgbot> !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker-exp2001.codfw.wmnet
2026-04-02 10:16:51 <jinxer-wm> FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 10:16:52 <logmsgbot> !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
2026-04-02 10:16:54 <logmsgbot> !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
2026-04-02 10:17:00 <logmsgbot> !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 10:17:05 <logmsgbot> !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
2026-04-02 10:17:14 <logmsgbot> !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2026-04-02 10:17:19 <logmsgbot> !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 10:17:24 <logmsgbot> !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
2026-04-02 10:17:32 <logmsgbot> !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 10:17:36 <logmsgbot> !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
2026-04-02 10:17:40 <effie> !incidents
2026-04-02 10:17:40 <sirenbot> 7803 (UNACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
2026-04-02 10:17:46 <effie> !ack 7803
2026-04-02 10:17:46 <sirenbot> 7803 (ACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
2026-04-02 10:17:50 <logmsgbot> !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 10:17:55 <logmsgbot> !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
2026-04-02 10:18:03 <logmsgbot> !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 10:18:08 <logmsgbot> !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
2026-04-02 10:18:24 <logmsgbot> !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 10:18:28 <logmsgbot> !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
2026-04-02 10:18:45 <logmsgbot> !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 10:18:46 <logmsgbot> !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2026-04-02 10:18:50 <logmsgbot> !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
2026-04-02 10:19:11 <logmsgbot> !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 10:19:15 <logmsgbot> !log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
2026-04-02 10:19:17 <moritzm> !log installing freetype security updates
2026-04-02 10:19:20 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 10:19:25 <logmsgbot> !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker-exp2001.codfw.wmnet
2026-04-02 10:19:27 <logmsgbot> !log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 10:19:30 <logmsgbot> !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
2026-04-02 10:19:31 <logmsgbot> !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 10:19:35 <logmsgbot> !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
2026-04-02 10:19:36 <logmsgbot> !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 10:21:06 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90212 and previous config saved to /var/cache/conftool/dbconfig/20260402-102105-fceratto.json
2026-04-02 10:21:09 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 10:21:41 <wikibugs> ('PS2) ''Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749)'
2026-04-02 10:22:45 <jinxer-wm> FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ...
2026-04-02 10:22:50 <jinxer-wm> fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 10:23:16 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: ''Mhorsey)'
2026-04-02 10:24:44 <wikibugs> 'SRE, ''SRE-tools, ''Infrastructure-Foundations, ''ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11781519 (''Volans) Given this has been moved to the backlog I'll leave here a comment for our future selves: i...'
2026-04-02 10:26:33 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 166195784 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 10:27:29 <wikibugs> ('PS1) ''Hashar: wm-checks-api: add tag for PostgreSQL jobs [software/gerrit] (deploy/wmf/stable-3.10) - ''https://gerrit.wikimedia.org/r/1266965'
2026-04-02 10:27:45 <jinxer-wm> FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 10:28:33 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3533304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 10:30:40 <jinxer-wm> FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 10:30:41 <wikibugs> 'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781562 (''Peachey88)'
2026-04-02 10:31:02 <logmsgbot> !log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
2026-04-02 10:31:12 <wikibugs> 'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781588 (''MBH) Many such servers: 26, 31. When just opening pages for read.'
2026-04-02 10:31:14 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P90213 and previous config saved to /var/cache/conftool/dbconfig/20260402-103113-fceratto.json
2026-04-02 10:31:25 <wikibugs> 'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781591 (''Peachey88)'
2026-04-02 10:31:27 <logmsgbot> !log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
2026-04-02 10:32:19 <logmsgbot> !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet
2026-04-02 10:33:27 <wikibugs> ('CR) ''Cathal Mooney: [C:''+1] "LGTM!" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 10:34:45 <wikibugs> 'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781642 (''Thryduulf) I've been experiencing these errors intermittently on English Wikipedia today, but only on trying to save edits. Each time trying again has resulted in the save being successful.'
2026-04-02 10:37:41 <logmsgbot> !log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
2026-04-02 10:38:10 <wikibugs> ('CR) ''Daniel Kinzler: [C:''+2] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
2026-04-02 10:38:22 <wikibugs> ('CR) ''CI reject: [V:''-1] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
2026-04-02 10:39:14 <wikibugs> ('PS5) ''Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581)'
2026-04-02 10:39:29 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Degraded RAID on an-worker1148 - https://phabricator.wikimedia.org/T421892#11781672 (''Jclark-ctr) ''Open''Declined This ticket automated ticket was opened by mistake it was still being worked on in In T411919'
2026-04-02 10:39:44 <wikibugs> ('CR) ''Daniel Kinzler: [C:''+2] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
2026-04-02 10:40:02 <logmsgbot> !log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
2026-04-02 10:41:22 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P90214 and previous config saved to /var/cache/conftool/dbconfig/20260402-104121-fceratto.json
2026-04-02 10:41:51 <wikibugs> ('Merged) ''jenkins-bot: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
2026-04-02 10:41:57 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970#11781681 (''Jclark-ctr) ''Open''Resolved a:''Jclark-ctr rebalanced'
2026-04-02 10:43:18 <wikibugs> 'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781698 (''Aklapper)'
2026-04-02 10:43:53 <logmsgbot> !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet
2026-04-02 10:44:33 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 76721280 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 10:45:00 <logmsgbot> !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2026-04-02 10:45:16 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11781704 (''Jclark-ctr) ''Open''Resolved replaced failed psu Outbound ticket for psu 1-258638557493'
2026-04-02 10:45:33 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3553128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 10:45:43 <logmsgbot> !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2026-04-02 10:48:21 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 10:48:23 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 10:48:23 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 10:48:23 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 10:48:28 <A_smart_kitten> fwiw I jusst got 'cannot access the database: database servers in cluster31 are overloaded' when trying to save an edit on metawiki. worked fine on the second attempt.
2026-04-02 10:48:33 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 298909248 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 10:49:26 <A_smart_kitten> oh i see it's already known, apologies :)
2026-04-02 10:49:33 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4010680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 10:49:49 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 10:49:49 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 10:50:33 <wikibugs_> 'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781731 (''Wellverywell) p:''Triage''Unbreak!'
2026-04-02 10:50:41 <wikibugs> 'SRE-Access-Requests, ''Data-Platform-SRE, ''Wikidata Platform Team: Request: wdqs shell access for user @AWesterinen-WMF - https://phabricator.wikimedia.org/T422141 (''gmodena) ''NEW'
2026-04-02 10:51:30 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90215 and previous config saved to /var/cache/conftool/dbconfig/20260402-105129-fceratto.json
2026-04-02 10:51:33 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 10:51:35 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
2026-04-02 10:51:43 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90216 and previous config saved to /var/cache/conftool/dbconfig/20260402-105142-fceratto.json
2026-04-02 10:52:14 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781750 (''RhinosF1)'
2026-04-02 10:52:49 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 10:54:13 <wikibugs> 'SRE-Access-Requests, ''Data-Platform-SRE, ''Wikidata Platform Team: Request: wdqs shell access for user AWesterinen-WMF - https://phabricator.wikimedia.org/T422141#11781774 (''gmodena)'
2026-04-02 10:56:57 <wikibugs> 'SRE-Access-Requests, ''Data-Platform-SRE, ''Wikidata Platform Team: Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11781779 (''gmodena)'
2026-04-02 10:57:52 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781783 (''1F616EMO) I experienced such errors when diffing and saving edits.'
2026-04-02 10:58:15 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781785 (''Jclark-ctr) a:''Jclark-ctr This server is out of warranty. Replaced Drive slot 16 with matching 8tb sata drive'
2026-04-02 10:58:45 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781790 (''Ladsgroup) We are on it.'
2026-04-02 10:59:47 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781793 (''1F616EMO) Should I expect the coming backport window be cancelled or delayed due to this incident?'
2026-04-02 11:00:25 <wikibugs> ('PS4) ''Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213)'
2026-04-02 11:00:25 <wikibugs> ('PS1) ''Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - ''https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213)'
2026-04-02 11:01:28 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781814 (''RhinosF1) >>! In T422130#11781793, @1F616EMO wrote: > Should I expect the coming backport window be cancelled or delayed due to this incident? Very likely yes. A dep...'
2026-04-02 11:02:00 <wikibugs> ('CR) ''Btullis: [C:''-1] "Set to -1 pending the review by Infrastructure Foundations." [puppet] - ''https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
2026-04-02 11:04:16 <wikibugs> ('PS1) ''Esanders: Fix suggestion mode availability check [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143)'
2026-04-02 11:04:20 <wikibugs> ('CR) ''Muehlenhoff: Add analytics-fr-tech system user and corresponding groups (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
2026-04-02 11:05:20 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: ''Esanders)'
2026-04-02 11:07:26 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781846 (''Jclark-ctr) After replacement Server showed drive as foreign. continued to fail to clear foreign config. Replaced drive again with new seagate 8tb sata drive'
2026-04-02 11:07:48 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781847 (''1F616EMO) >>! In T422130#11781814, @RhinosF1 wrote: >>>! In T422130#11781793, @1F616EMO wrote: >> Should I expect the coming backport window be cancelled or delayed d...'
2026-04-02 11:13:15 <jinxer-wm> FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 11.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 11:14:15 <jinxer-wm> FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 11:20:54 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781890 (''Lucas_Werkmeister_WMDE)'
2026-04-02 11:21:41 <jinxer-wm> FIRING: [8x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2026-04-02 11:24:22 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90217 and previous config saved to /var/cache/conftool/dbconfig/20260402-112421-fceratto.json
2026-04-02 11:24:25 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 11:26:41 <jinxer-wm> FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2026-04-02 11:26:51 <jinxer-wm> FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 11:27:00 <effie> !incidents
2026-04-02 11:27:00 <sirenbot> 7803 (ACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
2026-04-02 11:27:23 <jinxer-wm> FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
2026-04-02 11:27:49 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11781903 (''BTullis)'
2026-04-02 11:27:50 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781904 (''Jclark-ctr) ''Open''Resolved'
2026-04-02 11:28:38 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11781909 (''Jclark-ctr) a:''Jclark-ctr'
2026-04-02 11:29:02 <wikibugs> ('CR) ''Jforrester: [C:''+1] REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: ''KineticPelagic)'
2026-04-02 11:32:25 <icinga-wm> PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 6.702e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
2026-04-02 11:32:45 <jinxer-wm> FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 11:34:30 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P90218 and previous config saved to /var/cache/conftool/dbconfig/20260402-113429-fceratto.json
2026-04-02 11:34:33 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 97599648 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 11:35:23 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11781922 (''Jclark-ctr) updating bios firmware , expander firmware due to coms error on backplain. and idrac firmware additionally'
2026-04-02 11:35:33 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3557000 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 11:36:54 <wikibugs> ('PS1) ''Arnaudb: gerrit: bump upstream_idle_timeout to 900s [puppet] - ''https://gerrit.wikimedia.org/r/1266989 (https://phabricator.wikimedia.org/T421904)'
2026-04-02 11:37:12 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11781927 (''BTullis) I have validated all SSH keys via out-of...'
2026-04-02 11:37:15 <wikibugs> ('CR) ''Arnaudb: [C:''+2] gerrit: bump upstream_idle_timeout to 900s [puppet] - ''https://gerrit.wikimedia.org/r/1266989 (https://phabricator.wikimedia.org/T421904) (owner: ''Arnaudb)'
2026-04-02 11:37:23 <jinxer-wm> RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
2026-04-02 11:38:23 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 11:39:19 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11781930 (''Gehel) p:''Triage''High'
2026-04-02 11:42:49 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 11:44:38 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P90219 and previous config saved to /var/cache/conftool/dbconfig/20260402-114437-fceratto.json
2026-04-02 11:47:45 <jinxer-wm> FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 11:48:15 <jinxer-wm> RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 11:48:23 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781968 (''Thryduulf) I've just encountered what I presume is the same error, this time when trying to use the reply tool [6a4d47bf-961e-4513-9b1f-c6970e11f156] Caught exception...'
2026-04-02 11:48:23 <wikibugs> ('PS5) ''Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213)'
2026-04-02 11:48:24 <wikibugs> ('PS2) ''Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - ''https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213)'
2026-04-02 11:51:15 <wikibugs> ('PS1) ''Daniel Kinzler: rest gateway: refactor ratelimit integration test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266995'
2026-04-02 11:51:51 <jinxer-wm> RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 11:52:11 <logmsgbot> !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
2026-04-02 11:52:17 <wikibugs> ('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1254925 (owner: ''PipelineBot)'
2026-04-02 11:52:24 <wikibugs> ('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1254926 (owner: ''PipelineBot)'
2026-04-02 11:52:34 <wikibugs> ('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1241846 (owner: ''PipelineBot)'
2026-04-02 11:52:44 <wikibugs> ('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1258153 (owner: ''PipelineBot)'
2026-04-02 11:52:45 <jinxer-wm> RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 11:52:55 <wikibugs> ('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1254927 (owner: ''PipelineBot)'
2026-04-02 11:53:04 <wikibugs> ('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1246819 (owner: ''PipelineBot)'
2026-04-02 11:54:15 <jinxer-wm> RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 11:54:30 <wikibugs> ('CR) ''Ayounsi: [C:''+1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 11:54:47 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90220 and previous config saved to /var/cache/conftool/dbconfig/20260402-115446-fceratto.json
2026-04-02 11:54:50 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 11:55:03 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance
2026-04-02 11:55:12 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90221 and previous config saved to /var/cache/conftool/dbconfig/20260402-115511-fceratto.json
2026-04-02 11:59:00 <edsanders> I have a high visibility UBN in for the deployment window - just waiting for it to merge
2026-04-02 11:59:59 <wikibugs> ('PS1) ''Brouberol: deployment_server: tweak the labels on opensearch_k8s_master_cert_expiry_seconds [puppet] - ''https://gerrit.wikimedia.org/r/1266999 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 12:00:04 <jouncebot> Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1200)
2026-04-02 12:02:02 <edsanders> ah - timezone change - the window starts in one hour
2026-04-02 12:02:15 <jinxer-wm> FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 12:03:19 <wikibugs> ('CR) ''Brouberol: [C:''+2] deployment_server: tweak the labels on opensearch_k8s_master_cert_expiry_seconds [puppet] - ''https://gerrit.wikimedia.org/r/1266999 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
2026-04-02 12:05:25 <jinxer-wm> FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 12:06:41 <jinxer-wm> FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2026-04-02 12:07:15 <jinxer-wm> FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 12:09:35 <logmsgbot> !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1373.eqiad.wmnet with OS trixie
2026-04-02 12:09:47 <logmsgbot> !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1374.eqiad.wmnet with OS trixie
2026-04-02 12:09:57 <logmsgbot> !log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1373
2026-04-02 12:09:57 <logmsgbot> !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1373
2026-04-02 12:10:08 <logmsgbot> !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1374
2026-04-02 12:10:08 <logmsgbot> !log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1374
2026-04-02 12:10:47 <p858snake|cloud> edsanders: fyi there is a incident at the moment (T422130) so the window might be effected
2026-04-02 12:10:48 <stashbot> T422130: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130
2026-04-02 12:11:02 <wikibugs> ('CR) ''JMeybohm: "recheck" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: ''Kamila Součková)'
2026-04-02 12:11:31 <wikibugs> ('CR) ''Volans: [C:''+2] Add missing includes from Netbox exported data [dns] - ''https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) (owner: ''Volans)'
2026-04-02 12:11:41 <jinxer-wm> FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2026-04-02 12:11:41 <logmsgbot> !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
2026-04-02 12:11:57 <logmsgbot> !log volans@dns1004 START - running authdns-update
2026-04-02 12:12:15 <jinxer-wm> RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 12:12:19 <wikibugs> ('CR) ''JMeybohm: [C:''+1] admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266947 (owner: ''Klausman)'
2026-04-02 12:12:30 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782064 (''Jclark-ctr) ''Open''Resolved ` A configuration related issue on the device Backplane is resolved. `'
2026-04-02 12:13:46 <logmsgbot> !log volans@dns1004 END - running authdns-update
2026-04-02 12:13:51 <jinxer-wm> FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 12:14:26 <edsanders> p858snake I'd like to start my deployment asap, is everything on hold at the moment?
2026-04-02 12:16:13 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782090 (''FCeratto-WMF) Thanks!'
2026-04-02 12:16:41 <jinxer-wm> FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2026-04-02 12:17:17 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''observability, ''Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11782094 (''Jclark-ctr) @herron can you assist with updating puppet on this install ticket ?'
2026-04-02 12:18:38 <edsanders> Rhoni
2026-04-02 12:18:46 <edsanders> *typo
2026-04-02 12:18:51 <jinxer-wm> FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 12:19:02 <effie> !incidents
2026-04-02 12:19:02 <sirenbot> 7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
2026-04-02 12:19:03 <sirenbot> 7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
2026-04-02 12:19:12 <edsanders> RhinosF1: is there any chance of getting a UBN backported, despite T422130?
2026-04-02 12:19:13 <stashbot> T422130: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130
2026-04-02 12:19:32 <RhinosF1> edsanders: no idea why you are asking me
2026-04-02 12:19:32 <edsanders> (this: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1266984)
2026-04-02 12:19:39 <edsanders> I saw you commented on the incident task
2026-04-02 12:19:42 <RhinosF1> You need to ask the IC
2026-04-02 12:19:45 <jinxer-wm> FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 12:19:50 <RhinosF1> I suggest in #wikimedia-sre
2026-04-02 12:19:53 <edsanders> Thanks
2026-04-02 12:19:53 <RhinosF1> Much quieter there
2026-04-02 12:20:01 <wikibugs> ('CR) ''JMeybohm: Upgrade aux-k8s-codfw to k8s 1.31 (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1265426 (https://phabricator.wikimedia.org/T414486) (owner: ''Elukey)'
2026-04-02 12:20:08 <wikibugs> ('CR) ''JMeybohm: [C:''+1] admin_ng: upgrade aux-k8s-codfw to k8s 1.31 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265427 (https://phabricator.wikimedia.org/T414486) (owner: ''Elukey)'
2026-04-02 12:21:41 <jinxer-wm> FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2026-04-02 12:22:32 <logmsgbot> !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage
2026-04-02 12:22:35 <logmsgbot> !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage
2026-04-02 12:22:40 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782127 (''taavi)'
2026-04-02 12:24:40 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782144 (''Jclark-ctr) a:''Jclark-ctr'
2026-04-02 12:25:18 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782146 (''Jclark-ctr)'
2026-04-02 12:26:41 <jinxer-wm> FIRING: [54x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2026-04-02 12:26:43 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90222 and previous config saved to /var/cache/conftool/dbconfig/20260402-122642-fceratto.json
2026-04-02 12:26:46 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 12:27:05 <wikibugs> ('CR) ''Btullis: Add analytics-fr-tech system user and corresponding groups (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
2026-04-02 12:27:44 <wikibugs> 'SRE, ''DNS, ''Infrastructure-Foundations, ''netbox, and 3 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11782158 (''Volans) p:''Triage''Medium I've merged and release the fix, do you want to keep the task open to implement some form o...'
2026-04-02 12:28:49 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es1042.eqiad.wmnet
2026-04-02 12:28:50 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es1042.eqiad.wmnet
2026-04-02 12:29:17 <logmsgbot> !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage
2026-04-02 12:30:46 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042.eqiad.wmnet: Restoring section
2026-04-02 12:30:57 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782163 (''FCeratto-WMF) The host booted, I triggered a puppet run manually, started MariaDB, enabled alarming and checked that icinga is green and started pooling in to help with T422130'
2026-04-02 12:31:11 <wikibugs> ('CR) ''JMeybohm: service::catalog: add sophroid service catalog entry (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
2026-04-02 12:31:23 <wikibugs> ('CR) ''JMeybohm: [C:''+1] conftool: add sophroid etcd data [puppet] - ''https://gerrit.wikimedia.org/r/1248611 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
2026-04-02 12:31:41 <jinxer-wm> RESOLVED: [44x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2026-04-02 12:31:46 <logmsgbot> !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
2026-04-02 12:31:57 <wikibugs> ('CR) ''JMeybohm: [C:''+1] wmnet: add sophroid svc IPs [dns] - ''https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
2026-04-02 12:32:20 <wikibugs> ('CR) ''Klausman: [V:''+2 C:''+2] admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266947 (owner: ''Klausman)'
2026-04-02 12:32:32 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86555328 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 12:32:39 <wikibugs> ('PS1) ''Anne Tomasevich: Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490)'
2026-04-02 12:32:46 <logmsgbot> !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es1042.eqiad.wmnet: Restoring section
2026-04-02 12:32:49 <logmsgbot> !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage
2026-04-02 12:32:58 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042.eqiad.wmnet: Restoring section
2026-04-02 12:32:59 <logmsgbot> !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es1042.eqiad.wmnet: Restoring section
2026-04-02 12:33:10 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042: Restoring section
2026-04-02 12:33:25 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782172 (''ops-monitoring-bot) Starting pool of es1042 by fceratto@cumin1003: Restoring section'
2026-04-02 12:33:26 <wikibugs> ('CR) ''JMeybohm: [C:''-1] "This is the wrong file. Since you're targeting the aux cluster you need to add the pool there (`hieradata/role/common/aux_k8s/worker.yaml`" [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
2026-04-02 12:33:34 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 200752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 12:33:51 <jinxer-wm> FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 12:34:33 <effie> !incidents
2026-04-02 12:34:33 <sirenbot> 7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
2026-04-02 12:34:33 <sirenbot> 7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
2026-04-02 12:34:53 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne Tomasevich)'
2026-04-02 12:35:52 <wikibugs> ('CR) ''JMeybohm: [C:''-1] role::kubernetes::worker: add sophroid to the lvs pools (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
2026-04-02 12:36:32 <wikibugs> ('CR) ''Aude: [C:''+1] Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne Tomasevich)'
2026-04-02 12:36:51 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P90224 and previous config saved to /var/cache/conftool/dbconfig/20260402-123650-fceratto.json
2026-04-02 12:38:22 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 12:38:22 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 12:38:22 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 12:39:23 <jinxer-wm> FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 12:39:39 <wikibugs> ('Merged) ''jenkins-bot: admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266947 (owner: ''Klausman)'
2026-04-02 12:39:48 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 12:39:48 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2026-04-02 12:41:20 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782182 (''Jclark-ctr) a:''Jclark-ctr ` 2026-01-12 21:59:21 An unrecoverable disk media error occurred on Disk 20 in Backplane 2 of Integrated RAID Controller 1. Part Number =...'
2026-04-02 12:41:31 <logmsgbot> !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
2026-04-02 12:41:32 <jinxer-wm> FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
2026-04-02 12:41:40 <jinxer-wm> FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 12:41:43 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782184 (''BTullis) I have run `cross-validate-accounts` for...'
2026-04-02 12:42:33 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782190 (''Jclark-ctr) ''Open''Resolved'
2026-04-02 12:44:17 <logmsgbot> !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 12:45:04 <logmsgbot> !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
2026-04-02 12:45:29 <logmsgbot> !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1373.eqiad.wmnet with OS trixie
2026-04-02 12:45:51 <logmsgbot> !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 12:46:59 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P90225 and previous config saved to /var/cache/conftool/dbconfig/20260402-124659-fceratto.json
2026-04-02 12:48:33 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1042: Restoring section
2026-04-02 12:48:58 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782211 (''ops-monitoring-bot) Completed pooling of es1042 by fceratto@cumin1003: Restoring section'
2026-04-02 12:49:21 <logmsgbot> !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1374.eqiad.wmnet with OS trixie
2026-04-02 12:49:22 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 12:49:36 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782217 (''MatthewVernon) Thanks for the quick fixes @Jclark-ctr :-)'
2026-04-02 12:50:15 <jinxer-wm> FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 12:50:19 <logmsgbot> !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
2026-04-02 12:54:43 <jasmine_> hi folks, just a reminder that we will repooling codfw at 14:00 utc today
2026-04-02 12:55:15 <jinxer-wm> RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 12:55:32 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 468938744 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 12:56:20 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782255 (''Jclark-ctr) @Jgreen replaced cable link came up. Sorry for delay'
2026-04-02 12:56:37 <logmsgbot> !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
2026-04-02 12:57:07 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90227 and previous config saved to /var/cache/conftool/dbconfig/20260402-125707-fceratto.json
2026-04-02 12:57:11 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 12:57:25 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance
2026-04-02 12:57:32 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 12:57:33 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90228 and previous config saved to /var/cache/conftool/dbconfig/20260402-125732-fceratto.json
2026-04-02 12:58:15 <jinxer-wm> FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 13:00:05 <jouncebot> Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1300).
2026-04-02 13:00:05 <jouncebot> manfredi, HouseOfM, edsanders, and annet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2026-04-02 13:00:14 <Lucas_WMDE> o/
2026-04-02 13:00:23 <annet> o/
2026-04-02 13:00:24 <Lucas_WMDE> I can deploy but I need to catch up with the incident first
2026-04-02 13:00:32 <Lucas_WMDE> not sure if it’s okay to deploy at the moment
2026-04-02 13:00:41 <edsanders> last I heard it isn't
2026-04-02 13:01:01 <edsanders> I've also asked to deploy my UBN asap once the incident is resolved
2026-04-02 13:01:14 <Lucas_WMDE> https://www.wikimediastatus.net/incidents/kq46rrxd2yy4 is still up
2026-04-02 13:01:25 <jinxer-wm> RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 13:02:04 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782282 (''Aklapper)'
2026-04-02 13:02:15 <Lucas_WMDE> I agree that edsanders’ change seems top priority once we can deploy at all
2026-04-02 13:02:19 <wikibugs> ('PS1) ''Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)'
2026-04-02 13:03:07 <wikibugs> ('CR) ''CI reject: [V:''-1] Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
2026-04-02 13:03:51 <jinxer-wm> FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 13:08:53 <wikibugs> ('PS2) ''Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)'
2026-04-02 13:09:39 <wikibugs> ('CR) ''CI reject: [V:''-1] Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
2026-04-02 13:13:31 <wikibugs> ('PS3) ''Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)'
2026-04-02 13:15:21 <wikibugs> ('PS1) ''Ayounsi: Add Mayotte to geo-maps - prefer drmrs [dns] - ''https://gerrit.wikimedia.org/r/1267042'
2026-04-02 13:16:34 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 47811456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 13:17:24 <Lucas_WMDE> (the codfw repool is being pulled ahead, if that solves the incident then we *may* be able to deploy one or two patches in the window after all)
2026-04-02 13:17:33 <logmsgbot> !log jasmine@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool codfw [reason: no reason specified, T414486]
2026-04-02 13:17:37 <stashbot> T414486: Upgrade AUX clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414486
2026-04-02 13:17:46 <logmsgbot> !log jasmine@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool codfw [reason: no reason specified, T414486]
2026-04-02 13:18:15 <jinxer-wm> RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 13:18:33 <logmsgbot> !log jasmine@cumin1003 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: maintenance - T414486
2026-04-02 13:19:15 <jinxer-wm> FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 13:20:31 <wikibugs> ('CR) ''Btullis: [C:''-1] "I'm just waiting for final approval from Haroon on the ticket, for his 6 reports." [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
2026-04-02 13:20:32 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3981016 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 13:22:09 <sukhe> !incidents
2026-04-02 13:22:09 <sirenbot> 7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
2026-04-02 13:22:09 <sirenbot> 7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
2026-04-02 13:23:50 <wikibugs> 'ops-eqiad, ''DC-Ops, ''Infrastructure-Foundations, ''netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11782358 (''Jclark-ctr)'
2026-04-02 13:27:16 <wikibugs> ('PS1) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)'
2026-04-02 13:28:30 <jinxer-wm> RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 13:28:51 <jinxer-wm> FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 13:29:15 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90229 and previous config saved to /var/cache/conftool/dbconfig/20260402-132914-fceratto.json
2026-04-02 13:29:18 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 13:29:35 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782375 (''Jgreen) >>! In T417295#11782255, @Jclark-ctr wrote: > @Jgreen replaced cable link came up. Sorry for delay @Jclark-ctr looks good, it's imaging now. Thanks!'
2026-04-02 13:29:52 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17), ''Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782376 (''BTullis) This patch for the...'
2026-04-02 13:30:15 <jinxer-wm> FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 13:30:53 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Patch looks good, can be merged once approval is done" [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
2026-04-02 13:31:11 <wikibugs> ('CR) ''Eevans: [C:''+2] charts/cassandra-http-gateway: configurable Cassandra keyspace [deployment-charts] - ''https://gerrit.wikimedia.org/r/1259188 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 13:31:42 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782380 (''Jclark-ctr)'
2026-04-02 13:32:31 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
2026-04-02 13:32:44 <wikibugs> ('CR) ''Eevans: [C:''+2] services: add linked-artifacts service [deployment-charts] - ''https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 13:33:51 <jinxer-wm> FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 13:33:58 <sukhe> !ack
2026-04-02 13:33:59 <sirenbot> All incidents are already acked.
2026-04-02 13:34:45 <jinxer-wm> FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 13:34:51 <wikibugs> ('Merged) ''jenkins-bot: services: add linked-artifacts service [deployment-charts] - ''https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 13:35:15 <jinxer-wm> RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 21.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
2026-04-02 13:35:57 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11782401 (''Jclark-ctr) @VRiley-WMF Thanks for following up I had Sent the email with instructions to Papaul while I was out on Tuesday. This will require som...'
2026-04-02 13:36:52 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11782402 (''Jclark-ctr) ''Open''Resolved'
2026-04-02 13:37:45 <logmsgbot> !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
2026-04-02 13:39:24 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P90230 and previous config saved to /var/cache/conftool/dbconfig/20260402-133923-fceratto.json
2026-04-02 13:39:45 <jinxer-wm> FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 13:41:28 <wikibugs> ('PS1) ''Kosta Harlan: hCaptcha: Emit Prometheus counter on health check failover [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204)'
2026-04-02 13:41:47 <logmsgbot> !log jasmine@cumin1003 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: maintenance - T414486
2026-04-02 13:41:51 <stashbot> T414486: Upgrade AUX clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414486
2026-04-02 13:42:15 <jinxer-wm> RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2026-04-02 13:42:58 <logmsgbot> !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
2026-04-02 13:43:51 <jinxer-wm> RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
2026-04-02 13:44:45 <jinxer-wm> RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
2026-04-02 13:49:23 <wikibugs> ('CR) ''Lucas Werkmeister (WMDE): [C:''+2] "starting gate-and-submit ahead of deployment" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: ''Esanders)'
2026-04-02 13:49:32 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P90231 and previous config saved to /var/cache/conftool/dbconfig/20260402-134931-fceratto.json
2026-04-02 13:49:44 <Lucas_WMDE> ^ there’s some chance we’ll be able to deploy; otherwise I’ll undo that CR+2 (cc edsanders)
2026-04-02 13:50:16 <edsanders> I'm here
2026-04-02 13:50:22 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204) (owner: ''Kosta Harlan)'
2026-04-02 13:50:41 <edsanders> are we ready to deploy?
2026-04-02 13:50:54 <Lucas_WMDE> I just got the go-ahead in the security channel, so i think yes
2026-04-02 13:50:55 <wikibugs> ('Merged) ''jenkins-bot: Fix suggestion mode availability check [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: ''Esanders)'
2026-04-02 13:50:57 <cdanis> ye
2026-04-02 13:51:02 <Lucas_WMDE> spiders the pig
2026-04-02 13:51:15 <Lucas_WMDE> oh, that gate-and-submit was a lot faster than I expected
2026-04-02 13:51:25 <logmsgbot> !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
2026-04-02 13:51:28 <stashbot> T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
2026-04-02 13:51:40 <wikibugs> ('PS2) ''Arnaudb: gerrit: add Cache-Control for Gitiles with mod_proxy [puppet] - ''https://gerrit.wikimedia.org/r/1266238 (https://phabricator.wikimedia.org/T409422)'
2026-04-02 13:51:40 <edsanders> Lucas_WMDE: thanks
2026-04-02 13:52:53 <wikibugs> ('CR) ''Btullis: [C:''+2] Add analytics-fr-tech system user and corresponding groups [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
2026-04-02 13:53:09 <logmsgbot> !log lucaswerkmeister-wmde@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmne
2026-04-02 13:53:10 <logmsgbot> t/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_
2026-04-02 13:53:10 <logmsgbot> dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 44s)
2026-04-02 13:53:28 <Lucas_WMDE> looks
2026-04-02 13:54:06 <Lucas_WMDE> I think the sudo docker-pusher falied with “blob upload unknown”?
2026-04-02 13:54:09 <Lucas_WMDE> let me try again…
2026-04-02 13:54:47 <logmsgbot> !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
2026-04-02 13:55:45 <logmsgbot> !log lucaswerkmeister-wmde@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmne
2026-04-02 13:55:45 <logmsgbot> t/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_
2026-04-02 13:55:45 <logmsgbot> dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 00m 58s)
2026-04-02 13:56:06 <Lucas_WMDE> :(
2026-04-02 13:56:25 <Lucas_WMDE> same error I think
2026-04-02 13:56:29 <Lucas_WMDE> “blob upload unknown”
2026-04-02 13:57:11 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782509 (''cmooney) We are hopeful the situation should have improved after codfw was repooled, adding additional capacity. Root cause of the circuit breaking is still being in...'
2026-04-02 13:57:15 <edsanders> oh dear
2026-04-02 13:58:03 <Lucas_WMDE> jasmine_: as the codfw repooler (thanks again), any idea if this could be related?
2026-04-02 13:58:17 <wikibugs> ('CR) ''Dpogorzelski: ml-serve: add modified kserve 0.17 chart (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: ''Dpogorzelski)'
2026-04-02 13:58:19 <wikibugs> ('PS1) ''Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
2026-04-02 13:58:26 <Lucas_WMDE> I’m imagining something like, scap now has to push the new mw image to codfw, but something on codfw might not be ready for it…
2026-04-02 13:58:29 <Lucas_WMDE> juts guessing though
2026-04-02 13:58:35 <edsanders> I'll try once more for luck
2026-04-02 13:58:48 <Lucas_WMDE> ok
2026-04-02 13:58:53 <logmsgbot> !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
2026-04-02 13:58:56 <stashbot> T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
2026-04-02 13:58:58 <Lucas_WMDE> I didn’t realize you can deploy, I should’ve asked ^^
2026-04-02 13:59:00 <Lucas_WMDE> sorry
2026-04-02 13:59:17 <jasmine_> lucas_wmde: looking
2026-04-02 13:59:20 <Lucas_WMDE> thx
2026-04-02 13:59:40 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90232 and previous config saved to /var/cache/conftool/dbconfig/20260402-135939-fceratto.json
2026-04-02 13:59:43 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 13:59:56 <hashar> jouncebot: nowandnext
2026-04-02 13:59:56 <jouncebot> For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1300)
2026-04-02 13:59:56 <jouncebot> In 0 hour(s) and 0 minute(s): DC Switchover: Day 8 - Codfw Repool (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400)
2026-04-02 13:59:57 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance
2026-04-02 14:00:04 <jouncebot> jasmine_: May I have your attention please! DC Switchover: Day 8 - Codfw Repool. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400)
2026-04-02 14:00:05 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90233 and previous config saved to /var/cache/conftool/dbconfig/20260402-140004-fceratto.json
2026-04-02 14:00:08 <logmsgbot> !log esanders@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
2026-04-02 14:00:08 <logmsgbot> mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
2026-04-02 14:00:08 <logmsgbot> awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 15s)
2026-04-02 14:00:48 <hashar> jasmine_: I need to reload the CI Jenkins
2026-04-02 14:01:05 <hashar> it does not take long, I don't think it affects the switchover
2026-04-02 14:03:07 <hashar> !log Jenkins CI: reloading configuration from disk to poll new nodes # T421114
2026-04-02 14:03:11 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 14:03:11 <Lucas_WMDE> hashar: FYI, codfw was already repooled to respond to the incident (but I’m not sure how complete it is)
2026-04-02 14:03:12 <stashbot> T421114: Rebuild all Jenkins agents VM to Bookworm to support Java 21 - https://phabricator.wikimedia.org/T421114
2026-04-02 14:03:17 <hashar> done
2026-04-02 14:03:27 <hashar> Lucas_WMDE: ah cool, thank you!
2026-04-02 14:03:48 <wikibugs> ('PS2) ''Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
2026-04-02 14:03:48 <Lucas_WMDE> (we’re also still trying to deploy an UBN fix backport, but running into issues in scap)
2026-04-02 14:04:16 <wikibugs> ('CR) ''Elukey: [WIP] Move linting to Ruff and apply code fixes (''1 comment) [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
2026-04-02 14:05:34 <wikibugs> ('CR) ''Elukey: "First pass! I have intentionally removed a lot of problems allowing exceptions for tests etc.., I think it would be impossible (and probab" [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
2026-04-02 14:05:48 <wikibugs> ('CR) ''Ottomata: stream: mw-page-html-content-change-enrich (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 14:06:07 <jasmine_> hashar: yes we repooled a little bit earlier than scheduled, codfw is back up now
2026-04-02 14:07:25 <hashar> jasmine_: thank you and congratulations
2026-04-02 14:08:22 <hnowlan> could/should we make the config reload a part of a repool/depool?
2026-04-02 14:09:00 <wikibugs> ('PS3) ''Bking: opensearch: handle IP changes for software firewall [puppet] - ''https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714)'
2026-04-02 14:09:05 <wikibugs> ('PS2) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)'
2026-04-02 14:09:07 <wikibugs> ('CR) ''Bking: [C:''+2] opensearch: handle IP changes for software firewall (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: ''Bking)'
2026-04-02 14:09:11 <wikibugs> ('CR) ''Bking: [V:''+2 C:''+2] opensearch: handle IP changes for software firewall [puppet] - ''https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: ''Bking)'
2026-04-02 14:09:16 <logmsgbot> !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
2026-04-02 14:09:18 <hashar> hnowlan: the Jenkins reload? Nope it is unrelated, I had to do it for some unrelated configuration changes I have made on Jenkins
2026-04-02 14:09:19 <stashbot> T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
2026-04-02 14:09:23 <Lucas_WMDE> I confess I’m a bit torn between “revert the backport so the deployment is in a known state” and “leave it to be rolled out with the next deploy because it’s small and we really want it deployed”
2026-04-02 14:09:26 <hnowlan> hashar: ah okay
2026-04-02 14:10:01 <hashar> hnowlan: and whenever I act on Jenkins/Zuul I try to remember to check the deployment calendar to ensure that is not going to break some ongoing deployment :]
2026-04-02 14:10:24 <wikibugs> ('PS3) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)'
2026-04-02 14:10:26 <icinga-wm> RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 11 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
2026-04-02 14:10:32 <logmsgbot> !log esanders@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
2026-04-02 14:10:32 <logmsgbot> mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
2026-04-02 14:10:32 <logmsgbot> awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 16s)
2026-04-02 14:10:48 <Lucas_WMDE> still the same error
2026-04-02 14:11:17 <wikibugs> ('CR) ''JavierMonton: stream: mw-page-html-content-change-enrich (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 14:11:45 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17), ''Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782589 (''HShaikh) I approve these re...'
2026-04-02 14:11:47 <wikibugs> ('PS3) ''Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
2026-04-02 14:12:31 <wikibugs> ('CR) ''Elukey: [WIP] Move linting to Ruff and apply code fixes (''1 comment) [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
2026-04-02 14:13:23 <wikibugs> ('CR) ''Ottomata: "It is quite annoying that 'staging' AKA -next in dse-k8s is a different helmfile. It makes it hard to share common settings between 'stagi" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 14:13:44 <wikibugs> 'SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166 (''Lucas_Werkmeister_WMDE) ''NEW'
2026-04-02 14:13:47 <Lucas_WMDE> I filed T422166 for the deploy blocker (cc edsanders), not sure how it should be tagged
2026-04-02 14:13:48 <stashbot> T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
2026-04-02 14:14:06 <wikibugs> 'SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782617 (''Lucas_Werkmeister_WMDE) p:''Triage''Unbreak!'
2026-04-02 14:14:11 <Lucas_WMDE> cc jasmine_ ^ if you’re still looking into it
2026-04-02 14:14:18 <jasmine_> Lucas_WMDE: looking now if perhaps it's swift related see
2026-04-02 14:14:18 <jasmine_> [0] - https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook
2026-04-02 14:14:55 <wikibugs> ('PS1) ''Ladsgroup: Bump maxConnCount [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267062'
2026-04-02 14:15:28 <wikibugs> ('CR) ''CDanis: [C:''+1] Bump maxConnCount [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267062 (owner: ''Ladsgroup)'
2026-04-02 14:16:05 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267062 (owner: ''Ladsgroup)'
2026-04-02 14:16:50 <wikibugs> 'SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782637 (''Lucas_Werkmeister_WMDE) Timeline note: this comes hot on the tail of T422130, for which @jasmine_ repooled codfw slightly earlier than [scheduled](https://wikitech.wikimedia.org/w/index.php?title=Deployments&old...'
2026-04-02 14:16:54 <Lucas_WMDE> Amir1: good luck with that deploy
2026-04-02 14:16:59 <wikibugs> ('Merged) ''jenkins-bot: Bump maxConnCount [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267062 (owner: ''Ladsgroup)'
2026-04-02 14:17:13 <logmsgbot> !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1267062|Bump maxConnCount]]
2026-04-02 14:17:15 <Lucas_WMDE> (I expect you’ll run into T422166)
2026-04-02 14:17:23 <Amir1> Lucas_WMDE: that hopefully should prevent it from happening?
2026-04-02 14:17:46 <Amir1> oh that's a different issue
2026-04-02 14:17:48 <Amir1> yay
2026-04-02 14:17:48 <Lucas_WMDE> yeah
2026-04-02 14:18:25 <logmsgbot> !log ladsgroup@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted
2026-04-02 14:18:25 <logmsgbot> /mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/med
2026-04-02 14:18:25 <logmsgbot> iawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 11s)
2026-04-02 14:18:28 <Lucas_WMDE> yup :(
2026-04-02 14:19:23 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17), ''Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782654 (''BTullis)'
2026-04-02 14:19:39 <wikibugs> ('CR) ''Btullis: [C:''+2] "Manager approval received." [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
2026-04-02 14:23:17 <wikibugs> ('PS4) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)'
2026-04-02 14:23:24 <wikibugs> ('CR) ''JavierMonton: stream: mw-page-html-content-change-enrich (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 14:23:36 <Lucas_WMDE> (further investigation happening in -sre FTR)
2026-04-02 14:24:35 <wikibugs> ('CR) ''CDanis: [C:''+1] Add Mayotte to geo-maps - prefer drmrs [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
2026-04-02 14:27:10 <wikibugs> 'SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782695 (''Scott_French) dockerd logs on deploy1003 for the above example: ` Apr 02 14:09:17 deploy1003 dockerd[1070]: time="2026-04-02T14:09:17.561327804Z" level=info msg="ignoring event" container=c8f32695fd426caa327d6d...'
2026-04-02 14:28:22 <wikibugs> ('CR) ''Volans: [C:''+2] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 14:28:30 <moritzm> !log installing pyasn1 security updates
2026-04-02 14:28:31 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 14:29:42 <wikibugs> ('Merged) ''jenkins-bot: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
2026-04-02 14:30:05 <jouncebot> jasmine_: Time to snap out of that daydream and deploy DC Switchover: Day 8 - Codfw Repool. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400).
2026-04-02 14:30:05 <jouncebot> Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1430)
2026-04-02 14:33:14 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782729 (''BTullis) I have now modified the `airflow-platfor...'
2026-04-02 14:34:53 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90236 and previous config saved to /var/cache/conftool/dbconfig/20260402-143452-fceratto.json
2026-04-02 14:34:56 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 14:36:28 <wikibugs> 'SRE-tools, ''Cumin, ''Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360#11782751 (''Volans) ''Open''Resolved The cloudcumin hosts are now using the webproxies to connect to the openstack APIs and the firewall rule has been reverted...'
2026-04-02 14:37:31 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782760 (''MoritzMuehlenhoff) p:''Unbreak!''Medium The immediate impact has been mitigated, reducing priority, the task might still be used to collect followups.'
2026-04-02 14:41:11 <Lucas_WMDE> huge spike of PHP warnings from ExperimentManager all of a sudden
2026-04-02 14:41:11 <wikibugs> ('PS1) ''Eevans: cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112)'
2026-04-02 14:41:19 <Lucas_WMDE> (logspam-watch)
2026-04-02 14:42:09 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11782776 (''MoritzMuehlenhoff) What kind of access is needed? root access or simply shell access? We have exist...'
2026-04-02 14:42:17 <moritzm> !log installing libxml-parser-perl security updates
2026-04-02 14:42:18 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 14:44:33 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782789 (''BTullis) You should also now be able to start con...'
2026-04-02 14:45:01 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P90237 and previous config saved to /var/cache/conftool/dbconfig/20260402-144500-fceratto.json
2026-04-02 14:46:38 <wikibugs> ('CR) ''Eevans: "recheck" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 14:47:27 <wikibugs> ('CR) ''Elukey: ml-serve: add modified kserve 0.17 chart (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: ''Dpogorzelski)'
2026-04-02 14:48:34 <wikibugs> ('CR) ''Elukey: [C:''+1] "Final review - this is currently a ok-ish use case since we already run the same config in prod. We agreed to open a task and follow up on" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: ''Dpogorzelski)'
2026-04-02 14:49:26 <wikibugs> ('CR) ''JMeybohm: [C:''+1] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 14:50:08 <wikibugs> ('CR) ''Eevans: [C:''+2] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 14:50:17 <Lucas_WMDE> edsanders: are you still around and available to test your backport? (see -sre)
2026-04-02 14:50:45 <wikibugs> ('CR) ''Eevans: [V:''+2 C:''+2] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 14:51:13 <logmsgbot> !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
2026-04-02 14:51:41 <logmsgbot> !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
2026-04-02 14:52:01 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782824 (''BTullis) 4 Kerberos principals created and welcom...'
2026-04-02 14:52:25 <logmsgbot> !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
2026-04-02 14:52:40 <logmsgbot> !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
2026-04-02 14:53:40 <logmsgbot> !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
2026-04-02 14:53:54 <logmsgbot> !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
2026-04-02 14:54:02 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782828 (''Jgreen) ''Open''Resolved hosts are up and running'
2026-04-02 14:55:09 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P90239 and previous config saved to /var/cache/conftool/dbconfig/20260402-145508-fceratto.json
2026-04-02 14:55:12 <Lucas_WMDE> (the ExperimentManager warning spike seems to have abated again fwiw)
2026-04-02 14:56:38 <logmsgbot> !log swfrench@deploy1003 Started scap sync-world: Manual sync-world to pick up 1267062, 1266985 - T422143
2026-04-02 14:56:41 <stashbot> T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
2026-04-02 14:56:44 <logmsgbot> !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-eqiad,mr1-eqiad IPv6 with reason: switching from OSFP to BGP
2026-04-02 14:56:46 <Lucas_WMDE> \o/
2026-04-02 14:57:44 <logmsgbot> !log swfrench@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
2026-04-02 14:57:44 <logmsgbot> mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
2026-04-02 14:57:44 <logmsgbot> awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 06s)
2026-04-02 14:58:20 <wikibugs> ('CR) ''Ssingh: "I am guessing this is based on probenet data? (not that everything else in the repo currently is but I am mostly curious)" [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
2026-04-02 14:59:32 <papaul> !log ongoing maintenance on mr1-eqiad
2026-04-02 14:59:33 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 14:59:40 <logmsgbot> !log swfrench@deploy1003 Started scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 - T422143
2026-04-02 15:00:04 <jouncebot> jnuche and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1500)
2026-04-02 15:00:38 <wikibugs> ('CR) ''Ottomata: [C:''+1] stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 15:02:15 <wikibugs> ('CR) ''Dzahn: [C:''+2] buildkitd: Bump buildkit image to wmf-v0.29.0 [puppet] - ''https://gerrit.wikimedia.org/r/1266395 (https://phabricator.wikimedia.org/T415284) (owner: ''Ahmon Dancy)'
2026-04-02 15:02:37 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Preseed notes often use globbing where applicable, but with our ongoing migration of all servers to UEFI for hardware there will be a lot " [puppet] - ''https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: ''Herron)'
2026-04-02 15:03:03 <logmsgbot> !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
2026-04-02 15:03:45 <logmsgbot> !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
2026-04-02 15:04:20 <icinga-wm> PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2026-04-02 15:04:20 <icinga-wm> PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2026-04-02 15:05:17 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90241 and previous config saved to /var/cache/conftool/dbconfig/20260402-150517-fceratto.json
2026-04-02 15:05:20 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 15:05:34 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Maintenance
2026-04-02 15:05:47 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90242 and previous config saved to /var/cache/conftool/dbconfig/20260402-150542-fceratto.json
2026-04-02 15:05:49 <wikibugs> ('PS1) ''Papaul: Remove OSFP from mr1-eqiad [homer/public] - ''https://gerrit.wikimedia.org/r/1267081 (https://phabricator.wikimedia.org/T421238)'
2026-04-02 15:06:35 <logmsgbot> !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
2026-04-02 15:07:05 <logmsgbot> !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 15:07:55 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11782910 (''Jhancock.wm)'
2026-04-02 15:08:45 <wikibugs> ('CR) ''Papaul: [C:''+2] Remove OSFP from mr1-eqiad [homer/public] - ''https://gerrit.wikimedia.org/r/1267081 (https://phabricator.wikimedia.org/T421238) (owner: ''Papaul)'
2026-04-02 15:09:29 <wikibugs> ('CR) ''JavierMonton: [C:''+2] stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 15:11:23 <wikibugs> ('Merged) ''jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 15:11:40 <moritzm> !log installing apache2 security updates
2026-04-02 15:11:41 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 15:12:20 <icinga-wm> RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2026-04-02 15:12:20 <icinga-wm> RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2026-04-02 15:12:45 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 15:12:59 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 15:16:17 <wikibugs> ('PS1) ''Papaul: Add back "replace osfp" to be able to remove it [homer/public] - ''https://gerrit.wikimedia.org/r/1267085 (https://phabricator.wikimedia.org/T421238)'
2026-04-02 15:20:29 <wikibugs> ('CR) ''Papaul: [C:''+2] Add back "replace osfp" to be able to remove it [homer/public] - ''https://gerrit.wikimedia.org/r/1267085 (https://phabricator.wikimedia.org/T421238) (owner: ''Papaul)'
2026-04-02 15:22:31 <logmsgbot> !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
2026-04-02 15:23:08 <papaul> !log maintenance complete on mr1-eqiad
2026-04-02 15:23:09 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 15:23:22 <logmsgbot> !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 15:26:12 <swfrench-wmf> !log restarted docker-registry-restricted.service on registry200[45] - T422166
2026-04-02 15:26:14 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 15:26:14 <stashbot> T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
2026-04-02 15:26:28 <logmsgbot> !log swfrench@deploy1003 sync-world aborted: Manual full-rebuild sync-world to pick up 1267062, 1266985 - T422143 (duration: 26m 48s)
2026-04-02 15:26:31 <stashbot> T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
2026-04-02 15:27:38 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 15:27:46 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 15:31:16 <swfrench-wmf> !log restarted docker-registry-ml.service on registry200[45] - T422166
2026-04-02 15:31:18 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 15:31:19 <stashbot> T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
2026-04-02 15:32:34 <moritzm> !log installing freetype security updates
2026-04-02 15:32:35 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 15:32:59 <wikibugs> ('CR) ''Dzahn: [C:''+1] gerrit: adjust idleTimeout on Jetty [puppet] - ''https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) (owner: ''Arnaudb)'
2026-04-02 15:33:00 <logmsgbot> !log swfrench@deploy1003 Started scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143
2026-04-02 15:33:02 <stashbot> T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
2026-04-02 15:34:43 <wikibugs> ('PS4) ''Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
2026-04-02 15:35:06 <wikibugs> ('CR) ''Elukey: [WIP] Move linting to Ruff and apply code fixes (''1 comment) [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
2026-04-02 15:38:44 <wikibugs> ('CR) ''Elukey: "Local, venvs created (so not the first run):" [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
2026-04-02 15:39:18 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90244 and previous config saved to /var/cache/conftool/dbconfig/20260402-153918-fceratto.json
2026-04-02 15:39:22 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 15:41:49 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+1] "https://puppet-compiler.wmflabs.org/output/1256301/8370/"; [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 15:41:50 <wikibugs> ('PS5) ''Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
2026-04-02 15:44:23 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 15:45:37 <wikibugs> ('PS14) ''Herron: site: opt-in insetup defaults by hostname prefix [puppet] - ''https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929)'
2026-04-02 15:46:55 <wikibugs> ('CR) ''A smart kitten: "FWIW that [phab1004 NOOP result](https://puppet-compiler.wmflabs.org/output/1256301/8370/phab1004.eqiad.wmnet/index.html) seems wrong - it" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 15:46:59 <wikibugs> ('CR) ''A smart kitten: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 15:48:31 <jinxer-wm> FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards Has improved - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
2026-04-02 15:48:57 <wikibugs> ('CR) ''A smart kitten: "(FWIW @dzahn@wikimedia.org, feel free to shoot me a message in IRC if you want to sync-up e.g. if/when deploying/testing this patch. I'm n" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 15:49:08 <wikibugs> ('CR) ''Herron: [C:''+2] "thanks for the review!" [puppet] - ''https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: ''Herron)'
2026-04-02 15:49:22 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 15:49:26 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P90245 and previous config saved to /var/cache/conftool/dbconfig/20260402-154925-fceratto.json
2026-04-02 15:50:05 <logmsgbot> !log swfrench@deploy1003 swfrench: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 15:50:09 <stashbot> T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
2026-04-02 15:50:10 <wikibugs> ('CR) ''A smart kitten: "(if I'm around in IRC at the time you'll be deploying this, that is; otherwise feel free to just deploy it if/when is good for you :) )" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 15:51:13 <logmsgbot> !log swfrench@deploy1003 swfrench: Continuing with sync
2026-04-02 15:55:31 <wikibugs> ('PS3) ''Btullis: Remove the site.pp definitions for decommissioned an-worker hosts [puppet] - ''https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948)'
2026-04-02 15:55:47 <wikibugs> ('PS1) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 15:56:13 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+1] "it's because puppet DB queries were introduced somewhere (not by your patch) which often breaks compiler runs (Failed to execute '/pdb/que" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 15:59:23 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 15:59:35 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P90246 and previous config saved to /var/cache/conftool/dbconfig/20260402-155934-fceratto.json
2026-04-02 16:00:05 <jouncebot> No Gerrit patches in the queue for this window AFAICS.
2026-04-02 16:00:05 <jouncebot> jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1600). Please do the needful.
2026-04-02 16:00:34 <Lucas_WMDE> we’re so close to finishing the backport+config window lol
2026-04-02 16:00:49 <Lucas_WMDE> (with 1/4 patches deployed)
2026-04-02 16:01:31 <wikibugs> ('PS2) ''Herron: preseed: use efi for new kafka-logging hosts [puppet] - ''https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929)'
2026-04-02 16:01:33 <wikibugs> ('CR) ''CI reject: [V:''-1] preseed: use efi for new kafka-logging hosts [puppet] - ''https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929) (owner: ''Herron)'
2026-04-02 16:01:38 <wikibugs> ('PS2) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 16:02:56 <logmsgbot> !log swfrench@deploy1003 Finished scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 (duration: 29m 56s)
2026-04-02 16:02:59 <stashbot> T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
2026-04-02 16:02:59 <swfrench-wmf> \i/
2026-04-02 16:03:04 <Lucas_WMDE> \o/ \o/ \o/
2026-04-02 16:03:40 <Lucas_WMDE> !log UTC afternoon backport+config window (very belatedly) done ^^
2026-04-02 16:03:41 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2026-04-02 16:03:50 <Lucas_WMDE> thanks for figuring it out and deploying!
2026-04-02 16:04:08 <Lucas_WMDE> Amir1: your maxConnCount bump got deployed now btw ^
2026-04-02 16:04:15 <Amir1> thanks!
2026-04-02 16:04:22 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 16:05:09 <wikibugs> 'SRE, ''Datacenter-Switchover: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11783170 (''Scott_French) p:''Unbreak!''Medium This was a curious one. Many thanks to @elukey and @CDanis for the assistance. tl;dr - Cached connections in the (restricted) docker registry's...'
2026-04-02 16:05:26 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11783179 (''Ahoelzl) I approve the addition of the listed WME...'
2026-04-02 16:05:40 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 16:09:13 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 16:09:23 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 16:09:43 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90247 and previous config saved to /var/cache/conftool/dbconfig/20260402-160942-fceratto.json
2026-04-02 16:09:46 <stashbot> T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
2026-04-02 16:09:59 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance
2026-04-02 16:10:44 <wikibugs> ('Abandoned) ''Federico Ceratto: wmnet: update CNAME records for DB masters to eqiad [dns] - ''https://gerrit.wikimedia.org/r/1255655 (https://phabricator.wikimedia.org/T416705) (owner: ''Federico Ceratto)'
2026-04-02 16:11:45 <wikibugs> ('PS3) ''Herron: preseed: use efi for new kafka-logging hosts [puppet] - ''https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929)'
2026-04-02 16:12:31 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 16:12:43 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 16:12:55 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 16:13:01 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 16:14:07 <wikibugs> ('CR) ''Herron: [C:''+2] "ok! lets give this a try" [alerts] - ''https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: ''Herron)'
2026-04-02 16:14:23 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 16:15:28 <wikibugs> ('Merged) ''jenkins-bot: burrow: update expressions to handle multiple instances [alerts] - ''https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: ''Herron)'
2026-04-02 16:15:28 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] phabricator: Set a custom default-mail-address for the test instance [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 16:15:53 <swfrench-wmf> jouncebot: nowandnext
2026-04-02 16:15:53 <jouncebot> For the next 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1600)
2026-04-02 16:15:53 <jouncebot> In 0 hour(s) and 44 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
2026-04-02 16:15:53 <jouncebot> In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
2026-04-02 16:16:55 <wikibugs> ('CR) ''Herron: [C:''+2] "thanks all!" [puppet] - ''https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: ''Herron)'
2026-04-02 16:18:02 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] "deployed. confirmed it is a NOOP / no error on production host." [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 16:18:31 <wikibugs> ('CR) ''Scott French: "Thanks for the review!" [puppet] - ''https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 16:19:10 <wikibugs> ('CR) ''Scott French: [C:''+2] deployment_server: absent image-suggestion k8s creds config [puppet] - ''https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 16:23:33 <wikibugs> ('Restored) ''Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: ''Mmartorana)'
2026-04-02 16:24:14 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 16:24:35 <wikibugs> 'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11783241 (''BTullis) ''Open''Resolved p:''Triage'...
2026-04-02 16:25:39 <wikibugs> ('PS6) ''Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366)'
2026-04-02 16:25:48 <wikibugs> ('CR) ''CI reject: [V:''-1] config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: ''Mmartorana)'
2026-04-02 16:26:51 <wikibugs> ('Abandoned) ''Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: ''Mmartorana)'
2026-04-02 16:29:13 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 16:31:22 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it"; [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne Tomasevich)'
2026-04-02 16:32:25 <wikibugs> ('PS1) ''Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267116 (https://phabricator.wikimedia.org/T421366)'
2026-04-02 16:33:19 <wikibugs> 'SRE-swift-storage, ''API Platform, ''Commons, ''MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11783346 (''Ladsgroup) I was looking into this a bit yesterday (more general...'
2026-04-02 16:34:13 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 16:34:48 <wikibugs> ('CR) ''Btullis: data-platform: Add alerts for opensearch on k8s certificate expiry (''2 comments) [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
2026-04-02 16:37:32 <wikibugs> 'SRE, ''Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11783388 (''Alberto) Thank you very much for your help! I have correctly implemented the User-Agent in my LocalSettings.php for both MediaWiki core and the QuickInstantCommons...'
2026-04-02 16:39:14 <jinxer-wm> RESOLVED: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2026-04-02 16:39:22 <wikibugs> ('CR) ''Scott French: [C:''+2] deployment_server: remove absented image-suggestion k8s creds config [puppet] - ''https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 16:39:46 <wikibugs> ('PS4) ''Scott French: deployment_server: remove absented image-suggestion k8s creds config [puppet] - ''https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096)'
2026-04-02 16:40:30 <wikibugs> ('PS1) ''Daniel Kinzler: Revert "rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267119'
2026-04-02 16:41:02 <wikibugs> ('CR) ''Daniel Kinzler: [C:''+2] "revert undeployed change" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267119 (owner: ''Daniel Kinzler)'
2026-04-02 16:43:22 <wikibugs> ('CR) ''Scott French: [C:''+2] deployment_server: remove absented image-suggestion k8s creds config [puppet] - ''https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 16:44:00 <wikibugs> ('Merged) ''jenkins-bot: Revert "rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267119 (owner: ''Daniel Kinzler)'
2026-04-02 16:45:27 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''observability, ''Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11783408 (''Jclark-ctr) a:''herron''Jclark-ctr'
2026-04-02 16:45:58 <wikibugs> ('PS1) ''Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581)'
2026-04-02 16:47:02 <wikibugs> 'SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189 (''prabhat) ''NEW'
2026-04-02 16:47:35 <wikibugs> ('PS1) ''Herron: kafkamon: update burrow ports [puppet] - ''https://gerrit.wikimedia.org/r/1267121 (https://phabricator.wikimedia.org/T418858)'
2026-04-02 16:47:47 <wikibugs> 'SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783451 (''prabhat)'
2026-04-02 16:49:51 <wikibugs> ('CR) ''Scott French: "Thank you both for the review!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 16:50:07 <wikibugs> ('CR) ''Scott French: [C:''+2] image-suggestion: remove service configuration [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 16:52:26 <wikibugs> 'SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783519 (''ssingh) request and key confirmed out of band.'
2026-04-02 16:53:23 <logmsgbot> !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3009.esams.wmnet} and A:liberica
2026-04-02 16:54:23 <jinxer-wm> RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 16:57:02 <logmsgbot> !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3009.esams.wmnet} and A:liberica
2026-04-02 16:58:15 <wikibugs> ('Merged) ''jenkins-bot: image-suggestion: remove service configuration [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 16:59:30 <logmsgbot> !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3008.esams.wmnet} and A:liberica
2026-04-02 17:00:05 <jouncebot> bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700).
2026-04-02 17:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
2026-04-02 17:00:07 <swfrench-wmf> o/
2026-04-02 17:00:25 <swfrench-wmf> I'll be deploying some admin_ng changes shortly
2026-04-02 17:02:25 <logmsgbot> !log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
2026-04-02 17:03:03 <logmsgbot> !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3008.esams.wmnet} and A:liberica
2026-04-02 17:03:30 <wikibugs> ('PS1) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216)'
2026-04-02 17:04:46 <wikibugs> ('CR) ''Ottomata: [C:''+1] stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 17:05:13 <logmsgbot> !log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 17:05:34 <logmsgbot> !log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
2026-04-02 17:07:04 <logmsgbot> !log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 17:07:04 <wikibugs> ('CR) ''JavierMonton: [C:''+2] stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 17:08:15 <logmsgbot> !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
2026-04-02 17:08:37 <bd808> checks for things that need releasing
2026-04-02 17:09:06 <wikibugs> ('PS1) ''DCausse: search: add space-discount for wikidata custom prefix search profiles [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427)'
2026-04-02 17:09:09 <wikibugs> ('Merged) ''jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
2026-04-02 17:09:17 <bd808> nothing for my window this week</window>
2026-04-02 17:09:39 <wikibugs> ('PS4) ''Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109)'
2026-04-02 17:10:12 <wikibugs> ('CR) ''CI reject: [V:''-1] ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: ''Dzahn)'
2026-04-02 17:10:34 <logmsgbot> !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
2026-04-02 17:10:37 <wikibugs> ('CR) ''Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config (''3 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: ''Dzahn)'
2026-04-02 17:10:48 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 17:11:08 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 17:11:20 <wikibugs> ('PS5) ''Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109)'
2026-04-02 17:11:31 <logmsgbot> !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
2026-04-02 17:11:50 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 17:12:02 <logmsgbot> !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 17:12:12 <logmsgbot> !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2026-04-02 17:12:40 <logmsgbot> !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2026-04-02 17:13:56 <logmsgbot> !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
2026-04-02 17:14:49 <wikibugs> ('CR) ''Scott French: "Thanks for the review!" [dns] - ''https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 17:15:32 <wikibugs> ('CR) ''Scott French: [C:''+2] wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - ''https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 17:15:41 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: ''Dzahn)'
2026-04-02 17:16:11 <logmsgbot> !log swfrench@dns1004 START - running authdns-update
2026-04-02 17:18:08 <logmsgbot> !log swfrench@dns1004 END - running authdns-update
2026-04-02 17:20:27 <wikibugs> ('PS4) ''Scott French: service: remove image-suggestion [puppet] - ''https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096)'
2026-04-02 17:26:28 <wikibugs> 'SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783746 (''prabhat)'
2026-04-02 17:27:48 <swfrench-wmf> alright, I believe I'm done with my side of this window
2026-04-02 17:28:10 <wikibugs> ('PS1) ''Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)'
2026-04-02 17:28:39 <wikibugs> ('CR) ''CI reject: [V:''-1] cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: ''Eevans)'
2026-04-02 17:29:04 <wikibugs> ('PS1) ''Snwachukwu: Add rest gateway routes for video_plays path. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202)'
2026-04-02 17:31:23 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] phabricator: Set a custom default-mail-address for the test instance (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
2026-04-02 17:31:54 <wikibugs> ('CR) ''Mforns: [C:''+1] "LGTM!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
2026-04-02 17:32:10 <wikibugs> ('PS2) ''Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)'
2026-04-02 17:35:42 <wikibugs> ('PS1) ''Scott French: fixtures: clean up reference to image-suggestion [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096)'
2026-04-02 17:36:02 <wikibugs> ('CR) ''Snwachukwu: [C:''+2] Add rest gateway routes for video_plays path. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
2026-04-02 17:36:07 <wikibugs> ('PS3) ''Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)'
2026-04-02 17:36:12 <wikibugs> ('CR) ''Eevans: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: ''Eevans)'
2026-04-02 17:36:51 <wikibugs> ('PS1) ''Ssingh: admin: update SSH key for ptiwary [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189)'
2026-04-02 17:36:54 <wikibugs> ('CR) ''Snwachukwu: [C:''+2] "Thank you!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
2026-04-02 17:37:00 <wikibugs> ('CR) ''Snwachukwu: [V:''+2 C:''+2] Add rest gateway routes for video_plays path. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
2026-04-02 17:39:23 <wikibugs> ('PS3) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 17:39:32 <wikibugs> ('CR) ''Eevans: [C:''+2] cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: ''Eevans)'
2026-04-02 17:39:46 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783799 (''Jgreen) ''Open''Resolved boxes are imaged, in replication, and ready for traffic once pfw policy is done'
2026-04-02 17:40:49 <wikibugs> ('CR) ''CI reject: [V:''-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
2026-04-02 17:42:20 <wikibugs> ('CR) ''Ottomata: [C:''+1] Add rest gateway routes for video_plays path. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
2026-04-02 17:42:35 <wikibugs> ('CR) ''Ssingh: "Request verified out of band, please feel free to do an additional check." [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
2026-04-02 17:44:20 <wikibugs> ('CR) ''Ayounsi: "That's a follow up from an email that was sent to noc@ from a local ISP." [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
2026-04-02 17:44:27 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''Patch-For-Review: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783815 (''HShaikh) As prabhat's manager I approve this request.'
2026-04-02 17:45:50 <wikibugs> ('CR) ''Ssingh: [C:''+1] "Ah I see it now -- my bad. Thanks." [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
2026-04-02 17:46:51 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 17:47:50 <wikibugs> ('PS1) ''Snwachukwu: Add rest gateway routes for video_plays path production. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202)'
2026-04-02 17:49:08 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] "I can see in compiler how this changes things on new instance "integration-agent-docker-1070" just created on https://phabricator.wikimedi"; [puppet] - ''https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: ''Hashar)'
2026-04-02 17:50:58 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783859 (''Jgreen)'
2026-04-02 17:54:07 <wikibugs> 'SRE, ''DNS, ''Infrastructure-Foundations, ''netbox, and 2 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11783873 (''ssingh) Thanks for fixing it but I agree that we need an alert for this otherwise we will miss this again.'
2026-04-02 17:55:40 <wikibugs> ('CR) ''Snwachukwu: [C:''+2] Add rest gateway routes for video_plays path production. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
2026-04-02 17:56:20 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] "noop confirmed on contint prod hosts" [puppet] - ''https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: ''Hashar)'
2026-04-02 17:57:43 <wikibugs> ('Merged) ''jenkins-bot: Add rest gateway routes for video_plays path production. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
2026-04-02 17:58:30 <wikibugs> ('PS4) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 17:59:52 <logmsgbot> !log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
2026-04-02 18:00:10 <logmsgbot> !log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
2026-04-02 18:00:29 <logmsgbot> !log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
2026-04-02 18:00:35 <wikibugs> ('CR) ''CI reject: [V:''-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
2026-04-02 18:00:48 <logmsgbot> !log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
2026-04-02 18:01:24 <wikibugs> ('CR) ''Jasmine: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 18:01:51 <jinxer-wm> RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 18:04:40 <wikibugs> ('PS5) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 18:05:15 <wikibugs> ('CR) ''Brouberol: [C:''+1] fixtures: clean up reference to image-suggestion [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 18:06:00 <wikibugs> ('CR) ''CI reject: [V:''-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
2026-04-02 18:07:21 <wikibugs> ('CR) ''Muehlenhoff: "One validation is fine, you can either go ahead and merge it or I'll take care of it via Clinic duty, either is fine." [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
2026-04-02 18:07:35 <wikibugs> ('PS6) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
2026-04-02 18:14:19 <wikibugs> ('CR) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry (''2 comments) [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
2026-04-02 18:16:57 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783930 (''Jclark-ctr) a:''Jgreen''Jclark-ctr'
2026-04-02 18:24:15 <wikibugs> ('CR) ''SBassett: [C:''+2] "Oh, whoops, I see the commit msg says "miscweb(research-landing-page): bump image version". Just to be clear, this change set is for" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1174750 (https://phabricator.wikimedia.org/T399132) (owner: ''Jly)'
2026-04-02 18:24:47 <logmsgbot> !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5006.eqsin.wmnet} and A:liberica
2026-04-02 18:25:57 <jinxer-wm> FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2026-04-02 18:28:03 <logmsgbot> !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5006.eqsin.wmnet} and A:liberica
2026-04-02 18:28:50 <wikibugs> ('PS3) ''SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
2026-04-02 18:29:53 <sukhe> port 80!?
2026-04-02 18:30:57 <topranks> yeah I'm not sure why it's firing... sort of seems ok?
2026-04-02 18:30:57 <jinxer-wm> RESOLVED: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2026-04-02 18:31:19 <topranks> https://phabricator.wikimedia.org/P90248
2026-04-02 18:31:30 <wikibugs> ('CR) ''Scott French: "Thanks for the review!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 18:31:31 <wikibugs> ('CR) ''Scott French: [C:''+2] fixtures: clean up reference to image-suggestion [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 18:31:38 <sukhe> topranks: yeah it resolved. haven't looked very deeply on what happened but can't seem anything obvious
2026-04-02 18:31:56 <moritzm> same here
2026-04-02 18:31:56 <topranks> I gotta say the probe dashboard is absolutely incomprehensible to me, any time I have to visit it
2026-04-02 18:32:09 <topranks> I don't see any signs of general connectivity issues
2026-04-02 18:32:25 <moritzm> and ipv6 only?
2026-04-02 18:32:30 <sukhe> seems so yeah
2026-04-02 18:33:06 <topranks> yeah, tbh that is further evidence it is just an outlier failed connection, for whatever reason
2026-04-02 18:33:08 <sukhe> topranks: yep. we should improve that. it defaults to "All"
2026-04-02 18:33:11 <wikibugs> ('Merged) ''jenkins-bot: fixtures: clean up reference to image-suggestion [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 18:33:16 <topranks> rather than a systemic problem like everyone is failing to connect
2026-04-02 18:33:26 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784038 (''gmodena) >>! In T422141#11782776, @MoritzMuehlenhoff wrote: > What kind of access is needed? root ac...'
2026-04-02 18:33:51 <moritzm> don't see any specific signs of user-visible impact from graphs
2026-04-02 18:34:21 <wikibugs> ('CR) ''SBassett: [C:''+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
2026-04-02 18:34:21 <wikibugs> ('CR) ''Ssingh: "Thanks, I will merge if I can find a reviewer otherwise feel free to take it later." [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
2026-04-02 18:35:37 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784042 (''MoritzMuehlenhoff) >>! In T422141#11784038, @gmodena wrote: >>>! In T422141#11782776, @MoritzMuehlen...'
2026-04-02 18:35:58 <wikibugs> ('CR) ''Reedy: [C:''+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
2026-04-02 18:37:05 <wikibugs> ('CR) ''Ssingh: [C:''+1] "Two reviews by the sec team, merging." [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
2026-04-02 18:37:06 <wikibugs> ('CR) ''Ssingh: [C:''+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
2026-04-02 18:37:12 <Reedy> haha
2026-04-02 18:37:13 <Reedy> consensus!
2026-04-02 18:37:39 <sukhe> Reedy: who am I to say no to two +1s?!
2026-04-02 18:38:57 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "LGMT syntax-wise" [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
2026-04-02 18:39:25 <topranks> https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=probe_success%7Baddress%3D%222620%3A0%3A861%3Aed1a%3A%3A1%22%2C%20instance%3D%22text%3A80%22%7D%5B20m%5D&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h
2026-04-02 18:39:33 <topranks> I really don't understand why that fired, but anyway
2026-04-02 18:40:27 <sukhe> topranks: doesn't add up yep
2026-04-02 18:40:32 <sukhe> anyway nothing to do here I feel
2026-04-02 18:40:49 <topranks> yep enough other stuff to worry about
2026-04-02 18:40:58 <moritzm> yeah, this feels like a one time blip, and if it happens again, we can still correlat further
2026-04-02 18:41:21 <wikibugs> ('CR) ''Ssingh: [C:''+2] admin: update SSH key for ptiwary [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
2026-04-02 18:41:50 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11784073 (''Jclark-ctr) a:''Jclark-ctr''BTullis'
2026-04-02 18:41:52 <wikibugs> ('CR) ''Alex.sanford: [C:''+1] Allow-list some additional domains to the currently enforcing CSP (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
2026-04-02 18:42:19 <wikibugs> 'ops-eqiad, ''DC-Ops, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11784074 (''Jclark-ctr) a:''Jclark-ctr''BTullis'
2026-04-02 18:44:03 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784075 (''gmodena) >>! In T422141#11784042, @MoritzMuehlenhoff wrote: > We don't have a specific access group...'
2026-04-02 18:44:32 <wikibugs> ('PS1) ''Ottomata: dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794)'
2026-04-02 18:45:52 <logmsgbot> !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
2026-04-02 18:46:50 <wikibugs> ('CR) ''Ottomata: [C:''+2] dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
2026-04-02 18:49:09 <wikibugs> ('Merged) ''jenkins-bot: dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
2026-04-02 18:51:31 <logmsgbot> cmooney@cumin1003 netbox (PID 2341745) is awaiting input
2026-04-02 18:51:57 <logmsgbot> !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for new lumen 100g transport - cmooney@cumin1003"
2026-04-02 18:51:58 <wikibugs> ('PS1) ''Reedy: Undeploy Extension:StopForumSpam [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185)'
2026-04-02 18:52:24 <logmsgbot> !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for new lumen 100g transport - cmooney@cumin1003"
2026-04-02 18:52:24 <logmsgbot> !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2026-04-02 18:52:28 <wikibugs> ('PS1) ''Cathal Mooney: Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - ''https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878)'
2026-04-02 18:53:17 <wikibugs> ('CR) ''Ssingh: [C:''+1] Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - ''https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) (owner: ''Cathal Mooney)'
2026-04-02 18:54:38 <wikibugs> ('CR) ''Cathal Mooney: [C:''+2] Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - ''https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) (owner: ''Cathal Mooney)'
2026-04-02 18:54:48 <wikibugs> ('PS1) ''Ottomata: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794)'
2026-04-02 18:54:56 <wikibugs> ('CR) ''CI reject: [V:''-1] dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
2026-04-02 18:55:10 <logmsgbot> !log cmooney@dns2005 START - running authdns-update
2026-04-02 18:55:19 <wikibugs> ('PS2) ''Ottomata: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794)'
2026-04-02 18:56:09 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784108 (''AWesterinen-WMF) Retried ... no change'
2026-04-02 18:56:34 <logmsgbot> !log cmooney@dns2005 END - running authdns-update
2026-04-02 18:56:53 <wikibugs> ('CR) ''Jforrester: [C:''+1] Undeploy Extension:StopForumSpam [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: ''Reedy)'
2026-04-02 18:57:10 <wikibugs> ('CR) ''Ottomata: [C:''+2] dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
2026-04-02 18:59:14 <wikibugs> ('Merged) ''jenkins-bot: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
2026-04-02 19:00:25 <wikibugs> ('CR) ''Dzahn: [C:''+2] ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: ''Dzahn)'
2026-04-02 19:01:19 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 19:01:23 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 19:02:03 <wikibugs> ('PS3) ''Elukey: opensearch-semantic-search-test: Add to services proxy [puppet] - ''https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: ''Bking)'
2026-04-02 19:04:43 <wikibugs> ('CR) ''Scott French: "Thanks for the review!" [puppet] - ''https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 19:04:47 <wikibugs> ('CR) ''Scott French: [C:''+2] service: remove image-suggestion [puppet] - ''https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
2026-04-02 19:06:31 <wikibugs> ('PS1) ''Cathal Mooney: Management routers: set autonomous system number [homer/public] - ''https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238)'
2026-04-02 19:09:11 <logmsgbot> !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on releases2003.codfw.wmnet with reason: T418109
2026-04-02 19:09:14 <stashbot> T418109: Update Jenkins hosts from Java 17 to Java 21 - https://phabricator.wikimedia.org/T418109
2026-04-02 19:09:30 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784127 (''MoritzMuehlenhoff) You still need to request "wmf" at https://idm.wikimedia.org/permissions/, so far you only r...'
2026-04-02 19:12:13 <wikibugs> ('PS1) ''Dzahn: jenkins: add profile::ci::docker to role [puppet] - ''https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109)'
2026-04-02 19:16:13 <wikibugs> 'SRE, ''Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11784146 (''Scott_French)'
2026-04-02 19:16:44 <wikibugs> ('PS1) ''Ottomata: mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216)'
2026-04-02 19:19:50 <wikibugs> ('CR) ''Ottomata: [C:''+2] mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) (owner: ''Ottomata)'
2026-04-02 19:21:50 <wikibugs> ('Merged) ''jenkins-bot: mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) (owner: ''Ottomata)'
2026-04-02 19:23:41 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784167 (''AWesterinen-WMF) I tried to do that, but see no option for wmf. Only "logstash", "airflow" and "spiderpig".'
2026-04-02 19:24:12 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 19:24:16 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
2026-04-02 19:33:06 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11784179 (''ssingh) ''Open''Resolved a:''ssingh Should now be rolled out everywhere, let us know if you have any issues with access.'
2026-04-02 19:35:49 <wikibugs> ('PS1) ''Dduvall: zuul: Move cross-profile references to hiera [puppet] - ''https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207)'
2026-04-02 19:35:51 <wikibugs> ('PS1) ''Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - ''https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207)'
2026-04-02 19:45:21 <wikibugs> ('PS2) ''Dduvall: zuul: Move cross-profile references to hiera [puppet] - ''https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207)'
2026-04-02 19:45:21 <wikibugs> ('PS2) ''Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - ''https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207)'
2026-04-02 19:46:02 <wikibugs> ('CR) ''Dduvall: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: ''Dduvall)'
2026-04-02 19:48:46 <jinxer-wm> FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards Has improved - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
2026-04-02 19:56:29 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 19:56:32 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 19:56:48 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 19:56:50 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 19:57:36 <nya_1F616EMO> Is anyone here waiting for the UTC late backport window? And are there any blockers to the window?
2026-04-02 19:57:46 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 19:57:48 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 20:00:05 <jouncebot> RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T2000)
2026-04-02 20:00:05 <jouncebot> nya_1F616EMO and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2026-04-02 20:00:15 <nya_1F616EMO> o/
2026-04-02 20:00:26 <bwang> Im here~!
2026-04-02 20:00:49 <nya_1F616EMO> prays for a deployer to show up
2026-04-02 20:02:56 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 20:03:03 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 20:04:18 <wikibugs> ('PS4) ''Bking: opensearch-semantic-search-test: Add to services proxy [puppet] - ''https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293)'
2026-04-02 20:05:17 <wikibugs> ('CR) ''Bking: opensearch-semantic-search-test: Add to services proxy (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: ''Bking)'
2026-04-02 20:05:40 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2026-04-02 20:07:12 <wikibugs> ('PS1) ''Ottomata: mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216)'
2026-04-02 20:07:58 <wikibugs> ('CR) ''Ottomata: [C:''+2] mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) (owner: ''Ottomata)'
2026-04-02 20:08:04 <wikibugs> ('CR) ''Ottomata: [V:''+2 C:''+2] mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) (owner: ''Ottomata)'
2026-04-02 20:08:51 <nya_1F616EMO> It seems like we're out of luck?
2026-04-02 20:09:35 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 20:09:44 <logmsgbot> !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
2026-04-02 20:12:35 <wikibugs> 'ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11784255 (''phaultfinder)'
2026-04-02 20:13:27 <wikibugs> ('PS5) ''Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - ''https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293)'
2026-04-02 20:13:27 <wikibugs> ('CR) ''Bking: "Thanks for the course correction! I think we have a path forward here; we've added envoy TLS termination in 1248865 and monitoring for the" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) (owner: ''Bking)'
2026-04-02 20:13:43 <Kemayo> I'd offer to do it, but there was a big breakage of the ability to scap deploy things this morning, so it might be a good idea to have a real deployer present who could recover from an error if it happened.
2026-04-02 20:13:57 <wikibugs> ('Abandoned) ''Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - ''https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) (owner: ''Bking)'
2026-04-02 20:15:00 <nya_1F616EMO> One of my patch is a time-specific logo update for zhwikinews, and one is a non-time-specific SecurePoll deployment to a private wiki. I may propose to the local community to use CSS for the logo change; do you recommend doing so?
2026-04-02 20:17:01 <Kemayo> Feels inconvenient to deal with, given all the various logo sizes involved.
2026-04-02 20:17:17 <nya_1F616EMO> You mean to deploy?
2026-04-02 20:17:38 <nya_1F616EMO> Currently working on the CSS solution
2026-04-02 20:17:45 <nya_1F616EMO> (cuz there are no deployment on Fridays we all know)
2026-04-02 20:17:52 <Kemayo> If you and bwang don't mind, I could certainly kick off a spiderpig build with all your patches. If it breaks in the same way as it did before, it'd just fail to deploy even to testservers rather than ruining production.
2026-04-02 20:18:08 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784278 (''VRiley-WMF) This ticket seems like it relates to another ticket https://phabricator.wikimedia.org/T413559 @Eevans this server is out of warrenty, would you like us to replace this disk or leave it...'
2026-04-02 20:18:34 <Kemayo> There's just a *chance* that it'll wedge us into a state where a releng person needs to look at things before any deploys can happen. 😅
2026-04-02 20:19:12 <nya_1F616EMO> I won't let go my SecurePoll patch anyways under this state, it'd be up to you on whether to accept that zhwikinews logo change.
2026-04-02 20:20:05 <Kemayo> I'm fine giving it a shot.
2026-04-02 20:20:10 <Kemayo> bwang: Want yours in as well?
2026-04-02 20:21:26 <nya_1F616EMO> Wait, I found something that might be off
2026-04-02 20:21:44 <nya_1F616EMO> Let me chekc my patch for resolutions
2026-04-02 20:22:04 <Kemayo> Just let me know when you're happy with it, and if bwang hasn't shown up by then I can do just-yours.
2026-04-02 20:22:13 <nya_1F616EMO> Ah nvm, the script did the job for me
2026-04-02 20:22:25 <nya_1F616EMO> It successfully reduced the resolution to 135x135, nice
2026-04-02 20:22:33 <nya_1F616EMO> so good to go
2026-04-02 20:22:49 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: ''1F616EMO)'
2026-04-02 20:24:24 <wikibugs> ('CR) ''Bking: [C:''+2] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
2026-04-02 20:24:56 <wikibugs> ('CR) ''Bking: [C:''+2] "Ben is out for the next 10 days, so I'm going to be bold and merge after addressing his concerns." [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
2026-04-02 20:25:02 <wikibugs> ('CR) ''Bking: [V:''+2 C:''+2] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
2026-04-02 20:25:19 <wikibugs> ('Merged) ''jenkins-bot: zhwikinews: 20th anniversary logo change [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: ''1F616EMO)'
2026-04-02 20:25:37 <logmsgbot> !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]]
2026-04-02 20:25:40 <stashbot> T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165
2026-04-02 20:28:46 <bwang> Sorry I was in a call
2026-04-02 20:28:52 <bwang> Im still here and able to help test the backpoert
2026-04-02 20:29:16 <wikibugs> ('PS2) ''Clare Ming: Update the Test Kitchen maintenance script to target testwiki [puppet] - ''https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209)'
2026-04-02 20:29:22 <logmsgbot> !log kemayo@deploy1003 1f616emo, kemayo: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 20:29:40 <Kemayo> nya_1F616EMO: Can you verify your change?
2026-04-02 20:29:44 <nya_1F616EMO> testing
2026-04-02 20:30:44 <nya_1F616EMO> it works, tested on vector-2022, vector, monobook, timeless.
2026-04-02 20:31:03 <Kemayo> I will continue the deploy, then.
2026-04-02 20:31:06 <nya_1F616EMO> Thanks
2026-04-02 20:31:11 <logmsgbot> !log kemayo@deploy1003 1f616emo, kemayo: Continuing with sync
2026-04-02 20:33:09 <wikibugs> ('CR) ''1F616EMO: "Will re-schedule in the Monday, May 04 UTC morning backport window, together with Ia1a463ba01452b76b73ff6b59b821eae9154ddf8." [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: ''1F616EMO)'
2026-04-02 20:33:21 <wikibugs> ('PS1) ''1F616EMO: Revert "zhwikinews: 20th anniversary logo change" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165)'
2026-04-02 20:33:35 <wikibugs> ('CR) ''1F616EMO: "Will re-schedule in the Monday, May 04 UTC morning backport window, together with Iea2390c01600b5f93c7b01f5605d887541c74b02." [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: ''1F616EMO)'
2026-04-02 20:33:52 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784305 (''MoritzMuehlenhoff) >>! In T422141#11784075, @gmodena wrote: >>>! In T422141#11784042, @MoritzMuehlen...'
2026-04-02 20:35:37 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784306 (''MoritzMuehlenhoff) >>! In T420053#11784167, @AWesterinen-WMF wrote: > I tried to do that, but see no option for...'
2026-04-02 20:37:23 <logmsgbot> !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] (duration: 11m 46s)
2026-04-02 20:37:26 <stashbot> T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165
2026-04-02 20:37:34 <icinga-wm> PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 182040496 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 20:38:32 <icinga-wm> RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3815080 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
2026-04-02 20:39:36 <Kemayo> nya_1F616EMO: Okay, should be live now.
2026-04-02 20:40:01 <nya_1F616EMO> Nice and verified the changes through prod.
2026-04-02 20:40:04 <nya_1F616EMO> Thank you for your help
2026-04-02 20:40:33 <wikibugs> ('CR) ''Cathal Mooney: "Do we have stats for RE? Is it that much better to eqsin on average than drmrs? From the geography it's not clear to me." [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
2026-04-02 20:43:58 <Kemayo> nya_1F616EMO: np!
2026-04-02 20:47:18 <bwang> Hi are we still able to back port my patch?
2026-04-02 20:47:55 <Kemayo> bwang: sure, I can get it if you're willing to stick around until it's done.
2026-04-02 20:48:11 <bwang> Yes of course
2026-04-02 20:48:29 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne Tomasevich)'
2026-04-02 20:51:01 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''Data-Persistence, ''DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11784343 (''VRiley-WMF) Hey @elukey Thanks for working on this! Is there anything I can do from my end to assist with this? Let us know...'
2026-04-02 20:51:48 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11784345 (''VRiley-WMF)'
2026-04-02 20:51:52 <wikibugs> ('Merged) ''jenkins-bot: Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne Tomasevich)'
2026-04-02 20:52:10 <logmsgbot> !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]]
2026-04-02 20:52:13 <stashbot> T420490: [Logged in reader retention baseline] Launch A/A experiment - https://phabricator.wikimedia.org/T420490
2026-04-02 20:52:24 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11784348 (''VRiley-WMF)'
2026-04-02 20:53:51 <logmsgbot> !log kemayo@deploy1003 annet, kemayo: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 20:54:13 <Kemayo> bwang: let me know when it's tested
2026-04-02 20:56:36 <bwang> checking now
2026-04-02 20:57:02 <wikibugs> ('PS1) ''DLynch: SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267204'
2026-04-02 20:57:19 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it"; [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267204 (owner: ''DLynch)'
2026-04-02 20:58:54 <bwang> Looks good
2026-04-02 20:59:09 <Kemayo> Continuing, then.
2026-04-02 20:59:12 <logmsgbot> !log kemayo@deploy1003 annet, kemayo: Continuing with sync
2026-04-02 21:00:05 <jouncebot> Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T2100)
2026-04-02 21:01:02 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784373 (''AWesterinen-WMF) Updated my email and requested wmf access. But, I have a further problem. I tried to ssh in...'
2026-04-02 21:01:16 <Jdlrobson> Kemayo: let me know when you are done. I have a deploy but I need 15m to prep
2026-04-02 21:01:46 <Kemayo> Jdlrobson: Sure, I just have one more patch to get out after this, so that should fit into your timing pretty okay.
2026-04-02 21:03:50 <logmsgbot> !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] (duration: 11m 40s)
2026-04-02 21:03:54 <stashbot> T420490: [Logged in reader retention baseline] Launch A/A experiment - https://phabricator.wikimedia.org/T420490
2026-04-02 21:04:06 <Kemayo> bwang: Live now.
2026-04-02 21:04:16 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267204 (owner: ''DLynch)'
2026-04-02 21:08:09 <wikibugs> ('PS2) ''Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)'
2026-04-02 21:15:33 <wikibugs> ('Merged) ''jenkins-bot: SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267204 (owner: ''DLynch)'
2026-04-02 21:15:47 <logmsgbot> !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]]
2026-04-02 21:17:26 <logmsgbot> !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 21:18:36 <logmsgbot> !log kemayo@deploy1003 kemayo: Continuing with sync
2026-04-02 21:23:09 <wikibugs> ('PS3) ''Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)'
2026-04-02 21:23:42 <wikibugs> ('CR) ''CI reject: [V:''-1] role::kubernetes::worker: add sophroid to the lvs pools [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
2026-04-02 21:23:51 <wikibugs> ('PS4) ''Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)'
2026-04-02 21:26:03 <wikibugs> 'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11784439 (''Od1n) FWIW, I'm still currently encountering this error on frwiki, and it prevents my local custom JS/CSS files from loading. Unexpectedly not loaded: * `Special:Myp...'
2026-04-02 21:26:25 <logmsgbot> !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]]
2026-04-02 21:26:41 <Kemayo> Jdlrobson: Sorry, the k8s deploy failed, which is making everything *fun*.
2026-04-02 21:27:13 <Jdlrobson> no worries
2026-04-02 21:27:19 <Jdlrobson> im appreciating the extra testing time :)
2026-04-02 21:28:05 <logmsgbot> !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 21:28:34 <logmsgbot> !log kemayo@deploy1003 kemayo: Continuing with sync
2026-04-02 21:32:44 <logmsgbot> !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] (duration: 06m 18s)
2026-04-02 21:32:57 <Kemayo> Jdlrobson: okay, all yours!
2026-04-02 21:35:22 <Jdlrobson> thanks!
2026-04-02 21:35:45 <wikibugs> ('PS1) ''Jdlrobson: Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882)'
2026-04-02 21:36:53 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: ''Jdlrobson)'
2026-04-02 21:48:25 <wikibugs> ('CR) ''CI reject: [V:''-1] Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: ''Jdlrobson)'
2026-04-02 21:48:31 <jinxer-wm> FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
2026-04-02 21:49:08 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: ''Jdlrobson)'
2026-04-02 21:49:14 <Jdlrobson> Flakey Wikibase test :(
2026-04-02 21:50:31 <wikibugs> ('Merged) ''jenkins-bot: Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: ''Jdlrobson)'
2026-04-02 21:51:01 <wikibugs> ('CR) ''SBassett: [C:''+1] Undeploy Extension:StopForumSpam [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: ''Reedy)'
2026-04-02 21:58:21 <logmsgbot> !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]]
2026-04-02 21:58:24 <stashbot> T414882: Additional top margin for Parsoid outputs for sections with no lead - https://phabricator.wikimedia.org/T414882
2026-04-02 22:00:02 <logmsgbot> !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 22:01:42 <logmsgbot> !log jdlrobson@deploy1003 jdlrobson: Continuing with sync
2026-04-02 22:03:51 <jinxer-wm> FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 22:05:10 <jinxer-wm> FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
2026-04-02 22:05:39 <jinxer-wm> FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2026-04-02 22:05:54 <logmsgbot> !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] (duration: 07m 33s)
2026-04-02 22:05:57 <stashbot> T414882: Additional top margin for Parsoid outputs for sections with no lead - https://phabricator.wikimedia.org/T414882
2026-04-02 22:06:51 <Jdlrobson> All done.
2026-04-02 22:08:51 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 22:10:38 <wikibugs> 'SRE, ''ServiceOps new, ''Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11784520 (''Scott_French) Moving this into #serviceops_new, since we're probably the right team to figure out how this should b...'
2026-04-02 22:11:34 <wikibugs> ('PS1) ''Eevans: Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112)'
2026-04-02 22:17:35 <wikibugs> ('CR) ''Eevans: [C:''+2] Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 22:19:31 <wikibugs> ('Merged) ''jenkins-bot: Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 22:20:22 <logmsgbot> !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
2026-04-02 22:20:36 <logmsgbot> !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
2026-04-02 22:40:10 <jinxer-wm> RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
2026-04-02 22:40:39 <jinxer-wm> FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2026-04-02 22:43:51 <jinxer-wm> RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2026-04-02 22:45:39 <jinxer-wm> RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2026-04-02 22:59:29 <wikibugs> ('PS1) ''Eevans: Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112)'
2026-04-02 23:02:03 <wikibugs> ('CR) ''Eevans: [C:''+2] Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 23:03:31 <jinxer-wm> FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
2026-04-02 23:04:01 <wikibugs> ('Merged) ''jenkins-bot: Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
2026-04-02 23:06:01 <logmsgbot> !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
2026-04-02 23:06:07 <logmsgbot> !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
2026-04-02 23:28:38 <zabe> jouncebot: nowandnext
2026-04-02 23:28:38 <jouncebot> No deployments scheduled for the next 6 hour(s) and 31 minute(s)
2026-04-02 23:28:38 <jouncebot> In 6 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260403T0600)
2026-04-02 23:34:22 <wikibugs> ('CR) ''Zabe: [C:''+2] Start reading from new file table in dewiki and fawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: ''Zabe)'
2026-04-02 23:35:16 <wikibugs> ('Merged) ''jenkins-bot: Start reading from new file table in dewiki and fawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: ''Zabe)'
2026-04-02 23:35:42 <logmsgbot> !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]]
2026-04-02 23:35:45 <stashbot> T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
2026-04-02 23:37:19 <logmsgbot> !log zabe@deploy1003 zabe: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2026-04-02 23:37:40 <logmsgbot> !log zabe@deploy1003 zabe: Continuing with sync
2026-04-02 23:38:23 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784707 (''Eevans) >>! In T421439#11784276, @VRiley-WMF wrote: > This ticket seems like it relates to another ticket https://phabricator.wikimedia.org/T413559 > > @Eevans this server is out of warrenty, would...'
2026-04-02 23:38:31 <jinxer-wm> RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
2026-04-02 23:39:52 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1267280'
2026-04-02 23:39:52 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1267280 (owner: ''TrainBranchBot)'
2026-04-02 23:41:52 <logmsgbot> !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] (duration: 06m 10s)
2026-04-02 23:41:55 <stashbot> T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
2026-04-02 23:51:27 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1267280 (owner: ''TrainBranchBot)'
2026-04-02 23:51:34 <logmsgbot> !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
2026-04-02 23:52:58 <logmsgbot> !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply

This page is generated from SQL logs, you can also download static txt files from here