|
2026-04-02 00:01:06
|
<wikibugs>
|
('CR) ''Scott French: "Thanks, Raine!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) (owner: ''Kamila Součková)'
|
|
2026-04-02 00:09:10
|
<wikibugs>
|
('CR) ''Scott French: "Thanks, Raine!" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266264 (https://phabricator.wikimedia.org/T419049) (owner: ''Kamila Součková)'
|
|
2026-04-02 00:56:14
|
<logmsgbot>
|
!log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
|
|
2026-04-02 01:02:33
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 284378408 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 01:06:33
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7050408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 01:06:35
|
<logmsgbot>
|
!log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
|
|
2026-04-02 01:08:23
|
<logmsgbot>
|
!log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
|
|
2026-04-02 01:09:22
|
<jinxer-wm>
|
FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 01:11:46
|
<wikibugs>
|
('PS1) ''TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1266500'
|
|
2026-04-02 01:11:46
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1266500 (owner: ''TrainBranchBot)'
|
|
2026-04-02 01:18:44
|
<logmsgbot>
|
!log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
|
|
2026-04-02 01:19:48
|
<logmsgbot>
|
!log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
|
|
2026-04-02 01:24:09
|
<wikibugs>
|
('Merged) ''jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1266500 (owner: ''TrainBranchBot)'
|
|
2026-04-02 01:30:13
|
<logmsgbot>
|
!log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
|
|
2026-04-02 01:30:53
|
<logmsgbot>
|
!log jasmine@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
|
|
2026-04-02 01:41:15
|
<logmsgbot>
|
!log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
|
|
2026-04-02 01:51:17
|
<jinxer-wm>
|
FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 01:54:29
|
<icinga-wm>
|
PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
|
|
2026-04-02 02:00:56
|
<logmsgbot>
|
!log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
|
|
2026-04-02 02:01:29
|
<icinga-wm>
|
RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
|
|
2026-04-02 02:06:11
|
<jinxer-wm>
|
FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
|
|
2026-04-02 02:07:20
|
<logmsgbot>
|
!log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 23s)
|
|
2026-04-02 02:09:13
|
<jinxer-wm>
|
FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 02:34:13
|
<jinxer-wm>
|
RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 02:46:33
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 786199704 and 51 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 02:47:33
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 03:09:23
|
<jinxer-wm>
|
FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
|
|
2026-04-02 04:41:25
|
<jinxer-wm>
|
FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 04:54:23
|
<jinxer-wm>
|
RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
|
|
2026-04-02 04:55:25
|
<jinxer-wm>
|
FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 05:00:42
|
<wikibugs>
|
('CR) ''Giuseppe Lavagetto: [C:''+1] wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - ''https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 05:09:37
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 05:16:23
|
<jinxer-wm>
|
FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
|
|
2026-04-02 05:33:30
|
<jinxer-wm>
|
FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
|
|
2026-04-02 05:51:32
|
<jinxer-wm>
|
FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 05:56:17
|
<jinxer-wm>
|
FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 06:00:05
|
<jouncebot>
|
Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600)
|
|
2026-04-02 06:00:05
|
<jouncebot>
|
marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0600).
|
|
2026-04-02 06:06:11
|
<jinxer-wm>
|
FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
|
|
2026-04-02 06:10:25
|
<jinxer-wm>
|
FIRING: [2x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 06:15:12
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+1] "The patch looks good, but I left a comment on the comment :-)" [puppet] - ''https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: ''Bking)'
|
|
2026-04-02 06:19:56
|
<wikibugs>
|
('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: ''1F616EMO)'
|
|
2026-04-02 06:29:22
|
<wikibugs>
|
('PS2) ''1F616EMO: arbcom_zhwiki: Enable SecurePoll without PII rights [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309)'
|
|
2026-04-02 06:30:25
|
<jinxer-wm>
|
FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 06:52:10
|
<jinxer-wm>
|
FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
|
|
2026-04-02 06:56:17
|
<jinxer-wm>
|
FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 06:57:10
|
<jinxer-wm>
|
RESOLVED: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
|
|
2026-04-02 07:00:05
|
<jouncebot>
|
Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0700).
|
|
2026-04-02 07:00:05
|
<jouncebot>
|
georgekyz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
|
|
2026-04-02 07:00:24
|
<georgekyz>
|
Good morning folks!
|
|
2026-04-02 07:00:59
|
<georgekyz>
|
I am planning to deploy my patch now, is anybody around ?
|
|
2026-04-02 07:03:22
|
<georgekyz>
|
I running it.
|
|
2026-04-02 07:03:34
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: ''Gkyziridis)'
|
|
2026-04-02 07:04:26
|
<wikibugs>
|
('Merged) ''jenkins-bot: EventStreamConfig: Add rr-multilingual prediction_change stream [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266228 (https://phabricator.wikimedia.org/T415892) (owner: ''Gkyziridis)'
|
|
2026-04-02 07:05:19
|
<logmsgbot>
|
!log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]]
|
|
2026-04-02 07:05:22
|
<stashbot>
|
T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892
|
|
2026-04-02 07:07:35
|
<logmsgbot>
|
!log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 07:08:03
|
<logmsgbot>
|
!log gkyziridis@deploy1003 gkyziridis: Continuing with sync
|
|
2026-04-02 07:08:16
|
<georgekyz>
|
syncing
|
|
2026-04-02 07:08:42
|
<wikibugs>
|
'SRE, ''Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11780898 (''MoritzMuehlenhoff) p:''Triage→''Medium'
|
|
2026-04-02 07:08:49
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 07:12:19
|
<logmsgbot>
|
!log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266228|EventStreamConfig: Add rr-multilingual prediction_change stream (T415892)]] (duration: 07m 00s)
|
|
2026-04-02 07:12:23
|
<stashbot>
|
T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892
|
|
2026-04-02 07:12:53
|
<georgekyz>
|
the deployment finished successfully!
|
|
2026-04-02 07:13:09
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11780904 (''MoritzMuehlenhoff) Was this linked in some onboarding doc that you followed? If so, it can be removed for now. We're currently reworking 2FA support in CAS and the originally...'
|
|
2026-04-02 07:13:58
|
<wikibugs>
|
('CR) ''Gkyziridis: [C:''+2] ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: ''Gkyziridis)'
|
|
2026-04-02 07:16:01
|
<wikibugs>
|
('Merged) ''jenkins-bot: ml-services: Deploy rr-multilingual gpu model and eventstream in prod. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266212 (https://phabricator.wikimedia.org/T415892) (owner: ''Gkyziridis)'
|
|
2026-04-02 07:20:49
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11780907 (''MoritzMuehlenhoff) Since Andrea is working as a contractor the tracking entry in data.yaml should use the The t...'
|
|
2026-04-02 07:22:25
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11780912 (''MoritzMuehlenhoff) ''In progress→''Resolved a:''hnowlan @MPostoronca-WMF Your access is enabled,
so I'm rmarking this as resolved. If you run into any issues,...'
|
|
2026-04-02 07:24:57
|
<logmsgbot>
|
!log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
|
|
2026-04-02 07:25:06
|
<logmsgbot>
|
!log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
|
|
2026-04-02 07:27:56
|
<wikibugs>
|
('PS1) ''Jaime Nuche: ApiAuthManagerHelper: Accept fields with undefined label [core] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027)'
|
|
2026-04-02 07:29:00
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by jnuche@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) (owner: ''Jaime Nuche)'
|
|
2026-04-02 07:30:33
|
<logmsgbot>
|
!log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 64049
|
|
2026-04-02 07:32:13
|
<logmsgbot>
|
!log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 64049
|
|
2026-04-02 07:38:00
|
<logmsgbot>
|
!log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fea7794]: deploy to freshly reimaged wdqs host
|
|
2026-04-02 07:38:05
|
<logmsgbot>
|
!log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fea7794]: deploy to freshly reimaged wdqs host (duration: 00m 05s)
|
|
2026-04-02 07:38:07
|
<moritzm>
|
!log purge prometheus-nginx-exporter from url downloaders, remnants of early hcapcha rollout
|
|
2026-04-02 07:38:08
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 07:40:36
|
<wikibugs>
|
('PS1) ''Mszwarc: Disable external link analysis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837)'
|
|
2026-04-02 07:40:42
|
<wikibugs>
|
('Merged) ''jenkins-bot: ApiAuthManagerHelper: Accept fields with undefined label [core] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266861 (https://phabricator.wikimedia.org/T422027) (owner: ''Jaime Nuche)'
|
|
2026-04-02 07:41:06
|
<logmsgbot>
|
!log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]]
|
|
2026-04-02 07:41:09
|
<stashbot>
|
T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given - https://phabricator.wikimedia.org/T422027
|
|
2026-04-02 07:41:17
|
<jinxer-wm>
|
FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 07:42:21
|
<Msz2001>
|
I'll deploy a config change if there's nothing going on
|
|
2026-04-02 07:42:42
|
<Msz2001>
|
(I see it is, I'll wit)
|
|
2026-04-02 07:42:45
|
<Msz2001>
|
wait*
|
|
2026-04-02 07:43:08
|
<logmsgbot>
|
!log jnuche@deploy1003 jnuche: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 07:43:33
|
<logmsgbot>
|
!log jnuche@deploy1003 jnuche: Continuing with sync
|
|
2026-04-02 07:46:17
|
<jinxer-wm>
|
FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 07:46:54
|
<wikibugs>
|
('CR) ''Kosta Harlan: [C:''+1] Disable external link analysis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: ''Mszwarc)'
|
|
2026-04-02 07:47:40
|
<logmsgbot>
|
!log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (, T421714) xfer wdqs-all from wdqs2016.codfw.wmnet -> wdqs1027.eqiad.wmnet, repooling both afterwards
|
|
2026-04-02 07:47:44
|
<stashbot>
|
T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421714
|
|
2026-04-02 07:47:55
|
<logmsgbot>
|
!log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266861|ApiAuthManagerHelper: Accept fields with undefined label (T422027)]] (duration: 06m 39s)
|
|
2026-04-02 07:47:58
|
<stashbot>
|
T422027: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given - https://phabricator.wikimedia.org/T422027
|
|
2026-04-02 07:48:54
|
<wikibugs>
|
'SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops, ''Sustainability (Incident Followup): ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11780961 (''ayounsi)'
|
|
2026-04-02 07:49:22
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by mszwarc@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: ''Mszwarc)'
|
|
2026-04-02 07:50:16
|
<wikibugs>
|
('Merged) ''jenkins-bot: Disable external link analysis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266866 (https://phabricator.wikimedia.org/T419837) (owner: ''Mszwarc)'
|
|
2026-04-02 07:50:17
|
<wikibugs>
|
('PS1) ''Kevin Bazira: ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350)'
|
|
2026-04-02 07:50:40
|
<logmsgbot>
|
!log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]]
|
|
2026-04-02 07:50:43
|
<stashbot>
|
T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
|
|
2026-04-02 07:50:56
|
<jinxer-wm>
|
RESOLVED: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun
|
|
2026-04-02 07:51:23
|
<icinga-wm>
|
PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
|
|
2026-04-02 07:52:23
|
<icinga-wm>
|
PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
|
|
2026-04-02 07:52:40
|
<logmsgbot>
|
!log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 07:53:58
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+2] Failover URL downloaders [dns] - ''https://gerrit.wikimedia.org/r/1266242 (owner: ''Muehlenhoff)'
|
|
2026-04-02 07:54:14
|
<logmsgbot>
|
!log jmm@dns1004 START - running authdns-update
|
|
2026-04-02 07:55:55
|
<logmsgbot>
|
!log jmm@dns1004 END - running authdns-update
|
|
2026-04-02 07:56:39
|
<logmsgbot>
|
!log mszwarc@deploy1003 mszwarc: Continuing with sync
|
|
2026-04-02 07:58:49
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 08:00:05
|
<jouncebot>
|
jnuche and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T0800)
|
|
2026-04-02 08:00:53
|
<logmsgbot>
|
!log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1266866|Disable external link analysis (T419837)]] (duration: 10m 13s)
|
|
2026-04-02 08:00:57
|
<stashbot>
|
T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
|
|
2026-04-02 08:01:15
|
<jnuche>
|
morning, I will begin the train shortly
|
|
2026-04-02 08:01:58
|
<wikibugs>
|
('PS1) ''Arnaudb: apt-staging: error handling for restricted projects [puppet] - ''https://gerrit.wikimedia.org/r/1266920 (https://phabricator.wikimedia.org/T422070)'
|
|
2026-04-02 08:02:03
|
<wikibugs>
|
('CR) ''Arnaudb: [C:''+2] apt-staging: error handling for restricted projects [puppet] - ''https://gerrit.wikimedia.org/r/1266920 (https://phabricator.wikimedia.org/T422070) (owner: ''Arnaudb)'
|
|
2026-04-02 08:03:25
|
<wikibugs>
|
('PS1) ''TrainBranchBot: group2 to 1.46.0-wmf.22 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480)'
|
|
2026-04-02 08:03:28
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Initiated by jnuche@deploy1003" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) (owner: ''TrainBranchBot)'
|
|
2026-04-02 08:04:19
|
<wikibugs>
|
('Merged) ''jenkins-bot: group2 to 1.46.0-wmf.22 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266924 (https://phabricator.wikimedia.org/T420480) (owner: ''TrainBranchBot)'
|
|
2026-04-02 08:07:49
|
<wikibugs>
|
('CR) ''Ozge: [C:''+1] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: ''Kevin Bazira)'
|
|
2026-04-02 08:08:49
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 08:10:28
|
<logmsgbot>
|
!log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.22 refs T420480
|
|
2026-04-02 08:10:31
|
<stashbot>
|
T420480: 1.46.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T420480
|
|
2026-04-02 08:11:03
|
<wikibugs>
|
('CR) ''Kevin Bazira: [C:''+2] ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: ''Kevin Bazira)'
|
|
2026-04-02 08:11:59
|
<wikibugs>
|
('PS1) ''Muehlenhoff: Update email record for andreawest [puppet] - ''https://gerrit.wikimedia.org/r/1266931 (https://phabricator.wikimedia.org/T420053)'
|
|
2026-04-02 08:12:45
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17), ''Patch-For-Review: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11781036 (''MoritzMuehlenhoff) >>! In T420053#11778139, @AWesterinen wrote: > I still have the
error,...'
|
|
2026-04-02 08:13:10
|
<wikibugs>
|
('Merged) ''jenkins-bot: ml-services: enable multi-GPU setup using SHM to improve gpt isvc performance [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266905 (https://phabricator.wikimedia.org/T418350) (owner: ''Kevin Bazira)'
|
|
2026-04-02 08:14:38
|
<wikibugs>
|
('PS4) ''Volans: webproxies: allow cloudcumin to openstack [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360)'
|
|
2026-04-02 08:14:38
|
<wikibugs>
|
('CR) ''Volans: "PCC available at:" [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 08:16:16
|
<wikibugs>
|
'ops-eqiad, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111 (''FCeratto-WMF) ''NEW'
|
|
2026-04-02 08:16:17
|
<jinxer-wm>
|
FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 08:16:24
|
<wikibugs>
|
('PS1) ''Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 08:16:49
|
<wikibugs>
|
('PS2) ''Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 08:17:10
|
<logmsgbot>
|
!log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
|
|
2026-04-02 08:17:56
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C:''+1] "LGTM, nice!" [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 08:18:38
|
<wikibugs>
|
('PS1) ''Arnaudb: aptrepo: add an alert for failed prepare [alerts] - ''https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070)'
|
|
2026-04-02 08:18:41
|
<wikibugs>
|
('CR) ''Arnaudb: [C:''+2] aptrepo: add an alert for failed prepare [alerts] - ''https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) (owner: ''Arnaudb)'
|
|
2026-04-02 08:19:02
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
|
|
2026-04-02 08:19:21
|
<wikibugs>
|
('PS3) ''Brouberol: deployment_server: monitor the expirty of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 08:19:38
|
<wikibugs>
|
('PS4) ''Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 08:20:00
|
<wikibugs>
|
('Merged) ''jenkins-bot: aptrepo: add an alert for failed prepare [alerts] - ''https://gerrit.wikimedia.org/r/1266932 (https://phabricator.wikimedia.org/T422070) (owner: ''Arnaudb)'
|
|
2026-04-02 08:20:57
|
<wikibugs>
|
('CR) ''Ayounsi: [C:''+1] "lgtm, pcc looks good too, to be carefully rolled out/tested." [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 08:21:07
|
<wikibugs>
|
('PS5) ''Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 08:23:15
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
|
|
2026-04-02 08:24:10
|
<wikibugs>
|
('CR) ''Brouberol: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8368/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
|
|
2026-04-02 08:24:15
|
<wikibugs>
|
('PS6) ''Brouberol: deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 08:30:22
|
<volans>
|
!log briefly disabling puppet on P:installserver::proxy to deploy g/1266885
|
|
2026-04-02 08:30:23
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 08:31:21
|
<wikibugs>
|
('CR) ''Volans: [C:''+2] webproxies: allow cloudcumin to openstack [puppet] - ''https://gerrit.wikimedia.org/r/1266885 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 08:33:26
|
<wikibugs>
|
('CR) ''Btullis: [C:''+1] "Nice, thanks." [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
|
|
2026-04-02 08:40:18
|
<wikibugs>
|
('CR) ''Brouberol: [C:''+2] deployment_server: monitor the expiry of the internal opensearch TLS certs [puppet] - ''https://gerrit.wikimedia.org/r/1266935 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
|
|
2026-04-02 08:40:45
|
<XioNoX>
|
slyngs, effie, I'm going to reboot mr1-esams for a software upgrade, it will go down for up to 20min, device itself is downtimed, but there might be some alerting noise from esams mgmt being unreachable
|
|
2026-04-02 08:41:15
|
<jinxer-wm>
|
FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 08:41:17
|
<jinxer-wm>
|
FIRING: [3x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 08:41:23
|
<jinxer-wm>
|
RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
|
|
2026-04-02 08:41:40
|
<jinxer-wm>
|
FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 08:42:00
|
<XioNoX>
|
!log reboot mr1-esams - T416450
|
|
2026-04-02 08:42:03
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 08:42:04
|
<stashbot>
|
T416450: esams: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416450
|
|
2026-04-02 08:42:36
|
<effie>
|
XioNoX: thank you, break a leg
|
|
2026-04-02 08:43:59
|
<icinga-wm>
|
PROBLEM - ps1-by27-esams-infeed-load-tower-B-single-phase on ps1-by27-esams is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
|
|
2026-04-02 08:44:20
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781126 (''atsuko) Thanks, I'll update the onboarding.'
|
|
2026-04-02 08:44:32
|
<logmsgbot>
|
!log dpogorzelski@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: sync
|
|
2026-04-02 08:44:42
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781127 (''atsuko) a:''atsuko'
|
|
2026-04-02 08:44:45
|
<logmsgbot>
|
!log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
|
|
2026-04-02 08:44:53
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90206 and previous config saved to /var/cache/conftool/dbconfig/20260402-084452-fceratto.json
|
|
2026-04-02 08:44:56
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 08:45:07
|
<logmsgbot>
|
!log dpogorzelski@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
|
|
2026-04-02 08:45:23
|
<icinga-wm>
|
PROBLEM - Host ps1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100%
|
|
2026-04-02 08:45:23
|
<icinga-wm>
|
PROBLEM - Host ps1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100%
|
|
2026-04-02 08:45:32
|
<logmsgbot>
|
!log dpogorzelski@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync
|
|
2026-04-02 08:45:39
|
<jinxer-wm>
|
FIRING: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and mr1-esams (10.80.127.5) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDow
|
|
2026-04-02 08:46:09
|
<logmsgbot>
|
!log dpogorzelski@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
|
|
2026-04-02 08:46:15
|
<jinxer-wm>
|
FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 08:46:17
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781130 (''atsuko)'
|
|
2026-04-02 08:47:08
|
<wikibugs>
|
('PS1) ''Gkyziridis: ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941'
|
|
2026-04-02 08:47:29
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781133 (''atsuko)'
|
|
2026-04-02 08:47:50
|
<wikibugs>
|
'SRE, ''LDAP-Access-Requests: Enable SSO MFA YubiKey authentication for atsuko - https://phabricator.wikimedia.org/T422026#11781135 (''atsuko) ''Open→''Declined'
|
|
2026-04-02 08:49:13
|
<jinxer-wm>
|
FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 08:49:49
|
<wikibugs>
|
('CR) ''Ilias Sarantopoulos: [C:''+1] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941 (owner: ''Gkyziridis)'
|
|
2026-04-02 08:49:54
|
<moritzm>
|
!log added Atsuko to the cn=ops LDAP group T421860
|
|
2026-04-02 08:49:57
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 08:49:58
|
<stashbot>
|
T421860: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860
|
|
2026-04-02 08:50:23
|
<jinxer-wm>
|
FIRING: GnmiTargetDown: asw1-bw27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
|
|
2026-04-02 08:50:39
|
<jinxer-wm>
|
RESOLVED: [2x] CoreBGPDown: Core BGP session down between asw1-by27-esams and mr1-esams (10.80.127.5) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=asw1-by27-esams:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPD
|
|
2026-04-02 08:50:47
|
<icinga-wm>
|
RECOVERY - Host ps1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.26 ms
|
|
2026-04-02 08:50:47
|
<icinga-wm>
|
RECOVERY - Host ps1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 81.25 ms
|
|
2026-04-02 08:51:15
|
<jinxer-wm>
|
FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 08:51:27
|
<wikibugs>
|
('CR) ''Dpogorzelski: [C:''+1] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941 (owner: ''Gkyziridis)'
|
|
2026-04-02 08:51:32
|
<XioNoX>
|
router is back up - 10min downtime
|
|
2026-04-02 08:52:15
|
<wikibugs>
|
('CR) ''Gkyziridis: [C:''+2] ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941 (owner: ''Gkyziridis)'
|
|
2026-04-02 08:53:34
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests: Requesting shell access and membership of the ops group for atsuko - https://phabricator.wikimedia.org/T421860#11781141 (''MoritzMuehlenhoff) ''Open→''Resolved a:''MoritzMuehlenhoff @atsuko Your SSH access
should now be working. You can e.g. try to connect to cumin1003.e...'
|
|
2026-04-02 08:54:13
|
<wikibugs>
|
('Merged) ''jenkins-bot: ml-serices: Remove the gpu from revertrisk-multilingual model and add more cpu power. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266941 (owner: ''Gkyziridis)'
|
|
2026-04-02 08:54:13
|
<jinxer-wm>
|
RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 08:55:23
|
<jinxer-wm>
|
RESOLVED: GnmiTargetDown: asw1-bw27-esams is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
|
|
2026-04-02 08:55:27
|
<logmsgbot>
|
!log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
|
|
2026-04-02 08:55:41
|
<logmsgbot>
|
!log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
|
|
2026-04-02 08:56:15
|
<jinxer-wm>
|
RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 08:57:47
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+2] Update Cumin alias for contint to also cover the spun-off Trixie role [puppet] - ''https://gerrit.wikimedia.org/r/1266215 (owner: ''Muehlenhoff)'
|
|
2026-04-02 08:58:49
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 09:08:30
|
<jinxer-wm>
|
FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
|
|
2026-04-02 09:12:31
|
<wikibugs>
|
('PS1) ''Klausman: admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266947'
|
|
2026-04-02 09:17:43
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90207 and previous config saved to /var/cache/conftool/dbconfig/20260402-091743-fceratto.json
|
|
2026-04-02 09:17:47
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 09:19:48
|
<moritzm>
|
!log upgrading Envoy on the config-master servers to 1.35.9 T419637 T410975
|
|
2026-04-02 09:19:57
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 09:19:58
|
<stashbot>
|
T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637
|
|
2026-04-02 09:19:59
|
<stashbot>
|
T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
|
|
2026-04-02 09:21:37
|
<wikibugs>
|
('PS1) ''Gkyziridis: ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266948'
|
|
2026-04-02 09:23:16
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+1] "LGTM" [software/bitu] - ''https://gerrit.wikimedia.org/r/1265258 (owner: ''Slyngshede)'
|
|
2026-04-02 09:23:51
|
<wikibugs>
|
('CR) ''Gkyziridis: [C:''+2] ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266948 (owner: ''Gkyziridis)'
|
|
2026-04-02 09:25:57
|
<wikibugs>
|
('PS1) ''Volans: Add missing includes from Netbox exported data [dns] - ''https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115)'
|
|
2026-04-02 09:26:07
|
<wikibugs>
|
('Merged) ''jenkins-bot: ml-services: Revert the changes and the model version into the previous state. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266948 (owner: ''Gkyziridis)'
|
|
2026-04-02 09:27:36
|
<logmsgbot>
|
!log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
|
|
2026-04-02 09:27:42
|
<logmsgbot>
|
!log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
|
|
2026-04-02 09:27:52
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P90208 and previous config saved to /var/cache/conftool/dbconfig/20260402-092751-fceratto.json
|
|
2026-04-02 09:28:30
|
<jinxer-wm>
|
RESOLVED: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards Has been acknowledged - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
|
|
2026-04-02 09:29:31
|
<logmsgbot>
|
!log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master-codfw
|
|
2026-04-02 09:29:35
|
<logmsgbot>
|
!log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors
|
|
2026-04-02 09:29:39
|
<logmsgbot>
|
!log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors
|
|
2026-04-02 09:30:35
|
<wikibugs>
|
('PS4) ''Arnaudb: gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - ''https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909)'
|
|
2026-04-02 09:33:23
|
<wikibugs>
|
('PS1) ''Arnaudb: gerrit: update sshd timeouts [puppet] - ''https://gerrit.wikimedia.org/r/1266149 (https://phabricator.wikimedia.org/T417996)'
|
|
2026-04-02 09:33:45
|
<logmsgbot>
|
!log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master-codfw
|
|
2026-04-02 09:33:47
|
<wikibugs>
|
('Abandoned) ''Arnaudb: gerrit: update timeouts for gitiles [puppet] - ''https://gerrit.wikimedia.org/r/1265448 (https://phabricator.wikimedia.org/T421904) (owner: ''Arnaudb)'
|
|
2026-04-02 09:37:53
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+2] Obsolete airflow-search-admins POSIX group [puppet] - ''https://gerrit.wikimedia.org/r/1242407 (owner: ''Muehlenhoff)'
|
|
2026-04-02 09:38:00
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P90209 and previous config saved to /var/cache/conftool/dbconfig/20260402-093759-fceratto.json
|
|
2026-04-02 09:39:25
|
<wikibugs>
|
('PS5) ''Arnaudb: gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - ''https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909)'
|
|
2026-04-02 09:39:29
|
<wikibugs>
|
('CR) ''Effie Mouzeli: [C:''+1] image-suggestion: remove service configuration [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 09:39:45
|
<wikibugs>
|
('CR) ''Arnaudb: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) (owner: ''Arnaudb)'
|
|
2026-04-02 09:40:13
|
<wikibugs>
|
('CR) ''Effie Mouzeli: [C:''+1] profile::service_proxy::envoy: remove mw-parsoid [puppet] - ''https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) (owner: ''Elukey)'
|
|
2026-04-02 09:40:30
|
<wikibugs>
|
('PS2) ''Elukey: profile::service_proxy::envoy: remove mw-parsoid [puppet] - ''https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468)'
|
|
2026-04-02 09:41:18
|
<wikibugs>
|
('CR) ''Arnaudb: [C:''+2] gerrit: fix Envoy idle timeout handling for slow HTTPS git requests [puppet] - ''https://gerrit.wikimedia.org/r/1266950 (https://phabricator.wikimedia.org/T420909) (owner: ''Arnaudb)'
|
|
2026-04-02 09:41:50
|
<wikibugs>
|
('Abandoned) ''Effie Mouzeli: profile::service_proxy::envoy: remove mw-parsoid [puppet] - ''https://gerrit.wikimedia.org/r/1265420 (https://phabricator.wikimedia.org/T420468) (owner: ''Elukey)'
|
|
2026-04-02 09:43:42
|
<wikibugs>
|
('CR) ''Ayounsi: [C:''+1] "thanks!" [dns] - ''https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) (owner: ''Volans)'
|
|
2026-04-02 09:45:33
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: sync
|
|
2026-04-02 09:45:42
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync
|
|
2026-04-02 09:46:56
|
<jinxer-wm>
|
FIRING: GitlabPackagePullerFailedOnPrepare: Package puller has some run errors while preparing projects. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnPrepare
|
|
2026-04-02 09:47:41
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
|
|
2026-04-02 09:48:02
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
|
|
2026-04-02 09:48:03
|
<wikibugs>
|
('Abandoned) ''Majavah: hieradata: Add dumps.wikimedia.org CDN mapping [puppet] - ''https://gerrit.wikimedia.org/r/1237194 (https://phabricator.wikimedia.org/T306550) (owner: ''Majavah)'
|
|
2026-04-02 09:48:09
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T419635)', diff saved to https://phabricator.wikimedia.org/P90210 and previous config saved to /var/cache/conftool/dbconfig/20260402-094808-fceratto.json
|
|
2026-04-02 09:48:11
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 09:48:26
|
<logmsgbot>
|
!log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
|
|
2026-04-02 09:48:34
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90211 and previous config saved to /var/cache/conftool/dbconfig/20260402-094834-fceratto.json
|
|
2026-04-02 09:48:37
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
|
|
2026-04-02 09:48:58
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
|
|
2026-04-02 09:53:07
|
<wikibugs>
|
('PS1) ''Muehlenhoff: Obsolete airflow-wmde-admins POSIX group [puppet] - ''https://gerrit.wikimedia.org/r/1266959'
|
|
2026-04-02 09:58:30
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+2] Update email record for andreawest [puppet] - ''https://gerrit.wikimedia.org/r/1266931 (https://phabricator.wikimedia.org/T420053) (owner: ''Muehlenhoff)'
|
|
2026-04-02 10:00:04
|
<jouncebot>
|
Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1000)
|
|
2026-04-02 10:00:04
|
<jouncebot>
|
dues: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
|
|
2026-04-02 10:00:25
|
<wikibugs>
|
('CR) ''Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
|
|
2026-04-02 10:02:05
|
<wikibugs>
|
('PS1) ''Volans: cumin: use webproxy to connect to openstack APIs [puppet] - ''https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360)'
|
|
2026-04-02 10:02:05
|
<wikibugs>
|
('CR) ''Volans: "PCC available for cloudcumin1001 here:" [puppet] - ''https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 10:03:22
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+2] thumbor: Update service image to latest rebuild [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266229 (owner: ''Muehlenhoff)'
|
|
2026-04-02 10:03:35
|
<wikibugs>
|
('PS1) ''Arnaudb: gerrit: update upstream_idle_timeout [puppet] - ''https://gerrit.wikimedia.org/r/1266962 (https://phabricator.wikimedia.org/T421827)'
|
|
2026-04-02 10:03:38
|
<wikibugs>
|
('CR) ''Arnaudb: [C:''+2] gerrit: update upstream_idle_timeout [puppet] - ''https://gerrit.wikimedia.org/r/1266962 (https://phabricator.wikimedia.org/T421827) (owner: ''Arnaudb)'
|
|
2026-04-02 10:04:15
|
<jinxer-wm>
|
FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 10:04:17
|
<wikibugs>
|
('PS1) ''Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963'
|
|
2026-04-02 10:04:26
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (owner: ''Volans)'
|
|
2026-04-02 10:05:23
|
<logmsgbot>
|
!log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
|
|
2026-04-02 10:05:32
|
<logmsgbot>
|
!log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
|
|
2026-04-02 10:05:34
|
<wikibugs>
|
('PS2) ''Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360)'
|
|
2026-04-02 10:05:49
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 10:08:55
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 10:09:15
|
<jinxer-wm>
|
FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 10:09:42
|
<wikibugs>
|
('PS1) ''Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749)'
|
|
2026-04-02 10:10:18
|
<logmsgbot>
|
!log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
|
|
2026-04-02 10:10:38
|
<wikibugs>
|
('CR) ''Daniel Kinzler: [C:''+2] rest gateway: define authed-user class [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: ''Daniel Kinzler)'
|
|
2026-04-02 10:10:57
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] Enable the CampaignEvents extension on incubator [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: ''Mhorsey)'
|
|
2026-04-02 10:11:19
|
<wikibugs>
|
('PS3) ''Volans: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360)'
|
|
2026-04-02 10:11:41
|
<wikibugs>
|
('CR) ''Volans: [C:''+2] cumin: use webproxy to connect to openstack APIs [puppet] - ''https://gerrit.wikimedia.org/r/1266956 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 10:12:36
|
<logmsgbot>
|
!log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
|
|
2026-04-02 10:12:49
|
<wikibugs>
|
('Merged) ''jenkins-bot: rest gateway: define authed-user class [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266237 (https://phabricator.wikimedia.org/T420280) (owner: ''Daniel Kinzler)'
|
|
2026-04-02 10:13:17
|
<wikibugs>
|
('CR) ''Filippo Giunchedi: [C:''+1] "LGTM" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 10:14:30
|
<logmsgbot>
|
!log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
|
|
2026-04-02 10:15:11
|
<logmsgbot>
|
!log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker-exp2001.codfw.wmnet
|
|
2026-04-02 10:16:51
|
<jinxer-wm>
|
FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 10:16:52
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:16:54
|
<logmsgbot>
|
!log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
|
|
2026-04-02 10:17:00
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:17:05
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:17:14
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 10:17:19
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:17:24
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:17:32
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:17:36
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:17:40
|
<effie>
|
!incidents
|
|
2026-04-02 10:17:40
|
<sirenbot>
|
7803 (UNACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
|
|
2026-04-02 10:17:46
|
<effie>
|
!ack 7803
|
|
2026-04-02 10:17:46
|
<sirenbot>
|
7803 (ACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
|
|
2026-04-02 10:17:50
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:17:55
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:18:03
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:18:08
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:18:24
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:18:28
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:18:45
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:18:46
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 10:18:50
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:19:11
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:19:15
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:19:17
|
<moritzm>
|
!log installing freetype security updates
|
|
2026-04-02 10:19:20
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 10:19:25
|
<logmsgbot>
|
!log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker-exp2001.codfw.wmnet
|
|
2026-04-02 10:19:27
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:19:30
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:19:31
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:19:35
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 10:19:36
|
<logmsgbot>
|
!log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 10:21:06
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90212 and previous config saved to /var/cache/conftool/dbconfig/20260402-102105-fceratto.json
|
|
2026-04-02 10:21:09
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 10:21:41
|
<wikibugs>
|
('PS2) ''Mhorsey: Enable the CampaignEvents extension on incubator [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749)'
|
|
2026-04-02 10:22:45
|
<jinxer-wm>
|
FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ...
|
|
2026-04-02 10:22:50
|
<jinxer-wm>
|
fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 10:23:16
|
<wikibugs>
|
('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1266964 (https://phabricator.wikimedia.org/T421749) (owner: ''Mhorsey)'
|
|
2026-04-02 10:24:44
|
<wikibugs>
|
'SRE, ''SRE-tools, ''Infrastructure-Foundations, ''ServiceOps new, and 2 others: Support locking cookbooks run except for switchover related cookbooks - https://phabricator.wikimedia.org/T330997#11781519 (''Volans) Given this has been moved to the backlog I'll leave here a comment for our future selves: i...'
|
|
2026-04-02 10:26:33
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 166195784 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 10:27:29
|
<wikibugs>
|
('PS1) ''Hashar: wm-checks-api: add tag for PostgreSQL jobs [software/gerrit] (deploy/wmf/stable-3.10) - ''https://gerrit.wikimedia.org/r/1266965'
|
|
2026-04-02 10:27:45
|
<jinxer-wm>
|
FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 10:28:33
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3533304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 10:30:40
|
<jinxer-wm>
|
FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 10:30:41
|
<wikibugs>
|
'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781562 (''Peachey88)'
|
|
2026-04-02 10:31:02
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 10:31:12
|
<wikibugs>
|
'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781588 (''MBH) Many such servers: 26, 31. When just opening pages for read.'
|
|
2026-04-02 10:31:14
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P90213 and previous config saved to /var/cache/conftool/dbconfig/20260402-103113-fceratto.json
|
|
2026-04-02 10:31:25
|
<wikibugs>
|
'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781591 (''Peachey88)'
|
|
2026-04-02 10:31:27
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 10:32:19
|
<logmsgbot>
|
!log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet
|
|
2026-04-02 10:33:27
|
<wikibugs>
|
('CR) ''Cathal Mooney: [C:''+1] "LGTM!" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 10:34:45
|
<wikibugs>
|
'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781642 (''Thryduulf) I've been experiencing these errors intermittently on English Wikipedia today, but only on trying to save edits. Each time trying again has resulted in the save being successful.'
|
|
2026-04-02 10:37:41
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 10:38:10
|
<wikibugs>
|
('CR) ''Daniel Kinzler: [C:''+2] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
|
|
2026-04-02 10:38:22
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
|
|
2026-04-02 10:39:14
|
<wikibugs>
|
('PS5) ''Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581)'
|
|
2026-04-02 10:39:29
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops: Degraded RAID on an-worker1148 - https://phabricator.wikimedia.org/T421892#11781672 (''Jclark-ctr) ''Open→''Declined This ticket automated ticket was opened by mistake it was still being worked on in In
T411919'
|
|
2026-04-02 10:39:44
|
<wikibugs>
|
('CR) ''Daniel Kinzler: [C:''+2] rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
|
|
2026-04-02 10:40:02
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 10:41:22
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P90214 and previous config saved to /var/cache/conftool/dbconfig/20260402-104121-fceratto.json
|
|
2026-04-02 10:41:51
|
<wikibugs>
|
('Merged) ''jenkins-bot: rest gateway: introduce policy for abstractwiki/wikifunctions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265333 (https://phabricator.wikimedia.org/T421581) (owner: ''Daniel Kinzler)'
|
|
2026-04-02 10:41:57
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421970#11781681 (''Jclark-ctr) ''Open→''Resolved a:''Jclark-ctr rebalanced'
|
|
2026-04-02 10:43:18
|
<wikibugs>
|
'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781698 (''Aklapper)'
|
|
2026-04-02 10:43:53
|
<logmsgbot>
|
!log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet
|
|
2026-04-02 10:44:33
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 76721280 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 10:45:00
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 10:45:16
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops: Power Supply - PS2 Status - issue on ml-serve1015:9290 - https://phabricator.wikimedia.org/T421599#11781704 (''Jclark-ctr) ''Open→''Resolved replaced failed psu Outbound ticket for psu
1-258638557493'
|
|
2026-04-02 10:45:33
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3553128 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 10:45:43
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 10:48:21
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 10:48:23
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 10:48:23
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 10:48:23
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 10:48:28
|
<A_smart_kitten>
|
fwiw I jusst got 'cannot access the database: database servers in cluster31 are overloaded' when trying to save an edit on metawiki. worked fine on the second attempt.
|
|
2026-04-02 10:48:33
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 298909248 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 10:49:26
|
<A_smart_kitten>
|
oh i see it's already known, apologies :)
|
|
2026-04-02 10:49:33
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4010680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 10:49:49
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 10:49:49
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 10:50:33
|
<wikibugs_>
|
'SRE, ''DBA: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781731 (''Wellverywell) p:''Triage→''Unbreak!'
|
|
2026-04-02 10:50:41
|
<wikibugs>
|
'SRE-Access-Requests, ''Data-Platform-SRE, ''Wikidata Platform Team: Request: wdqs shell access for user @AWesterinen-WMF - https://phabricator.wikimedia.org/T422141 (''gmodena) ''NEW'
|
|
2026-04-02 10:51:30
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T419635)', diff saved to https://phabricator.wikimedia.org/P90215 and previous config saved to /var/cache/conftool/dbconfig/20260402-105129-fceratto.json
|
|
2026-04-02 10:51:33
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 10:51:35
|
<logmsgbot>
|
!log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
|
|
2026-04-02 10:51:43
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90216 and previous config saved to /var/cache/conftool/dbconfig/20260402-105142-fceratto.json
|
|
2026-04-02 10:52:14
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781750 (''RhinosF1)'
|
|
2026-04-02 10:52:49
|
<icinga-wm>
|
PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 10:54:13
|
<wikibugs>
|
'SRE-Access-Requests, ''Data-Platform-SRE, ''Wikidata Platform Team: Request: wdqs shell access for user AWesterinen-WMF - https://phabricator.wikimedia.org/T422141#11781774 (''gmodena)'
|
|
2026-04-02 10:56:57
|
<wikibugs>
|
'SRE-Access-Requests, ''Data-Platform-SRE, ''Wikidata Platform Team: Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11781779 (''gmodena)'
|
|
2026-04-02 10:57:52
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781783 (''1F616EMO) I experienced such errors when diffing and saving edits.'
|
|
2026-04-02 10:58:15
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781785 (''Jclark-ctr) a:''Jclark-ctr This server is out of warranty. Replaced Drive slot 16 with matching 8tb sata
drive'
|
|
2026-04-02 10:58:45
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781790 (''Ladsgroup) We are on it.'
|
|
2026-04-02 10:59:47
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781793 (''1F616EMO) Should I expect the coming backport window be cancelled or delayed due to this incident?'
|
|
2026-04-02 11:00:25
|
<wikibugs>
|
('PS4) ''Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213)'
|
|
2026-04-02 11:00:25
|
<wikibugs>
|
('PS1) ''Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - ''https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213)'
|
|
2026-04-02 11:01:28
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781814 (''RhinosF1) >>! In T422130#11781793, @1F616EMO wrote: > Should I expect the coming backport window be cancelled or delayed due to this incident? Very likely yes. A dep...'
|
|
2026-04-02 11:02:00
|
<wikibugs>
|
('CR) ''Btullis: [C:''-1] "Set to -1 pending the review by Infrastructure Foundations." [puppet] - ''https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
|
|
2026-04-02 11:04:16
|
<wikibugs>
|
('PS1) ''Esanders: Fix suggestion mode availability check [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143)'
|
|
2026-04-02 11:04:20
|
<wikibugs>
|
('CR) ''Muehlenhoff: Add analytics-fr-tech system user and corresponding groups (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
|
|
2026-04-02 11:05:20
|
<wikibugs>
|
('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: ''Esanders)'
|
|
2026-04-02 11:07:26
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781846 (''Jclark-ctr) After replacement Server showed drive as foreign. continued to fail to clear foreign config. Replaced drive again with new seagate 8tb sata drive'
|
|
2026-04-02 11:07:48
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781847 (''1F616EMO) >>! In T422130#11781814, @RhinosF1 wrote: >>>! In T422130#11781793, @1F616EMO wrote: >> Should I expect the coming backport window be cancelled or delayed d...'
|
|
2026-04-02 11:13:15
|
<jinxer-wm>
|
FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 11.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 11:14:15
|
<jinxer-wm>
|
FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 11:20:54
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781890 (''Lucas_Werkmeister_WMDE)'
|
|
2026-04-02 11:21:41
|
<jinxer-wm>
|
FIRING: [8x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
|
|
2026-04-02 11:24:22
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90217 and previous config saved to /var/cache/conftool/dbconfig/20260402-112421-fceratto.json
|
|
2026-04-02 11:24:25
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 11:26:41
|
<jinxer-wm>
|
FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
|
|
2026-04-02 11:26:51
|
<jinxer-wm>
|
FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 11:27:00
|
<effie>
|
!incidents
|
|
2026-04-02 11:27:00
|
<sirenbot>
|
7803 (ACKED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
|
|
2026-04-02 11:27:23
|
<jinxer-wm>
|
FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
|
|
2026-04-02 11:27:49
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11781903 (''BTullis)'
|
|
2026-04-02 11:27:50
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdt) failed in ms-be1065 - https://phabricator.wikimedia.org/T422011#11781904 (''Jclark-ctr) ''Open→''Resolved'
|
|
2026-04-02 11:28:38
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11781909 (''Jclark-ctr) a:''Jclark-ctr'
|
|
2026-04-02 11:29:02
|
<wikibugs>
|
('CR) ''Jforrester: [C:''+1] REST: Publish ReadingLists v0 module in REST Sandbox [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264856 (https://phabricator.wikimedia.org/T419619) (owner: ''KineticPelagic)'
|
|
2026-04-02 11:32:25
|
<icinga-wm>
|
PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 6.702e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
|
|
2026-04-02 11:32:45
|
<jinxer-wm>
|
FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 11:34:30
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P90218 and previous config saved to /var/cache/conftool/dbconfig/20260402-113429-fceratto.json
|
|
2026-04-02 11:34:33
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 97599648 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 11:35:23
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11781922 (''Jclark-ctr) updating bios firmware , expander firmware due to coms error on backplain. and idrac firmware additionally'
|
|
2026-04-02 11:35:33
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3557000 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 11:36:54
|
<wikibugs>
|
('PS1) ''Arnaudb: gerrit: bump upstream_idle_timeout to 900s [puppet] - ''https://gerrit.wikimedia.org/r/1266989 (https://phabricator.wikimedia.org/T421904)'
|
|
2026-04-02 11:37:12
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11781927 (''BTullis) I have validated all SSH keys via out-of...'
|
|
2026-04-02 11:37:15
|
<wikibugs>
|
('CR) ''Arnaudb: [C:''+2] gerrit: bump upstream_idle_timeout to 900s [puppet] - ''https://gerrit.wikimedia.org/r/1266989 (https://phabricator.wikimedia.org/T421904) (owner: ''Arnaudb)'
|
|
2026-04-02 11:37:23
|
<jinxer-wm>
|
RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn
|
|
2026-04-02 11:38:23
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 11:39:19
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11781930 (''Gehel) p:''Triage→''High'
|
|
2026-04-02 11:42:49
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 11:44:38
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P90219 and previous config saved to /var/cache/conftool/dbconfig/20260402-114437-fceratto.json
|
|
2026-04-02 11:47:45
|
<jinxer-wm>
|
FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 11:48:15
|
<jinxer-wm>
|
RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 11:48:23
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11781968 (''Thryduulf) I've just encountered what I presume is the same error, this time when trying to use the reply tool [6a4d47bf-961e-4513-9b1f-c6970e11f156] Caught exception...'
|
|
2026-04-02 11:48:23
|
<wikibugs>
|
('PS5) ''Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213)'
|
|
2026-04-02 11:48:24
|
<wikibugs>
|
('PS2) ''Btullis: Grant sudo privileges for the analytics-fr-tech-users group [puppet] - ''https://gerrit.wikimedia.org/r/1266980 (https://phabricator.wikimedia.org/T417213)'
|
|
2026-04-02 11:51:15
|
<wikibugs>
|
('PS1) ''Daniel Kinzler: rest gateway: refactor ratelimit integration test [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266995'
|
|
2026-04-02 11:51:51
|
<jinxer-wm>
|
RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 11:52:11
|
<logmsgbot>
|
!log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
|
|
2026-04-02 11:52:17
|
<wikibugs>
|
('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1254925 (owner: ''PipelineBot)'
|
|
2026-04-02 11:52:24
|
<wikibugs>
|
('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1254926 (owner: ''PipelineBot)'
|
|
2026-04-02 11:52:34
|
<wikibugs>
|
('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1241846 (owner: ''PipelineBot)'
|
|
2026-04-02 11:52:44
|
<wikibugs>
|
('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1258153 (owner: ''PipelineBot)'
|
|
2026-04-02 11:52:45
|
<jinxer-wm>
|
RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 11:52:55
|
<wikibugs>
|
('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1254927 (owner: ''PipelineBot)'
|
|
2026-04-02 11:53:04
|
<wikibugs>
|
('Abandoned) ''Mvolz: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/1246819 (owner: ''PipelineBot)'
|
|
2026-04-02 11:54:15
|
<jinxer-wm>
|
RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 11:54:30
|
<wikibugs>
|
('CR) ''Ayounsi: [C:''+1] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 11:54:47
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T419635)', diff saved to https://phabricator.wikimedia.org/P90220 and previous config saved to /var/cache/conftool/dbconfig/20260402-115446-fceratto.json
|
|
2026-04-02 11:54:50
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 11:55:03
|
<logmsgbot>
|
!log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance
|
|
2026-04-02 11:55:12
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90221 and previous config saved to /var/cache/conftool/dbconfig/20260402-115511-fceratto.json
|
|
2026-04-02 11:59:00
|
<edsanders>
|
I have a high visibility UBN in for the deployment window - just waiting for it to merge
|
|
2026-04-02 11:59:59
|
<wikibugs>
|
('PS1) ''Brouberol: deployment_server: tweak the labels on opensearch_k8s_master_cert_expiry_seconds [puppet] - ''https://gerrit.wikimedia.org/r/1266999 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 12:00:04
|
<jouncebot>
|
Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1200)
|
|
2026-04-02 12:02:02
|
<edsanders>
|
ah - timezone change - the window starts in one hour
|
|
2026-04-02 12:02:15
|
<jinxer-wm>
|
FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 12:03:19
|
<wikibugs>
|
('CR) ''Brouberol: [C:''+2] deployment_server: tweak the labels on opensearch_k8s_master_cert_expiry_seconds [puppet] - ''https://gerrit.wikimedia.org/r/1266999 (https://phabricator.wikimedia.org/T418175) (owner: ''Brouberol)'
|
|
2026-04-02 12:05:25
|
<jinxer-wm>
|
FIRING: [3x] SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 12:06:41
|
<jinxer-wm>
|
FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
|
|
2026-04-02 12:07:15
|
<jinxer-wm>
|
FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 12:09:35
|
<logmsgbot>
|
!log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1373.eqiad.wmnet with OS trixie
|
|
2026-04-02 12:09:47
|
<logmsgbot>
|
!log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1374.eqiad.wmnet with OS trixie
|
|
2026-04-02 12:09:57
|
<logmsgbot>
|
!log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1373
|
|
2026-04-02 12:09:57
|
<logmsgbot>
|
!log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1373
|
|
2026-04-02 12:10:08
|
<logmsgbot>
|
!log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1374
|
|
2026-04-02 12:10:08
|
<logmsgbot>
|
!log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1374
|
|
2026-04-02 12:10:47
|
<p858snake|cloud>
|
edsanders: fyi there is a incident at the moment (T422130) so the window might be effected
|
|
2026-04-02 12:10:48
|
<stashbot>
|
T422130: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130
|
|
2026-04-02 12:11:02
|
<wikibugs>
|
('CR) ''JMeybohm: "recheck" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: ''Kamila Součková)'
|
|
2026-04-02 12:11:31
|
<wikibugs>
|
('CR) ''Volans: [C:''+2] Add missing includes from Netbox exported data [dns] - ''https://gerrit.wikimedia.org/r/1266952 (https://phabricator.wikimedia.org/T422115) (owner: ''Volans)'
|
|
2026-04-02 12:11:41
|
<jinxer-wm>
|
FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
|
|
2026-04-02 12:11:41
|
<logmsgbot>
|
!log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
|
|
2026-04-02 12:11:57
|
<logmsgbot>
|
!log volans@dns1004 START - running authdns-update
|
|
2026-04-02 12:12:15
|
<jinxer-wm>
|
RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 12:12:19
|
<wikibugs>
|
('CR) ''JMeybohm: [C:''+1] admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266947 (owner: ''Klausman)'
|
|
2026-04-02 12:12:30
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782064 (''Jclark-ctr) ''Open→''Resolved ` A
configuration related issue on the device Backplane is resolved. `'
|
|
2026-04-02 12:13:46
|
<logmsgbot>
|
!log volans@dns1004 END - running authdns-update
|
|
2026-04-02 12:13:51
|
<jinxer-wm>
|
FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 12:14:26
|
<edsanders>
|
p858snake I'd like to start my deployment asap, is everything on hold at the moment?
|
|
2026-04-02 12:16:13
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782090 (''FCeratto-WMF) Thanks!'
|
|
2026-04-02 12:16:41
|
<jinxer-wm>
|
FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
|
|
2026-04-02 12:17:17
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''observability, ''Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11782094 (''Jclark-ctr) @herron can you assist with updating puppet on this install ticket
?'
|
|
2026-04-02 12:18:38
|
<edsanders>
|
Rhoni
|
|
2026-04-02 12:18:46
|
<edsanders>
|
*typo
|
|
2026-04-02 12:18:51
|
<jinxer-wm>
|
FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 12:19:02
|
<effie>
|
!incidents
|
|
2026-04-02 12:19:02
|
<sirenbot>
|
7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
|
|
2026-04-02 12:19:03
|
<sirenbot>
|
7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
|
|
2026-04-02 12:19:12
|
<edsanders>
|
RhinosF1: is there any chance of getting a UBN backported, despite T422130?
|
|
2026-04-02 12:19:13
|
<stashbot>
|
T422130: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130
|
|
2026-04-02 12:19:32
|
<RhinosF1>
|
edsanders: no idea why you are asking me
|
|
2026-04-02 12:19:32
|
<edsanders>
|
(this: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1266984)
|
|
2026-04-02 12:19:39
|
<edsanders>
|
I saw you commented on the incident task
|
|
2026-04-02 12:19:42
|
<RhinosF1>
|
You need to ask the IC
|
|
2026-04-02 12:19:45
|
<jinxer-wm>
|
FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 12:19:50
|
<RhinosF1>
|
I suggest in #wikimedia-sre
|
|
2026-04-02 12:19:53
|
<edsanders>
|
Thanks
|
|
2026-04-02 12:19:53
|
<RhinosF1>
|
Much quieter there
|
|
2026-04-02 12:20:01
|
<wikibugs>
|
('CR) ''JMeybohm: Upgrade aux-k8s-codfw to k8s 1.31 (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1265426 (https://phabricator.wikimedia.org/T414486) (owner: ''Elukey)'
|
|
2026-04-02 12:20:08
|
<wikibugs>
|
('CR) ''JMeybohm: [C:''+1] admin_ng: upgrade aux-k8s-codfw to k8s 1.31 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1265427 (https://phabricator.wikimedia.org/T414486) (owner: ''Elukey)'
|
|
2026-04-02 12:21:41
|
<jinxer-wm>
|
FIRING: [55x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
|
|
2026-04-02 12:22:32
|
<logmsgbot>
|
!log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage
|
|
2026-04-02 12:22:35
|
<logmsgbot>
|
!log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage
|
|
2026-04-02 12:22:40
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782127 (''taavi)'
|
|
2026-04-02 12:24:40
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782144 (''Jclark-ctr) a:''Jclark-ctr'
|
|
2026-04-02 12:25:18
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782146 (''Jclark-ctr)'
|
|
2026-04-02 12:26:41
|
<jinxer-wm>
|
FIRING: [54x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
|
|
2026-04-02 12:26:43
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90222 and previous config saved to /var/cache/conftool/dbconfig/20260402-122642-fceratto.json
|
|
2026-04-02 12:26:46
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 12:27:05
|
<wikibugs>
|
('CR) ''Btullis: Add analytics-fr-tech system user and corresponding groups (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
|
|
2026-04-02 12:27:44
|
<wikibugs>
|
'SRE, ''DNS, ''Infrastructure-Foundations, ''netbox, and 3 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11782158 (''Volans) p:''Triage→''Medium I've merged
and release the fix, do you want to keep the task open to implement some form o...'
|
|
2026-04-02 12:28:49
|
<logmsgbot>
|
!log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es1042.eqiad.wmnet
|
|
2026-04-02 12:28:50
|
<logmsgbot>
|
!log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es1042.eqiad.wmnet
|
|
2026-04-02 12:29:17
|
<logmsgbot>
|
!log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage
|
|
2026-04-02 12:30:46
|
<logmsgbot>
|
!log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042.eqiad.wmnet: Restoring section
|
|
2026-04-02 12:30:57
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782163 (''FCeratto-WMF) The host booted, I triggered a puppet run manually, started MariaDB, enabled alarming and checked that icinga is green and started pooling in to help with
T422130'
|
|
2026-04-02 12:31:11
|
<wikibugs>
|
('CR) ''JMeybohm: service::catalog: add sophroid service catalog entry (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1260767 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
|
|
2026-04-02 12:31:23
|
<wikibugs>
|
('CR) ''JMeybohm: [C:''+1] conftool: add sophroid etcd data [puppet] - ''https://gerrit.wikimedia.org/r/1248611 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
|
|
2026-04-02 12:31:41
|
<jinxer-wm>
|
RESOLVED: [44x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
|
|
2026-04-02 12:31:46
|
<logmsgbot>
|
!log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
|
|
2026-04-02 12:31:57
|
<wikibugs>
|
('CR) ''JMeybohm: [C:''+1] wmnet: add sophroid svc IPs [dns] - ''https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
|
|
2026-04-02 12:32:20
|
<wikibugs>
|
('CR) ''Klausman: [V:''+2 C:''+2] admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266947 (owner: ''Klausman)'
|
|
2026-04-02 12:32:32
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86555328 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 12:32:39
|
<wikibugs>
|
('PS1) ''Anne Tomasevich: Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490)'
|
|
2026-04-02 12:32:46
|
<logmsgbot>
|
!log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es1042.eqiad.wmnet: Restoring section
|
|
2026-04-02 12:32:49
|
<logmsgbot>
|
!log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage
|
|
2026-04-02 12:32:58
|
<logmsgbot>
|
!log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042.eqiad.wmnet: Restoring section
|
|
2026-04-02 12:32:59
|
<logmsgbot>
|
!log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) pool es1042.eqiad.wmnet: Restoring section
|
|
2026-04-02 12:33:10
|
<logmsgbot>
|
!log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool es1042: Restoring section
|
|
2026-04-02 12:33:25
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782172 (''ops-monitoring-bot) Starting pool of es1042 by fceratto@cumin1003: Restoring section'
|
|
2026-04-02 12:33:26
|
<wikibugs>
|
('CR) ''JMeybohm: [C:''-1] "This is the wrong file. Since you're targeting the aux cluster you need to add the pool there (`hieradata/role/common/aux_k8s/worker.yaml`" [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
|
|
2026-04-02 12:33:34
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 200752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 12:33:51
|
<jinxer-wm>
|
FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 12:34:33
|
<effie>
|
!incidents
|
|
2026-04-02 12:34:33
|
<sirenbot>
|
7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
|
|
2026-04-02 12:34:33
|
<sirenbot>
|
7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
|
|
2026-04-02 12:34:53
|
<wikibugs>
|
('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne
Tomasevich)'
|
|
2026-04-02 12:35:52
|
<wikibugs>
|
('CR) ''JMeybohm: [C:''-1] role::kubernetes::worker: add sophroid to the lvs pools (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
|
|
2026-04-02 12:36:32
|
<wikibugs>
|
('CR) ''Aude: [C:''+1] Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne Tomasevich)'
|
|
2026-04-02 12:36:51
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P90224 and previous config saved to /var/cache/conftool/dbconfig/20260402-123650-fceratto.json
|
|
2026-04-02 12:38:22
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 12:38:22
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 12:38:22
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 12:39:23
|
<jinxer-wm>
|
FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 12:39:39
|
<wikibugs>
|
('Merged) ''jenkins-bot: admin-ng: Allow ML/exp users to use describe verb on nodes [deployment-charts] - ''https://gerrit.wikimedia.org/r/1266947 (owner: ''Klausman)'
|
|
2026-04-02 12:39:48
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 12:39:48
|
<icinga-wm>
|
RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
|
|
2026-04-02 12:41:20
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782182 (''Jclark-ctr) a:''Jclark-ctr ` 2026-01-12 21:59:21 An unrecoverable disk media error occurred on Disk 20 in Backplane 2 of Integrated RAID Controller 1.
Part Number =...'
|
|
2026-04-02 12:41:31
|
<logmsgbot>
|
!log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 12:41:32
|
<jinxer-wm>
|
FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
|
|
2026-04-02 12:41:40
|
<jinxer-wm>
|
FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 12:41:43
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782184 (''BTullis) I have run `cross-validate-accounts` for...'
|
|
2026-04-02 12:42:33
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782190 (''Jclark-ctr) ''Open→''Resolved'
|
|
2026-04-02 12:44:17
|
<logmsgbot>
|
!log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 12:45:04
|
<logmsgbot>
|
!log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 12:45:29
|
<logmsgbot>
|
!log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1373.eqiad.wmnet with OS trixie
|
|
2026-04-02 12:45:51
|
<logmsgbot>
|
!log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 12:46:59
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P90225 and previous config saved to /var/cache/conftool/dbconfig/20260402-124659-fceratto.json
|
|
2026-04-02 12:48:33
|
<logmsgbot>
|
!log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1042: Restoring section
|
|
2026-04-02 12:48:58
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: es1042 not starting after powercycle - https://phabricator.wikimedia.org/T422111#11782211 (''ops-monitoring-bot) Completed pooling of es1042 by fceratto@cumin1003: Restoring section'
|
|
2026-04-02 12:49:21
|
<logmsgbot>
|
!log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1374.eqiad.wmnet with OS trixie
|
|
2026-04-02 12:49:22
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 12:49:36
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Disk (sdw) failed in ms-be1069 - https://phabricator.wikimedia.org/T421986#11782217 (''MatthewVernon) Thanks for the quick fixes @Jclark-ctr :-)'
|
|
2026-04-02 12:50:15
|
<jinxer-wm>
|
FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 12:50:19
|
<logmsgbot>
|
!log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
|
|
2026-04-02 12:54:43
|
<jasmine_>
|
hi folks, just a reminder that we will repooling codfw at 14:00 utc today
|
|
2026-04-02 12:55:15
|
<jinxer-wm>
|
RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 12:55:32
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 468938744 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 12:56:20
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782255 (''Jclark-ctr) @Jgreen replaced cable link came up. Sorry for delay'
|
|
2026-04-02 12:56:37
|
<logmsgbot>
|
!log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
|
|
2026-04-02 12:57:07
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T419635)', diff saved to https://phabricator.wikimedia.org/P90227 and previous config saved to /var/cache/conftool/dbconfig/20260402-125707-fceratto.json
|
|
2026-04-02 12:57:11
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 12:57:25
|
<logmsgbot>
|
!log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance
|
|
2026-04-02 12:57:32
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 12:57:33
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90228 and previous config saved to /var/cache/conftool/dbconfig/20260402-125732-fceratto.json
|
|
2026-04-02 12:58:15
|
<jinxer-wm>
|
FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 13:00:05
|
<jouncebot>
|
Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1300).
|
|
2026-04-02 13:00:05
|
<jouncebot>
|
manfredi, HouseOfM, edsanders, and annet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
|
|
2026-04-02 13:00:14
|
<Lucas_WMDE>
|
o/
|
|
2026-04-02 13:00:23
|
<annet>
|
o/
|
|
2026-04-02 13:00:24
|
<Lucas_WMDE>
|
I can deploy but I need to catch up with the incident first
|
|
2026-04-02 13:00:32
|
<Lucas_WMDE>
|
not sure if it’s okay to deploy at the moment
|
|
2026-04-02 13:00:41
|
<edsanders>
|
last I heard it isn't
|
|
2026-04-02 13:01:01
|
<edsanders>
|
I've also asked to deploy my UBN asap once the incident is resolved
|
|
2026-04-02 13:01:14
|
<Lucas_WMDE>
|
https://www.wikimediastatus.net/incidents/kq46rrxd2yy4 is still up
|
|
2026-04-02 13:01:25
|
<jinxer-wm>
|
RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 13:02:04
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782282 (''Aklapper)'
|
|
2026-04-02 13:02:15
|
<Lucas_WMDE>
|
I agree that edsanders’ change seems top priority once we can deploy at all
|
|
2026-04-02 13:02:19
|
<wikibugs>
|
('PS1) ''Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)'
|
|
2026-04-02 13:03:07
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
|
|
2026-04-02 13:03:51
|
<jinxer-wm>
|
FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 13:08:53
|
<wikibugs>
|
('PS2) ''Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)'
|
|
2026-04-02 13:09:39
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
|
|
2026-04-02 13:13:31
|
<wikibugs>
|
('PS3) ''Btullis: Grant the WME engineering team production access suitable for Airflow [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214)'
|
|
2026-04-02 13:15:21
|
<wikibugs>
|
('PS1) ''Ayounsi: Add Mayotte to geo-maps - prefer drmrs [dns] - ''https://gerrit.wikimedia.org/r/1267042'
|
|
2026-04-02 13:16:34
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 47811456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 13:17:24
|
<Lucas_WMDE>
|
(the codfw repool is being pulled ahead, if that solves the incident then we *may* be able to deploy one or two patches in the window after all)
|
|
2026-04-02 13:17:33
|
<logmsgbot>
|
!log jasmine@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool codfw [reason: no reason specified, T414486]
|
|
2026-04-02 13:17:37
|
<stashbot>
|
T414486: Upgrade AUX clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414486
|
|
2026-04-02 13:17:46
|
<logmsgbot>
|
!log jasmine@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool codfw [reason: no reason specified, T414486]
|
|
2026-04-02 13:18:15
|
<jinxer-wm>
|
RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 13:18:33
|
<logmsgbot>
|
!log jasmine@cumin1003 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: maintenance - T414486
|
|
2026-04-02 13:19:15
|
<jinxer-wm>
|
FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 13:20:31
|
<wikibugs>
|
('CR) ''Btullis: [C:''-1] "I'm just waiting for final approval from Haroon on the ticket, for his 6 reports." [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
|
|
2026-04-02 13:20:32
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3981016 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 13:22:09
|
<sukhe>
|
!incidents
|
|
2026-04-02 13:22:09
|
<sirenbot>
|
7804 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqiad)
|
|
2026-04-02 13:22:09
|
<sirenbot>
|
7803 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
|
|
2026-04-02 13:23:50
|
<wikibugs>
|
'ops-eqiad, ''DC-Ops, ''Infrastructure-Foundations, ''netops: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11782358 (''Jclark-ctr)'
|
|
2026-04-02 13:27:16
|
<wikibugs>
|
('PS1) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)'
|
|
2026-04-02 13:28:30
|
<jinxer-wm>
|
RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 13:28:51
|
<jinxer-wm>
|
FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 13:29:15
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90229 and previous config saved to /var/cache/conftool/dbconfig/20260402-132914-fceratto.json
|
|
2026-04-02 13:29:18
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 13:29:35
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782375 (''Jgreen) >>! In T417295#11782255, @Jclark-ctr wrote: > @Jgreen replaced cable link came up. Sorry for delay @Jclark-ctr looks good, it's imaging now. Thanks!'
|
|
2026-04-02 13:29:52
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17), ''Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782376 (''BTullis) This
patch for the...'
|
|
2026-04-02 13:30:15
|
<jinxer-wm>
|
FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 13:30:53
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+1] "Patch looks good, can be merged once approval is done" [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
|
|
2026-04-02 13:31:11
|
<wikibugs>
|
('CR) ''Eevans: [C:''+2] charts/cassandra-http-gateway: configurable Cassandra keyspace [deployment-charts] - ''https://gerrit.wikimedia.org/r/1259188 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 13:31:42
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11782380 (''Jclark-ctr)'
|
|
2026-04-02 13:32:31
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
|
|
2026-04-02 13:32:44
|
<wikibugs>
|
('CR) ''Eevans: [C:''+2] services: add linked-artifacts service [deployment-charts] - ''https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 13:33:51
|
<jinxer-wm>
|
FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 13:33:58
|
<sukhe>
|
!ack
|
|
2026-04-02 13:33:59
|
<sirenbot>
|
All incidents are already acked.
|
|
2026-04-02 13:34:45
|
<jinxer-wm>
|
FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 13:34:51
|
<wikibugs>
|
('Merged) ''jenkins-bot: services: add linked-artifacts service [deployment-charts] - ''https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 13:35:15
|
<jinxer-wm>
|
RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 21.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
|
|
2026-04-02 13:35:57
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11782401 (''Jclark-ctr) @VRiley-WMF Thanks for following up I had Sent the email with instructions to Papaul while I was out on Tuesday. This will require som...'
|
|
2026-04-02 13:36:52
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11782402 (''Jclark-ctr) ''Open→''Resolved'
|
|
2026-04-02 13:37:45
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 13:39:24
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P90230 and previous config saved to /var/cache/conftool/dbconfig/20260402-133923-fceratto.json
|
|
2026-04-02 13:39:45
|
<jinxer-wm>
|
FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 13:41:28
|
<wikibugs>
|
('PS1) ''Kosta Harlan: hCaptcha: Emit Prometheus counter on health check failover [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204)'
|
|
2026-04-02 13:41:47
|
<logmsgbot>
|
!log jasmine@cumin1003 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: maintenance - T414486
|
|
2026-04-02 13:41:51
|
<stashbot>
|
T414486: Upgrade AUX clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414486
|
|
2026-04-02 13:42:15
|
<jinxer-wm>
|
RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
|
|
2026-04-02 13:42:58
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 13:43:51
|
<jinxer-wm>
|
RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
|
|
2026-04-02 13:44:45
|
<jinxer-wm>
|
RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
|
|
2026-04-02 13:49:23
|
<wikibugs>
|
('CR) ''Lucas Werkmeister (WMDE): [C:''+2] "starting gate-and-submit ahead of deployment" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: ''Esanders)'
|
|
2026-04-02 13:49:32
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P90231 and previous config saved to /var/cache/conftool/dbconfig/20260402-134931-fceratto.json
|
|
2026-04-02 13:49:44
|
<Lucas_WMDE>
|
^ there’s some chance we’ll be able to deploy; otherwise I’ll undo that CR+2 (cc edsanders)
|
|
2026-04-02 13:50:16
|
<edsanders>
|
I'm here
|
|
2026-04-02 13:50:22
|
<wikibugs>
|
('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [extensions/ConfirmEdit] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267056 (https://phabricator.wikimedia.org/T421204) (owner: ''Kosta Harlan)'
|
|
2026-04-02 13:50:41
|
<edsanders>
|
are we ready to deploy?
|
|
2026-04-02 13:50:54
|
<Lucas_WMDE>
|
I just got the go-ahead in the security channel, so i think yes
|
|
2026-04-02 13:50:55
|
<wikibugs>
|
('Merged) ''jenkins-bot: Fix suggestion mode availability check [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1266985 (https://phabricator.wikimedia.org/T422143) (owner: ''Esanders)'
|
|
2026-04-02 13:50:57
|
<cdanis>
|
ye
|
|
2026-04-02 13:51:02
|
<Lucas_WMDE>
|
spiders the pig
|
|
2026-04-02 13:51:15
|
<Lucas_WMDE>
|
oh, that gate-and-submit was a lot faster than I expected
|
|
2026-04-02 13:51:25
|
<logmsgbot>
|
!log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
|
|
2026-04-02 13:51:28
|
<stashbot>
|
T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
|
|
2026-04-02 13:51:40
|
<wikibugs>
|
('PS2) ''Arnaudb: gerrit: add Cache-Control for Gitiles with mod_proxy [puppet] - ''https://gerrit.wikimedia.org/r/1266238 (https://phabricator.wikimedia.org/T409422)'
|
|
2026-04-02 13:51:40
|
<edsanders>
|
Lucas_WMDE: thanks
|
|
2026-04-02 13:52:53
|
<wikibugs>
|
('CR) ''Btullis: [C:''+2] Add analytics-fr-tech system user and corresponding groups [puppet] - ''https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: ''Btullis)'
|
|
2026-04-02 13:53:09
|
<logmsgbot>
|
!log lucaswerkmeister-wmde@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmne
|
|
2026-04-02 13:53:10
|
<logmsgbot>
|
t/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_
|
|
2026-04-02 13:53:10
|
<logmsgbot>
|
dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 44s)
|
|
2026-04-02 13:53:28
|
<Lucas_WMDE>
|
looks
|
|
2026-04-02 13:54:06
|
<Lucas_WMDE>
|
I think the sudo docker-pusher falied with “blob upload unknown”?
|
|
2026-04-02 13:54:09
|
<Lucas_WMDE>
|
let me try again…
|
|
2026-04-02 13:54:47
|
<logmsgbot>
|
!log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
|
|
2026-04-02 13:55:45
|
<logmsgbot>
|
!log lucaswerkmeister-wmde@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmne
|
|
2026-04-02 13:55:45
|
<logmsgbot>
|
t/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_
|
|
2026-04-02 13:55:45
|
<logmsgbot>
|
dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 00m 58s)
|
|
2026-04-02 13:56:06
|
<Lucas_WMDE>
|
:(
|
|
2026-04-02 13:56:25
|
<Lucas_WMDE>
|
same error I think
|
|
2026-04-02 13:56:29
|
<Lucas_WMDE>
|
“blob upload unknown”
|
|
2026-04-02 13:57:11
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782509 (''cmooney) We are hopeful the situation should have improved after codfw was repooled, adding additional capacity. Root cause of the circuit breaking is still being in...'
|
|
2026-04-02 13:57:15
|
<edsanders>
|
oh dear
|
|
2026-04-02 13:58:03
|
<Lucas_WMDE>
|
jasmine_: as the codfw repooler (thanks again), any idea if this could be related?
|
|
2026-04-02 13:58:17
|
<wikibugs>
|
('CR) ''Dpogorzelski: ml-serve: add modified kserve 0.17 chart (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: ''Dpogorzelski)'
|
|
2026-04-02 13:58:19
|
<wikibugs>
|
('PS1) ''Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
|
|
2026-04-02 13:58:26
|
<Lucas_WMDE>
|
I’m imagining something like, scap now has to push the new mw image to codfw, but something on codfw might not be ready for it…
|
|
2026-04-02 13:58:29
|
<Lucas_WMDE>
|
juts guessing though
|
|
2026-04-02 13:58:35
|
<edsanders>
|
I'll try once more for luck
|
|
2026-04-02 13:58:48
|
<Lucas_WMDE>
|
ok
|
|
2026-04-02 13:58:53
|
<logmsgbot>
|
!log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
|
|
2026-04-02 13:58:56
|
<stashbot>
|
T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
|
|
2026-04-02 13:58:58
|
<Lucas_WMDE>
|
I didn’t realize you can deploy, I should’ve asked ^^
|
|
2026-04-02 13:59:00
|
<Lucas_WMDE>
|
sorry
|
|
2026-04-02 13:59:17
|
<jasmine_>
|
lucas_wmde: looking
|
|
2026-04-02 13:59:20
|
<Lucas_WMDE>
|
thx
|
|
2026-04-02 13:59:40
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T419635)', diff saved to https://phabricator.wikimedia.org/P90232 and previous config saved to /var/cache/conftool/dbconfig/20260402-135939-fceratto.json
|
|
2026-04-02 13:59:43
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 13:59:56
|
<hashar>
|
jouncebot: nowandnext
|
|
2026-04-02 13:59:56
|
<jouncebot>
|
For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1300)
|
|
2026-04-02 13:59:56
|
<jouncebot>
|
In 0 hour(s) and 0 minute(s): DC Switchover: Day 8 - Codfw Repool (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400)
|
|
2026-04-02 13:59:57
|
<logmsgbot>
|
!log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance
|
|
2026-04-02 14:00:04
|
<jouncebot>
|
jasmine_: May I have your attention please! DC Switchover: Day 8 - Codfw Repool. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400)
|
|
2026-04-02 14:00:05
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90233 and previous config saved to /var/cache/conftool/dbconfig/20260402-140004-fceratto.json
|
|
2026-04-02 14:00:08
|
<logmsgbot>
|
!log esanders@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
|
|
2026-04-02 14:00:08
|
<logmsgbot>
|
mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
|
|
2026-04-02 14:00:08
|
<logmsgbot>
|
awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 15s)
|
|
2026-04-02 14:00:48
|
<hashar>
|
jasmine_: I need to reload the CI Jenkins
|
|
2026-04-02 14:01:05
|
<hashar>
|
it does not take long, I don't think it affects the switchover
|
|
2026-04-02 14:03:07
|
<hashar>
|
!log Jenkins CI: reloading configuration from disk to poll new nodes # T421114
|
|
2026-04-02 14:03:11
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 14:03:11
|
<Lucas_WMDE>
|
hashar: FYI, codfw was already repooled to respond to the incident (but I’m not sure how complete it is)
|
|
2026-04-02 14:03:12
|
<stashbot>
|
T421114: Rebuild all Jenkins agents VM to Bookworm to support Java 21 - https://phabricator.wikimedia.org/T421114
|
|
2026-04-02 14:03:17
|
<hashar>
|
done
|
|
2026-04-02 14:03:27
|
<hashar>
|
Lucas_WMDE: ah cool, thank you!
|
|
2026-04-02 14:03:48
|
<wikibugs>
|
('PS2) ''Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
|
|
2026-04-02 14:03:48
|
<Lucas_WMDE>
|
(we’re also still trying to deploy an UBN fix backport, but running into issues in scap)
|
|
2026-04-02 14:04:16
|
<wikibugs>
|
('CR) ''Elukey: [WIP] Move linting to Ruff and apply code fixes (''1 comment) [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
|
|
2026-04-02 14:05:34
|
<wikibugs>
|
('CR) ''Elukey: "First pass! I have intentionally removed a lot of problems allowing exceptions for tests etc.., I think it would be impossible (and probab" [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
|
|
2026-04-02 14:05:48
|
<wikibugs>
|
('CR) ''Ottomata: stream: mw-page-html-content-change-enrich (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 14:06:07
|
<jasmine_>
|
hashar: yes we repooled a little bit earlier than scheduled, codfw is back up now
|
|
2026-04-02 14:07:25
|
<hashar>
|
jasmine_: thank you and congratulations
|
|
2026-04-02 14:08:22
|
<hnowlan>
|
could/should we make the config reload a part of a repool/depool?
|
|
2026-04-02 14:09:00
|
<wikibugs>
|
('PS3) ''Bking: opensearch: handle IP changes for software firewall [puppet] - ''https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714)'
|
|
2026-04-02 14:09:05
|
<wikibugs>
|
('PS2) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)'
|
|
2026-04-02 14:09:07
|
<wikibugs>
|
('CR) ''Bking: [C:''+2] opensearch: handle IP changes for software firewall (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: ''Bking)'
|
|
2026-04-02 14:09:11
|
<wikibugs>
|
('CR) ''Bking: [V:''+2 C:''+2] opensearch: handle IP changes for software firewall [puppet] - ''https://gerrit.wikimedia.org/r/1266372 (https://phabricator.wikimedia.org/T421714) (owner: ''Bking)'
|
|
2026-04-02 14:09:16
|
<logmsgbot>
|
!log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1266985|Fix suggestion mode availability check (T422143)]]
|
|
2026-04-02 14:09:18
|
<hashar>
|
hnowlan: the Jenkins reload? Nope it is unrelated, I had to do it for some unrelated configuration changes I have made on Jenkins
|
|
2026-04-02 14:09:19
|
<stashbot>
|
T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
|
|
2026-04-02 14:09:23
|
<Lucas_WMDE>
|
I confess I’m a bit torn between “revert the backport so the deployment is in a known state” and “leave it to be rolled out with the next deploy because it’s small and we really want it deployed”
|
|
2026-04-02 14:09:26
|
<hnowlan>
|
hashar: ah okay
|
|
2026-04-02 14:10:01
|
<hashar>
|
hnowlan: and whenever I act on Jenkins/Zuul I try to remember to check the deployment calendar to ensure that is not going to break some ongoing deployment :]
|
|
2026-04-02 14:10:24
|
<wikibugs>
|
('PS3) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)'
|
|
2026-04-02 14:10:26
|
<icinga-wm>
|
RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 11 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
|
|
2026-04-02 14:10:32
|
<logmsgbot>
|
!log esanders@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
|
|
2026-04-02 14:10:32
|
<logmsgbot>
|
mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
|
|
2026-04-02 14:10:32
|
<logmsgbot>
|
awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 16s)
|
|
2026-04-02 14:10:48
|
<Lucas_WMDE>
|
still the same error
|
|
2026-04-02 14:11:17
|
<wikibugs>
|
('CR) ''JavierMonton: stream: mw-page-html-content-change-enrich (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 14:11:45
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17), ''Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782589 (''HShaikh) I
approve these re...'
|
|
2026-04-02 14:11:47
|
<wikibugs>
|
('PS3) ''Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
|
|
2026-04-02 14:12:31
|
<wikibugs>
|
('CR) ''Elukey: [WIP] Move linting to Ruff and apply code fixes (''1 comment) [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
|
|
2026-04-02 14:13:23
|
<wikibugs>
|
('CR) ''Ottomata: "It is quite annoying that 'staging' AKA -next in dse-k8s is a different helmfile. It makes it hard to share common settings between 'stagi" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 14:13:44
|
<wikibugs>
|
'SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166 (''Lucas_Werkmeister_WMDE) ''NEW'
|
|
2026-04-02 14:13:47
|
<Lucas_WMDE>
|
I filed T422166 for the deploy blocker (cc edsanders), not sure how it should be tagged
|
|
2026-04-02 14:13:48
|
<stashbot>
|
T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
|
|
2026-04-02 14:14:06
|
<wikibugs>
|
'SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782617 (''Lucas_Werkmeister_WMDE) p:''Triage→''Unbreak!'
|
|
2026-04-02 14:14:11
|
<Lucas_WMDE>
|
cc jasmine_ ^ if you’re still looking into it
|
|
2026-04-02 14:14:18
|
<jasmine_>
|
Lucas_WMDE: looking now if perhaps it's swift related see
|
|
2026-04-02 14:14:18
|
<jasmine_>
|
[0] - https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook
|
|
2026-04-02 14:14:55
|
<wikibugs>
|
('PS1) ''Ladsgroup: Bump maxConnCount [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267062'
|
|
2026-04-02 14:15:28
|
<wikibugs>
|
('CR) ''CDanis: [C:''+1] Bump maxConnCount [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267062 (owner: ''Ladsgroup)'
|
|
2026-04-02 14:16:05
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267062 (owner: ''Ladsgroup)'
|
|
2026-04-02 14:16:50
|
<wikibugs>
|
'SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782637 (''Lucas_Werkmeister_WMDE) Timeline note: this comes hot on the tail of T422130, for which @jasmine_ repooled codfw slightly earlier than [scheduled](https://wikitech.wikimedia.org/w/index.php?title=Deployments&old...'
|
|
2026-04-02 14:16:54
|
<Lucas_WMDE>
|
Amir1: good luck with that deploy
|
|
2026-04-02 14:16:59
|
<wikibugs>
|
('Merged) ''jenkins-bot: Bump maxConnCount [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267062 (owner: ''Ladsgroup)'
|
|
2026-04-02 14:17:13
|
<logmsgbot>
|
!log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1267062|Bump maxConnCount]]
|
|
2026-04-02 14:17:15
|
<Lucas_WMDE>
|
(I expect you’ll run into T422166)
|
|
2026-04-02 14:17:23
|
<Amir1>
|
Lucas_WMDE: that hopefully should prevent it from happening?
|
|
2026-04-02 14:17:46
|
<Amir1>
|
oh that's a different issue
|
|
2026-04-02 14:17:48
|
<Amir1>
|
yay
|
|
2026-04-02 14:17:48
|
<Lucas_WMDE>
|
yeah
|
|
2026-04-02 14:18:25
|
<logmsgbot>
|
!log ladsgroup@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted
|
|
2026-04-02 14:18:25
|
<logmsgbot>
|
/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/med
|
|
2026-04-02 14:18:25
|
<logmsgbot>
|
iawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 11s)
|
|
2026-04-02 14:18:28
|
<Lucas_WMDE>
|
yup :(
|
|
2026-04-02 14:19:23
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17), ''Patch-For-Review: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782654 (''BTullis)'
|
|
2026-04-02 14:19:39
|
<wikibugs>
|
('CR) ''Btullis: [C:''+2] "Manager approval received." [puppet] - ''https://gerrit.wikimedia.org/r/1267031 (https://phabricator.wikimedia.org/T421214) (owner: ''Btullis)'
|
|
2026-04-02 14:23:17
|
<wikibugs>
|
('PS4) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216)'
|
|
2026-04-02 14:23:24
|
<wikibugs>
|
('CR) ''JavierMonton: stream: mw-page-html-content-change-enrich (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 14:23:36
|
<Lucas_WMDE>
|
(further investigation happening in -sre FTR)
|
|
2026-04-02 14:24:35
|
<wikibugs>
|
('CR) ''CDanis: [C:''+1] Add Mayotte to geo-maps - prefer drmrs [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
|
|
2026-04-02 14:27:10
|
<wikibugs>
|
'SRE: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11782695 (''Scott_French) dockerd logs on deploy1003 for the above example: ` Apr 02 14:09:17 deploy1003 dockerd[1070]: time="2026-04-02T14:09:17.561327804Z" level=info msg="ignoring event" container=c8f32695fd426caa327d6d...'
|
|
2026-04-02 14:28:22
|
<wikibugs>
|
('CR) ''Volans: [C:''+2] Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 14:28:30
|
<moritzm>
|
!log installing pyasn1 security updates
|
|
2026-04-02 14:28:31
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 14:29:42
|
<wikibugs>
|
('Merged) ''jenkins-bot: Revert "cr-cloud: allow cumin/cloudcumin traffic" [homer/public] - ''https://gerrit.wikimedia.org/r/1266963 (https://phabricator.wikimedia.org/T420360) (owner: ''Volans)'
|
|
2026-04-02 14:30:05
|
<jouncebot>
|
jasmine_: Time to snap out of that daydream and deploy DC Switchover: Day 8 - Codfw Repool. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1400).
|
|
2026-04-02 14:30:05
|
<jouncebot>
|
Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1430)
|
|
2026-04-02 14:33:14
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782729 (''BTullis) I have now modified the `airflow-platfor...'
|
|
2026-04-02 14:34:53
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90236 and previous config saved to /var/cache/conftool/dbconfig/20260402-143452-fceratto.json
|
|
2026-04-02 14:34:56
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 14:36:28
|
<wikibugs>
|
'SRE-tools, ''Cumin, ''Infrastructure-Foundations: Add proxy support to cumin openstack backend - https://phabricator.wikimedia.org/T420360#11782751 (''Volans) ''Open→''Resolved The cloudcumin hosts are now using the webproxies to
connect to the openstack APIs and the firewall rule has been reverted...'
|
|
2026-04-02 14:37:31
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11782760 (''MoritzMuehlenhoff) p:''Unbreak!→''Medium The immediate impact has been mitigated, reducing priority, the task might still be used to collect
followups.'
|
|
2026-04-02 14:41:11
|
<Lucas_WMDE>
|
huge spike of PHP warnings from ExperimentManager all of a sudden
|
|
2026-04-02 14:41:11
|
<wikibugs>
|
('PS1) ''Eevans: cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112)'
|
|
2026-04-02 14:41:19
|
<Lucas_WMDE>
|
(logspam-watch)
|
|
2026-04-02 14:42:09
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11782776 (''MoritzMuehlenhoff) What kind of access is needed? root access or simply shell access? We have exist...'
|
|
2026-04-02 14:42:17
|
<moritzm>
|
!log installing libxml-parser-perl security updates
|
|
2026-04-02 14:42:18
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 14:44:33
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782789 (''BTullis) You should also now be able to start con...'
|
|
2026-04-02 14:45:01
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P90237 and previous config saved to /var/cache/conftool/dbconfig/20260402-144500-fceratto.json
|
|
2026-04-02 14:46:38
|
<wikibugs>
|
('CR) ''Eevans: "recheck" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 14:47:27
|
<wikibugs>
|
('CR) ''Elukey: ml-serve: add modified kserve 0.17 chart (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: ''Dpogorzelski)'
|
|
2026-04-02 14:48:34
|
<wikibugs>
|
('CR) ''Elukey: [C:''+1] "Final review - this is currently a ok-ish use case since we already run the same config in prod. We agreed to open a task and follow up on" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: ''Dpogorzelski)'
|
|
2026-04-02 14:49:26
|
<wikibugs>
|
('CR) ''JMeybohm: [C:''+1] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 14:50:08
|
<wikibugs>
|
('CR) ''Eevans: [C:''+2] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 14:50:17
|
<Lucas_WMDE>
|
edsanders: are you still around and available to test your backport? (see -sre)
|
|
2026-04-02 14:50:45
|
<wikibugs>
|
('CR) ''Eevans: [V:''+2 C:''+2] cassandra-http-gateway: update version to 0.4.1 (April Fool's) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267075 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 14:51:13
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 14:51:41
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 14:52:01
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11782824 (''BTullis) 4 Kerberos principals created and welcom...'
|
|
2026-04-02 14:52:25
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 14:52:40
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 14:53:40
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 14:53:54
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 14:54:02
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11782828 (''Jgreen) ''Open→''Resolved hosts are up and running'
|
|
2026-04-02 14:55:09
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P90239 and previous config saved to /var/cache/conftool/dbconfig/20260402-145508-fceratto.json
|
|
2026-04-02 14:55:12
|
<Lucas_WMDE>
|
(the ExperimentManager warning spike seems to have abated again fwiw)
|
|
2026-04-02 14:56:38
|
<logmsgbot>
|
!log swfrench@deploy1003 Started scap sync-world: Manual sync-world to pick up 1267062, 1266985 - T422143
|
|
2026-04-02 14:56:41
|
<stashbot>
|
T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
|
|
2026-04-02 14:56:44
|
<logmsgbot>
|
!log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mr1-eqiad,mr1-eqiad IPv6 with reason: switching from OSFP to BGP
|
|
2026-04-02 14:56:46
|
<Lucas_WMDE>
|
\o/
|
|
2026-04-02 14:57:44
|
<logmsgbot>
|
!log swfrench@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.21,1.46.0-wmf.22,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/
|
|
2026-04-02 14:57:44
|
<logmsgbot>
|
mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.243.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/medi
|
|
2026-04-02 14:57:44
|
<logmsgbot>
|
awiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.243.0) (duration: 01m 06s)
|
|
2026-04-02 14:58:20
|
<wikibugs>
|
('CR) ''Ssingh: "I am guessing this is based on probenet data? (not that everything else in the repo currently is but I am mostly curious)" [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
|
|
2026-04-02 14:59:32
|
<papaul>
|
!log ongoing maintenance on mr1-eqiad
|
|
2026-04-02 14:59:33
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 14:59:40
|
<logmsgbot>
|
!log swfrench@deploy1003 Started scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 - T422143
|
|
2026-04-02 15:00:04
|
<jouncebot>
|
jnuche and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1500)
|
|
2026-04-02 15:00:38
|
<wikibugs>
|
('CR) ''Ottomata: [C:''+1] stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 15:02:15
|
<wikibugs>
|
('CR) ''Dzahn: [C:''+2] buildkitd: Bump buildkit image to wmf-v0.29.0 [puppet] - ''https://gerrit.wikimedia.org/r/1266395 (https://phabricator.wikimedia.org/T415284) (owner: ''Ahmon Dancy)'
|
|
2026-04-02 15:02:37
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+1] "Preseed notes often use globbing where applicable, but with our ongoing migration of all servers to UEFI for hardware there will be a lot " [puppet] - ''https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: ''Herron)'
|
|
2026-04-02 15:03:03
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 15:03:45
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 15:04:20
|
<icinga-wm>
|
PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
|
|
2026-04-02 15:04:20
|
<icinga-wm>
|
PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
|
|
2026-04-02 15:05:17
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T419635)', diff saved to https://phabricator.wikimedia.org/P90241 and previous config saved to /var/cache/conftool/dbconfig/20260402-150517-fceratto.json
|
|
2026-04-02 15:05:20
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 15:05:34
|
<logmsgbot>
|
!log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Maintenance
|
|
2026-04-02 15:05:47
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90242 and previous config saved to /var/cache/conftool/dbconfig/20260402-150542-fceratto.json
|
|
2026-04-02 15:05:49
|
<wikibugs>
|
('PS1) ''Papaul: Remove OSFP from mr1-eqiad [homer/public] - ''https://gerrit.wikimedia.org/r/1267081 (https://phabricator.wikimedia.org/T421238)'
|
|
2026-04-02 15:06:35
|
<logmsgbot>
|
!log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 15:07:05
|
<logmsgbot>
|
!log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 15:07:55
|
<wikibugs>
|
'ops-codfw, ''SRE, ''DC-Ops: Q3:rack/setup/install cloudcephmon2007-dev - https://phabricator.wikimedia.org/T416396#11782910 (''Jhancock.wm)'
|
|
2026-04-02 15:08:45
|
<wikibugs>
|
('CR) ''Papaul: [C:''+2] Remove OSFP from mr1-eqiad [homer/public] - ''https://gerrit.wikimedia.org/r/1267081 (https://phabricator.wikimedia.org/T421238) (owner: ''Papaul)'
|
|
2026-04-02 15:09:29
|
<wikibugs>
|
('CR) ''JavierMonton: [C:''+2] stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 15:11:23
|
<wikibugs>
|
('Merged) ''jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267052 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 15:11:40
|
<moritzm>
|
!log installing apache2 security updates
|
|
2026-04-02 15:11:41
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 15:12:20
|
<icinga-wm>
|
RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
|
|
2026-04-02 15:12:20
|
<icinga-wm>
|
RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
|
|
2026-04-02 15:12:45
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 15:12:59
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 15:16:17
|
<wikibugs>
|
('PS1) ''Papaul: Add back "replace osfp" to be able to remove it [homer/public] - ''https://gerrit.wikimedia.org/r/1267085 (https://phabricator.wikimedia.org/T421238)'
|
|
2026-04-02 15:20:29
|
<wikibugs>
|
('CR) ''Papaul: [C:''+2] Add back "replace osfp" to be able to remove it [homer/public] - ''https://gerrit.wikimedia.org/r/1267085 (https://phabricator.wikimedia.org/T421238) (owner: ''Papaul)'
|
|
2026-04-02 15:22:31
|
<logmsgbot>
|
!log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 15:23:08
|
<papaul>
|
!log maintenance complete on mr1-eqiad
|
|
2026-04-02 15:23:09
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 15:23:22
|
<logmsgbot>
|
!log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 15:26:12
|
<swfrench-wmf>
|
!log restarted docker-registry-restricted.service on registry200[45] - T422166
|
|
2026-04-02 15:26:14
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 15:26:14
|
<stashbot>
|
T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
|
|
2026-04-02 15:26:28
|
<logmsgbot>
|
!log swfrench@deploy1003 sync-world aborted: Manual full-rebuild sync-world to pick up 1267062, 1266985 - T422143 (duration: 26m 48s)
|
|
2026-04-02 15:26:31
|
<stashbot>
|
T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
|
|
2026-04-02 15:27:38
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 15:27:46
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 15:31:16
|
<swfrench-wmf>
|
!log restarted docker-registry-ml.service on registry200[45] - T422166
|
|
2026-04-02 15:31:18
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 15:31:19
|
<stashbot>
|
T422166: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166
|
|
2026-04-02 15:32:34
|
<moritzm>
|
!log installing freetype security updates
|
|
2026-04-02 15:32:35
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 15:32:59
|
<wikibugs>
|
('CR) ''Dzahn: [C:''+1] gerrit: adjust idleTimeout on Jetty [puppet] - ''https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T421827) (owner: ''Arnaudb)'
|
|
2026-04-02 15:33:00
|
<logmsgbot>
|
!log swfrench@deploy1003 Started scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143
|
|
2026-04-02 15:33:02
|
<stashbot>
|
T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
|
|
2026-04-02 15:34:43
|
<wikibugs>
|
('PS4) ''Elukey: [WIP] Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
|
|
2026-04-02 15:35:06
|
<wikibugs>
|
('CR) ''Elukey: [WIP] Move linting to Ruff and apply code fixes (''1 comment) [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
|
|
2026-04-02 15:38:44
|
<wikibugs>
|
('CR) ''Elukey: "Local, venvs created (so not the first run):" [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: ''Elukey)'
|
|
2026-04-02 15:39:18
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90244 and previous config saved to /var/cache/conftool/dbconfig/20260402-153918-fceratto.json
|
|
2026-04-02 15:39:22
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 15:41:49
|
<wikibugs>
|
('CR) ''Dzahn: [V:''+1 C:''+1] "https://puppet-compiler.wmflabs.org/output/1256301/8370/"; [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A
smart kitten)'
|
|
2026-04-02 15:41:50
|
<wikibugs>
|
('PS5) ''Elukey: Move linting to Ruff and apply code fixes [software/spicerack] - ''https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475)'
|
|
2026-04-02 15:44:23
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 15:45:37
|
<wikibugs>
|
('PS14) ''Herron: site: opt-in insetup defaults by hostname prefix [puppet] - ''https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929)'
|
|
2026-04-02 15:46:55
|
<wikibugs>
|
('CR) ''A smart kitten: "FWIW that [phab1004 NOOP result](https://puppet-compiler.wmflabs.org/output/1256301/8370/phab1004.eqiad.wmnet/index.html) seems wrong - it" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
|
|
2026-04-02 15:46:59
|
<wikibugs>
|
('CR) ''A smart kitten: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
|
|
2026-04-02 15:48:31
|
<jinxer-wm>
|
FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards Has improved - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
|
|
2026-04-02 15:48:57
|
<wikibugs>
|
('CR) ''A smart kitten: "(FWIW @dzahn@wikimedia.org, feel free to shoot me a message in IRC if you want to sync-up e.g. if/when deploying/testing this patch. I'm n" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
|
|
2026-04-02 15:49:08
|
<wikibugs>
|
('CR) ''Herron: [C:''+2] "thanks for the review!" [puppet] - ''https://gerrit.wikimedia.org/r/1260727 (https://phabricator.wikimedia.org/T418929) (owner: ''Herron)'
|
|
2026-04-02 15:49:22
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 15:49:26
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P90245 and previous config saved to /var/cache/conftool/dbconfig/20260402-154925-fceratto.json
|
|
2026-04-02 15:50:05
|
<logmsgbot>
|
!log swfrench@deploy1003 swfrench: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 15:50:09
|
<stashbot>
|
T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
|
|
2026-04-02 15:50:10
|
<wikibugs>
|
('CR) ''A smart kitten: "(if I'm around in IRC at the time you'll be deploying this, that is; otherwise feel free to just deploy it if/when is good for you :) )" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
|
|
2026-04-02 15:51:13
|
<logmsgbot>
|
!log swfrench@deploy1003 swfrench: Continuing with sync
|
|
2026-04-02 15:55:31
|
<wikibugs>
|
('PS3) ''Btullis: Remove the site.pp definitions for decommissioned an-worker hosts [puppet] - ''https://gerrit.wikimedia.org/r/1242516 (https://phabricator.wikimedia.org/T414948)'
|
|
2026-04-02 15:55:47
|
<wikibugs>
|
('PS1) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 15:56:13
|
<wikibugs>
|
('CR) ''Dzahn: [V:''+1 C:''+1] "it's because puppet DB queries were introduced somewhere (not by your patch) which often breaks compiler runs (Failed to execute '/pdb/que" [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart
kitten)'
|
|
2026-04-02 15:59:23
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 15:59:35
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P90246 and previous config saved to /var/cache/conftool/dbconfig/20260402-155934-fceratto.json
|
|
2026-04-02 16:00:05
|
<jouncebot>
|
No Gerrit patches in the queue for this window AFAICS.
|
|
2026-04-02 16:00:05
|
<jouncebot>
|
jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1600). Please do the needful.
|
|
2026-04-02 16:00:34
|
<Lucas_WMDE>
|
we’re so close to finishing the backport+config window lol
|
|
2026-04-02 16:00:49
|
<Lucas_WMDE>
|
(with 1/4 patches deployed)
|
|
2026-04-02 16:01:31
|
<wikibugs>
|
('PS2) ''Herron: preseed: use efi for new kafka-logging hosts [puppet] - ''https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929)'
|
|
2026-04-02 16:01:33
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] preseed: use efi for new kafka-logging hosts [puppet] - ''https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929) (owner: ''Herron)'
|
|
2026-04-02 16:01:38
|
<wikibugs>
|
('PS2) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 16:02:56
|
<logmsgbot>
|
!log swfrench@deploy1003 Finished scap sync-world: Manual full-rebuild sync-world to pick up 1267062, 1266985 (attempt 2) - T422143 (duration: 29m 56s)
|
|
2026-04-02 16:02:59
|
<stashbot>
|
T422143: Suggestion mode showing for all users - https://phabricator.wikimedia.org/T422143
|
|
2026-04-02 16:02:59
|
<swfrench-wmf>
|
\i/
|
|
2026-04-02 16:03:04
|
<Lucas_WMDE>
|
\o/ \o/ \o/
|
|
2026-04-02 16:03:40
|
<Lucas_WMDE>
|
!log UTC afternoon backport+config window (very belatedly) done ^^
|
|
2026-04-02 16:03:41
|
<stashbot>
|
Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
|
|
2026-04-02 16:03:50
|
<Lucas_WMDE>
|
thanks for figuring it out and deploying!
|
|
2026-04-02 16:04:08
|
<Lucas_WMDE>
|
Amir1: your maxConnCount bump got deployed now btw ^
|
|
2026-04-02 16:04:15
|
<Amir1>
|
thanks!
|
|
2026-04-02 16:04:22
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 16:05:09
|
<wikibugs>
|
'SRE, ''Datacenter-Switchover: scap can’t deploy: blob upload unknown - https://phabricator.wikimedia.org/T422166#11783170 (''Scott_French) p:''Unbreak!→''Medium This was a curious one. Many thanks to @elukey and @CDanis for the assistance. tl;dr - Cached connections in the (restricted) docker registry's...'
|
|
2026-04-02 16:05:26
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11783179 (''Ahoelzl) I approve the addition of the listed WME...'
|
|
2026-04-02 16:05:40
|
<jinxer-wm>
|
FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 16:09:13
|
<jinxer-wm>
|
FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 16:09:23
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 16:09:43
|
<logmsgbot>
|
!log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T419635)', diff saved to https://phabricator.wikimedia.org/P90247 and previous config saved to /var/cache/conftool/dbconfig/20260402-160942-fceratto.json
|
|
2026-04-02 16:09:46
|
<stashbot>
|
T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635
|
|
2026-04-02 16:09:59
|
<logmsgbot>
|
!log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance
|
|
2026-04-02 16:10:44
|
<wikibugs>
|
('Abandoned) ''Federico Ceratto: wmnet: update CNAME records for DB masters to eqiad [dns] - ''https://gerrit.wikimedia.org/r/1255655 (https://phabricator.wikimedia.org/T416705) (owner: ''Federico Ceratto)'
|
|
2026-04-02 16:11:45
|
<wikibugs>
|
('PS3) ''Herron: preseed: use efi for new kafka-logging hosts [puppet] - ''https://gerrit.wikimedia.org/r/1267102 (https://phabricator.wikimedia.org/T418929)'
|
|
2026-04-02 16:12:31
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 16:12:43
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 16:12:55
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 16:13:01
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 16:14:07
|
<wikibugs>
|
('CR) ''Herron: [C:''+2] "ok! lets give this a try" [alerts] - ''https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: ''Herron)'
|
|
2026-04-02 16:14:23
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 16:15:28
|
<wikibugs>
|
('Merged) ''jenkins-bot: burrow: update expressions to handle multiple instances [alerts] - ''https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) (owner: ''Herron)'
|
|
2026-04-02 16:15:28
|
<wikibugs>
|
('CR) ''Dzahn: [V:''+1 C:''+2] phabricator: Set a custom default-mail-address for the test instance [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
|
|
2026-04-02 16:15:53
|
<swfrench-wmf>
|
jouncebot: nowandnext
|
|
2026-04-02 16:15:53
|
<jouncebot>
|
For the next 0 hour(s) and 44 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1600)
|
|
2026-04-02 16:15:53
|
<jouncebot>
|
In 0 hour(s) and 44 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
|
|
2026-04-02 16:15:53
|
<jouncebot>
|
In 0 hour(s) and 44 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
|
|
2026-04-02 16:16:55
|
<wikibugs>
|
('CR) ''Herron: [C:''+2] "thanks all!" [puppet] - ''https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: ''Herron)'
|
|
2026-04-02 16:18:02
|
<wikibugs>
|
('CR) ''Dzahn: [V:''+1 C:''+2] "deployed. confirmed it is a NOOP / no error on production host." [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
|
|
2026-04-02 16:18:31
|
<wikibugs>
|
('CR) ''Scott French: "Thanks for the review!" [puppet] - ''https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 16:19:10
|
<wikibugs>
|
('CR) ''Scott French: [C:''+2] deployment_server: absent image-suggestion k8s creds config [puppet] - ''https://gerrit.wikimedia.org/r/1198576 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 16:23:33
|
<wikibugs>
|
('Restored) ''Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: ''Mmartorana)'
|
|
2026-04-02 16:24:14
|
<jinxer-wm>
|
FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 16:24:35
|
<wikibugs>
|
'SRE-Access-Requests, ''LDAP-Access-Requests, ''Wikimedia Enterprise, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team - https://phabricator.wikimedia.org/T421214#11783241 (''BTullis) ''Open→''Resolved p:''Triage→'...
|
|
2026-04-02 16:25:39
|
<wikibugs>
|
('PS6) ''Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366)'
|
|
2026-04-02 16:25:48
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: ''Mmartorana)'
|
|
2026-04-02 16:26:51
|
<wikibugs>
|
('Abandoned) ''Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: ''Mmartorana)'
|
|
2026-04-02 16:29:13
|
<jinxer-wm>
|
FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 16:31:22
|
<wikibugs>
|
('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it"; [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne
Tomasevich)'
|
|
2026-04-02 16:32:25
|
<wikibugs>
|
('PS1) ''Mmartorana: config: Enable EmailConfirmationBanner on selected wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267116 (https://phabricator.wikimedia.org/T421366)'
|
|
2026-04-02 16:33:19
|
<wikibugs>
|
'SRE-swift-storage, ''API Platform, ''Commons, ''MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11783346 (''Ladsgroup) I was looking into this a bit yesterday (more general...'
|
|
2026-04-02 16:34:13
|
<jinxer-wm>
|
FIRING: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 16:34:48
|
<wikibugs>
|
('CR) ''Btullis: data-platform: Add alerts for opensearch on k8s certificate expiry (''2 comments) [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
|
|
2026-04-02 16:37:32
|
<wikibugs>
|
'SRE, ''Traffic: IP Block/Throttling relief request: urbipedia.org - Bot attack mitigated - https://phabricator.wikimedia.org/T421650#11783388 (''Alberto) Thank you very much for your help! I have correctly implemented the User-Agent in my LocalSettings.php for both MediaWiki core and the QuickInstantCommons...'
|
|
2026-04-02 16:39:14
|
<jinxer-wm>
|
RESOLVED: [4x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
|
|
2026-04-02 16:39:22
|
<wikibugs>
|
('CR) ''Scott French: [C:''+2] deployment_server: remove absented image-suggestion k8s creds config [puppet] - ''https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 16:39:46
|
<wikibugs>
|
('PS4) ''Scott French: deployment_server: remove absented image-suggestion k8s creds config [puppet] - ''https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096)'
|
|
2026-04-02 16:40:30
|
<wikibugs>
|
('PS1) ''Daniel Kinzler: Revert "rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267119'
|
|
2026-04-02 16:41:02
|
<wikibugs>
|
('CR) ''Daniel Kinzler: [C:''+2] "revert undeployed change" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267119 (owner: ''Daniel Kinzler)'
|
|
2026-04-02 16:43:22
|
<wikibugs>
|
('CR) ''Scott French: [C:''+2] deployment_server: remove absented image-suggestion k8s creds config [puppet] - ''https://gerrit.wikimedia.org/r/1198577 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 16:44:00
|
<wikibugs>
|
('Merged) ''jenkins-bot: Revert "rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267119 (owner: ''Daniel Kinzler)'
|
|
2026-04-02 16:45:27
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''observability, ''Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11783408 (''Jclark-ctr) a:''herron→''Jclark-ctr'
|
|
2026-04-02 16:45:58
|
<wikibugs>
|
('PS1) ''Daniel Kinzler: rest gateway: introduce policy for abstractwiki/wikifunctions" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581)'
|
|
2026-04-02 16:47:02
|
<wikibugs>
|
'SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189 (''prabhat) ''NEW'
|
|
2026-04-02 16:47:35
|
<wikibugs>
|
('PS1) ''Herron: kafkamon: update burrow ports [puppet] - ''https://gerrit.wikimedia.org/r/1267121 (https://phabricator.wikimedia.org/T418858)'
|
|
2026-04-02 16:47:47
|
<wikibugs>
|
'SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783451 (''prabhat)'
|
|
2026-04-02 16:49:51
|
<wikibugs>
|
('CR) ''Scott French: "Thank you both for the review!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 16:50:07
|
<wikibugs>
|
('CR) ''Scott French: [C:''+2] image-suggestion: remove service configuration [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 16:52:26
|
<wikibugs>
|
'SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783519 (''ssingh) request and key confirmed out of band.'
|
|
2026-04-02 16:53:23
|
<logmsgbot>
|
!log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3009.esams.wmnet} and A:liberica
|
|
2026-04-02 16:54:23
|
<jinxer-wm>
|
RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 16:57:02
|
<logmsgbot>
|
!log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3009.esams.wmnet} and A:liberica
|
|
2026-04-02 16:58:15
|
<wikibugs>
|
('Merged) ''jenkins-bot: image-suggestion: remove service configuration [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198580 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 16:59:30
|
<logmsgbot>
|
!log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs3008.esams.wmnet} and A:liberica
|
|
2026-04-02 17:00:05
|
<jouncebot>
|
bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700).
|
|
2026-04-02 17:00:05
|
<jouncebot>
|
Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T1700)
|
|
2026-04-02 17:00:07
|
<swfrench-wmf>
|
o/
|
|
2026-04-02 17:00:25
|
<swfrench-wmf>
|
I'll be deploying some admin_ng changes shortly
|
|
2026-04-02 17:02:25
|
<logmsgbot>
|
!log swfrench@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 17:03:03
|
<logmsgbot>
|
!log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs3008.esams.wmnet} and A:liberica
|
|
2026-04-02 17:03:30
|
<wikibugs>
|
('PS1) ''JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216)'
|
|
2026-04-02 17:04:46
|
<wikibugs>
|
('CR) ''Ottomata: [C:''+1] stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 17:05:13
|
<logmsgbot>
|
!log swfrench@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 17:05:34
|
<logmsgbot>
|
!log swfrench@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 17:07:04
|
<logmsgbot>
|
!log swfrench@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 17:07:04
|
<wikibugs>
|
('CR) ''JavierMonton: [C:''+2] stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 17:08:15
|
<logmsgbot>
|
!log swfrench@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
|
|
2026-04-02 17:08:37
|
<bd808>
|
checks for things that need releasing
|
|
2026-04-02 17:09:06
|
<wikibugs>
|
('PS1) ''DCausse: search: add space-discount for wikidata custom prefix search profiles [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267130 (https://phabricator.wikimedia.org/T420427)'
|
|
2026-04-02 17:09:09
|
<wikibugs>
|
('Merged) ''jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267128 (https://phabricator.wikimedia.org/T421216) (owner: ''JavierMonton)'
|
|
2026-04-02 17:09:17
|
<bd808>
|
nothing for my window this week</window>
|
|
2026-04-02 17:09:39
|
<wikibugs>
|
('PS4) ''Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109)'
|
|
2026-04-02 17:10:12
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: ''Dzahn)'
|
|
2026-04-02 17:10:34
|
<logmsgbot>
|
!log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 17:10:37
|
<wikibugs>
|
('CR) ''Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config (''3 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: ''Dzahn)'
|
|
2026-04-02 17:10:48
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 17:11:08
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 17:11:20
|
<wikibugs>
|
('PS5) ''Dzahn: ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109)'
|
|
2026-04-02 17:11:31
|
<logmsgbot>
|
!log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
|
|
2026-04-02 17:11:50
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 17:12:02
|
<logmsgbot>
|
!log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 17:12:12
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 17:12:40
|
<logmsgbot>
|
!log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 17:13:56
|
<logmsgbot>
|
!log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
|
|
2026-04-02 17:14:49
|
<wikibugs>
|
('CR) ''Scott French: "Thanks for the review!" [dns] - ''https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 17:15:32
|
<wikibugs>
|
('CR) ''Scott French: [C:''+2] wmnet: remove image-suggestion k8s ingress CNAMEs [dns] - ''https://gerrit.wikimedia.org/r/1198584 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 17:15:41
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: ''Dzahn)'
|
|
2026-04-02 17:16:11
|
<logmsgbot>
|
!log swfrench@dns1004 START - running authdns-update
|
|
2026-04-02 17:18:08
|
<logmsgbot>
|
!log swfrench@dns1004 END - running authdns-update
|
|
2026-04-02 17:20:27
|
<wikibugs>
|
('PS4) ''Scott French: service: remove image-suggestion [puppet] - ''https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096)'
|
|
2026-04-02 17:26:28
|
<wikibugs>
|
'SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783746 (''prabhat)'
|
|
2026-04-02 17:27:48
|
<swfrench-wmf>
|
alright, I believe I'm done with my side of this window
|
|
2026-04-02 17:28:10
|
<wikibugs>
|
('PS1) ''Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)'
|
|
2026-04-02 17:28:39
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: ''Eevans)'
|
|
2026-04-02 17:29:04
|
<wikibugs>
|
('PS1) ''Snwachukwu: Add rest gateway routes for video_plays path. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202)'
|
|
2026-04-02 17:31:23
|
<wikibugs>
|
('CR) ''Dzahn: [V:''+1 C:''+2] phabricator: Set a custom default-mail-address for the test instance (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: ''A smart kitten)'
|
|
2026-04-02 17:31:54
|
<wikibugs>
|
('CR) ''Mforns: [C:''+1] "LGTM!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
|
|
2026-04-02 17:32:10
|
<wikibugs>
|
('PS2) ''Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)'
|
|
2026-04-02 17:35:42
|
<wikibugs>
|
('PS1) ''Scott French: fixtures: clean up reference to image-suggestion [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096)'
|
|
2026-04-02 17:36:02
|
<wikibugs>
|
('CR) ''Snwachukwu: [C:''+2] Add rest gateway routes for video_plays path. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
|
|
2026-04-02 17:36:07
|
<wikibugs>
|
('PS3) ''Eevans: cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444)'
|
|
2026-04-02 17:36:12
|
<wikibugs>
|
('CR) ''Eevans: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: ''Eevans)'
|
|
2026-04-02 17:36:51
|
<wikibugs>
|
('PS1) ''Ssingh: admin: update SSH key for ptiwary [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189)'
|
|
2026-04-02 17:36:54
|
<wikibugs>
|
('CR) ''Snwachukwu: [C:''+2] "Thank you!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
|
|
2026-04-02 17:37:00
|
<wikibugs>
|
('CR) ''Snwachukwu: [V:''+2 C:''+2] Add rest gateway routes for video_plays path. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
|
|
2026-04-02 17:39:23
|
<wikibugs>
|
('PS3) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 17:39:32
|
<wikibugs>
|
('CR) ''Eevans: [C:''+2] cassandra-dev: add ferm srange for k8s staging [puppet] - ''https://gerrit.wikimedia.org/r/1267133 (https://phabricator.wikimedia.org/T421444) (owner: ''Eevans)'
|
|
2026-04-02 17:39:46
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783799 (''Jgreen) ''Open→''Resolved boxes are imaged, in replication, and ready for traffic once pfw policy is done'
|
|
2026-04-02 17:40:49
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
|
|
2026-04-02 17:42:20
|
<wikibugs>
|
('CR) ''Ottomata: [C:''+1] Add rest gateway routes for video_plays path. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267136 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
|
|
2026-04-02 17:42:35
|
<wikibugs>
|
('CR) ''Ssingh: "Request verified out of band, please feel free to do an additional check." [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
|
|
2026-04-02 17:44:20
|
<wikibugs>
|
('CR) ''Ayounsi: "That's a follow up from an email that was sent to noc@ from a local ISP." [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
|
|
2026-04-02 17:44:27
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''Patch-For-Review: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11783815 (''HShaikh) As prabhat's manager I approve this request.'
|
|
2026-04-02 17:45:50
|
<wikibugs>
|
('CR) ''Ssingh: [C:''+1] "Ah I see it now -- my bad. Thanks." [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
|
|
2026-04-02 17:46:51
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 17:47:50
|
<wikibugs>
|
('PS1) ''Snwachukwu: Add rest gateway routes for video_plays path production. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202)'
|
|
2026-04-02 17:49:08
|
<wikibugs>
|
('CR) ''Dzahn: [V:''+1 C:''+2] "I can see in compiler how this changes things on new instance "integration-agent-docker-1070" just created on https://phabricator.wikimedi"; [puppet] - ''https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109)
(owner: ''Hashar)'
|
|
2026-04-02 17:50:58
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783859 (''Jgreen)'
|
|
2026-04-02 17:54:07
|
<wikibugs>
|
'SRE, ''DNS, ''Infrastructure-Foundations, ''netbox, and 2 others: Missing includes in DNS repo from Netbox-generated snippets - https://phabricator.wikimedia.org/T422115#11783873 (''ssingh) Thanks for fixing it but I agree that we need an alert for this otherwise we will miss this again.'
|
|
2026-04-02 17:55:40
|
<wikibugs>
|
('CR) ''Snwachukwu: [C:''+2] Add rest gateway routes for video_plays path production. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
|
|
2026-04-02 17:56:20
|
<wikibugs>
|
('CR) ''Dzahn: [V:''+1 C:''+2] "noop confirmed on contint prod hosts" [puppet] - ''https://gerrit.wikimedia.org/r/1260659 (https://phabricator.wikimedia.org/T418109) (owner: ''Hashar)'
|
|
2026-04-02 17:57:43
|
<wikibugs>
|
('Merged) ''jenkins-bot: Add rest gateway routes for video_plays path production. [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267147 (https://phabricator.wikimedia.org/T415202) (owner: ''Snwachukwu)'
|
|
2026-04-02 17:58:30
|
<wikibugs>
|
('PS4) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 17:59:52
|
<logmsgbot>
|
!log ebysans@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 18:00:10
|
<logmsgbot>
|
!log ebysans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 18:00:29
|
<logmsgbot>
|
!log ebysans@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 18:00:35
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
|
|
2026-04-02 18:00:48
|
<logmsgbot>
|
!log ebysans@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
|
|
2026-04-02 18:01:24
|
<wikibugs>
|
('CR) ''Jasmine: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 18:01:51
|
<jinxer-wm>
|
RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 18:04:40
|
<wikibugs>
|
('PS5) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 18:05:15
|
<wikibugs>
|
('CR) ''Brouberol: [C:''+1] fixtures: clean up reference to image-suggestion [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 18:06:00
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
|
|
2026-04-02 18:07:21
|
<wikibugs>
|
('CR) ''Muehlenhoff: "One validation is fine, you can either go ahead and merge it or I'll take care of it via Clinic duty, either is fine." [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
|
|
2026-04-02 18:07:35
|
<wikibugs>
|
('PS6) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175)'
|
|
2026-04-02 18:14:19
|
<wikibugs>
|
('CR) ''Bking: data-platform: Add alerts for opensearch on k8s certificate expiry (''2 comments) [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
|
|
2026-04-02 18:16:57
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11783930 (''Jclark-ctr) a:''Jgreen→''Jclark-ctr'
|
|
2026-04-02 18:24:15
|
<wikibugs>
|
('CR) ''SBassett: [C:''+2] "Oh, whoops, I see the commit msg says "miscweb(research-landing-page): bump image version". Just to be clear, this change set is for" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1174750 (https://phabricator.wikimedia.org/T399132) (owner: ''Jly)'
|
|
2026-04-02 18:24:47
|
<logmsgbot>
|
!log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs5006.eqsin.wmnet} and A:liberica
|
|
2026-04-02 18:25:57
|
<jinxer-wm>
|
FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
|
|
2026-04-02 18:28:03
|
<logmsgbot>
|
!log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs5006.eqsin.wmnet} and A:liberica
|
|
2026-04-02 18:28:50
|
<wikibugs>
|
('PS3) ''SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
|
|
2026-04-02 18:29:53
|
<sukhe>
|
port 80!?
|
|
2026-04-02 18:30:57
|
<topranks>
|
yeah I'm not sure why it's firing... sort of seems ok?
|
|
2026-04-02 18:30:57
|
<jinxer-wm>
|
RESOLVED: ProbeDown: Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
|
|
2026-04-02 18:31:19
|
<topranks>
|
https://phabricator.wikimedia.org/P90248
|
|
2026-04-02 18:31:30
|
<wikibugs>
|
('CR) ''Scott French: "Thanks for the review!" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 18:31:31
|
<wikibugs>
|
('CR) ''Scott French: [C:''+2] fixtures: clean up reference to image-suggestion [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 18:31:38
|
<sukhe>
|
topranks: yeah it resolved. haven't looked very deeply on what happened but can't seem anything obvious
|
|
2026-04-02 18:31:56
|
<moritzm>
|
same here
|
|
2026-04-02 18:31:56
|
<topranks>
|
I gotta say the probe dashboard is absolutely incomprehensible to me, any time I have to visit it
|
|
2026-04-02 18:32:09
|
<topranks>
|
I don't see any signs of general connectivity issues
|
|
2026-04-02 18:32:25
|
<moritzm>
|
and ipv6 only?
|
|
2026-04-02 18:32:30
|
<sukhe>
|
seems so yeah
|
|
2026-04-02 18:33:06
|
<topranks>
|
yeah, tbh that is further evidence it is just an outlier failed connection, for whatever reason
|
|
2026-04-02 18:33:08
|
<sukhe>
|
topranks: yep. we should improve that. it defaults to "All"
|
|
2026-04-02 18:33:11
|
<wikibugs>
|
('Merged) ''jenkins-bot: fixtures: clean up reference to image-suggestion [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267137 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 18:33:16
|
<topranks>
|
rather than a systemic problem like everyone is failing to connect
|
|
2026-04-02 18:33:26
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784038 (''gmodena) >>! In T422141#11782776, @MoritzMuehlenhoff wrote: > What kind of access is needed? root ac...'
|
|
2026-04-02 18:33:51
|
<moritzm>
|
don't see any specific signs of user-visible impact from graphs
|
|
2026-04-02 18:34:21
|
<wikibugs>
|
('CR) ''SBassett: [C:''+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
|
|
2026-04-02 18:34:21
|
<wikibugs>
|
('CR) ''Ssingh: "Thanks, I will merge if I can find a reviewer otherwise feel free to take it later." [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
|
|
2026-04-02 18:35:37
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784042 (''MoritzMuehlenhoff) >>! In T422141#11784038, @gmodena wrote: >>>! In T422141#11782776,
@MoritzMuehlen...'
|
|
2026-04-02 18:35:58
|
<wikibugs>
|
('CR) ''Reedy: [C:''+1] Allow-list some additional domains to the currently enforcing CSP [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
|
|
2026-04-02 18:37:05
|
<wikibugs>
|
('CR) ''Ssingh: [C:''+1] "Two reviews by the sec team, merging." [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
|
|
2026-04-02 18:37:06
|
<wikibugs>
|
('CR) ''Ssingh: [C:''+2] Allow-list some additional domains to the currently enforcing CSP [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
|
|
2026-04-02 18:37:12
|
<Reedy>
|
haha
|
|
2026-04-02 18:37:13
|
<Reedy>
|
consensus!
|
|
2026-04-02 18:37:39
|
<sukhe>
|
Reedy: who am I to say no to two +1s?!
|
|
2026-04-02 18:38:57
|
<wikibugs>
|
('CR) ''Muehlenhoff: [C:''+1] "LGMT syntax-wise" [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
|
|
2026-04-02 18:39:25
|
<topranks>
|
https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=probe_success%7Baddress%3D%222620%3A0%3A861%3Aed1a%3A%3A1%22%2C%20instance%3D%22text%3A80%22%7D%5B20m%5D&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h
|
|
2026-04-02 18:39:33
|
<topranks>
|
I really don't understand why that fired, but anyway
|
|
2026-04-02 18:40:27
|
<sukhe>
|
topranks: doesn't add up yep
|
|
2026-04-02 18:40:32
|
<sukhe>
|
anyway nothing to do here I feel
|
|
2026-04-02 18:40:49
|
<topranks>
|
yep enough other stuff to worry about
|
|
2026-04-02 18:40:58
|
<moritzm>
|
yeah, this feels like a one time blip, and if it happens again, we can still correlat further
|
|
2026-04-02 18:41:21
|
<wikibugs>
|
('CR) ''Ssingh: [C:''+2] admin: update SSH key for ptiwary [puppet] - ''https://gerrit.wikimedia.org/r/1267142 (https://phabricator.wikimedia.org/T422189) (owner: ''Ssingh)'
|
|
2026-04-02 18:41:50
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11784073 (''Jclark-ctr) a:''Jclark-ctr→''BTullis'
|
|
2026-04-02 18:41:52
|
<wikibugs>
|
('CR) ''Alex.sanford: [C:''+1] Allow-list some additional domains to the currently enforcing CSP (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1263948 (https://phabricator.wikimedia.org/T421637) (owner: ''WikiBayer)'
|
|
2026-04-02 18:42:19
|
<wikibugs>
|
'ops-eqiad, ''DC-Ops, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11784074 (''Jclark-ctr) a:''Jclark-ctr→''BTullis'
|
|
2026-04-02 18:44:03
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784075 (''gmodena) >>! In T422141#11784042, @MoritzMuehlenhoff wrote: > We don't have a specific access group...'
|
|
2026-04-02 18:44:32
|
<wikibugs>
|
('PS1) ''Ottomata: dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794)'
|
|
2026-04-02 18:45:52
|
<logmsgbot>
|
!log cmooney@cumin1003 START - Cookbook sre.dns.netbox
|
|
2026-04-02 18:46:50
|
<wikibugs>
|
('CR) ''Ottomata: [C:''+2] dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
|
|
2026-04-02 18:49:09
|
<wikibugs>
|
('Merged) ''jenkins-bot: dse-k8s - add common dir for mw-page-html-content-change-enrich [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267152 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
|
|
2026-04-02 18:51:31
|
<logmsgbot>
|
cmooney@cumin1003 netbox (PID 2341745) is awaiting input
|
|
2026-04-02 18:51:57
|
<logmsgbot>
|
!log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for new lumen 100g transport - cmooney@cumin1003"
|
|
2026-04-02 18:51:58
|
<wikibugs>
|
('PS1) ''Reedy: Undeploy Extension:StopForumSpam [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185)'
|
|
2026-04-02 18:52:24
|
<logmsgbot>
|
!log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns names for new lumen 100g transport - cmooney@cumin1003"
|
|
2026-04-02 18:52:24
|
<logmsgbot>
|
!log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
|
|
2026-04-02 18:52:28
|
<wikibugs>
|
('PS1) ''Cathal Mooney: Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - ''https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878)'
|
|
2026-04-02 18:53:17
|
<wikibugs>
|
('CR) ''Ssingh: [C:''+1] Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - ''https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) (owner: ''Cathal Mooney)'
|
|
2026-04-02 18:54:38
|
<wikibugs>
|
('CR) ''Cathal Mooney: [C:''+2] Add INCLUDE statement for 2620:0:861:fe03::/64 subnet [dns] - ''https://gerrit.wikimedia.org/r/1267158 (https://phabricator.wikimedia.org/T395878) (owner: ''Cathal Mooney)'
|
|
2026-04-02 18:54:48
|
<wikibugs>
|
('PS1) ''Ottomata: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794)'
|
|
2026-04-02 18:54:56
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
|
|
2026-04-02 18:55:10
|
<logmsgbot>
|
!log cmooney@dns2005 START - running authdns-update
|
|
2026-04-02 18:55:19
|
<wikibugs>
|
('PS2) ''Ottomata: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794)'
|
|
2026-04-02 18:56:09
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784108 (''AWesterinen-WMF) Retried ... no change'
|
|
2026-04-02 18:56:34
|
<logmsgbot>
|
!log cmooney@dns2005 END - running authdns-update
|
|
2026-04-02 18:56:53
|
<wikibugs>
|
('CR) ''Jforrester: [C:''+1] Undeploy Extension:StopForumSpam [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: ''Reedy)'
|
|
2026-04-02 18:57:10
|
<wikibugs>
|
('CR) ''Ottomata: [C:''+2] dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
|
|
2026-04-02 18:59:14
|
<wikibugs>
|
('Merged) ''jenkins-bot: dse-k8s - set flinkConfiguration properly after directory reorg [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267161 (https://phabricator.wikimedia.org/T360794) (owner: ''Ottomata)'
|
|
2026-04-02 19:00:25
|
<wikibugs>
|
('CR) ''Dzahn: [C:''+2] ci: Add 'Signed-by' keyfile reference to thirdparty APT repo config [puppet] - ''https://gerrit.wikimedia.org/r/1260766 (https://phabricator.wikimedia.org/T418109) (owner: ''Dzahn)'
|
|
2026-04-02 19:01:19
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 19:01:23
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 19:02:03
|
<wikibugs>
|
('PS3) ''Elukey: opensearch-semantic-search-test: Add to services proxy [puppet] - ''https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: ''Bking)'
|
|
2026-04-02 19:04:43
|
<wikibugs>
|
('CR) ''Scott French: "Thanks for the review!" [puppet] - ''https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 19:04:47
|
<wikibugs>
|
('CR) ''Scott French: [C:''+2] service: remove image-suggestion [puppet] - ''https://gerrit.wikimedia.org/r/1198578 (https://phabricator.wikimedia.org/T368096) (owner: ''Scott French)'
|
|
2026-04-02 19:06:31
|
<wikibugs>
|
('PS1) ''Cathal Mooney: Management routers: set autonomous system number [homer/public] - ''https://gerrit.wikimedia.org/r/1267170 (https://phabricator.wikimedia.org/T421238)'
|
|
2026-04-02 19:09:11
|
<logmsgbot>
|
!log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on releases2003.codfw.wmnet with reason: T418109
|
|
2026-04-02 19:09:14
|
<stashbot>
|
T418109: Update Jenkins hosts from Java 17 to Java 21 - https://phabricator.wikimedia.org/T418109
|
|
2026-04-02 19:09:30
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784127 (''MoritzMuehlenhoff) You still need to request "wmf" at https://idm.wikimedia.org/permissions/, so far you only r...'
|
|
2026-04-02 19:12:13
|
<wikibugs>
|
('PS1) ''Dzahn: jenkins: add profile::ci::docker to role [puppet] - ''https://gerrit.wikimedia.org/r/1267173 (https://phabricator.wikimedia.org/T418109)'
|
|
2026-04-02 19:16:13
|
<wikibugs>
|
'SRE, ''Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11784146 (''Scott_French)'
|
|
2026-04-02 19:16:44
|
<wikibugs>
|
('PS1) ''Ottomata: mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216)'
|
|
2026-04-02 19:19:50
|
<wikibugs>
|
('CR) ''Ottomata: [C:''+2] mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) (owner: ''Ottomata)'
|
|
2026-04-02 19:21:50
|
<wikibugs>
|
('Merged) ''jenkins-bot: mw-page-html-content-change-enrich - tune backfill in staging release (-next) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267175 (https://phabricator.wikimedia.org/T421216) (owner: ''Ottomata)'
|
|
2026-04-02 19:23:41
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784167 (''AWesterinen-WMF) I tried to do that, but see no option for wmf. Only "logstash", "airflow" and "spiderpig".'
|
|
2026-04-02 19:24:12
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 19:24:16
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
|
|
2026-04-02 19:33:06
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Engineering: Update production access key for ptiwary - https://phabricator.wikimedia.org/T422189#11784179 (''ssingh) ''Open→''Resolved a:''ssingh Should now be rolled out everywhere, let us know if you have any issues with access.'
|
|
2026-04-02 19:35:49
|
<wikibugs>
|
('PS1) ''Dduvall: zuul: Move cross-profile references to hiera [puppet] - ''https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207)'
|
|
2026-04-02 19:35:51
|
<wikibugs>
|
('PS1) ''Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - ''https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207)'
|
|
2026-04-02 19:45:21
|
<wikibugs>
|
('PS2) ''Dduvall: zuul: Move cross-profile references to hiera [puppet] - ''https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207)'
|
|
2026-04-02 19:45:21
|
<wikibugs>
|
('PS2) ''Dduvall: zuul: Fix nodepool zookeeper configuration [puppet] - ''https://gerrit.wikimedia.org/r/1267178 (https://phabricator.wikimedia.org/T422207)'
|
|
2026-04-02 19:46:02
|
<wikibugs>
|
('CR) ''Dduvall: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1267177 (https://phabricator.wikimedia.org/T422207) (owner: ''Dduvall)'
|
|
2026-04-02 19:48:46
|
<jinxer-wm>
|
FIRING: Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards Has improved - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
|
|
2026-04-02 19:56:29
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 19:56:32
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 19:56:48
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 19:56:50
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 19:57:36
|
<nya_1F616EMO>
|
Is anyone here waiting for the UTC late backport window? And are there any blockers to the window?
|
|
2026-04-02 19:57:46
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 19:57:48
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 20:00:05
|
<jouncebot>
|
RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T2000)
|
|
2026-04-02 20:00:05
|
<jouncebot>
|
nya_1F616EMO and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
|
|
2026-04-02 20:00:15
|
<nya_1F616EMO>
|
o/
|
|
2026-04-02 20:00:26
|
<bwang>
|
Im here~!
|
|
2026-04-02 20:00:49
|
<nya_1F616EMO>
|
prays for a deployer to show up
|
|
2026-04-02 20:02:56
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 20:03:03
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 20:04:18
|
<wikibugs>
|
('PS4) ''Bking: opensearch-semantic-search-test: Add to services proxy [puppet] - ''https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293)'
|
|
2026-04-02 20:05:17
|
<wikibugs>
|
('CR) ''Bking: opensearch-semantic-search-test: Add to services proxy (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: ''Bking)'
|
|
2026-04-02 20:05:40
|
<jinxer-wm>
|
FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-nginx-exporter.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
|
|
2026-04-02 20:07:12
|
<wikibugs>
|
('PS1) ''Ottomata: mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216)'
|
|
2026-04-02 20:07:58
|
<wikibugs>
|
('CR) ''Ottomata: [C:''+2] mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) (owner: ''Ottomata)'
|
|
2026-04-02 20:08:04
|
<wikibugs>
|
('CR) ''Ottomata: [V:''+2 C:''+2] mw-page-html-content-change-enrich-next - use kafka jumbo external services [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267189 (https://phabricator.wikimedia.org/T421216) (owner: ''Ottomata)'
|
|
2026-04-02 20:08:51
|
<nya_1F616EMO>
|
It seems like we're out of luck?
|
|
2026-04-02 20:09:35
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 20:09:44
|
<logmsgbot>
|
!log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply
|
|
2026-04-02 20:12:35
|
<wikibugs>
|
'ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11784255 (''phaultfinder)'
|
|
2026-04-02 20:13:27
|
<wikibugs>
|
('PS5) ''Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - ''https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293)'
|
|
2026-04-02 20:13:27
|
<wikibugs>
|
('CR) ''Bking: "Thanks for the course correction! I think we have a path forward here; we've added envoy TLS termination in 1248865 and monitoring for the" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) (owner: ''Bking)'
|
|
2026-04-02 20:13:43
|
<Kemayo>
|
I'd offer to do it, but there was a big breakage of the ability to scap deploy things this morning, so it might be a good idea to have a real deployer present who could recover from an error if it happened.
|
|
2026-04-02 20:13:57
|
<wikibugs>
|
('Abandoned) ''Bking: opensearch-cluster: Add support for service mesh [deployment-charts] - ''https://gerrit.wikimedia.org/r/1260795 (https://phabricator.wikimedia.org/T421293) (owner: ''Bking)'
|
|
2026-04-02 20:15:00
|
<nya_1F616EMO>
|
One of my patch is a time-specific logo update for zhwikinews, and one is a non-time-specific SecurePoll deployment to a private wiki. I may propose to the local community to use CSS for the logo change; do you recommend doing so?
|
|
2026-04-02 20:17:01
|
<Kemayo>
|
Feels inconvenient to deal with, given all the various logo sizes involved.
|
|
2026-04-02 20:17:17
|
<nya_1F616EMO>
|
You mean to deploy?
|
|
2026-04-02 20:17:38
|
<nya_1F616EMO>
|
Currently working on the CSS solution
|
|
2026-04-02 20:17:45
|
<nya_1F616EMO>
|
(cuz there are no deployment on Fridays we all know)
|
|
2026-04-02 20:17:52
|
<Kemayo>
|
If you and bwang don't mind, I could certainly kick off a spiderpig build with all your patches. If it breaks in the same way as it did before, it'd just fail to deploy even to testservers rather than ruining production.
|
|
2026-04-02 20:18:08
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784278 (''VRiley-WMF) This ticket seems like it relates to another ticket https://phabricator.wikimedia.org/T413559 @Eevans this server is out of warrenty, would you like us to replace this disk or leave it...'
|
|
2026-04-02 20:18:34
|
<Kemayo>
|
There's just a *chance* that it'll wedge us into a state where a releng person needs to look at things before any deploys can happen. 😅
|
|
2026-04-02 20:19:12
|
<nya_1F616EMO>
|
I won't let go my SecurePoll patch anyways under this state, it'd be up to you on whether to accept that zhwikinews logo change.
|
|
2026-04-02 20:20:05
|
<Kemayo>
|
I'm fine giving it a shot.
|
|
2026-04-02 20:20:10
|
<Kemayo>
|
bwang: Want yours in as well?
|
|
2026-04-02 20:21:26
|
<nya_1F616EMO>
|
Wait, I found something that might be off
|
|
2026-04-02 20:21:44
|
<nya_1F616EMO>
|
Let me chekc my patch for resolutions
|
|
2026-04-02 20:22:04
|
<Kemayo>
|
Just let me know when you're happy with it, and if bwang hasn't shown up by then I can do just-yours.
|
|
2026-04-02 20:22:13
|
<nya_1F616EMO>
|
Ah nvm, the script did the job for me
|
|
2026-04-02 20:22:25
|
<nya_1F616EMO>
|
It successfully reduced the resolution to 135x135, nice
|
|
2026-04-02 20:22:33
|
<nya_1F616EMO>
|
so good to go
|
|
2026-04-02 20:22:49
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: ''1F616EMO)'
|
|
2026-04-02 20:24:24
|
<wikibugs>
|
('CR) ''Bking: [C:''+2] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
|
|
2026-04-02 20:24:56
|
<wikibugs>
|
('CR) ''Bking: [C:''+2] "Ben is out for the next 10 days, so I'm going to be bold and merge after addressing his concerns." [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
|
|
2026-04-02 20:25:02
|
<wikibugs>
|
('CR) ''Bking: [V:''+2 C:''+2] data-platform: Add alerts for opensearch on k8s certificate expiry [alerts] - ''https://gerrit.wikimedia.org/r/1267100 (https://phabricator.wikimedia.org/T418175) (owner: ''Bking)'
|
|
2026-04-02 20:25:19
|
<wikibugs>
|
('Merged) ''jenkins-bot: zhwikinews: 20th anniversary logo change [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264569 (https://phabricator.wikimedia.org/T420165) (owner: ''1F616EMO)'
|
|
2026-04-02 20:25:37
|
<logmsgbot>
|
!log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]]
|
|
2026-04-02 20:25:40
|
<stashbot>
|
T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165
|
|
2026-04-02 20:28:46
|
<bwang>
|
Sorry I was in a call
|
|
2026-04-02 20:28:52
|
<bwang>
|
Im still here and able to help test the backpoert
|
|
2026-04-02 20:29:16
|
<wikibugs>
|
('PS2) ''Clare Ming: Update the Test Kitchen maintenance script to target testwiki [puppet] - ''https://gerrit.wikimedia.org/r/1265525 (https://phabricator.wikimedia.org/T422209)'
|
|
2026-04-02 20:29:22
|
<logmsgbot>
|
!log kemayo@deploy1003 1f616emo, kemayo: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 20:29:40
|
<Kemayo>
|
nya_1F616EMO: Can you verify your change?
|
|
2026-04-02 20:29:44
|
<nya_1F616EMO>
|
testing
|
|
2026-04-02 20:30:44
|
<nya_1F616EMO>
|
it works, tested on vector-2022, vector, monobook, timeless.
|
|
2026-04-02 20:31:03
|
<Kemayo>
|
I will continue the deploy, then.
|
|
2026-04-02 20:31:06
|
<nya_1F616EMO>
|
Thanks
|
|
2026-04-02 20:31:11
|
<logmsgbot>
|
!log kemayo@deploy1003 1f616emo, kemayo: Continuing with sync
|
|
2026-04-02 20:33:09
|
<wikibugs>
|
('CR) ''1F616EMO: "Will re-schedule in the Monday, May 04 UTC morning backport window, together with Ia1a463ba01452b76b73ff6b59b821eae9154ddf8." [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1265959 (https://phabricator.wikimedia.org/T419309) (owner: ''1F616EMO)'
|
|
2026-04-02 20:33:21
|
<wikibugs>
|
('PS1) ''1F616EMO: Revert "zhwikinews: 20th anniversary logo change" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165)'
|
|
2026-04-02 20:33:35
|
<wikibugs>
|
('CR) ''1F616EMO: "Will re-schedule in the Monday, May 04 UTC morning backport window, together with Iea2390c01600b5f93c7b01f5605d887541c74b02." [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264575 (https://phabricator.wikimedia.org/T420165) (owner: ''1F616EMO)'
|
|
2026-04-02 20:33:52
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Wikidata Platform Team, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Request: wdqs shell access for user AWesterinen - https://phabricator.wikimedia.org/T422141#11784305 (''MoritzMuehlenhoff) >>! In T422141#11784075, @gmodena wrote: >>>! In T422141#11784042,
@MoritzMuehlen...'
|
|
2026-04-02 20:35:37
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784306 (''MoritzMuehlenhoff) >>! In T420053#11784167, @AWesterinen-WMF wrote: > I tried to do that, but see no option for...'
|
|
2026-04-02 20:37:23
|
<logmsgbot>
|
!log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264569|zhwikinews: 20th anniversary logo change (T420165)]] (duration: 11m 46s)
|
|
2026-04-02 20:37:26
|
<stashbot>
|
T420165: Requesting temporary logo change for zh.wikinews.org - https://phabricator.wikimedia.org/T420165
|
|
2026-04-02 20:37:34
|
<icinga-wm>
|
PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 182040496 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 20:38:32
|
<icinga-wm>
|
RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3815080 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
|
|
2026-04-02 20:39:36
|
<Kemayo>
|
nya_1F616EMO: Okay, should be live now.
|
|
2026-04-02 20:40:01
|
<nya_1F616EMO>
|
Nice and verified the changes through prod.
|
|
2026-04-02 20:40:04
|
<nya_1F616EMO>
|
Thank you for your help
|
|
2026-04-02 20:40:33
|
<wikibugs>
|
('CR) ''Cathal Mooney: "Do we have stats for RE? Is it that much better to eqsin on average than drmrs? From the geography it's not clear to me." [dns] - ''https://gerrit.wikimedia.org/r/1267042 (owner: ''Ayounsi)'
|
|
2026-04-02 20:43:58
|
<Kemayo>
|
nya_1F616EMO: np!
|
|
2026-04-02 20:47:18
|
<bwang>
|
Hi are we still able to back port my patch?
|
|
2026-04-02 20:47:55
|
<Kemayo>
|
bwang: sure, I can get it if you're willing to stick around until it's done.
|
|
2026-04-02 20:48:11
|
<bwang>
|
Yes of course
|
|
2026-04-02 20:48:29
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne Tomasevich)'
|
|
2026-04-02 20:51:01
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''SRE-swift-storage, ''Data-Persistence, ''DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11784343 (''VRiley-WMF) Hey @elukey Thanks for working on this! Is there anything I can do from my end to assist with
this? Let us know...'
|
|
2026-04-02 20:51:48
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11784345 (''VRiley-WMF)'
|
|
2026-04-02 20:51:52
|
<wikibugs>
|
('Merged) ''jenkins-bot: Add logged-in reader retention instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267008 (https://phabricator.wikimedia.org/T420490) (owner: ''Anne Tomasevich)'
|
|
2026-04-02 20:52:10
|
<logmsgbot>
|
!log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]]
|
|
2026-04-02 20:52:13
|
<stashbot>
|
T420490: [Logged in reader retention baseline] Launch A/A experiment - https://phabricator.wikimedia.org/T420490
|
|
2026-04-02 20:52:24
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations: Q3:rack/setup/install ganeti105[5667] - https://phabricator.wikimedia.org/T418903#11784348 (''VRiley-WMF)'
|
|
2026-04-02 20:53:51
|
<logmsgbot>
|
!log kemayo@deploy1003 annet, kemayo: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 20:54:13
|
<Kemayo>
|
bwang: let me know when it's tested
|
|
2026-04-02 20:56:36
|
<bwang>
|
checking now
|
|
2026-04-02 20:57:02
|
<wikibugs>
|
('PS1) ''DLynch: SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267204'
|
|
2026-04-02 20:57:19
|
<wikibugs>
|
('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it"; [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267204 (owner: ''DLynch)'
|
|
2026-04-02 20:58:54
|
<bwang>
|
Looks good
|
|
2026-04-02 20:59:09
|
<Kemayo>
|
Continuing, then.
|
|
2026-04-02 20:59:12
|
<logmsgbot>
|
!log kemayo@deploy1003 annet, kemayo: Continuing with sync
|
|
2026-04-02 21:00:05
|
<jouncebot>
|
Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260402T2100)
|
|
2026-04-02 21:01:02
|
<wikibugs>
|
'SRE, ''SRE-Access-Requests, ''Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11784373 (''AWesterinen-WMF) Updated my email and requested wmf access. But, I have a further problem. I tried to ssh in...'
|
|
2026-04-02 21:01:16
|
<Jdlrobson>
|
Kemayo: let me know when you are done. I have a deploy but I need 15m to prep
|
|
2026-04-02 21:01:46
|
<Kemayo>
|
Jdlrobson: Sure, I just have one more patch to get out after this, so that should fit into your timing pretty okay.
|
|
2026-04-02 21:03:50
|
<logmsgbot>
|
!log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267008|Add logged-in reader retention instrument (T420490)]] (duration: 11m 40s)
|
|
2026-04-02 21:03:54
|
<stashbot>
|
T420490: [Logged in reader retention baseline] Launch A/A experiment - https://phabricator.wikimedia.org/T420490
|
|
2026-04-02 21:04:06
|
<Kemayo>
|
bwang: Live now.
|
|
2026-04-02 21:04:16
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267204 (owner: ''DLynch)'
|
|
2026-04-02 21:08:09
|
<wikibugs>
|
('PS2) ''Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)'
|
|
2026-04-02 21:15:33
|
<wikibugs>
|
('Merged) ''jenkins-bot: SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise [extensions/VisualEditor] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267204 (owner: ''DLynch)'
|
|
2026-04-02 21:15:47
|
<logmsgbot>
|
!log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]]
|
|
2026-04-02 21:17:26
|
<logmsgbot>
|
!log kemayo@deploy1003 kemayo: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 21:18:36
|
<logmsgbot>
|
!log kemayo@deploy1003 kemayo: Continuing with sync
|
|
2026-04-02 21:23:09
|
<wikibugs>
|
('PS3) ''Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)'
|
|
2026-04-02 21:23:42
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] role::kubernetes::worker: add sophroid to the lvs pools [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748) (owner: ''Jasmine)'
|
|
2026-04-02 21:23:51
|
<wikibugs>
|
('PS4) ''Jasmine: role::kubernetes::worker: add sophroid to the lvs pools [puppet] - ''https://gerrit.wikimedia.org/r/1260765 (https://phabricator.wikimedia.org/T418748)'
|
|
2026-04-02 21:26:03
|
<wikibugs>
|
'SRE, ''DBA, ''Wikimedia-Incident: Database servers in cluster(number) are overloaded - https://phabricator.wikimedia.org/T422130#11784439 (''Od1n) FWIW, I'm still currently encountering this error on frwiki, and it prevents my local custom JS/CSS files from loading. Unexpectedly not loaded: * `Special:Myp...'
|
|
2026-04-02 21:26:25
|
<logmsgbot>
|
!log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]]
|
|
2026-04-02 21:26:41
|
<Kemayo>
|
Jdlrobson: Sorry, the k8s deploy failed, which is making everything *fun*.
|
|
2026-04-02 21:27:13
|
<Jdlrobson>
|
no worries
|
|
2026-04-02 21:27:19
|
<Jdlrobson>
|
im appreciating the extra testing time :)
|
|
2026-04-02 21:28:05
|
<logmsgbot>
|
!log kemayo@deploy1003 kemayo: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 21:28:34
|
<logmsgbot>
|
!log kemayo@deploy1003 kemayo: Continuing with sync
|
|
2026-04-02 21:32:44
|
<logmsgbot>
|
!log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267204|SuggestedLinkEditCheck: fetchSuggestions return a jQuery.Promise]] (duration: 06m 18s)
|
|
2026-04-02 21:32:57
|
<Kemayo>
|
Jdlrobson: okay, all yours!
|
|
2026-04-02 21:35:22
|
<Jdlrobson>
|
thanks!
|
|
2026-04-02 21:35:45
|
<wikibugs>
|
('PS1) ''Jdlrobson: Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882)'
|
|
2026-04-02 21:36:53
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: ''Jdlrobson)'
|
|
2026-04-02 21:48:25
|
<wikibugs>
|
('CR) ''CI reject: [V:''-1] Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: ''Jdlrobson)'
|
|
2026-04-02 21:48:31
|
<jinxer-wm>
|
FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
|
|
2026-04-02 21:49:08
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: ''Jdlrobson)'
|
|
2026-04-02 21:49:14
|
<Jdlrobson>
|
Flakey Wikibase test :(
|
|
2026-04-02 21:50:31
|
<wikibugs>
|
('Merged) ''jenkins-bot: Fix section heading spacing on mobile [skins/MinervaNeue] (wmf/1.46.0-wmf.22) - ''https://gerrit.wikimedia.org/r/1267214 (https://phabricator.wikimedia.org/T414882) (owner: ''Jdlrobson)'
|
|
2026-04-02 21:51:01
|
<wikibugs>
|
('CR) ''SBassett: [C:''+1] Undeploy Extension:StopForumSpam [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1267157 (https://phabricator.wikimedia.org/T422185) (owner: ''Reedy)'
|
|
2026-04-02 21:58:21
|
<logmsgbot>
|
!log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]]
|
|
2026-04-02 21:58:24
|
<stashbot>
|
T414882: Additional top margin for Parsoid outputs for sections with no lead - https://phabricator.wikimedia.org/T414882
|
|
2026-04-02 22:00:02
|
<logmsgbot>
|
!log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 22:01:42
|
<logmsgbot>
|
!log jdlrobson@deploy1003 jdlrobson: Continuing with sync
|
|
2026-04-02 22:03:51
|
<jinxer-wm>
|
FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 22:05:10
|
<jinxer-wm>
|
FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
|
|
2026-04-02 22:05:39
|
<jinxer-wm>
|
FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
|
|
2026-04-02 22:05:54
|
<logmsgbot>
|
!log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1267214|Fix section heading spacing on mobile (T414882)]] (duration: 07m 33s)
|
|
2026-04-02 22:05:57
|
<stashbot>
|
T414882: Additional top margin for Parsoid outputs for sections with no lead - https://phabricator.wikimedia.org/T414882
|
|
2026-04-02 22:06:51
|
<Jdlrobson>
|
All done.
|
|
2026-04-02 22:08:51
|
<jinxer-wm>
|
FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 22:10:38
|
<wikibugs>
|
'SRE, ''ServiceOps new, ''Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11784520 (''Scott_French) Moving this into #serviceops_new, since we're probably the right team to figure out how this should b...'
|
|
2026-04-02 22:11:34
|
<wikibugs>
|
('PS1) ''Eevans: Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112)'
|
|
2026-04-02 22:17:35
|
<wikibugs>
|
('CR) ''Eevans: [C:''+2] Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 22:19:31
|
<wikibugs>
|
('Merged) ''jenkins-bot: Use cassandra-dev2001-a (instance) for lambda [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267229 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 22:20:22
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 22:20:36
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 22:40:10
|
<jinxer-wm>
|
RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
|
|
2026-04-02 22:40:39
|
<jinxer-wm>
|
FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
|
|
2026-04-02 22:43:51
|
<jinxer-wm>
|
RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
|
|
2026-04-02 22:45:39
|
<jinxer-wm>
|
RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
|
|
2026-04-02 22:59:29
|
<wikibugs>
|
('PS1) ''Eevans: Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112)'
|
|
2026-04-02 23:02:03
|
<wikibugs>
|
('CR) ''Eevans: [C:''+2] Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 23:03:31
|
<jinxer-wm>
|
FIRING: [2x] Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
|
|
2026-04-02 23:04:01
|
<wikibugs>
|
('Merged) ''jenkins-bot: Add egress rule for cassandra-dev2001-a:50051 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1267251 (https://phabricator.wikimedia.org/T414112) (owner: ''Eevans)'
|
|
2026-04-02 23:06:01
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] START helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 23:06:07
|
<logmsgbot>
|
!log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/linked-artifacts: apply
|
|
2026-04-02 23:28:38
|
<zabe>
|
jouncebot: nowandnext
|
|
2026-04-02 23:28:38
|
<jouncebot>
|
No deployments scheduled for the next 6 hour(s) and 31 minute(s)
|
|
2026-04-02 23:28:38
|
<jouncebot>
|
In 6 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260403T0600)
|
|
2026-04-02 23:34:22
|
<wikibugs>
|
('CR) ''Zabe: [C:''+2] Start reading from new file table in dewiki and fawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: ''Zabe)'
|
|
2026-04-02 23:35:16
|
<wikibugs>
|
('Merged) ''jenkins-bot: Start reading from new file table in dewiki and fawiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1264110 (https://phabricator.wikimedia.org/T416548) (owner: ''Zabe)'
|
|
2026-04-02 23:35:42
|
<logmsgbot>
|
!log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]]
|
|
2026-04-02 23:35:45
|
<stashbot>
|
T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
|
|
2026-04-02 23:37:19
|
<logmsgbot>
|
!log zabe@deploy1003 zabe: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
|
|
2026-04-02 23:37:40
|
<logmsgbot>
|
!log zabe@deploy1003 zabe: Continuing with sync
|
|
2026-04-02 23:38:23
|
<wikibugs>
|
'ops-eqiad, ''SRE, ''DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T421439#11784707 (''Eevans) >>! In T421439#11784276, @VRiley-WMF wrote: > This ticket seems like it relates to another ticket https://phabricator.wikimedia.org/T413559 > > @Eevans this server is out of warrenty,
would...'
|
|
2026-04-02 23:38:31
|
<jinxer-wm>
|
RESOLVED: Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards
|
|
2026-04-02 23:39:52
|
<wikibugs>
|
('PS1) ''TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1267280'
|
|
2026-04-02 23:39:52
|
<wikibugs>
|
('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1267280 (owner: ''TrainBranchBot)'
|
|
2026-04-02 23:41:52
|
<logmsgbot>
|
!log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1264110|Start reading from new file table in dewiki and fawiki (T416548)]] (duration: 06m 10s)
|
|
2026-04-02 23:41:55
|
<stashbot>
|
T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548
|
|
2026-04-02 23:51:27
|
<wikibugs>
|
('Merged) ''jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1267280 (owner: ''TrainBranchBot)'
|
|
2026-04-02 23:51:34
|
<logmsgbot>
|
!log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
|
|
2026-04-02 23:52:58
|
<logmsgbot>
|
!log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
|