[00:03:20] going ahead! [00:03:39] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/apertium: apply [00:04:10] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [00:04:56] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [00:05:22] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [00:05:28] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:05:31] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:06:53] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [00:07:07] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [00:07:18] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [00:07:25] FIRING: [7x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:32] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [00:08:00] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [00:08:16] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [00:08:44] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [00:08:58] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [00:09:05] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/echostore: apply [00:10:10] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [00:10:17] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [00:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:10:29] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [00:10:39] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [00:10:53] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [00:11:26] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [00:12:06] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [00:12:13] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [00:12:49] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [00:12:49] (03PS2) 10Clare Ming: Update references to Test Kitchen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) [00:13:07] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [00:13:39] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [00:13:45] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [00:14:20] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [00:14:46] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [00:15:30] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [00:15:49] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [00:17:05] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [00:17:33] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [00:17:47] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [00:18:02] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [00:18:23] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [00:18:49] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [00:19:07] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [00:19:42] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [00:19:59] 🤞 [00:21:39] yeah I didn't think so [00:21:51] I'll let the timeout run though [00:26:12] PROBLEM - dump of s7 in eqiad on backupmon1001 is CRITICAL: dump for s7 at eqiad (db1171) taken more than a week ago: Most recent backup 2025-12-09 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:33:50] annnnnd [00:34:30] * swfrench-wmf makes drum roll hand motion [00:36:41] 🤨 [00:37:19] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [00:37:25] there 'tis [00:37:33] same story, moving on and I'll come back around to figure that one out later [00:37:41] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [00:37:53] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [00:40:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1219245 [00:40:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1219245 (owner: 10TrainBranchBot) [00:40:41] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:42:37] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:43:01] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [00:43:37] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [00:44:07] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [00:44:11] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [00:44:35] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [00:44:47] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [00:45:14] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [00:46:25] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [00:46:46] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [00:47:18] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [00:48:23] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [00:48:26] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [00:48:35] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [00:48:54] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [00:49:07] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [00:49:23] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [00:50:04] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [00:50:44] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [00:52:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1219245 (owner: 10TrainBranchBot) [00:52:22] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [00:53:02] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [00:53:38] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [00:54:12] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [00:54:23] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [00:54:42] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [00:54:59] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [00:55:23] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [00:55:33] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [00:56:11] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [00:56:16] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [00:56:40] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [01:00:31] !log rzl@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [01:00:39] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:04] !log rzl@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [01:01:18] !log rzl@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [01:01:48] !log rzl@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [01:03:53] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11471277 (10thcipriani) Recapping my understanding: - We deploy a change that changes a large number of files -- either a new version deploy (e.g., T408272#1136991... [01:07:25] FIRING: [7x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:10:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1219247 [01:10:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1219247 (owner: 10TrainBranchBot) [01:12:25] FIRING: [7x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:24:03] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 23m 23s) [01:33:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1219247 (owner: 10TrainBranchBot) [01:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:23:44] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:23:46] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:25:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:26:07] (03PS1) 10MusikAnimal: Use CodeMirror instead of CodeEditor for beta feature users + vue mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219250 (https://phabricator.wikimedia.org/T373711) [02:28:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:33:04] PROBLEM - LDAP -writable server- on serpens is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [02:45:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219250 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [02:46:03] (03Merged) 10jenkins-bot: Use CodeMirror instead of CodeEditor for beta feature users + vue mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219250 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [02:47:01] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1219250|Use CodeMirror instead of CodeEditor for beta feature users + vue mode (T373711)]] [02:47:05] T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711 [02:49:16] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1219250|Use CodeMirror instead of CodeEditor for beta feature users + vue mode (T373711)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:50:13] !log musikanimal@deploy2002 musikanimal: Continuing with sync [02:54:15] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219250|Use CodeMirror instead of CodeEditor for beta feature users + vue mode (T373711)]] (duration: 07m 15s) [02:54:20] T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711 [02:55:05] (03PS1) 10MusikAnimal: codemirror.less: order the gutters [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219255 (https://phabricator.wikimedia.org/T412884) [02:55:35] (03PS1) 10MusikAnimal: CodeMirror: disable spellcheck for non-wikitext [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219256 (https://phabricator.wikimedia.org/T412848) [02:57:25] FIRING: [7x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:26:49] FIRING: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:30:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:37:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:28] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11471476 (10Krinkle) 05Open→03Resolved [03:42:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [04:05:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:11:33] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:21:33] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:47:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:48] FIRING: PuppetFailure: Puppet has failed on serpens:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:00:46] (03PS1) 10MusikAnimal: extension.json: make activeLine on by default for non-wikitext [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219269 (https://phabricator.wikimedia.org/T412886) [05:07:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:55] (03PS1) 10MusikAnimal: CodeMirrorJavaScript: better descriptions for ESLint suggestions [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219270 [05:22:46] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:23:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219255 (https://phabricator.wikimedia.org/T412884) (owner: 10MusikAnimal) [05:23:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219256 (https://phabricator.wikimedia.org/T412848) (owner: 10MusikAnimal) [05:23:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219269 (https://phabricator.wikimedia.org/T412886) (owner: 10MusikAnimal) [05:23:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219270 (owner: 10MusikAnimal) [05:23:46] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:24:46] (03Merged) 10jenkins-bot: codemirror.less: order the gutters [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219255 (https://phabricator.wikimedia.org/T412884) (owner: 10MusikAnimal) [05:25:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:28:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:32:29] (03Merged) 10jenkins-bot: CodeMirror: disable spellcheck for non-wikitext [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219256 (https://phabricator.wikimedia.org/T412848) (owner: 10MusikAnimal) [05:32:30] (03Merged) 10jenkins-bot: extension.json: make activeLine on by default for non-wikitext [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219269 (https://phabricator.wikimedia.org/T412886) (owner: 10MusikAnimal) [05:32:31] (03Merged) 10jenkins-bot: CodeMirrorJavaScript: better descriptions for ESLint suggestions [extensions/CodeMirror] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219270 (owner: 10MusikAnimal) [05:33:07] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1219255|codemirror.less: order the gutters (T412884)]], [[gerrit:1219256|CodeMirror: disable spellcheck for non-wikitext (T412848)]], [[gerrit:1219269|extension.json: make activeLine on by default for non-wikitext (T412886)]], [[gerrit:1219270|CodeMirrorJavaScript: better descriptions for ESLint suggestions]] [05:33:15] T412884: Move linter icons to the left of line numbers - https://phabricator.wikimedia.org/T412884 [05:33:16] T412848: Disable spellcheck in non-wikitext - https://phabricator.wikimedia.org/T412848 [05:33:16] T412886: Enable active line feature by default in non-wikitext - https://phabricator.wikimedia.org/T412886 [05:33:36] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add yubikey SSH key for 'denisse' - https://phabricator.wikimedia.org/T413006#11471604 (10Marostegui) p:05Triage→03Medium @andrea.denisse I assume you'd handle this yourelf or you'd need help from clinic duty? [05:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:21] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1219255|codemirror.less: order the gutters (T412884)]], [[gerrit:1219256|CodeMirror: disable spellcheck for non-wikitext (T412848)]], [[gerrit:1219269|extension.json: make activeLine on by default for non-wikitext (T412886)]], [[gerrit:1219270|CodeMirrorJavaScript: better descriptions for ESLint suggestions]] synced to the testservers (see https://wikitech.wik [05:35:21] imedia.org/wiki/Mwdebug). Changes can now be verified there. [05:35:37] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO-backed SSH key for aklapper - https://phabricator.wikimedia.org/T413009#11471608 (10Marostegui) p:05Triage→03Medium I am happy to verify this out of band, I've tried to talk to you in irc to arrange it, but you aren't online. Please ping me when... [05:36:49] 06SRE, 10SRE-Access-Requests: Add FIDO ssh key(s) for ariel - https://phabricator.wikimedia.org/T413019#11471618 (10Marostegui) p:05Triage→03Medium Do you need help from clinic duty? I saw it was already merged by @SLyngshede-WMF [05:37:19] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11471621 (10Marostegui) 05Open→03Stalled [05:37:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:26] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11471622 (10Marostegui) Stalling til the access is verified [05:41:06] !log musikanimal@deploy2002 musikanimal: Continuing with sync [05:42:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:11] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219255|codemirror.less: order the gutters (T412884)]], [[gerrit:1219256|CodeMirror: disable spellcheck for non-wikitext (T412848)]], [[gerrit:1219269|extension.json: make activeLine on by default for non-wikitext (T412886)]], [[gerrit:1219270|CodeMirrorJavaScript: better descriptions for ESLint suggestions]] (duration: 12m 04s) [05:45:20] T412884: Move linter icons to the left of line numbers - https://phabricator.wikimedia.org/T412884 [05:45:20] T412848: Disable spellcheck in non-wikitext - https://phabricator.wikimedia.org/T412848 [05:45:21] T412886: Enable active line feature by default in non-wikitext - https://phabricator.wikimedia.org/T412886 [05:46:11] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply Redundancy alert on db2247 - https://phabricator.wikimedia.org/T412935#11471629 (10Marostegui) Thank you! [05:52:27] 06SRE, 10SRE-Access-Requests: Add FIDO ssh key(s) for ariel - https://phabricator.wikimedia.org/T413019#11471630 (10Marostegui) a:03ArielGlenn [05:52:58] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11471631 (10Marostegui) This still requires manager approvals. [05:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:02:05] (03CR) 10Phuedx: [C:03+1] EventStreamConfig: enrich stream with more headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219234 (https://phabricator.wikimedia.org/T396562) (owner: 10Bearloga) [06:12:34] (03PS1) 10Marostegui: control-mariadb-client-10.6-bullseye: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1219272 [06:14:06] (03CR) 10Marostegui: [C:03+2] control-mariadb-client-10.6-bullseye: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1219272 (owner: 10Marostegui) [06:14:33] (03Merged) 10jenkins-bot: control-mariadb-client-10.6-bullseye: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1219272 (owner: 10Marostegui) [06:21:59] 06SRE, 10SRE-Access-Requests: Add FIDO ssh key(s) for ariel - https://phabricator.wikimedia.org/T413019#11471654 (10Marostegui) >>! In T413019#11471618, @Marostegui wrote: > Do you need help from clinic duty? I saw it was already merged by @SLyngshede-WMF Sorry about this ^ it was merged by @ArielGlenn I was... [06:24:04] 06SRE, 10SRE-Access-Requests: Add FIDO ssh key(s) for ariel - https://phabricator.wikimedia.org/T413019#11471656 (10ArielGlenn) [06:27:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add yubikey SSH key for 'denisse' - https://phabricator.wikimedia.org/T413006#11471664 (10Marostegui) [06:32:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T0700) [07:00:05] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T0700). [07:14:00] FIRING: [2x] ProbeDown: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:15:07] (03PS1) 10ArielGlenn: Add the second yubikey FIDO-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219498 (https://phabricator.wikimedia.org/T413019) [07:15:49] (03CR) 10CI reject: [V:04-1] Add the second yubikey FIDO-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219498 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [07:18:42] (03PS2) 10ArielGlenn: Add the second yubikey FIDO-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219498 (https://phabricator.wikimedia.org/T413019) [07:26:49] FIRING: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:39:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:55:04] RECOVERY - LDAP -writable server- on serpens is OK: LDAP OK - 0.096 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [07:55:46] !log bounced slapd on serpens after cleaninp up a failed logrotate [07:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:16] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1219498 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [07:56:35] RESOLVED: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:57:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:00] RESOLVED: [2x] ProbeDown: Service idm2001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T0800). [08:00:05] Volker_E: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:24] (03CR) 10Elukey: "Hi folks! Please do include somebody from I/F the next time that you add something like this to cumin nodes, so we are aware of these use " [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec) [08:02:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:48] RESOLVED: PuppetFailure: Puppet has failed on serpens:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:07:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:16:45] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7837/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [08:17:25] FIRING: [5x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:04] (03CR) 10Jelto: [V:03+1] "one comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [08:18:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:20:01] FIRING: [2x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:21:43] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11471758 (10elukey) ` root@ms-fe2009:~# swift stat docker_registry_codfw Account: AUTH_docker... [08:22:34] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11471761 (10MatthewVernon) I understand why #sre-swift-storage got tagged, but: replication between eqiad an... [08:23:51] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11471763 (10MatthewVernon) p:05Triage→03High [08:38:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:48:06] (03PS1) 10STran: Revert^2 "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219533 [08:48:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219533 (owner: 10STran) [08:56:19] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwikisource --logwiki=metawiki 'Anurag Bhattamishra' 'Renamed user d198c4f693b15534f61d97349d9d7d8e' # T413036 [08:56:24] T413036: Unblock stuck global rename of Renamed user d198c4f693b15534f61d97349d9d7d8e - https://phabricator.wikimedia.org/T413036 [08:57:04] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11471809 (10MoritzMuehlenhoff) >>! In T412807#11470167, @cmooney wrote: > So maybe we are on the right track. Of course how to properly pass the... [09:00:04] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T0900) [09:03:09] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11471835 (10MatthewVernon) OK, the above turns out not to be true, it's just what I thought was true for my... [09:22:18] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11471929 (10ABran-WMF) I found a [[ https://netbox.wikimedia.org/ipam/ip-addresses/6659/ | couple ]] of [[ https://netbox.wikimedia.or... [09:28:39] FIRING: [4x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:36:43] !log restart swift-container-sync on ms-be2081 T413008 [09:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:47] T413008: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008 [09:40:25] (03PS1) 10Alexandros Kosiaris: kube-state-metrics: Remove limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:04:19] (03PS1) 10Sergio Gimeno: GrowthExperiments: cleanup unnecessary GEUseMetricsPlatformExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) [10:06:02] (03CR) 10Elukey: [C:03+1] kube-state-metrics: Remove limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris) [10:11:01] (03CR) 10Sergio Gimeno: [C:04-1] "The dependency needs to rollout first, wait until first train of the year." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno) [10:12:20] (03CR) 10Cathal Mooney: [C:03+1] hiera: enable video tos on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1219185 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [10:13:33] (03CR) 10Fabfur: [C:03+2] hiera: enable video tos on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1219185 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [10:15:48] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472099 (10MatthewVernon) At least one of the problems is that the container is damaged - there are objects... [10:16:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturati [10:17:31] codfw-eqsin link [10:17:31] !incidents [10:17:32] 7210 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [10:17:36] !ack 7210 [10:17:37] 7210 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [10:18:32] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472114 (10MatthewVernon) And we can see on that server that the sync is going nowhere... ` background.log:... [10:19:35] 10SRE-swift-storage, 10Ceph, 06serviceops, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11472116 (10MatthewVernon) Whatever we do, it should not involve trying to get swift to sync betw... [10:21:13] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472133 (10MatthewVernon) And the summary: ` Dec 18 08:51:02 ms-be2081 container-sync: Since Thu Dec 18 07:... [10:24:01] (03PS1) 10Btullis: Record the fact that tchanders now has kerberos access [puppet] - 10https://gerrit.wikimedia.org/r/1219544 (https://phabricator.wikimedia.org/T411860) [10:24:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219186 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [10:26:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:2 (Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSatura [10:29:12] (03CR) 10JMeybohm: "I would argue that we should at least keep requests around to allow for proper scheduling decisions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris) [10:47:32] (03PS2) 10Cparle: EditWatchlistPaginate feature flag has been removed from MW code, so remove it from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) [10:48:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) (owner: 10Cparle) [10:52:31] (03PS1) 10Dpogorzelski: ml: add ml specific config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219552 [10:55:17] (03PS1) 10Dpogorzelski: ml-builder: clone production images [puppet] - 10https://gerrit.wikimedia.org/r/1219553 [10:58:24] (03CR) 10Fabfur: [C:03+2] hiera: enable video tos on upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1219186 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [10:59:54] (03CR) 10Cathal Mooney: [C:03+1] hiera: enable video tos on upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1219186 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1100) [11:00:07] (03PS2) 10Alexandros Kosiaris: kube-state-metrics: Remove limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 [11:03:36] (03CR) 10Clément Goubert: [C:03+2] mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [11:03:44] (03CR) 10Clément Goubert: [C:03+2] campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [11:04:02] (03CR) 10Clément Goubert: [C:03+2] campaignevents: Skip mesh check in aggregateanswers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [11:06:33] (03CR) 10Clément Goubert: [C:03+1] smokepy: send http requests in parallel [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219188 (owner: 10Daniel Kinzler) [11:07:08] (03CR) 10Alexandros Kosiaris: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris) [11:21:36] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [11:22:00] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [11:24:40] (03CR) 10Marostegui: [C:03+1] "Verified out of band. @mmuhlenhoff@wikimedia.org do you want to double check this or can I merge and deploy?" [puppet] - 10https://gerrit.wikimedia.org/r/1219213 (https://phabricator.wikimedia.org/T413009) (owner: 10Aklapper) [11:26:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO-backed SSH key for aklapper - https://phabricator.wikimedia.org/T413009#11472325 (10Marostegui) Verified out of band - discussion on-going on the patch [11:29:58] (03PS1) 10D3r1ck01: Rest: Add more debug logging for `Resource::getProfile()` [extensions/OAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219558 (https://phabricator.wikimedia.org/T409901) [11:30:19] (03PS1) 10D3r1ck01: Rest: Add more debug logging for `Resource::getProfile()` [extensions/OAuth] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219559 (https://phabricator.wikimedia.org/T409901) [11:31:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/OAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219558 (https://phabricator.wikimedia.org/T409901) (owner: 10D3r1ck01) [11:31:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/OAuth] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219559 (https://phabricator.wikimedia.org/T409901) (owner: 10D3r1ck01) [11:31:55] (03CR) 10JMeybohm: [C:03+1] kube-state-metrics: Remove limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris) [11:38:41] (03CR) 10Mszwarc: [C:03+1] Revert^2 "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219533 (owner: 10STran) [11:39:04] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11472383 (10cmooney) >>! In T412807#11471809, @MoritzMuehlenhoff wrote: > Your hunch was spot-on! I did a little digging in the docs and found thi... [11:39:17] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:47:11] (03PS1) 10Cathal Mooney: autoinstall: set timeout for network link detection for uefi mode [puppet] - 10https://gerrit.wikimedia.org/r/1219562 (https://phabricator.wikimedia.org/T412807) [11:48:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:53:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:55:36] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, things are working as expected in mgaru. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1219187 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [11:57:27] (03PS2) 10Dpogorzelski: ml: add ml specific config Adding docker-pkg config specific to the ML namespace instead of using a spearate repo since we have dependencies here that we rely on. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219552 (https://phabricator.wikimedia.org/T394778) [11:57:49] (03CR) 10Fabfur: [C:03+2] hiera: enable video tos on cache upload [puppet] - 10https://gerrit.wikimedia.org/r/1219187 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [11:59:58] (03CR) 10Clément Goubert: [C:03+1] kube-state-metrics: Remove limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris) [12:01:59] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472466 (10elukey) ` elukey@ms-be2081:~$ sudo journalctl -u swift-container-sync.service| egrep "\.db " | e... [12:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:17:40] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:16] FIRING: [2x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1219562 (https://phabricator.wikimedia.org/T412807) (owner: 10Cathal Mooney) [12:20:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1219544 (https://phabricator.wikimedia.org/T411860) (owner: 10Btullis) [12:21:42] (03CR) 10Cathal Mooney: [C:03+2] autoinstall: set timeout for network link detection for uefi mode [puppet] - 10https://gerrit.wikimedia.org/r/1219562 (https://phabricator.wikimedia.org/T412807) (owner: 10Cathal Mooney) [12:29:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:34:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:36:38] (03CR) 10Kamila Součková: [C:03+2] "Sorry, will do '^^" [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec) [12:41:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:41:23] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [12:41:32] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11472502 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie [12:41:46] (03PS3) 10Dpogorzelski: ml: add ml specific config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219552 (https://phabricator.wikimedia.org/T394778) [12:48:08] (03CR) 10Muehlenhoff: [C:03+2] Remove access for joelyrookewmde [puppet] - 10https://gerrit.wikimedia.org/r/1218080 (https://phabricator.wikimedia.org/T412508) (owner: 10Muehlenhoff) [12:51:04] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11472551 (10MoritzMuehlenhoff) [12:51:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:53:53] !log root@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Joely Rooke WMDE out of all services on: 2435 hosts [12:55:29] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11472568 (10cmooney) Somewhat amazed but that seems to have worked! I guess I won't have to write a new firmware for this chipset after all. Loo... [12:57:20] (03PS1) 10Sergio Gimeno: UserImpact: stop using pre-computed impact in the user impact job [extensions/GrowthExperiments] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219582 (https://phabricator.wikimedia.org/T398500) [12:57:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219582 (https://phabricator.wikimedia.org/T398500) (owner: 10Sergio Gimeno) [12:57:40] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11472574 (10Marostegui) Do you have some logs about partman? @MoritzMuehlenhoff worked on the uefi partman it has been working fine, so I wonder i... [12:58:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:59:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1300) [13:00:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:00:22] !incidents [13:00:22] 7211 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [13:00:22] 7210 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr3-eqsin:xe-0/1/0 (Arelion, IC-331929 200ms EVPN) {#11991_12273-3} xe-1/0/1:2 gnmi codfw) [13:00:29] !ack 7211 [13:00:29] 7211 (ACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:06:14] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219584 [13:09:39] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219584 (owner: 10PipelineBot) [13:09:46] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218830 (owner: 10PipelineBot) [13:10:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:11:32] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219584 (owner: 10PipelineBot) [13:12:34] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11472643 (10cmooney) >>! In T412807#11472574, @Marostegui wrote: > Do you have some logs about partman? @MoritzMuehlenhoff worked on the uefi part... [13:12:42] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:12:54] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472651 (10MatthewVernon) There are false-positives in that list (e.g. the last one is a good object, but t... [13:13:01] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:13:57] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:14:24] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:14:32] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:14:56] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:16:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:17:00] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Offboarding for joelyrookewmde - https://phabricator.wikimedia.org/T412508#11472669 (10MoritzMuehlenhoff) Access to Wikimedia production has been removed and the NDA-relevant LDAP/Phab groups have been removed. I've also reached out to have your Phabri... [13:17:11] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472671 (10MatthewVernon) Earlier, we tested an approach (used before with ghost swift objects cf T327253)... [13:19:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [13:21:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:21:49] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11472687 (10MoritzMuehlenhoff) This seems to use the wrong recipe? AFAIK es2028 is being installed with UEFI, but it has the BIOS variant configur... [13:22:27] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472688 (10MatthewVernon) A few other notes (beyond "we should stop using swift_container_sync already"): L... [13:27:06] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11472712 (10Marostegui) Damn, I was convinced I wrote uefi on the patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1218652/1/modules/pro... [13:28:54] FIRING: [4x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:33:13] (03PS1) 10Cathal Mooney: Debian intaller: set netcfg link_wait_timeout to 10 seconds [puppet] - 10https://gerrit.wikimedia.org/r/1219586 (https://phabricator.wikimedia.org/T412807) [13:33:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:33:52] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read Views on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219587 [13:33:52] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read Views on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219588 [13:34:07] (03PS1) 10Marostegui: installserver: Add efi recipe to es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1219589 (https://phabricator.wikimedia.org/T412807) [13:36:19] cmooney@cumin1003 reimage (PID 2452715) is awaiting input [13:38:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:38:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1219589 (https://phabricator.wikimedia.org/T412807) (owner: 10Marostegui) [13:39:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219587 (owner: 10C. Scott Ananian) [13:39:40] (03CR) 10Marostegui: [C:03+2] installserver: Add efi recipe to es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1219589 (https://phabricator.wikimedia.org/T412807) (owner: 10Marostegui) [13:40:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219588 (owner: 10C. Scott Ananian) [13:43:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:43:49] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie [13:44:08] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11472810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmne... [13:48:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:49:27] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11472837 (10ayounsi) >>! In T286066#11471929, @ABran-WMF wrote: > I found a [[ https://netbox.wikimedia.org/ipam/ip-addresses/6659/ |... [13:52:47] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219590 [13:53:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:55:29] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219590 (owner: 10PipelineBot) [13:57:57] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219590 (owner: 10PipelineBot) [13:58:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:59:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:59:49] 10ops-eqiad, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083 (10phaultfinder) 03NEW [13:59:57] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1400) [14:00:05] Tran, cormacparle, xSavitar, Sergi0, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] I can’t deploy today, in a meeting [14:00:21] o/ I can deploy my own patch [14:00:24] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:00:30] o/ [14:00:31] o/ [14:00:35] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:00:37] I can self-deploy my patches as well. o/ [14:00:45] I can also self-deploy [14:01:20] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:01:23] Tran, I guess you can begin since you're first in line :) [14:01:24] In order, then? Unless someone's needs to go in sooner rather than later [14:01:33] in order is fine with me [14:01:34] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:01:41] same here [14:01:44] 👌 [14:01:49] 👍 I'll get started then. I need a bit of time to test as well. [14:01:55] not sure if I still have deploy rights, haven't done this in a long time [14:02:07] mine is just a no-op clean up thing [14:02:25] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:02:33] cormacparle, I can help deploy yours [14:02:44] 👍 [14:02:46] Will you be around to tests? :) [14:03:11] I will, but the only test is "Special:EditWatchlist is not broken" [14:03:24] but yeah I'm here [14:03:37] cormacparle, also, re deployment, these days we use: https://wikitech.wikimedia.org/wiki/Scap/SpiderPig [14:03:40] PROBLEM - SSH on bast6003 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:03:51] cool [14:03:58] (03CR) 10Marostegui: [C:03+1] "Talked to Moritz, fine to merge" [puppet] - 10https://gerrit.wikimedia.org/r/1219213 (https://phabricator.wikimedia.org/T413009) (owner: 10Aklapper) [14:03:59] And it should be easy once have the necessary rights and added to the appropriate groups [14:04:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:04:03] (03CR) 10Marostegui: [C:03+2] admin: add fido backed ssh key for aklapper [puppet] - 10https://gerrit.wikimedia.org/r/1219213 (https://phabricator.wikimedia.org/T413009) (owner: 10Aklapper) [14:04:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219533 (owner: 10STran) [14:04:40] RECOVERY - SSH on bast6003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:04:59] (03Merged) 10jenkins-bot: Revert^2 "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219533 (owner: 10STran) [14:05:01] RESOLVED: [2x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:05:31] > And it should be easy once have the necessary rights and added to the appropriate groups [14:05:31] yeah I'm just not sure I still have the rights - I'll find out when Tran is finished i guess! [14:05:51] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1219533|Revert^2 "Enable v2 non-emergency workflow by default"]] [14:06:04] o/ [14:06:05] (03PS3) 10D3r1ck01: EditWatchlistPaginate feature flag has been removed from MW code, so remove it from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) (owner: 10Cparle) [14:06:21] i can also self-deploy [14:08:08] !log stran@deploy2002 stran: Backport for [[gerrit:1219533|Revert^2 "Enable v2 non-emergency workflow by default"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:11] cormacparle, per https://www.mediawiki.org/wiki/MediaWiki_1.46/wmf.7, it looks like we will have to wait 1 more week? [14:08:19] group2 wikis are still on wmf.5 [14:08:27] Which doesn't yet have the change [14:08:33] The core change I mean [14:08:55] ah! [14:09:01] good catch, you are right! [14:09:02] It would have been fine but the train didn't ride in one of the past weeks, there is no wmf.6 [14:09:22] So we might have to hold one for 1 more week. Next week Monday or Tuesday should be fine I suppose [14:09:45] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11472929 (10elukey) >>! In T413008#11472688, @MatthewVernon wrote: > I think the pragmatic next step is to d... [14:09:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:09:48] After the holidays is fine, it's just removing an unused flag from config [14:09:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1219586 (https://phabricator.wikimedia.org/T412807) (owner: 10Cathal Mooney) [14:10:05] thank you xSavitar ! I'll remove my patch from the list and go and have lunch [14:10:06] cormacparle, Ack! [14:10:10] Thank you [14:10:35] !log stran@deploy2002 stran: Continuing with sync [14:10:45] (03PS2) 10C. Scott Ananian: Turn on Parsoid Read Views on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219587 (https://phabricator.wikimedia.org/T413084) [14:10:47] (03PS2) 10C. Scott Ananian: Turn on Parsoid Read Views on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219588 (https://phabricator.wikimedia.org/T413084) [14:11:59] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Turn on Parsoid Read Views on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219587 (https://phabricator.wikimedia.org/T413084) (owner: 10C. Scott Ananian) [14:12:32] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Turn on Parsoid Read Views on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219588 (https://phabricator.wikimedia.org/T413084) (owner: 10C. Scott Ananian) [14:13:06] (03CR) 10D3r1ck01: "Copy/paste of IRC conversation why this wasn't deployed today, for reference." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) (owner: 10Cparle) [14:13:08] (03CR) 10Cparle: "unscheduled for deployment because the patch removing the flag from MW hasn't reached Group 2 yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) (owner: 10Cparle) [14:13:26] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:42] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 26.38 ms [14:14:41] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219533|Revert^2 "Enable v2 non-emergency workflow by default"]] (duration: 08m 50s) [14:15:13] xSavitar I'm done [14:15:28] Tran, thanks! I'll deploy now [14:15:36] sergi0, will poke you once I'm done. [14:15:51] (03CR) 10Cathal Mooney: [C:03+2] Debian intaller: set netcfg link_wait_timeout to 10 seconds [puppet] - 10https://gerrit.wikimedia.org/r/1219586 (https://phabricator.wikimedia.org/T412807) (owner: 10Cathal Mooney) [14:16:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219558 (https://phabricator.wikimedia.org/T409901) (owner: 10D3r1ck01) [14:16:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [extensions/OAuth] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219559 (https://phabricator.wikimedia.org/T409901) (owner: 10D3r1ck01) [14:16:50] perfect, ty [14:17:59] (03Merged) 10jenkins-bot: Rest: Add more debug logging for `Resource::getProfile()` [extensions/OAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219558 (https://phabricator.wikimedia.org/T409901) (owner: 10D3r1ck01) [14:18:03] (03Merged) 10jenkins-bot: Rest: Add more debug logging for `Resource::getProfile()` [extensions/OAuth] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219559 (https://phabricator.wikimedia.org/T409901) (owner: 10D3r1ck01) [14:18:37] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1219558|Rest: Add more debug logging for `Resource::getProfile()` (T409901)]], [[gerrit:1219559|Rest: Add more debug logging for `Resource::getProfile()` (T409901)]] [14:18:41] T409901: TypeError: array_keys(): Argument #1 ($array) must be of type array, null given by $resourceServer->getScopes() - https://phabricator.wikimedia.org/T409901 [14:20:06] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q#:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088 (10RobH) 03NEW [14:20:28] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Q#:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11473029 (10RobH) [14:20:37] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be209[56] - https://phabricator.wikimedia.org/T413088#11473030 (10RobH) [14:20:49] !log derick@deploy2002 d3r1ck01, derick: Backport for [[gerrit:1219558|Rest: Add more debug logging for `Resource::getProfile()` (T409901)]], [[gerrit:1219559|Rest: Add more debug logging for `Resource::getProfile()` (T409901)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:21:17] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089 (10RobH) 03NEW [14:21:29] nothing to test, will monitor Logstash after deploying [14:21:32] !log derick@deploy2002 d3r1ck01, derick: Continuing with sync [14:21:38] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops: FY2526 Q3:rack/setup/install ms-be109[67] - https://phabricator.wikimedia.org/T413089#11473052 (10RobH) [14:25:31] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219558|Rest: Add more debug logging for `Resource::getProfile()` (T409901)]], [[gerrit:1219559|Rest: Add more debug logging for `Resource::getProfile()` (T409901)]] (duration: 06m 54s) [14:25:35] T409901: TypeError: array_keys(): Argument #1 ($array) must be of type array, null given by $resourceServer->getScopes() - https://phabricator.wikimedia.org/T409901 [14:26:01] sergi0, over to you. [14:26:05] I'm done [14:26:33] xSavitar alright, I will poke you cscott after [14:26:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:26:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219582 (https://phabricator.wikimedia.org/T398500) (owner: 10Sergio Gimeno) [14:27:00] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:27:53] sergi0: thanks! [14:28:37] !log installing rubygems security updates [14:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:33:15] (03PS1) 10Matthieulec: Keeping all services in the exported metrics. The switchover exclusion list should be applied on the final dashboard to filter out services data consistently. [puppet] - 10https://gerrit.wikimedia.org/r/1219595 (https://phabricator.wikimedia.org/T327663) [14:35:16] (03CR) 10CI reject: [V:04-1] Keeping all services in the exported metrics. The switchover exclusion list should be applied on the final dashboard to filter out services data consistently. [puppet] - 10https://gerrit.wikimedia.org/r/1219595 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec) [14:36:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:36:49] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11473183 (10Scott_French) @thcipriani - Thanks for pulling together T412265#11471277. Indeed, your understanding here is correct. //Cause// We believe this is an... [14:37:00] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:37:04] (03PS2) 10Matthieulec: export_service_type: Remove exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/1219595 (https://phabricator.wikimedia.org/T327663) [14:37:54] 06SRE, 10SRE-Access-Requests: Add FIDO-backed SSH key for aklapper - https://phabricator.wikimedia.org/T413009#11473194 (10Marostegui) 05Open→03Resolved a:03Marostegui The change has been merged, give it 20-30 minutes for it to spread across production and test it! Please reopen if you need anything... [14:40:45] (03Merged) 10jenkins-bot: UserImpact: stop using pre-computed impact in the user impact job [extensions/GrowthExperiments] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219582 (https://phabricator.wikimedia.org/T398500) (owner: 10Sergio Gimeno) [14:41:15] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1219582|UserImpact: stop using pre-computed impact in the user impact job (T398500)]] [14:41:19] T398500: [timebox: 3 days] Impact module: Support larger wgGEUserImpactMaxEdits - https://phabricator.wikimedia.org/T398500 [14:43:26] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1219582|UserImpact: stop using pre-computed impact in the user impact job (T398500)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:43:48] Testing now [14:44:43] (03CR) 10Dzahn: [V:03+1] mx/spamassassin: allow overriding sa daemon package name in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [14:44:58] (03PS3) 10Dzahn: mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) [14:46:45] !log sgimeno@deploy2002 sgimeno: Continuing with sync [14:47:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:48:23] (03CR) 10Klausman: [C:03+1] Rework Makefile.build to ease additional distributions (031 comment) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219130 (owner: 10Elukey) [14:49:01] (03CR) 10Klausman: [C:03+1] ml-builder: clone production images [puppet] - 10https://gerrit.wikimedia.org/r/1219553 (owner: 10Dpogorzelski) [14:49:15] (03CR) 10Klausman: [C:03+1] ml: add ml specific config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219552 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [14:50:46] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219582|UserImpact: stop using pre-computed impact in the user impact job (T398500)]] (duration: 09m 31s) [14:50:51] T398500: [timebox: 3 days] Impact module: Support larger wgGEUserImpactMaxEdits - https://phabricator.wikimedia.org/T398500 [14:51:00] cscott: all yours [14:51:18] thanks! [14:52:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:52:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219587 (https://phabricator.wikimedia.org/T413084) (owner: 10C. Scott Ananian) [14:53:38] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219587 (https://phabricator.wikimedia.org/T413084) (owner: 10C. Scott Ananian) [14:54:09] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1219587|Turn on Parsoid Read Views on nlwiki (T413084)]] [14:54:13] T413084: Parsoid Read Views to deploy ~2025-12-18 (itwiki, nlwiki) - https://phabricator.wikimedia.org/T413084 [14:55:30] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:56:27] !log cscott@deploy2002 cscott: Backport for [[gerrit:1219587|Turn on Parsoid Read Views on nlwiki (T413084)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:57:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:58:52] !log cscott@deploy2002 cscott: Continuing with sync [15:01:22] (03CR) 10Btullis: [C:03+2] Record the fact that tchanders now has kerberos access [puppet] - 10https://gerrit.wikimedia.org/r/1219544 (https://phabricator.wikimedia.org/T411860) (owner: 10Btullis) [15:02:21] (03Abandoned) 10Dzahn: mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [15:03:21] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219587|Turn on Parsoid Read Views on nlwiki (T413084)]] (duration: 09m 12s) [15:03:26] T413084: Parsoid Read Views to deploy ~2025-12-18 (itwiki, nlwiki) - https://phabricator.wikimedia.org/T413084 [15:04:02] ok, i'm done, and I think I was the last one in the backport window [15:04:31] deployment calendar looks like it, thanks! [15:05:12] !log UTC afternoon backport+config window done [15:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:30] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:07:08] (03CR) 10Clément Goubert: [C:03+1] export_service_type: Remove exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/1219595 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec) [15:07:55] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473341 (10Scott_French) Many thanks for investigating this @MatthewVernon and @elukey. It's interesting h... [15:08:54] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11473354 (10MatthewVernon) Adding or removing hosts from the swift rings will create more "churn" - you have to make incremental changes to the swift rings, deploy... [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:11:28] (03CR) 10Clément Goubert: rest-gateway: move values-minikube.minikube to service definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 (owner: 10Daniel Kinzler) [15:16:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:21:00] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:21:59] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473425 (10elukey) I've executed the following from ms-fe2009, deleting the objects that Matthew highlighte... [15:23:44] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473438 (10MatthewVernon) My unfounded suspicion is that the bad objects were trying to be uploaded during... [15:25:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:25:18] (03PS1) 10Cathal Mooney: hiera: disable video qos on cache upload [puppet] - 10https://gerrit.wikimedia.org/r/1219598 (https://phabricator.wikimedia.org/T412785) [15:26:01] (03CR) 10Clément Goubert: [C:03+2] export_service_type: Remove exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/1219595 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1530) [15:30:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:30:32] Elevated timeouts on api-ext, commons [15:34:09] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473485 (10elukey) After the container-sync restart on ms-be2081, I noticed the following errors and I trie... [15:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:40:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:42:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:43:57] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473497 (10elukey) ` [15:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:47:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:52:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:56:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:00:05] dancy and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1600) [16:01:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:02:39] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11473512 (10Papaul) ` Hi Papaul, I’ve replicated this issue in our lab. I’m escalating this to our next level of support for further investigation. I... [16:06:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:11:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:12:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:15:49] (03PS2) 10Elukey: Rework Makefile.build to ease additional distributions [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219130 [16:15:49] (03PS2) 10Elukey: Add Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219131 [16:16:09] (03CR) 10Elukey: Rework Makefile.build to ease additional distributions (031 comment) [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219130 (owner: 10Elukey) [16:17:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:17:40] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:23:18] (03PS3) 10Elukey: Rework Makefile.build to ease additional distributions [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219130 [16:23:18] (03PS3) 10Elukey: Add Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219131 [16:23:46] (03CR) 10Elukey: [V:03+2 C:03+2] "Removed a couple of PHONY targets that I added, they were making the script unusable." [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219130 (owner: 10Elukey) [16:23:57] (03CR) 10Elukey: [V:03+2 C:03+2] Add Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219131 (owner: 10Elukey) [16:24:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:25:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:27:12] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@a8e9cb3]: (no justification provided) [16:27:19] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@a8e9cb3]: (no justification provided) (duration: 00m 12s) [16:30:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:32:18] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473631 (10MatthewVernon) Yes, I think we're at "give it some time", but I think we've unblocked replicatio... [16:35:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:40:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:42:50] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@a8e9cb3]: (no justification provided) [16:43:04] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@a8e9cb3]: (no justification provided) (duration: 00m 15s) [16:43:17] (03PS3) 10Aaron Schulz: rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) [17:00:05] jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:05:55] (03PS1) 10Aaron Schulz: restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219604 [17:07:56] (03CR) 10CI reject: [V:04-1] restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219604 (owner: 10Aaron Schulz) [17:14:18] (03PS1) 10Dreamy Jazz: CheckUser: Set $wgCheckUserLogMaxRangeToShowInLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219606 (https://phabricator.wikimedia.org/T320769) [17:14:49] Anyone using the puppet window? If not, I'd like to use scap to apply a config change [17:15:28] nope, all yours [17:18:32] Thanks [17:18:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219606 (https://phabricator.wikimedia.org/T320769) (owner: 10Dreamy Jazz) [17:19:27] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11473731 (10Jclark-ctr) a:03Jclark-ctr Phase, BA:L3-L1, Active Power and AA:L3-L1, Active Power are over powered will need to be rebalanced off those branches [17:19:28] (03Merged) 10jenkins-bot: CheckUser: Set $wgCheckUserLogMaxRangeToShowInLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219606 (https://phabricator.wikimedia.org/T320769) (owner: 10Dreamy Jazz) [17:19:46] (03PS1) 10Mforns: Add new image to page-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219607 (https://phabricator.wikimedia.org/T405041) [17:19:58] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1219606|CheckUser: Set $wgCheckUserLogMaxRangeToShowInLog (T320769)]] [17:20:02] T320769: Don't show over limit checks in the CheckUserLog or remove all over limit entries from enwiki - https://phabricator.wikimedia.org/T320769 [17:22:18] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1219606|CheckUser: Set $wgCheckUserLogMaxRangeToShowInLog (T320769)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:22:40] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [17:26:44] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219606|CheckUser: Set $wgCheckUserLogMaxRangeToShowInLog (T320769)]] (duration: 06m 46s) [17:26:48] T320769: Don't show over limit checks in the CheckUserLog or remove all over limit entries from enwiki - https://phabricator.wikimedia.org/T320769 [17:26:58] I'm finished with my deployments [17:27:12] (03CR) 10Btullis: [C:03+2] Add new image to page-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219607 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [17:28:30] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473749 (10MatthewVernon) ` Dec 18 14:36:26 ms-be2081 container-sync: Since Thu Dec 18 13:36:25 2025: 12 sy... [17:28:40] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:54] FIRING: [4x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:29:13] (03Merged) 10jenkins-bot: Add new image to page-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219607 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [17:29:18] RECOVERY - Host wdqs1013 is UP: PING WARNING - Packet loss = 90%, RTA = 0.37 ms [17:34:06] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473764 (10MatthewVernon) Found some more with `journalctl -o cat -u swift-container-sync.service -g 'Unkno... [18:00:05] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1800). [18:00:06] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1800) [18:25:47] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11473918 (10MatthewVernon) ` Dec 18 18:24:54 ms-be2081 container-sync: Since Thu Dec 18 17:24:44 2025: 12 sy... [18:28:11] (03PS3) 10Clare Ming: Update references to Test Kitchen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) [18:29:19] (03CR) 10Clare Ming: "per discussion with team, we will not update stream names with `product_metrics` in the title" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) (owner: 10Clare Ming) [18:36:46] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219616 (https://phabricator.wikimedia.org/T402636) [18:42:44] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:42:47] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:42:48] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:42:51] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:42:53] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:42:55] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:42:57] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219616 (https://phabricator.wikimedia.org/T402636) (owner: 10DDesouza) [18:44:48] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219616 (https://phabricator.wikimedia.org/T402636) (owner: 10DDesouza) [18:45:39] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:45:56] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:45:57] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:46:14] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:46:15] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:46:31] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:53:52] (03PS1) 10Tchanders: Don't collect CheckUser-specific temp account patrolling metrics on labs [puppet] - 10https://gerrit.wikimedia.org/r/1219619 (https://phabricator.wikimedia.org/T413101) [19:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T1900) [19:01:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11474018 (10VRiley-WMF) [19:02:11] o/ [19:02:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11474039 (10VRiley-WMF) wikikube-worker1360 B2 U18 wikikube-worker1361 B4 U36 wikikube-worker1362 C3 U37 wikikube-worker1363 C4 U28 wikikube-worker1364 C5 U31 wikikube... [19:02:53] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219621 (https://phabricator.wikimedia.org/T408277) [19:02:55] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219621 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [19:03:43] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219621 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [19:05:26] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11474062 (10bvibber) > iOS currently uses 2x320 = 640 on retina devices, which would end up being 960-downscaled. Is that too wasteful? c... [19:08:24] (03PS3) 10Daniel Kinzler: rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 [19:09:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:10:00] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.7 refs T408277 [19:10:04] T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277 [19:13:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:35:15] (03PS1) 10Herron: admin: herron: add yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1219625 [19:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:46:56] (03PS2) 10Andrea Denisse: admin: Remove non yubikey SSH key for denisse. [puppet] - 10https://gerrit.wikimedia.org/r/1219254 (https://phabricator.wikimedia.org/T413006) [19:49:56] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11474190 (10phaultfinder) [19:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:12:56] (03PS1) 10D3r1ck01: Fetch user from primary DB when saving settings [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219628 (https://phabricator.wikimedia.org/T411804) [20:13:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219628 (https://phabricator.wikimedia.org/T411804) (owner: 10D3r1ck01) [20:16:40] (03CR) 10D3r1ck01: "Tentative schedule, if I don't make it, I can deploy any other time." [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219628 (https://phabricator.wikimedia.org/T411804) (owner: 10D3r1ck01) [20:17:40] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218775 (https://phabricator.wikimedia.org/T412455) (owner: 10LorenMora) [20:20:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:32] (03CR) 10Dr0ptp4kt: trafficserver: Send /evt-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [20:25:00] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11474248 (10phaultfinder) [20:28:22] (03PS3) 10ArielGlenn: Add the second yubikey FIDO-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219498 (https://phabricator.wikimedia.org/T413019) [20:29:18] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 100% [20:29:44] RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 199.33 ms [20:29:50] (03CR) 10ArielGlenn: [C:03+2] Add the second yubikey FIDO-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219498 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [20:30:14] (03PS4) 10Daniel Kinzler: rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 [20:39:29] 06SRE, 10SRE-Access-Requests: Add yubikey SSH key for 'denisse' - https://phabricator.wikimedia.org/T413006#11474280 (10andrea.denisse) >>! In T413006#11471604, @Marostegui wrote: > @andrea.denisse I assume you'd handle this yourelf or you'd need help from clinic duty? Hi Miguel, I'll handle this myself, than... [20:39:59] 06SRE, 10SRE-Access-Requests: Add yubikey SSH key for 'denisse' - https://phabricator.wikimedia.org/T413006#11474281 (10andrea.denisse) 05In progress→03Resolved [20:45:52] !log dancy@deploy2002 Installing scap version "4.230.0" for 2 host(s) [20:47:39] !log dancy@deploy2002 Installation of scap version "4.230.0" completed for 2 hosts [20:48:09] !log dancy@deploy2002 Installing scap version "4.230.0" for 1 host(s) [20:49:07] !log dancy@deploy2002 Installation of scap version "4.230.0" completed for 1 hosts [20:53:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [20:54:11] RESOLVED: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:59:53] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11474364 (10phaultfinder) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T2100). [21:00:05] tgr, cscott, xSavitar, toyofuku, and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:06] Here [21:00:10] o/ [21:00:10] o/ [21:00:47] that's a lot of patches, who doesn't need separate testing? [21:00:50] Mine doesn't [21:00:53] i don't need separate testing [21:01:07] it's a change to a python script that doesn't run on the wiki at all but happens to be stored in the config repo, so a total no-op [21:01:41] (And yes, I scheduled it for deployment yesterday, but the lost track of time and no-showed the window. I'm sorry) [21:02:13] tgr_: do you feel comfortable doing the first three in one batch (yours, pppery's and mine?) [21:02:26] Mine needs separate testing - I can deploy myself, but would need to deploy it today (sorry) [21:02:27] yes [21:02:46] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11474371 (10CDobbins) 05Stalled→03Resolved [21:02:54] Let me amend my "need" - the world will not stop if it doesn't go out, but in an ideal world I would be able to deploy it (although all good with waiting a bit!) [21:03:12] tgr_: i'd say you might as well get started, then. we'll let toyofuku and xSavitar fight for second deploy when/if xSavitar shows up [21:04:07] toyofuku: seems like there should be plenty of time, especially if we do three of them at once. config patches are fairly quick. [21:04:10] > Change '1217790' has dependency '1203252' targeting the master branch [21:04:13] of MediaWiki code project 'mediawiki/core', but the dependency is not [21:04:16] present in recent train branch: wmf/1.46.0-wmf.5 [21:04:18] > This branch is a likely rollback target, so it is recommended that you [21:04:21] cherry-pick the dependency into that branch for rollback safety. [21:04:31] ^ scap is getting pretty smart! [21:05:11] (the patch will work with wmf.5, but it's cool that it warns about it) [21:05:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza) [21:05:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219588 (https://phabricator.wikimedia.org/T413084) (owner: 10C. Scott Ananian) [21:05:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [21:06:27] (03Merged) 10jenkins-bot: Remove LoggedOut cookie logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza) [21:06:29] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219588 (https://phabricator.wikimedia.org/T413084) (owner: 10C. Scott Ananian) [21:06:33] (03Merged) 10jenkins-bot: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [21:06:54] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1217790|Remove LoggedOut cookie logic (T142542)]], [[gerrit:1219588|Turn on Parsoid Read Views on itwiki (T413084)]], [[gerrit:1217282|Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (T405169)]] [21:07:02] T142542: LoggedOut cookie not set anymore - https://phabricator.wikimedia.org/T142542 [21:07:02] T413084: Parsoid Read Views to deploy ~2025-12-18 (itwiki, nlwiki) - https://phabricator.wikimedia.org/T413084 [21:07:03] T405169: logos/manage.py does not find 1.5x logo - https://phabricator.wikimedia.org/T405169 [21:08:54] !log tgr@deploy2002 pppery, tgr, cscott: Backport for [[gerrit:1217790|Remove LoggedOut cookie logic (T142542)]], [[gerrit:1219588|Turn on Parsoid Read Views on itwiki (T413084)]], [[gerrit:1217282|Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (T405169)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:09:18] !log tgr@deploy2002 pppery, tgr, cscott: Continuing with sync [21:09:58] tgr_: yeah, looks good to me [21:10:17] (03PS1) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:10:50] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:13:18] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7838/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:13:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:13:23] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1217790|Remove LoggedOut cookie logic (T142542)]], [[gerrit:1219588|Turn on Parsoid Read Views on itwiki (T413084)]], [[gerrit:1217282|Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (T405169)]] (duration: 06m 28s) [21:13:30] T142542: LoggedOut cookie not set anymore - https://phabricator.wikimedia.org/T142542 [21:13:30] T413084: Parsoid Read Views to deploy ~2025-12-18 (itwiki, nlwiki) - https://phabricator.wikimedia.org/T413084 [21:13:30] T405169: logos/manage.py does not find 1.5x logo - https://phabricator.wikimedia.org/T405169 [21:13:32] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:14:10] toyofuku: all yours [21:14:21] Thank you!! [21:15:23] (03PS2) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:15:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218775 (https://phabricator.wikimedia.org/T412455) (owner: 10LorenMora) [21:15:56] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:16:06] tgr_: thank you! [21:16:36] (03Merged) 10jenkins-bot: [Legal Footer] Deploy Legal Footer for Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218775 (https://phabricator.wikimedia.org/T412455) (owner: 10LorenMora) [21:16:55] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1218775|[Legal Footer] Deploy Legal Footer for Phase 1 wikis (T412455)]] [21:16:59] T412455: [Legal Footer] Turn on wmgUseFooterLegalContactLink config for English and German - https://phabricator.wikimedia.org/T412455 [21:18:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:18:26] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:19:12] !log toyofuku@deploy2002 toyofuku, lmora: Backport for [[gerrit:1218775|[Legal Footer] Deploy Legal Footer for Phase 1 wikis (T412455)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:19:20] Testing rq [21:19:54] Looks good, continuing [21:19:57] !log toyofuku@deploy2002 toyofuku, lmora: Continuing with sync [21:23:59] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218775|[Legal Footer] Deploy Legal Footer for Phase 1 wikis (T412455)]] (duration: 07m 04s) [21:24:04] T412455: [Legal Footer] Turn on wmgUseFooterLegalContactLink config for English and German - https://phabricator.wikimedia.org/T412455 [21:25:14] All set, thank you so much!!! [21:39:26] (03PS3) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:39:59] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [21:46:23] (03PS1) 10Scott French: Various UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1219644 [21:49:54] (03CR) 10Scott French: [V:03+2 C:03+2] Various UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1219644 (owner: 10Scott French) [21:50:55] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy: Various UI improvements - swfrench@cumin2002" [21:50:58] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Various UI improvements - swfrench@cumin2002 [21:51:48] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy: Various UI improvements - swfrench@cumin2002 [21:51:50] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy: Various UI improvements - swfrench@cumin2002" [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251218T2200) [22:10:20] preparing to do a security deployment with scap [22:10:26] is anyone deploying right now? [22:11:08] !log cwhite@deploy2002 Started deploy [statsv/statsv@0751b0b]: T383563 [22:11:13] T383563: mw.track: support for histogram metrics - https://phabricator.wikimedia.org/T383563 [22:11:19] !log cwhite@deploy2002 Finished deploy [statsv/statsv@0751b0b]: T383563 (duration: 00m 10s) [22:13:25] (03PS4) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:13:45] looks like that's a no [22:13:59] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:15:03] !log uploading corto 1.0.21 [22:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:03] (03PS5) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:20:54] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:24:18] !log mstyles Deployed security patch for T384147 [22:25:01] 06SRE, 10Wikimedia-Mailing-lists: create list for WikiClub Moncton - https://phabricator.wikimedia.org/T413098#11474652 (10Ladsgroup) I think you meant `wikiclub-moncton@lists.wikimedia.org`? [22:25:18] (03PS6) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:25:54] 06SRE, 10Wikimedia-Mailing-lists: create list for WikiClub Moncton - https://phabricator.wikimedia.org/T413098#11474659 (10SophieWMCA) [22:26:01] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:26:10] 06SRE, 10Wikimedia-Mailing-lists: create list for WikiClub Moncton - https://phabricator.wikimedia.org/T413098#11474660 (10SophieWMCA) >>! In T413098#11474652, @Ladsgroup wrote: > I think you meant `wikiclub-moncton@lists.wikimedia.org`? indeed! thanks for catching that typo :) [22:27:48] (03PS7) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:28:30] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:29:56] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11474675 (10phaultfinder) [22:44:39] (03PS8) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:57:10] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7840/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:57:43] (03CR) 10RLazarus: [C:03+2] Remove LoggedOut cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1217774 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza) [23:10:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:11:17] (03PS1) 10ArielGlenn: Remove the old non-fido-compliant ssh key for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219654 (https://phabricator.wikimedia.org/T413019) [23:13:54] FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:14:01] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO ssh key(s) for ariel - https://phabricator.wikimedia.org/T413019#11474759 (10ArielGlenn) [23:15:23] 06SRE, 10Wikimedia-Mailing-lists: create list for WikiClub Moncton - https://phabricator.wikimedia.org/T413098#11474766 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikiclub-moncton.lists.wikimedia.org/members/owner/ I made it a public mailing list but yo... [23:35:13] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-f4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T413083#11474779 (10phaultfinder) [23:49:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140 (10AKhatun_WMF) 03NEW [23:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:58:44] (03PS2) 10Aaron Schulz: restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219604 (https://phabricator.wikimedia.org/T396807) [23:59:43] (03CR) 10Milimetric: trafficserver: Send /evt-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric)