[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260127T0000) [00:00:08] jan_drewniak: I was eventually able to get everything deployed once that wrapped up. no worries! [00:40:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1233317 [00:40:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1233317 (owner: 10TrainBranchBot) [00:51:11] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11556167 (10Krinkle) >>! In T414805#11555784, @Izno wrote: > https://en.wikipedia.org/wiki/User:Bradv/Scripts/ExpandDiffs.js#L-20 […]... [00:52:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1233317 (owner: 10TrainBranchBot) [01:04:03] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11556200 (10Nux) Are icons really a problem? I mean, are there actually external sites using images loaded from Wikimedia as icons on... [01:05:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [01:10:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1233324 [01:10:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1233324 (owner: 10TrainBranchBot) [01:25:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [01:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:48] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1233324 (owner: 10TrainBranchBot) [02:01:12] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:01:14] RECOVERY - dump of s1 in codfw on backupmon1001 is OK: Last dump for s1 at codfw (db2141) taken on 2026-01-27 00:00:14 (153 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:01:32] FIRING: [4x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:10:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.13 [core] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1233336 (https://phabricator.wikimedia.org/T413804) [02:10:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.13 [core] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1233336 (https://phabricator.wikimedia.org/T413804) (owner: 10TrainBranchBot) [02:13:54] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 42s) [02:20:57] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11556311 (10Nux) Found two problems with the migration: 1. The docs for the editor suggest using 22px (though I guess you can replace... [02:22:50] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.13 [core] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1233336 (https://phabricator.wikimedia.org/T413804) (owner: 10TrainBranchBot) [02:29:46] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11556320 (10Ladsgroup) why not simply using https://upload.wikimedia.org/wikipedia/commons/thumb/6/61/Contribs_icon-black.svg/20px-Con... [02:30:42] (03PS1) 10Foks: AccountRecovery: Adding additional Zendesk fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233349 (https://phabricator.wikimedia.org/T414597) [02:50:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:52:43] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11556343 (10Nux) Because SVG is sharp on any dpi/ppi. [02:56:18] PROBLEM - dump of m1 in codfw on backupmon1001 is CRITICAL: Last dump for m1 at codfw (db2160) taken on 2026-01-27 00:15:00 is 75 GiB, but the previous one was 89 GiB, a change of -15.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260127T0300) [03:06:16] PROBLEM - dump of s1 in eqiad on backupmon1001 is CRITICAL: Last dump for s1 at eqiad (db1240) taken on 2026-01-27 00:00:10 is 153 GiB, but the previous one was 183 GiB, a change of -16.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:34:14] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:41:18] PROBLEM - dump of s8 in eqiad on backupmon1001 is CRITICAL: Last dump for s8 at eqiad (db1171) taken on 2026-01-27 00:00:03 is 183 GiB, but the previous one was 240 GiB, a change of -23.4 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260127T0400) [04:02:09] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233367 (https://phabricator.wikimedia.org/T413804) [04:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233367 (https://phabricator.wikimedia.org/T413804) (owner: 10TrainBranchBot) [04:03:04] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233367 (https://phabricator.wikimedia.org/T413804) (owner: 10TrainBranchBot) [04:03:33] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.13 refs T413804 [04:03:39] T413804: 1.46.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T413804 [04:48:36] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.13 refs T413804 (duration: 45m 03s) [04:48:41] T413804: 1.46.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T413804 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260127T0500) [05:02:46] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.10 (duration: 02m 42s) [05:09:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:29:41] FIRING: [9x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:10] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:14] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:18] PROBLEM - dump of m1 in eqiad on backupmon1001 is CRITICAL: Last dump for m1 at eqiad (db1217) taken on 2026-01-27 03:05:15 is 75 GiB, but the previous one was 90 GiB, a change of -15.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:01:32] FIRING: [4x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:36:54] (03PS1) 10Clare Ming: Add ssh key for cjming new laptop [puppet] - 10https://gerrit.wikimedia.org/r/1233566 [06:41:14] PROBLEM - dump of s8 in codfw on backupmon1001 is CRITICAL: Last dump for s8 at codfw (db2198) taken on 2026-01-27 00:00:05 is 183 GiB, but the previous one was 240 GiB, a change of -23.4 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:42:52] (03PS1) 10Marostegui: installserver: Do not format db2248 [puppet] - 10https://gerrit.wikimedia.org/r/1233569 (https://phabricator.wikimedia.org/T415358) [06:44:52] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2248 [puppet] - 10https://gerrit.wikimedia.org/r/1233569 (https://phabricator.wikimedia.org/T415358) (owner: 10Marostegui) [06:50:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260127T0700) [07:00:05] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260127T0700). [07:18:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool depool db2248: Reimage [07:18:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) depool db2248: Reimage [07:19:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on db2248.codfw.wmnet with reason: reimage [07:19:14] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:19:51] FIRING: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ... [07:19:51] MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/network-device-interface-queues-and-error-stats?var-site=eqsin,var-instance=cr3-eqsin:9804&var-interface=xe-0/1/3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [07:20:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2248.codfw.wmnet with OS trixie [07:24:51] RESOLVED: CoreRouterInterfaceDropPercent: Core router normal + high priority queue drops are high on cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, ... [07:24:51] MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#CoreRouterInterfaceDropPercent - https://grafana.wikimedia.org/d/5p97dAASz/network-device-interface-queues-and-error-stats?var-site=eqsin,var-instance=cr3-eqsin:9804&var-interface=xe-0/1/3 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDropPercent [07:24:51] ACKNOWLEDGEMENT - dump of m1 in codfw on backupmon1001 is CRITICAL: Last dump for m1 at codfw (db2160) taken on 2026-01-27 00:15:00 is 75 GiB, but the previous one was 89 GiB, a change of -15.9 % Marostegui This is expected https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:24:51] ACKNOWLEDGEMENT - dump of m1 in eqiad on backupmon1001 is CRITICAL: Last dump for m1 at eqiad (db1217) taken on 2026-01-27 03:05:15 is 75 GiB, but the previous one was 90 GiB, a change of -15.9 % Marostegui This is expected https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:24:51] ACKNOWLEDGEMENT - dump of s1 in eqiad on backupmon1001 is CRITICAL: Last dump for s1 at eqiad (db1240) taken on 2026-01-27 00:00:10 is 153 GiB, but the previous one was 183 GiB, a change of -16.2 % Marostegui This is expected https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:24:51] ACKNOWLEDGEMENT - dump of s8 in codfw on backupmon1001 is CRITICAL: Last dump for s8 at codfw (db2198) taken on 2026-01-27 00:00:05 is 183 GiB, but the previous one was 240 GiB, a change of -23.4 % Marostegui This is expected https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:24:51] ACKNOWLEDGEMENT - dump of s8 in eqiad on backupmon1001 is CRITICAL: Last dump for s8 at eqiad (db1171) taken on 2026-01-27 00:00:03 is 183 GiB, but the previous one was 240 GiB, a change of -23.4 % Marostegui This is expected https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:36:14] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2248.codfw.wmnet with reason: host reimage [07:39:36] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2248.codfw.wmnet with reason: host reimage [08:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260127T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2248.codfw.wmnet with OS trixie [08:04:42] 06SRE, 06Traffic: OGP lists fullsize thumbnail version of original instead the original itself - https://phabricator.wikimedia.org/T415598#11556682 (10Bawolff) >>! In T415598#11555961, @AntiCompositeNumber wrote: >>>! In T415598#11555931, @TheDJ wrote: >> The ogp.me tag is listing the thumbnail variant of the... [08:19:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2248: After reimage [08:20:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1184 with weight 0 T415238', diff saved to https://phabricator.wikimedia.org/P87950 and previous config saved to /var/cache/conftool/dbconfig/20260127-082020-marostegui.json [08:20:26] T415238: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T415238 [08:20:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T415238 [08:21:03] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1230144 (https://phabricator.wikimedia.org/T415238) (owner: 10Gerrit maintenance bot) [08:24:46] !log Starting s1 eqiad failover from db1163 to db1184 - T415238 [08:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1184 to s1 primary T415238', diff saved to https://phabricator.wikimedia.org/P87951 and previous config saved to /var/cache/conftool/dbconfig/20260127-082502-marostegui.json [08:25:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1163 T415238', diff saved to https://phabricator.wikimedia.org/P87952 and previous config saved to /var/cache/conftool/dbconfig/20260127-082542-marostegui.json [08:25:49] T415238: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T415238 [08:27:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1163.eqiad.wmnet with reason: schema change [08:28:20] (03PS1) 10Marostegui: db1163: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1233641 (https://phabricator.wikimedia.org/T411163) [08:28:49] !log Deploy schema change on old s1 master db1163 T411163 T411164 [08:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:03] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:29:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:29:07] (03CR) 10Marostegui: [C:03+2] db1163: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1233641 (https://phabricator.wikimedia.org/T411163) (owner: 10Marostegui)