[00:01:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86686 and previous config saved to /var/cache/conftool/dbconfig/20251217-000109-marostegui.json [00:01:16] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:01:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:16:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P86687 and previous config saved to /var/cache/conftool/dbconfig/20251217-001617-marostegui.json [00:17:54] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [00:18:01] rolling some envoy updates, staging only [00:18:18] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [00:20:07] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [00:20:27] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [00:20:38] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [00:20:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:20:53] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [00:21:12] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [00:22:15] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [00:22:25] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [00:22:32] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [00:22:54] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [00:23:22] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [00:23:34] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [00:23:44] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [00:23:54] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [00:24:20] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [00:24:30] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [00:24:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [00:24:48] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [00:25:24] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [00:25:37] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [00:25:43] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [00:25:55] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [00:26:15] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [00:26:27] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [00:26:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [00:26:49] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [00:27:01] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [00:27:12] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [00:27:18] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [00:27:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [00:27:54] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [00:28:06] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [00:28:17] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [00:28:36] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [00:28:50] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [00:29:02] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [00:29:28] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [00:29:37] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [00:30:11] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [00:30:37] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [00:30:44] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [00:31:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P86688 and previous config saved to /var/cache/conftool/dbconfig/20251217-003126-marostegui.json [00:31:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [00:32:34] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [00:33:50] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [00:34:50] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [00:37:30] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [00:37:42] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply [00:38:09] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [00:38:29] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:39:25] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:39:33] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [00:39:47] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [00:40:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860 [00:40:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860 (owner: 10TrainBranchBot) [00:41:36] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [00:41:42] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [00:42:16] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [00:42:43] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [00:42:50] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [00:43:01] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [00:43:09] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [00:43:19] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [00:43:33] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [00:43:39] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [00:43:45] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [00:43:55] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [00:45:28] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [00:45:39] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [00:45:55] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [00:46:07] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [00:46:20] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [00:46:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86689 and previous config saved to /var/cache/conftool/dbconfig/20251217-004634-marostegui.json [00:46:36] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [00:46:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:46:40] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:46:45] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [00:46:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [00:46:57] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [00:47:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86690 and previous config saved to /var/cache/conftool/dbconfig/20251217-004659-marostegui.json [00:48:27] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [00:48:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [00:48:56] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [00:49:15] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [00:49:25] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [00:49:53] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [00:49:58] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [00:50:31] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [00:50:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [00:50:50] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [00:52:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860 (owner: 10TrainBranchBot) [00:56:19] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [00:56:30] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [00:56:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [00:56:52] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [00:57:03] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [00:57:21] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [00:57:30] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:58:06] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:58:12] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [00:58:31] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [01:01:03] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862 [01:10:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862 (owner: 10TrainBranchBot) [01:25:14] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 24m 10s) [01:34:32] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862 (owner: 10TrainBranchBot) [01:44:06] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11466958 (10Aklapper) 05Open→03Declined [01:48:05] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11466974 (10Papaul) a:05Papaul→03ayounsi @ayounsi assigned back to you since you are working on it. thanks [01:55:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86691 and previous config saved to /var/cache/conftool/dbconfig/20251217-015538-marostegui.json [01:55:44] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [01:55:45] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:10:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P86692 and previous config saved to /var/cache/conftool/dbconfig/20251217-021046-marostegui.json [02:13:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86693 and previous config saved to /var/cache/conftool/dbconfig/20251217-021310-ladsgroup.json [02:13:14] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:25:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P86694 and previous config saved to /var/cache/conftool/dbconfig/20251217-022554-marostegui.json [02:28:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86695 and previous config saved to /var/cache/conftool/dbconfig/20251217-022818-ladsgroup.json [02:30:24] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:41:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86696 and previous config saved to /var/cache/conftool/dbconfig/20251217-024103-marostegui.json [02:41:09] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:41:09] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:41:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance [02:41:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86697 and previous config saved to /var/cache/conftool/dbconfig/20251217-024127-marostegui.json [02:43:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86698 and previous config saved to /var/cache/conftool/dbconfig/20251217-024326-ladsgroup.json [02:58:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86699 and previous config saved to /var/cache/conftool/dbconfig/20251217-025835-ladsgroup.json [02:58:40] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:58:52] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance [02:59:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P86700 and previous config saved to /var/cache/conftool/dbconfig/20251217-025900-ladsgroup.json [03:41:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86701 and previous config saved to /var/cache/conftool/dbconfig/20251217-034143-marostegui.json [03:41:50] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:41:50] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:49:20] (03CR) 10Dzahn: [C:04-2] "this can go last after everything else, cleanup-only and it needs a typo fix" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [03:54:19] PROBLEM - Host lsw1-e2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:54:52] that is me [03:55:12] evening papaul :) thanks [03:55:31] rzl: hello [03:56:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P86702 and previous config saved to /var/cache/conftool/dbconfig/20251217-035651-marostegui.json [04:02:23] FIRING: GnmiTargetDown: lsw1-e2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [04:04:33] RECOVERY - Host lsw1-e2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [04:07:22] RESOLVED: GnmiTargetDown: lsw1-e2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [04:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:12:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P86703 and previous config saved to /var/cache/conftool/dbconfig/20251217-041200-marostegui.json [04:17:26] (03CR) 10Dzahn: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [04:27:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86704 and previous config saved to /var/cache/conftool/dbconfig/20251217-042708-marostegui.json [04:27:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:27:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:27:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1253.eqiad.wmnet with reason: Maintenance [04:27:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86705 and previous config saved to /var/cache/conftool/dbconfig/20251217-042733-marostegui.json [04:29:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86706 and previous config saved to /var/cache/conftool/dbconfig/20251217-042943-marostegui.json [04:44:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P86707 and previous config saved to /var/cache/conftool/dbconfig/20251217-044453-marostegui.json [04:51:29] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11467201 (10Papaul) I took a quick look at this before getting the support ticket going on. On lsw1-e2-codfw we have ` Frame length statistics for m... [04:55:41] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 562521992 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:59:41] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:00:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P86708 and previous config saved to /var/cache/conftool/dbconfig/20251217-050001-marostegui.json [05:01:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [05:01:51] FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [05:01:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:02:07] PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:02:09] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:02:48] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:58] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:59] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Swift [05:02:59] RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.206 second response time https://wikitech.wikimedia.org/wiki/Swift [05:04:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [05:04:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:04:13] !incidents [05:04:14] 7196 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [05:04:14] 7197 (UNACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [05:04:14] 7198 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [05:04:14] 7199 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [05:04:15] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [05:04:24] !ack 7196 [05:04:24] 7196 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [05:04:28] !ack 7197 [05:04:29] 7197 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [05:04:33] !ack 7198 [05:04:34] 7198 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [05:04:37] !ack 7199 [05:04:37] 7199 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [05:06:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [05:06:51] RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [05:06:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:08:18] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467204 (10Marostegui) [05:08:25] !incidents [05:08:25] 7199 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [05:08:25] 7200 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [05:08:25] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [05:08:26] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [05:08:26] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [05:08:26] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [05:08:32] !ack 7200 [05:08:32] 7200 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [05:09:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:11:32] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467205 (10Marostegui) [05:12:18] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467206 (10Marostegui) p:05Triage→03Medium [05:15:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86709 and previous config saved to /var/cache/conftool/dbconfig/20251217-051509-marostegui.json [05:15:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [05:15:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:15:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:17:25] (03PS5) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [05:21:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86710 and previous config saved to /var/cache/conftool/dbconfig/20251217-052117-marostegui.json [05:21:23] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:21:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:24:32] (03PS6) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [05:24:57] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Access Admin menu in Airflow - https://phabricator.wikimedia.org/T412119#11467222 (10Marostegui) 05Open→03Resolved I believe this is all done - please reopen if not. Thanks Ben for handling this. [05:25:20] !incidents [05:25:20] 7200 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [05:25:20] 7199 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [05:25:21] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [05:25:21] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [05:25:21] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [05:25:21] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [05:25:23] (03PS7) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [05:27:57] (03PS1) 10Marostegui: es2028: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1218872 [05:29:00] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1218872 (owner: 10Marostegui) [05:29:01] (03CR) 10Marostegui: [C:03+2] es2028: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1218872 (owner: 10Marostegui) [05:30:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: schema change [05:33:24] (03PS4) 10Pppery: Add an internal translation file for this repo's own strings [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217873 (https://phabricator.wikimedia.org/T412651) [05:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:36:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P86711 and previous config saved to /var/cache/conftool/dbconfig/20251217-053625-marostegui.json [05:51:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P86712 and previous config saved to /var/cache/conftool/dbconfig/20251217-055133-marostegui.json [06:06:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86713 and previous config saved to /var/cache/conftool/dbconfig/20251217-060641-marostegui.json [06:06:48] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:06:48] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:06:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2221.codfw.wmnet with reason: Maintenance [06:07:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86714 and previous config saved to /var/cache/conftool/dbconfig/20251217-060706-marostegui.json [06:07:45] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:07:59] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.082 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.088 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.156 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:00] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.212 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:01] PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:01] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:01] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:08:02] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.103 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:03] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:08:07] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:07] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:07] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:07] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:07] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:09] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:09] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:57] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:57] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:57] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:59] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:59] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Swift [06:09:05] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.278 second response time https://wikitech.wikimedia.org/wiki/Swift [06:09:11] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [06:09:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:09:57] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Swift [06:09:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Swift [06:10:01] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [06:11:03] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.246 second response time https://wikitech.wikimedia.org/wiki/Swift [06:11:59] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.093 second response time https://wikitech.wikimedia.org/wiki/Swift [06:11:59] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.105 second response time https://wikitech.wikimedia.org/wiki/Swift [06:12:05] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.555 second response time https://wikitech.wikimedia.org/wiki/Swift [06:12:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:12:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift [06:12:59] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.580 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:01] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.089 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:05] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.473 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:05] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.914 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:05] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.131 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:07] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:01] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Swift [06:14:05] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.509 second response time https://wikitech.wikimedia.org/wiki/Swift [06:14:11] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:14:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2018.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:14:35] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:37] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:43] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:57] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Swift [06:14:59] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [06:14:59] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:01] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:01] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.257 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:11] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:15:35] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.610 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:35] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:15:35] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:15:37] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:15:59] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:59] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.080 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:59] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:01] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:01] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:07] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.886 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:11] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:16:25] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:16:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:16:27] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.835 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:35] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.189 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:51] FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [06:16:59] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.093 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:59] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:03] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.611 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:07] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.713 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:27] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:29] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.720 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:35] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.715 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:35] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.800 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:51] RESOLVED: [5x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:17:57] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:17:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:59] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.518 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:01] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.080 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:01] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:03] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.694 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:03] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.548 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:07] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.571 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:11] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:18:24] !incidents [06:18:25] 7201 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:18:25] 7202 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:18:25] 7203 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:18:25] 7205 (UNACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:18:26] 7204 (RESOLVED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [06:18:26] 7200 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:18:26] 7199 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:18:26] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:18:26] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:18:27] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [06:18:27] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [06:18:42] !ack 7205 [06:18:43] 7205 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:18:49] !ack 7203 [06:18:49] 7203 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:18:51] !ack 7202 [06:18:52] 7202 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:18:53] !ack 7201 [06:18:54] 7201 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:18:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:03] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.176 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:05] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.495 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:05] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.998 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:11] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.864 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:11] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:19:11] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:19:35] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:19:35] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:20:01] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:05] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.132 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:11] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:20:24] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:20:25] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:25] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:43] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:20:59] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:01] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:01] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:01] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.181 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:01] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:33] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.778 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:21:57] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift [06:22:05] RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 5.789 second response time https://wikitech.wikimedia.org/wiki/Swift [06:23:59] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.158 second response time https://wikitech.wikimedia.org/wiki/Swift [06:24:01] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.484 second response time https://wikitech.wikimedia.org/wiki/Swift [06:24:01] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.523 second response time https://wikitech.wikimedia.org/wiki/Swift [06:24:05] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 5.819 second response time https://wikitech.wikimedia.org/wiki/Swift [06:24:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.068 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.246 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:05] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.392 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:11] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:25:11] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:25:24] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:25:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:25:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:25:27] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.214 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:27] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.303 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:26:01] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.198 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:03] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.157 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:11] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.374 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:27] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.217 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:29] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.263 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:59] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:59] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:59] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:01] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:01] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.388 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:01] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:05] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.945 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:07] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.116 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:11] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:27:27] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.713 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:31] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.561 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:01] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.139 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:07] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.291 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:25] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:28:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:28:29] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.686 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:33] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:43] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:29:01] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:29:01] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.501 second response time https://wikitech.wikimedia.org/wiki/Swift [06:29:11] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:29:31] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [06:29:37] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.096 second response time https://wikitech.wikimedia.org/wiki/Swift [06:29:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.232 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:01] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:03] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.052 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:24] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:30:24] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:30:25] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 5.155 second response time https://wikitech.wikimedia.org/wiki/Docker [06:30:35] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.380 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:31:01] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.714 second response time https://wikitech.wikimedia.org/wiki/Swift [06:31:11] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:31:35] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:31:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:32:03] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.936 second response time https://wikitech.wikimedia.org/wiki/Swift [06:32:11] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:32:18] !incidents [06:32:18] 7201 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:32:18] 7202 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:32:18] 7203 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:32:19] 7205 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:32:19] 7206 (UNACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [06:32:19] 7204 (RESOLVED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [06:32:19] 7200 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:32:19] 7199 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:32:20] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:32:20] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:32:21] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [06:32:21] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [06:32:27] !ack 7206 [06:32:28] 7206 (ACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [06:32:29] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.289 second response time https://wikitech.wikimedia.org/wiki/Swift [06:32:29] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.569 second response time https://wikitech.wikimedia.org/wiki/Swift [06:33:01] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:33:07] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.363 second response time https://wikitech.wikimedia.org/wiki/Swift [06:33:09] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:33:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:33:29] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.566 second response time https://wikitech.wikimedia.org/wiki/Swift [06:33:43] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:33:59] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Swift [06:34:01] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.149 second response time https://wikitech.wikimedia.org/wiki/Swift [06:34:03] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.362 second response time https://wikitech.wikimedia.org/wiki/Swift [06:34:11] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:34:11] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:34:11] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:34:39] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.030 second response time https://wikitech.wikimedia.org/wiki/Swift [06:34:59] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.371 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:01] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:01] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:01] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.775 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:05] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.618 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:11] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:35:24] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:35:37] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:35:41] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 525440072 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:35:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:36:01] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.601 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:01] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.206 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:03] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.599 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:05] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.825 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:09] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.312 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:36:27] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:33] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:37] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:36:59] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.067 second response time https://wikitech.wikimedia.org/wiki/Swift [06:37:11] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:37:11] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:37:41] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:37:43] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:37:59] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [06:37:59] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.096 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:01] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:38:29] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.789 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:35] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.960 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:35] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.036 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:35] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:38:37] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.096 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.301 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:01] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.241 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:03] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.469 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:03] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.498 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:11] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:39:11] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:25] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:57] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:01] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 1.762 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:01] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.043 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.328 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:03] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:07] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.444 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:07] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.270 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:11] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:40:24] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:40:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:40:25] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:41:01] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:01] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:03] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.567 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:07] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.007 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:43] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:42:01] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [06:42:35] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.780 second response time https://wikitech.wikimedia.org/wiki/Swift [06:42:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:42:59] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.090 second response time https://wikitech.wikimedia.org/wiki/Swift [06:42:59] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.280 second response time https://wikitech.wikimedia.org/wiki/Swift [06:42:59] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:01] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.297 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:03] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:03] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.238 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:07] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:09] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.957 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:43:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:43:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:43:35] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:37] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:37] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:43] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:57] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:57] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.877 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.896 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.475 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:11] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:44:11] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:44:11] FIRING: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:33] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.122 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:59] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.933 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:01] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.317 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.579 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.822 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.293 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:07] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.961 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:45:25] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:45:27] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:33] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Swift [06:46:01] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [06:46:02] !incidents [06:46:03] 7202 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:46:03] 7203 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:46:03] 7205 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:46:03] 7206 (ACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [06:46:03] 7207 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:46:04] 7201 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:46:04] 7204 (RESOLVED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [06:46:04] 7200 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:46:04] 7199 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:46:05] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:46:05] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:46:06] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [06:46:06] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [06:46:13] !ack 7207 [06:46:13] 7207 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:46:51] RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [06:47:01] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.121 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:59] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.161 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:07] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.586 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:12] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:48:53] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.932 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.386 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 2.087 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.320 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:00] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:01] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.955 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:01] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:49:01] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:49:05] RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.799 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:05] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.908 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:07] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.496 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:11] RESOLVED: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:49:51] FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [06:49:59] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:52:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:54:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [06:54:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:54:51] RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [06:56:51] RESOLVED: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:57:51] RESOLVED: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:59:11] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0700) [07:04:11] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:07:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:07:48] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:17:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:22:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:53] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:32:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:37:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:42:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:42:53] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:47:43] FIRING: [11x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:47:48] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:43] RESOLVED: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by akosiaris@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) (owner: 10Alexandros Kosiaris) [08:03:46] (03Merged) 10jenkins-bot: Update fc-list to point to fc-list Tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) (owner: 10Alexandros Kosiaris) [08:04:41] !log akosiaris@deploy2002 Started scap sync-world: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]] [08:04:45] T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 [08:07:36] !log akosiaris@deploy2002 akosiaris: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:08:24] !log akosiaris@deploy2002 akosiaris: Continuing with sync [08:09:41] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 8160.08 ms [08:10:03] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 0%, RTA = 1393.21 ms [08:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:13:03] !log akosiaris@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]] (duration: 08m 22s) [08:13:07] T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice - https://phabricator.wikimedia.org/T280718 [08:26:37] !log installing jq security updates [08:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:43] (03PS1) 10Elukey: scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524) [08:27:49] (03CR) 10Dpogorzelski: [C:03+1] scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524) (owner: 10Elukey) [08:29:05] (03CR) 10Elukey: [V:03+2 C:03+2] scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524) (owner: 10Elukey) [08:32:13] (03PS1) 10Muehlenhoff: debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115 [08:40:43] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@4533f76]: Deploy docker-pkg [08:41:39] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@4533f76]: Deploy docker-pkg (duration: 01m 08s) [08:42:51] (03PS1) 10Alexandros Kosiaris: Remove scap_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) [08:45:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris) [08:45:55] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris) [08:48:44] (03CR) 10Elukey: [C:03+1] debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115 (owner: 10Muehlenhoff) [08:50:08] (03CR) 10Muehlenhoff: [C:03+2] admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737 (owner: 10Elukey) [08:50:58] (03PS1) 10KartikMistry: Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119 [08:54:21] (03CR) 10Ayounsi: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey) [08:58:04] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11467826 (10ayounsi) My guess is that SR-Linux < 25 doesn't have stats for mgmt0 (either not implemented yet or a bug), with the upgrade we've started... [09:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0900) [09:00:08] (03PS1) 10Elukey: images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 [09:01:14] (03PS2) 10Elukey: images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 [09:03:13] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey) [09:04:28] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:05:34] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:06:35] !log jelto@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:07:21] !log jelto@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:09:38] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:10:04] (03CR) 10Alexandros Kosiaris: "Adding Blake and Jasmine per comments in https://phabricator.wikimedia.org/T411508 for review (also feel free to deploy)" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris) [09:10:28] (03CR) 10Alexandros Kosiaris: [C:03+2] images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey) [09:12:05] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:12:37] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119 (owner: 10KartikMistry) [09:13:02] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:13:39] (03PS1) 10Daniel Kinzler: rest-gateway: log x-wmf- headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123 [09:13:55] !log installing nginx security updates [09:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:19] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:14:28] (03Merged) 10jenkins-bot: Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119 (owner: 10KartikMistry) [09:17:35] jouncebot: nowandnext [09:17:35] For the next 1 hour(s) and 42 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0900) [09:17:35] In 1 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1100) [09:18:01] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:18:53] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:23:54] (03CR) 10Jelto: [C:03+1] "lgtm, I deployed this on all wikikube clusters" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:26:35] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11467881 (10MoritzMuehlenhoff) [09:28:50] !log depool and disable puppet on cp7009 for haproxy qos testing (T412785) [09:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:54] T412785: Enable QoS for upload video files - https://phabricator.wikimedia.org/T412785 [09:32:05] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7009.* [09:32:12] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7009.* [09:36:02] (03PS3) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) [09:37:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:38:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:39:00] (03PS4) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) [09:40:47] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11467954 (10tappof) [09:46:42] (03CR) 10Elukey: [V:03+2] images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:54:35] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:55:10] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:55:41] (03PS4) 10Elukey: DNM - Reimage: dup-uefi after the first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731 [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:37] (03PS1) 10Muehlenhoff: kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126 [09:59:15] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:59:50] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [10:02:38] (03CR) 10Muehlenhoff: [C:03+2] debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115 (owner: 10Muehlenhoff) [10:05:54] (03CR) 10Mszwarc: [C:03+1] Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [10:07:18] !log Updated cxserver to 2025-12-15-140202-production [10:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:50] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [10:09:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [10:19:58] (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role and related Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) [10:22:52] (03PS1) 10Elukey: Rework Makefile.build to ease additional distributions [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219130 [10:22:52] (03PS1) 10Elukey: Add Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219131 [10:25:18] (03CR) 10Elukey: [C:03+1] kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126 (owner: 10Muehlenhoff) [10:25:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:25:34] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: fix retry logic for the Supermicro BMC password [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey) [10:25:39] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468080 (10ayounsi) We can set the rule now as non-paging to start collecting data and test it. So we can gain trust in it before... [10:26:54] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [10:26:57] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 2 others: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11468090 (10elukey) @VRiley-WMF @Jclark-ctr the new code is merged, so you can test it once you have servers ready (I don't want to rush you). Please r... [10:27:17] (03PS1) 10Filippo Giunchedi: metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) [10:27:46] (03CR) 10CI reject: [V:04-1] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [10:30:23] (03CR) 10Muehlenhoff: [C:03+2] kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126 (owner: 10Muehlenhoff) [10:30:24] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:30:52] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7829/co" [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [10:33:03] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [10:33:41] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [10:34:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [10:34:29] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: apply [10:35:33] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [10:35:56] (03CR) 10Elukey: [C:03+1] Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [10:36:16] (03CR) 10Elukey: [C:03+1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [10:36:19] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [10:37:44] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [10:42:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86716 and previous config saved to /var/cache/conftool/dbconfig/20251217-104240-marostegui.json [10:42:46] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [10:42:47] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [10:44:23] (03PS1) 10Muehlenhoff: Add missing stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1219133 [10:45:17] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add missing stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1219133 (owner: 10Muehlenhoff) [10:45:53] (03PS2) 10Muehlenhoff: Remove puppetmaster::backend role and related Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) [10:47:35] (03PS1) 10Filippo Giunchedi: typos: match .wmet [puppet] - 10https://gerrit.wikimedia.org/r/1219134 [10:50:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:51:38] !log installing libssh security updates [10:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P86717 and previous config saved to /var/cache/conftool/dbconfig/20251217-105748-marostegui.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1100) [11:04:05] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie [11:09:20] (03CR) 10Majavah: [C:03+1] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [11:12:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P86718 and previous config saved to /var/cache/conftool/dbconfig/20251217-111257-marostegui.json [11:14:15] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] "CI failure will be fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1219134 (only a typo)" [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [11:14:20] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [11:14:40] (03CR) 10Muehlenhoff: "There's some noise in the PCC, which seems to be around stale PCC data, puppetmaster2002 is already gone e.g." [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:18:41] (03CR) 10Silvan Heintze: [C:03+1] "nice - now the symlinks are working in our local dev environment, too 👍" [dumps] - 10https://gerrit.wikimedia.org/r/1218317 (https://phabricator.wikimedia.org/T412726) (owner: 10Jakob) [11:22:38] (03CR) 10FNegri: [C:03+1] typos: match .wmet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219134 (owner: 10Filippo Giunchedi) [11:23:32] !log dropped "trash" and "percona" databases in x1 [11:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:47] (03PS1) 10Muehlenhoff: spamassassin: Remove OS check [puppet] - 10https://gerrit.wikimedia.org/r/1219137 [11:23:47] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for spamd [puppet] - 10https://gerrit.wikimedia.org/r/1219138 (https://phabricator.wikimedia.org/T135991) [11:23:58] (03PS2) 10Filippo Giunchedi: typos: match .wmet [puppet] - 10https://gerrit.wikimedia.org/r/1219134 [11:24:14] (03CR) 10Filippo Giunchedi: [C:03+2] typos: match .wmet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219134 (owner: 10Filippo Giunchedi) [11:25:32] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:25:32] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp7009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:25:36] PROBLEM - haproxy process on cp7009 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [11:26:38] (03CR) 10Elukey: [C:03+1] Remove puppetmaster::backend role and related Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:26:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff) [11:28:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86719 and previous config saved to /var/cache/conftool/dbconfig/20251217-112805-marostegui.json [11:28:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: Maintenance [11:28:11] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:28:11] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:28:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86720 and previous config saved to /var/cache/conftool/dbconfig/20251217-112818-marostegui.json [11:29:32] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7009 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2026-02-04 04:29:30 +0000 (expires in 48 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:29:32] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp7009 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2026-01-13 14:24:42 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:29:36] RECOVERY - haproxy process on cp7009 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [11:30:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86721 and previous config saved to /var/cache/conftool/dbconfig/20251217-113031-marostegui.json [11:31:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff) [11:31:58] (03CR) 10Lucas Werkmeister (WMDE): throttle: Allow for overriding temp account creation limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan) [11:32:23] (03CR) 10Lucas Werkmeister (WMDE): lift throttle limits for Sing Lit 2025 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [11:34:05] jouncebot: nowandnext [11:34:05] For the next 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1100) [11:34:05] In 0 hour(s) and 25 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1200) [11:35:10] 10ops-codfw, 06DC-Ops: Power Supply Redundancy alert on db2247 - https://phabricator.wikimedia.org/T412935 (10FCeratto-WMF) 03NEW [11:40:36] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468298 (10fgiunchedi) >>! In T384052#11462541, @cmooney wrote: > > https://grafana.wikimedia.org/goto/YOk1qBMDg > > In terms of... [11:42:26] !log installing libsndfile security updates [11:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:45:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P86722 and previous config saved to /var/cache/conftool/dbconfig/20251217-114539-marostegui.json [11:47:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:54:31] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218294 (owner: 10PipelineBot) [11:56:18] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218294 (owner: 10PipelineBot) [12:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1200). [12:00:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P86723 and previous config saved to /var/cache/conftool/dbconfig/20251217-120047-marostegui.json [12:01:30] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:02:04] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:04:22] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:04:57] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:06:47] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:07:16] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:07:48] (03PS11) 10Matthieulec: Add new script to export A/A and A/P service types from Cumin hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) [12:08:42] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214072 (owner: 10PipelineBot) [12:08:48] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217229 (owner: 10PipelineBot) [12:08:55] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217558 (owner: 10PipelineBot) [12:09:01] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217722 (owner: 10PipelineBot) [12:09:19] (03CR) 10Clément Goubert: [C:03+1] rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler) [12:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:15:32] !log installing pam security updates [12:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86724 and previous config saved to /var/cache/conftool/dbconfig/20251217-121556-marostegui.json [12:16:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:16:02] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:24:11] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:43:55] (03PS13) 10Daniel Kinzler: rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 [12:43:59] (03CR) 10Clément Goubert: [C:03+1] rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler) [12:59:13] (03PS8) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) [13:08:35] (03CR) 10Jelto: [C:03+1] "lgtm now, should be merged (and tested) in January" [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [13:09:36] (03CR) 10Clément Goubert: [C:03+1] rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler) [13:13:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [13:15:19] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7830/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff) [13:15:25] (03PS1) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) [13:15:25] (03PS1) 10Tiziano Fogli: Thanos/Store: add a ruler(s) dedicate store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [13:15:38] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler) [13:15:42] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler) [13:15:54] (03CR) 10CI reject: [V:04-1] Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [13:15:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [13:16:29] (03PS2) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [13:17:32] (03Merged) 10jenkins-bot: rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler) [13:17:37] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468560 (10ABran-WMF) [13:17:39] (03Merged) 10jenkins-bot: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler) [13:21:24] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, thanks for the cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff) [13:22:02] (03CR) 10Effie Mouzeli: [C:03+1] "The idea is excellent and aligns well with our future plans to add post-upgrade hooks for running smoke tests (as part of T412941). For no" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler) [13:23:30] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468568 (10ABran-WMF) >>! In T286066#11465434, @Dzahn wrote: > You can remove the "Prepare tcpproxy VMs for accepting traffic on the... [13:24:26] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1219138 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:25:50] (03CR) 10Effie Mouzeli: "rephrase: The idea is excellent and aligns well with potential future plans to add post-upgrade hooks for running smoke tests (for example" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler) [13:25:54] (03CR) 10Jelto: [V:03+1 C:03+2] spamassassin: Remove OS check [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff) [13:25:58] (03CR) 10Jelto: [C:03+2] Enable profile::auto_restarts::service for spamd [puppet] - 10https://gerrit.wikimedia.org/r/1219138 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:27:49] (03PS2) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) [13:27:53] !log upgtrade Envoy on an-web T410975 [13:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:57] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [13:28:54] (03CR) 10CI reject: [V:04-1] lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [13:29:26] (03CR) 10Robertsky: lift throttle limits for Sing Lit 2025 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [13:29:33] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:31:02] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11468599 (10cmooney) >>! In T412807#11465779, @elukey wrote: > @cmooney I am +1 on testing something like `d-i netcfg/link_wait_timeout string 10`... [13:31:57] (03PS3) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) [13:32:12] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for the clamav [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991) [13:32:41] (03CR) 10CI reject: [V:04-1] lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [13:32:42] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:34:35] (03PS1) 10Jelto: lists: remove duplicate spamd auto restart [puppet] - 10https://gerrit.wikimedia.org/r/1219148 [13:35:07] (03PS4) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) [13:35:32] (03CR) 10Jelto: "puppet fails with" [puppet] - 10https://gerrit.wikimedia.org/r/1219148 (owner: 10Jelto) [13:35:52] (03PS2) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) [13:35:54] (03PS3) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) [13:36:23] !log installing apache2 security updates [13:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:56] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:37:09] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468622 (10ABran-WMF) [13:37:38] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7831/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219148 (owner: 10Jelto) [13:38:58] (03PS1) 10Majavah: spec: Stop running tests on buster [puppet] - 10https://gerrit.wikimedia.org/r/1219149 [13:39:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, sorry for missing that" [puppet] - 10https://gerrit.wikimedia.org/r/1219148 (owner: 10Jelto) [13:39:42] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7832/co" [puppet] - 10https://gerrit.wikimedia.org/r/1218808 (owner: 10Majavah) [13:40:29] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for clamav [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991) [13:40:42] (03CR) 10Jelto: [V:03+1 C:03+2] lists: remove duplicate spamd auto restart [puppet] - 10https://gerrit.wikimedia.org/r/1219148 (owner: 10Jelto) [13:40:56] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec) [13:41:18] (03CR) 10CI reject: [V:04-1] spec: Stop running tests on buster [puppet] - 10https://gerrit.wikimedia.org/r/1219149 (owner: 10Majavah) [13:42:26] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468628 (10ABran-WMF) [13:44:40] !log upgtrade Envoy on grafana* T410975 [13:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:44] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [13:45:54] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:46:56] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:47:15] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:47:40] (03PS2) 10Majavah: spec: Stop running tests on buster [puppet] - 10https://gerrit.wikimedia.org/r/1219149 [13:52:56] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:53:19] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat" [puppet] - 10https://gerrit.wikimedia.org/r/1218808 (owner: 10Majavah) [13:53:28] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:53:32] (03CR) 10Majavah: [V:03+1 C:03+2] P:mail::smarthost: Remove NRPE monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1218808 (owner: 10Majavah) [13:55:54] (03CR) 10Lucas Werkmeister (WMDE): lift throttle limits for Sing Lit 2025 (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [13:57:31] (03CR) 10Tiziano Fogli: "I tested it on Pontoon. The catalog was applied without errors and gave me the following two processes:" [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [13:58:42] (03PS2) 10Daniel Kinzler: rest-gateway: log x-wmf-* headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123 [13:58:56] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11468667 (10MoritzMuehlenhoff) >>! In T412807#11468599, @cmooney wrote: > @elukey yeah it probably won't work but it's worth a throw of the dice.... [13:59:28] (03PS5) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1400). [14:00:05] Robertsky, Tran, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11468678 (10cmooney) >>! In T412807#11468667, @MoritzMuehlenhoff wrote: > We don't configure netcfg/link_wait_timeout ourselves, 10 is the built-i... [14:00:13] o/ [14:00:43] (03CR) 10Robertsky: lift throttle limits for Sing Lit 2025 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [14:01:32] o/ [14:02:17] will need help with deploying. [14:02:20] o/ [14:02:20] I can deploy [14:02:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [14:02:50] thanks! [14:03:36] (03Merged) 10jenkins-bot: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [14:04:07] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1218853|lift throttle limits for Sing Lit 2025 (T412820)]] [14:04:11] T412820: Requesting temporary lift of IP cap for editathon on 27 Dec 2025 - https://phabricator.wikimedia.org/T412820 [14:06:20] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, robertsky: Backport for [[gerrit:1218853|lift throttle limits for Sing Lit 2025 (T412820)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:06:50] robertsky: anything to test on mwdebug for this change? [14:06:54] push ahead, changes can't be verified until the day. [14:06:59] yeah, makes sense [14:07:01] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, robertsky: Continuing with sync [14:09:26] !log installing pdns-recursor security updates [14:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:18] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218853|lift throttle limits for Sing Lit 2025 (T412820)]] (duration: 07m 10s) [14:11:22] T412820: Requesting temporary lift of IP cap for editathon on 27 Dec 2025 - https://phabricator.wikimedia.org/T412820 [14:11:39] I don’t see Tran yet [14:11:43] cscott: want to continue with your config change? [14:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:12:43] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468717 (10ABran-WMF) [14:13:14] thanks! signing off. gotta get that dinner. ciao. [14:13:19] see you! [14:13:30] Lucas_WMDE: Tran is on their way [14:13:38] hi Tran :) [14:13:42] 👋 hi hi I'm a little late to the party, so sorry I was distracted by a meeting [14:13:52] no problem, we just finished deploying another change [14:13:55] do you want to deploy yours now? [14:14:04] yes please! Would you like me to or are you already there? [14:14:11] either works for me [14:14:21] Lucas_WMDE: i can wait (sorry, i was distracted) [14:14:25] I wouldn't say no if you did it :p [14:14:29] alright, sure ^^ [14:14:36] 🙇 [14:14:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [14:16:02] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: log x-wmf-* headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123 (owner: 10Daniel Kinzler) [14:16:13] (03Merged) 10jenkins-bot: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [14:16:45] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1207845|Enable v2 non-emergency workflow by default (T410512 T412715)]] [14:16:51] T410512: Add support for maintaining legacy non-emergency flow during transition to v2 - https://phabricator.wikimedia.org/T410512 [14:16:51] T412715: Deploy Incident Reporting System to test2wiki - https://phabricator.wikimedia.org/T412715 [14:17:36] (03PS1) 10Cathal Mooney: Trixie d-i preseed file: increase link_wait_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1219154 (https://phabricator.wikimedia.org/T412807) [14:18:19] (03Merged) 10jenkins-bot: rest-gateway: log x-wmf-* headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123 (owner: 10Daniel Kinzler) [14:18:23] !log installing redis security updates [14:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1219154 (https://phabricator.wikimedia.org/T412807) (owner: 10Cathal Mooney) [14:19:01] !log lucaswerkmeister-wmde@deploy2002 stran, lucaswerkmeister-wmde: Backport for [[gerrit:1207845|Enable v2 non-emergency workflow by default (T410512 T412715)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:19:17] Tran: can you test the change? [14:19:22] Yes, on it [14:19:31] (03CR) 10Cathal Mooney: [C:03+2] Trixie d-i preseed file: increase link_wait_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1219154 (https://phabricator.wikimedia.org/T412807) (owner: 10Cathal Mooney) [14:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:22:33] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:22:49] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:27:08] Hm...I made the assumption that the train rolled out to group 1 today but it looks like there was a blocker [14:28:28] ah [14:28:40] I think this config can't go through [14:29:00] I don’t see a blocker in https://phabricator.wikimedia.org/T408277 but maybe they’re using the primary time slot this week https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1900 [14:29:09] ok, so abort sync and revert? [14:29:30] Yes I think so, sorry I should have confirmed (and will do so next time before scheduling the config change again) [14:29:37] alright [14:29:42] we can deploy the revert together with cscott’s change then [14:29:44] !log lucaswerkmeister-wmde@deploy2002 Sync cancelled. [14:29:49] works for me. [14:30:15] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219158 (https://phabricator.wikimedia.org/T410512) [14:30:26] cscott: do you want to deploy or should I? [14:30:26] (03PS1) 10STran: Revert "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219159 [14:30:45] i'm going to let you do it, since it is being combined with the revert [14:30:50] ok [14:30:54] oh you made the revert, thank you 🙇 I'll abandon mine [14:31:02] ah, ok :D [14:31:04] anything that happens after an aborted scap makes me nervous. ;) [14:31:09] (03Abandoned) 10STran: Revert "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219159 (owner: 10STran) [14:31:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219158 (https://phabricator.wikimedia.org/T410512) (owner: 10Lucas Werkmeister (WMDE)) [14:31:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [14:31:40] Tran: are there any potential errors we should look out for during the revert deploy? [14:32:18] (03Merged) 10jenkins-bot: Revert "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219158 (https://phabricator.wikimedia.org/T410512) (owner: 10Lucas Werkmeister (WMDE)) [14:32:26] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468772 (10cmooney) >>! In T384052#11468080, @ayounsi wrote: > We can set the rule now as non-paging to start collecting data and... [14:32:33] (03Merged) 10jenkins-bot: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [14:33:04] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]] [14:33:11] T410512: Add support for maintaining legacy non-emergency flow during transition to v2 - https://phabricator.wikimedia.org/T410512 [14:33:11] T412715: Deploy Incident Reporting System to test2wiki - https://phabricator.wikimedia.org/T412715 [14:33:11] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [14:33:34] (03CR) 10Filippo Giunchedi: Thanos/Store: add a ruler(s)-dedicated store gateway (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [14:34:23] No, the config shouldn't have had any effect as the critical fields it would have enabled access to weren't deployed yet and it was meant to fallback gracefully [14:34:29] ok [14:34:32] thanks [14:35:18] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ihurbain: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:35:46] cscott: can you test your change? [14:37:48] yup, testing [14:41:12] !log installing tiff security updates [14:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:36] PROBLEM - haproxy process on cp7009 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [14:42:38] RECOVERY - haproxy process on cp7009 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [14:44:04] Lucas_WMDE: still checking [14:44:12] ack [14:46:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability: Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11468866 (10herron) [14:47:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q2): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11468868 (10herron) [14:47:18] Lucas_WMDE: ok, looks good [14:47:38] 10ops-codfw, 06SRE, 06DC-Ops, 06SRE Observability: Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11468869 (10herron) [14:47:47] ok, thanks! [14:47:50] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ihurbain: Continuing with sync [14:47:54] 10ops-codfw, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q2): Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11468871 (10herron) [14:48:13] hm, there’s one warning in mwdebug logstash [14:48:17] Pool key 'simplewiki:parsoid-pcache:232335:|#|:idhash:useParsoid=1:revid:10648812' (ArticleView): Usage error: You may only aquire a single non-nowait lock. [14:48:19] is that relevant? [14:49:22] i was testing just now on simplewiki, let me see if i can reproduce that [14:49:48] * Lucas_WMDE searches further back in time [14:49:53] ok it’s happened before, at least [14:50:12] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:50:33] i'm wondering if it happens on purge, because that's part of what i did on simplewiki to test the new cache mechanism. [14:50:39] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:51:03] (03CR) 10Kamila Součková: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec) [14:51:32] one of the previous logstash hits was apparently a purge too [14:51:38] https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2025.12.15?id=EP9bIpsBVE0pYbVvzWpE [14:51:41] judging by its referrer [14:51:49] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]] (duration: 18m 45s) [14:51:50] but the others weren’t [14:51:56] T410512: Add support for maintaining legacy non-emergency flow during transition to v2 - https://phabricator.wikimedia.org/T410512 [14:51:57] T412715: Deploy Incident Reporting System to test2wiki - https://phabricator.wikimedia.org/T412715 [14:51:57] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [14:52:33] anything before that? the postprocessing cache is also enabled on idwiki, which is where that message came from. [14:52:37] (03PS1) 10Clément Goubert: mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) [14:53:09] one result on 4 December [14:53:30] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:53:38] nothing earlier in the last 90 days, at least in mwdebug logstash [14:53:42] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:54:15] i think this is fine to continue to deploy since it didn't result in any user-visible errors, and the 4 dec predates our code [14:54:15] oh. it’s… rather common in non-mwdebug logstash, if you remove the error “error channels” requirement [14:54:23] 580259 hits in the last 24 hours [14:54:32] we didn't deploy to idwiki until 15 dec (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1217768) [14:54:49] the 4 Dec one was officewiki [14:55:40] hm, that is more suspicious: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1215115 was the officewiki deploy and it was 4 dec. [14:55:57] logstash link for the 580k messages: https://logstash.wikimedia.org/goto/28dd0fe40e67c1a37039eeb6f4456f16 [14:56:20] almost all of those are on idwiki (560k) [14:56:28] then 4k on testwiki, 4k on dewiki, 2k on thwiki [14:56:47] * Lucas_WMDE goes back in time and hopes logstash won’t melt [14:57:22] yeah that definitely looks like a very sharp uptick [14:57:28] on 15 December [14:58:09] I think there’s some background noise in the poolcounter channel, but most of the "You may only acquire a single non-nowait lock" messages are likely due to the postprocessing cache [14:58:16] (03CR) 10Dzahn: [C:03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [14:58:20] (03PS1) 10Clément Goubert: campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) [14:58:25] should I make a task or are you going to? [14:58:38] (03PS2) 10Clément Goubert: campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412544) [14:58:42] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [14:58:50] plenty of hits before dec 1, but all of those seem to be on Special:Contributions. So that seems like a different bug. [14:59:15] (03CR) 10CI reject: [V:04-1] campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412544) (owner: 10Clément Goubert) [14:59:42] !log installing nodejs security updates [14:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:44] Lucas_WMDE: can you make the bug? I think we're still okay with the deploy, its on small wikis and I believe what's happening is that we're doing a recursive lock acquisition, but the outer lock is sufficient for what we're doing. so it's a usage error but not a practical bug. [14:59:51] ok [15:00:01] (03PS1) 10Bking: stat hosts: remove load average alerts [alerts] - 10https://gerrit.wikimedia.org/r/1219163 (https://phabricator.wikimedia.org/T401589) [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1500) [15:00:11] but i'll hold off rolling out the postprocessing cache further until we better understand this & to prevent further logspa. [15:00:14] (03PS3) 10Clément Goubert: campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) [15:00:57] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11468976 (10ayounsi) [15:01:12] if you're making a bug for the recent messages, i'll make a bug for the pre-dec-3 messages (Special:Contributions) which look unrelated [15:01:19] (03PS1) 10Krinkle: scap: Remove unused php7_admin_port option [puppet] - 10https://gerrit.wikimedia.org/r/1219164 (https://phabricator.wikimedia.org/T224491) [15:01:32] (03PS3) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableContributionTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210716 (https://phabricator.wikimedia.org/T410939) [15:01:41] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:01:53] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:02:01] (03CR) 10Daimona Eaytoy: "(Memo: waiting for 1.46.0-wmf.7)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210716 (https://phabricator.wikimedia.org/T410939) (owner: 10Daimona Eaytoy) [15:03:08] 10SRE-swift-storage, 10Ceph, 06Release-Engineering-Team, 06serviceops: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11469007 (10MatthewVernon) [15:03:15] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:03:42] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:04:08] cscott: created T412959 [15:04:09] T412959: Logstash poolcounter warnings "Usage error: You may only aquire a single non-nowait lock" on wikis with post-processing cache enabled - https://phabricator.wikimedia.org/T412959 [15:04:12] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:15] !log UTC afternoon backport+config window done [15:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:40] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 0%, RTA = 546.57 ms [15:05:02] Lucas_WMDE: ok, and I created T412960 for the pre-dec 4 instances. [15:05:03] T412960: Pool key 'dewiki:SpecialContributions:a:127.0.0.1' (SpecialContributions): Usage error: You may only aquire a single non-nowait lock. - https://phabricator.wikimedia.org/T412960 [15:05:10] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:05:11] thanks! [15:05:11] Lucas_WMDE: thanks! [15:05:37] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:06:53] !log add AAAA record to restbase1031.eqiad.wmnet - T271140 [15:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:58] T271140: Some Data Persistence clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271140 [15:07:29] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [15:07:37] (03PS2) 10Clément Goubert: mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) [15:07:37] (03PS4) 10Clément Goubert: campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) [15:08:29] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:11:38] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [15:11:42] !log cmooney@cumin1003 START - Cookbook sre.hosts.move-vlan for host es2028 [15:11:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host es2028 [15:11:52] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie [15:11:58] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA to restbase1031 - ayounsi@cumin1003" [15:12:03] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA to restbase1031 - ayounsi@cumin1003" [15:12:03] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:12:29] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache restbase1031.eqiad.wmnet on all recursors [15:12:33] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1031.eqiad.wmnet on all recursors [15:13:20] (03CR) 10Dr0ptp4kt: [C:03+2] stat hosts: remove load average alerts [alerts] - 10https://gerrit.wikimedia.org/r/1219163 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [15:14:26] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [15:14:34] (03Merged) 10jenkins-bot: stat hosts: remove load average alerts [alerts] - 10https://gerrit.wikimedia.org/r/1219163 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [15:17:51] (03CR) 10Jelto: [C:03+1] "lgtm, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:27:47] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469199 (10cmooney) Hmmm so this didn't work, but also I see in the log file it still only waited 3 seconds (and indeed that is shorter than the... [15:28:35] !log upgrade Envoy on etherpad* T410975 [15:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:39] T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975 [15:29:08] (03PS1) 10Scott French: php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 [15:29:22] (03PS1) 10FNegri: P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) [15:29:37] 10SRE-swift-storage, 10Ceph, 06serviceops, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11469210 (10thcipriani) [15:29:51] (03CR) 10CI reject: [V:04-1] P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) (owner: 10FNegri) [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1500) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1530) [15:30:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:32:19] (03PS1) 10Muehlenhoff: Record LDAP access for bmartinez [puppet] - 10https://gerrit.wikimedia.org/r/1219172 [15:34:11] FIRING: [3x] JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:23] (03CR) 10Kamila Součková: [C:03+2] Add new script to export A/A and A/P service types from Cumin hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec) [15:37:07] (03PS2) 10Muehlenhoff: Record LDAP access for bmartinez [puppet] - 10https://gerrit.wikimedia.org/r/1219172 [15:40:13] (03PS2) 10FNegri: P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) [15:42:10] (03PS1) 10Muehlenhoff: Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 [15:42:16] (03CR) 10Dzahn: [C:03+2] Enable profile::auto_restarts::service for clamav [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:45:48] (03CR) 10Cathal Mooney: [C:03+1] Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 (owner: 10Muehlenhoff) [15:45:49] !log eevans@cumin1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-* [15:46:35] (03CR) 10Cathal Mooney: [C:03+2] Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 (owner: 10Muehlenhoff) [15:47:14] (03PS2) 10Muehlenhoff: Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 [15:47:29] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7833/co" [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) (owner: 10FNegri) [15:47:53] (03CR) 10Majavah: [V:03+1 C:03+1] P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) (owner: 10FNegri) [15:49:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11469293 (10MoritzMuehlenhoff) [15:49:53] !log eevans@cumin1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-backend [15:50:25] !log eevans@cumin1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-https [15:50:52] !log eevans@cumin1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-ssl [15:51:47] (03CR) 10FNegri: [C:03+2] P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) (owner: 10FNegri) [15:53:09] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for bmartinez [puppet] - 10https://gerrit.wikimedia.org/r/1219172 (owner: 10Muehlenhoff) [15:53:46] (03CR) 10Ahmon Dancy: [C:03+1] scap: Remove unused php7_admin_port option [puppet] - 10https://gerrit.wikimedia.org/r/1219164 (https://phabricator.wikimedia.org/T224491) (owner: 10Krinkle) [15:54:04] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971 (10MatthewVernon) 03NEW [15:54:11] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:03:19] 10ops-codfw, 06SRE, 07sre-alert-triage, 06DC-Ops, 06Infrastructure-Foundations: Alert in need of triage: SmartNotHealthy (instance sretest2006:9100) - https://phabricator.wikimedia.org/T412078#11469413 (10Jhancock.wm) if you zoom out to half a year, this alert has been active since the end of July. Could... [16:04:12] (03CR) 10Ahmon Dancy: "This change has broken puppet on deployment-mx03.deployment-prep. I filed https://phabricator.wikimedia.org/T412975" [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff) [16:09:12] (03CR) 10Clément Goubert: [C:03+2] team-sre/mw-cron: Improve dashboard and description [alerts] - 10https://gerrit.wikimedia.org/r/1218756 (https://phabricator.wikimedia.org/T412799) (owner: 10Clément Goubert) [16:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:10:47] (03Merged) 10jenkins-bot: team-sre/mw-cron: Improve dashboard and description [alerts] - 10https://gerrit.wikimedia.org/r/1218756 (https://phabricator.wikimedia.org/T412799) (owner: 10Clément Goubert) [16:12:01] (03CR) 10Dzahn: [C:03+2] "tested starting the new service on vrts2002 - looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:15:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:16:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11469513 (10Clement_Goubert) [16:16:58] (03CR) 10Scott French: [C:03+1] mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [16:18:07] (03CR) 10Scott French: [C:03+1] campaignevents: Skip mesh check in aggregateanswers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [16:18:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11469518 (10Clement_Goubert) Updated racking plan to: - Row A: 0 - **Row B: 2** - **Row C: 3** - **Row D: 6** - **Row E: 1** - **Row F: 1** This would still leave us with A... [16:24:49] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply Redundancy alert on db2247 - https://phabricator.wikimedia.org/T412935#11469569 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated cable. alert has cleared. [16:25:24] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:27:32] (03PS1) 10Dzahn: mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) [16:29:36] (03CR) 10Ahmon Dancy: [C:03+1] mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [16:30:32] PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [16:30:32] PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp7009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [16:30:38] PROBLEM - haproxy process on cp7009 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [16:31:28] ^^ I think this may be due to fabfur testing [16:31:33] !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host restbase1031.eqiad.wmnet [16:34:18] (03CR) 10Dzahn: [V:04-1] "you get the idea. would have to fix this though: https://puppet-compiler.wmflabs.org/output/1219180/7835/lists1004.wikimedia.org/change.li" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [16:34:34] RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp7009 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2026-01-13 14:24:42 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:34:34] RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7009 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2026-02-04 04:29:30 +0000 (expires in 48 days) https://wikitech.wikimedia.org/wiki/HTTPS [16:34:36] RECOVERY - haproxy process on cp7009 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [16:35:24] FIRING: [4x] ProbeDown: Service restbase1031-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:59] (03PS2) 10Dzahn: mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) [16:38:08] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1219180/7836/" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn) [16:38:11] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1031.eqiad.wmnet [16:39:11] FIRING: [6x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:24] RESOLVED: [6x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:11] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:53:21] (03PS1) 10Fabfur: P:cache::haproxy: TOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1219182 (https://phabricator.wikimedia.org/T412785) [16:54:50] 10ops-codfw, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T412983 (10phaultfinder) 03NEW [16:56:32] (03CR) 10RLazarus: [C:03+1] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 (owner: 10Scott French) [17:00:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:01:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:01:39] (03CR) 10Cathal Mooney: [C:03+1] P:cache::haproxy: TOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1219182 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [17:01:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:02:04] that is all we need :D [17:02:17] :| [17:03:59] !log enabling puppet and repooling cp7009 (T412785) [17:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:04] T412785: Enable QoS for upload video files - https://phabricator.wikimedia.org/T412785 [17:04:39] (03CR) 10Fabfur: [C:03+2] P:cache::haproxy: TOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1219182 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [17:05:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:06:07] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469757 (10Papaul) Ticket 05304338 has been submitted with Nokia [17:06:22] jouncebot: nowandnext [17:06:22] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [17:06:22] In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1800) [17:07:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11469771 (10cmooney) Hi @VRiley-WMF just to be aware please try to spread these as much as is practical evenly across the racks in each row. The "row-wide" view is sort of... [17:08:37] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7009.* [17:09:28] (03PS1) 10Fabfur: hiera: enable video tos on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1219185 (https://phabricator.wikimedia.org/T412785) [17:09:33] FYI, during the upcoming infra window, I'll be releasing some changes that will incur a full mediawiki image rebuild and deployment. depending on how quiet things are by ~ 17:20 UTC, I might get that (long) process started on the early side. [17:10:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:11:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:11:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:11:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:12:41] !log reprepro include php8.3_8.3.28-1+wmf11u2 in component/php83 [17:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:10] (03CR) 10JavierMonton: [C:03+1] Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 (https://phabricator.wikimedia.org/T411774) (owner: 10Muehlenhoff) [17:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:16:19] (03CR) 10Scott French: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 (owner: 10Scott French) [17:16:38] (03CR) 10Scott French: [V:03+2] "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 (owner: 10Scott French) [17:16:49] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219185 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [17:17:18] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 (owner: 10Scott French) [17:18:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:20:05] (03PS1) 10Fabfur: hiera: enable video tos on upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1219186 (https://phabricator.wikimedia.org/T412785) [17:22:21] (03PS1) 10Fabfur: hiera: enable video tos on cache upload [puppet] - 10https://gerrit.wikimedia.org/r/1219187 (https://phabricator.wikimedia.org/T412785) [17:23:39] (03PS1) 10Daniel Kinzler: smokepy: send http requests in parallel [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219188 [17:24:15] as noted previously, I am going to get this build / deploy process started shortly [17:24:22] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie [17:24:30] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie execu... [17:27:00] !log swfrench@deploy2002 Started scap sync-world: Rebuild deployment to pick up new production image [17:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:28:18] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [17:28:22] !log cmooney@cumin1003 START - Cookbook sre.hosts.move-vlan for host es2028 [17:28:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host es2028 [17:28:29] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie [17:32:07] (03PS1) 10Krinkle: scap: Remove unused mwmaint config, obsolete wikitech/php7 comments [puppet] - 10https://gerrit.wikimedia.org/r/1219189 (https://phabricator.wikimedia.org/T397017) [17:33:17] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:30] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lswtest-d8-eqiad,lswtest-d8-eqiad IPv6 with reason: upgradiing sr-linux on lswtest-d8-eqiad [17:33:38] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469916 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4ac5ae06-34f5-425c-b0df-bc77a3758cd3) set by cmooney@cumin1003 for 2:00:0... [17:34:00] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1006.eqiad.wmnet with reason: upgrading connected switch [17:36:32] (03PS1) 10Krinkle: scap: Add php_l10n build in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1219190 (https://phabricator.wikimedia.org/T99740) [17:39:36] (03CR) 10Ahmon Dancy: [C:03+1] scap: Remove unused mwmaint config, obsolete wikitech/php7 comments [puppet] - 10https://gerrit.wikimedia.org/r/1219189 (https://phabricator.wikimedia.org/T397017) (owner: 10Krinkle) [17:44:41] (03CR) 10Dzahn: [C:03+1] scap: Remove unused mwmaint config, obsolete wikitech/php7 comments [puppet] - 10https://gerrit.wikimedia.org/r/1219189 (https://phabricator.wikimedia.org/T397017) (owner: 10Krinkle) [17:45:52] (03CR) 10Cathal Mooney: [C:03+1] Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 (owner: 10Muehlenhoff) [17:45:56] (03CR) 10Cathal Mooney: [C:03+2] Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 (owner: 10Muehlenhoff) [17:46:42] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [17:46:50] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie execu... [17:48:59] (03PS1) 10Cathal Mooney: lswtest-d8-eqiad: define srlinux_version var as v25.10.1 [homer/public] - 10https://gerrit.wikimedia.org/r/1219192 (https://phabricator.wikimedia.org/T412733) [17:50:59] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ssw1-d[1,8]-eqiad.mgmt with reason: upgradiing sr-linux on lswtest-d8-eqiad [17:51:15] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469994 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec73e489-e95a-4824-ad67-a99943eae0e7) set by cmoone... [17:51:29] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ssw1-d[1,8]-eqiad with reason: upgradiing sr-linux on lswtest-d8-eqiad [17:51:43] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470001 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=98bc0d0a-c3e1-4862-b66a-e386322de608) set by cmoone... [17:51:46] !log upgrading OS on lswtest-d8-eqiad T412733 [17:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:49] T412733: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733 [17:53:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:24] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [17:54:55] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11470010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie [17:55:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [17:55:38] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11470015 (10Marostegui) [17:58:19] (03CR) 10Papaul: [C:03+1] lswtest-d8-eqiad: define srlinux_version var as v25.10.1 [homer/public] - 10https://gerrit.wikimedia.org/r/1219192 (https://phabricator.wikimedia.org/T412733) (owner: 10Cathal Mooney) [18:00:04] swfrench-wmf: #bothumor I � Unicode. All rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1800). [18:01:30] (03CR) 10Cathal Mooney: [C:03+2] lswtest-d8-eqiad: define srlinux_version var as v25.10.1 [homer/public] - 10https://gerrit.wikimedia.org/r/1219192 (https://phabricator.wikimedia.org/T412733) (owner: 10Cathal Mooney) [18:02:50] (03Merged) 10jenkins-bot: lswtest-d8-eqiad: define srlinux_version var as v25.10.1 [homer/public] - 10https://gerrit.wikimedia.org/r/1219192 (https://phabricator.wikimedia.org/T412733) (owner: 10Cathal Mooney) [18:05:42] (03CR) 10RLazarus: [C:03+1] mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert) [18:12:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11470078 (10VRiley-WMF) [18:13:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11470083 (10VRiley-WMF) [18:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:15:18] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470088 (10cmooney) >>! In T412733#11467826, @ayounsi wrote: > My guess is that SR-Linux < 25 doesn't have stats for mgmt0 (eit... [18:23:55] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470106 (10cmooney) @papaul lswtest-d8-eqiad is upgraded to v25.10.1 now for you. {F71107154 width=500} [18:24:15] mediawiki rebuild / deployment still chugging along [18:32:39] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [18:32:50] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11470138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie execu... [18:34:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:38:21] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11470156 (10Jclark-ctr) @RKemper I replaced the battery and that error has cleared. It still shows an error for Drive Slot 1. I’ve opened an RMA for the drive since it was pur... [18:39:47] RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:48] !log swfrench@deploy2002 Finished scap sync-world: Rebuild deployment to pick up new production image (duration: 78m 01s) [18:43:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11470166 (10CDobbins) Sorry, I was (blindly) following the instructions on wikitech and didn't stop to think. I'll take care of this myself! [18:43:39] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11470167 (10cmooney) Unfortunately it wasn't just a quirk to do with the tabs v. spaces in the preseed file. I tried again and the same happens,... [18:46:57] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [18:47:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [18:48:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [18:48:24] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [18:48:45] alright, I'm done with mediawiki deployments for this window. as expected this took quite a while :) [18:54:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11470232 (10Jhancock.wm) if we use 1G copper, we don't need to order anything. I can probably get it pre-ran tomorrow. Then papaul or I can conne... [18:55:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [19:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1900) [19:00:13] o/ [19:00:38] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219202 (https://phabricator.wikimedia.org/T408277) [19:00:41] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219202 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [19:01:37] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219202 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [19:09:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11470297 (10Marostegui) Thank you - if you need help with the verification out band, let me know! [19:11:40] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.7 refs T408277 [19:11:44] T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277 [19:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 15.16% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:23:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:26:34] FIRING: DiskSpace: Disk space serpens:9100:/ 3.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:27:00] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470334 (10Papaul) We are seeing the same error on lswtest-d8 in eqiad ` in-error-packets 2466 ` [19:33:39] FIRING: [2x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:34:34] PROBLEM - Host lswtest-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:34:57] (03Abandoned) 10CDobbins: icinga: add cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1014589 (owner: 10CDobbins) [19:36:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lswtest-d8-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:36:54] (03CR) 10CDobbins: [V:03+2] admin: add fido-based ssh access for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1218360 (https://phabricator.wikimedia.org/T412755) (owner: 10CDobbins) [19:36:58] 10ops-eqiad, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T413004 (10phaultfinder) 03NEW [19:37:00] PROBLEM - Host lswtest-d8-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:37:14] (03CR) 10CDobbins: [V:03+2 C:03+2] admin: add fido-based ssh access for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1218360 (https://phabricator.wikimedia.org/T412755) (owner: 10CDobbins) [19:37:49] 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413005 (10phaultfinder) 03NEW [19:42:04] (03PS1) 10Eric Gardner: Delay StickyHeaders section click instrumentation for slow loads [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219209 (https://phabricator.wikimedia.org/T412857) [19:42:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219209 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner) [19:43:03] jouncebot nowandnext [19:43:03] For the next 1 hour(s) and 16 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1900) [19:43:03] In 1 hour(s) and 16 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T2100) [19:45:09] The train looks good so I'm okay with folks using the rest of the window for backports. [19:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:46:07] (^ EricGardner) [19:47:22] 10SRE-Access-Requests: Add yubikey SSH key for 'denisse' - https://phabricator.wikimedia.org/T413006 (10andrea.denisse) 03NEW [19:48:29] 10SRE-Access-Requests: Add yubikey SSH key for 'denisse' - https://phabricator.wikimedia.org/T413006#11470411 (10andrea.denisse) 05Open→03In progress [19:51:51] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:53:39] FIRING: [4x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:55:59] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06Release-Engineering-Team, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008 (10CDanis) 03NEW [19:56:19] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06Release-Engineering-Team, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11470444 (10CDanis) This is at least High and possibly UBN! [20:04:33] (03Abandoned) 10Daniel Kinzler: api-gateway chart: add values-rest-staging.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211656 (owner: 10Daniel Kinzler) [20:05:05] (03Abandoned) 10Daniel Kinzler: rest-gateway: add prefix to all user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212239 (owner: 10Daniel Kinzler) [20:06:14] (03Abandoned) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [20:08:03] 10SRE-Access-Requests: Add FIDO-backed SSH key for aklapper - https://phabricator.wikimedia.org/T413009 (10Aklapper) 03NEW [20:08:17] (03PS1) 10Aklapper: admin: add fido backed ssh key for aklapper [puppet] - 10https://gerrit.wikimedia.org/r/1219213 (https://phabricator.wikimedia.org/T413009) [20:10:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:15:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:17:52] (03PS1) 10Andrea Denisse: admin: Add yubikey SSH key for denisse. [puppet] - 10https://gerrit.wikimedia.org/r/1219211 (https://phabricator.wikimedia.org/T413006) [20:19:11] Hi, can this patch be merged?? CDobbins: admin: add fido-based ssh access for cdobbins (476b0919fe) [20:19:40] ChrisDobbins901_ ^ [20:20:18] yes. I thought I merged it 😳 [20:20:47] It was merged on Gerrit but not on the Puppet host, no worries, I'll merge it. :) [20:21:01] thank you 🤦🏽 [20:23:20] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T412983#11470512 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm removing Phase, Active Power values until T401937 is resolved. [20:28:35] (03PS1) 10C. Scott Ananian: ParserOutputAccess: don't use PoolCounter recursively [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219216 (https://phabricator.wikimedia.org/T412959) [20:28:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219216 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian) [20:29:10] (03PS1) 10C. Scott Ananian: ParserOutputAccess: don't use PoolCounter recursively [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219217 (https://phabricator.wikimedia.org/T412959) [20:30:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219217 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian) [20:30:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P86725 and previous config saved to /var/cache/conftool/dbconfig/20251217-203012-ladsgroup.json [20:30:17] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [20:30:25] (03PS1) 10Scott French: shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 [20:32:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:17] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Offline Script not completing - https://phabricator.wikimedia.org/T411551#11470555 (10Jhancock.wm) i had a decomm ticket that passed without issues. T412783 [20:35:37] (03CR) 10RLazarus: [C:03+1] shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 (owner: 10Scott French) [20:39:54] (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 (owner: 10Scott French) [20:39:58] (03CR) 10Scott French: [C:03+2] shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 (owner: 10Scott French) [20:40:26] (03CR) 10Scott French: [C:03+1] mw-videoscaler: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217609 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [20:42:24] (03Merged) 10jenkins-bot: shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 (owner: 10Scott French) [20:44:46] (03CR) 10RLazarus: [C:03+2] mw-videoscaler: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217609 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [20:45:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P86726 and previous config saved to /var/cache/conftool/dbconfig/20251217-204520-ladsgroup.json [20:45:24] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:46:30] (03Merged) 10jenkins-bot: mw-videoscaler: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217609 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [20:47:52] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [20:48:20] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [20:48:51] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [20:49:07] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [20:49:38] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [20:49:41] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [20:49:52] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [20:49:55] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [20:50:21] (03PS1) 10Neriah: Enable protection indicators for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219219 [20:50:23] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:50:41] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:50:59] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [20:51:09] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [20:51:12] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [20:51:33] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [20:52:05] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [20:52:30] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [21:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T2100). [21:00:05] cscott, Pppery, and EricGardner: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:22] * swfrench-wmf has some pending shellbox updates, but will hold off until the backport window wraps up [21:00:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P86727 and previous config saved to /var/cache/conftool/dbconfig/20251217-210029-ladsgroup.json [21:00:35] o/ [21:00:44] I'm through with what I was doing, also :) [21:00:59] my backports should go out before the mediawiki-config patch [21:01:02] (for now... *ominous chord* *thunderclap* *maniacal laughter*) [21:01:12] I'm here and can deploy my patches (a simple backport and a config patch) when others are done [21:01:22] i can spiderpig my patches as well. [21:01:37] This is the config patch (it's already merged so I could not add it to the schedule): https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1217799 [21:02:03] Neither of my patches should produce any user-facing changes [21:02:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:02:48] EricGardner: if it's merged already, will it go out at the new scap, or only at the next scap of mediawiki-config? [21:03:16] I'm not totally clear on that. The exact timing does not really matter, this is more of a housekeeping change. [21:03:29] We are just removing some reference to a dead project. [21:03:36] i can get started then, and then your config change will probably go out with my config change. [21:04:07] That would be great if you want to include it [21:04:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219216 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian) [21:04:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219217 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian) [21:08:59] (03Merged) 10jenkins-bot: ParserOutputAccess: don't use PoolCounter recursively [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219216 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian) [21:09:03] (03Merged) 10jenkins-bot: ParserOutputAccess: don't use PoolCounter recursively [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219217 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian) [21:09:38] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1219216|ParserOutputAccess: don't use PoolCounter recursively (T412959)]], [[gerrit:1219217|ParserOutputAccess: don't use PoolCounter recursively (T412959)]] [21:09:42] T412959: Logstash poolcounter warnings "Usage error: You may only aquire a single non-nowait lock" on wikis with post-processing cache enabled - https://phabricator.wikimedia.org/T412959 [21:11:50] !log cscott@deploy2002 cscott: Backport for [[gerrit:1219216|ParserOutputAccess: don't use PoolCounter recursively (T412959)]], [[gerrit:1219217|ParserOutputAccess: don't use PoolCounter recursively (T412959)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:13:20] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Offline Script not completing - https://phabricator.wikimedia.org/T411551#11470629 (10Papaul) 05Open→03Resolved @Jhancock.wm thank you for the update. WE can resolve this task for now if it does happen again we can reopen. [21:14:23] !log cscott@deploy2002 cscott: Continuing with sync [21:15:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P86728 and previous config saved to /var/cache/conftool/dbconfig/20251217-211537-ladsgroup.json [21:15:42] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [21:15:54] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2239.codfw.wmnet with reason: Maintenance [21:18:28] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219216|ParserOutputAccess: don't use PoolCounter recursively (T412959)]], [[gerrit:1219217|ParserOutputAccess: don't use PoolCounter recursively (T412959)]] (duration: 08m 50s) [21:18:32] T412959: Logstash poolcounter warnings "Usage error: You may only aquire a single non-nowait lock" on wikis with post-processing cache enabled - https://phabricator.wikimedia.org/T412959 [21:19:14] (03PS1) 10Daniel Kinzler: rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 [21:19:21] EricGardner: ok, my mediawiki-core patches are done. the config patch is next. do you want to do your core patch before the config, or does it not matter? [21:20:08] It doesn't matter for my change [21:20:31] ok, i'm going to do the config patches now then. [21:20:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:20:58] (03CR) 10CI reject: [V:04-1] Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:21:17] (03CR) 10CI reject: [V:04-1] rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 (owner: 10Daniel Kinzler) [21:21:37] (03PS7) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:22:21] (03CR) 10TrainBranchBot: "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:23:23] (03Merged) 10jenkins-bot: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:23:56] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]] [21:24:01] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [21:24:02] T411558: ArticleSummaries: Decommission the extension (code changes) - https://phabricator.wikimedia.org/T411558 [21:25:44] (03PS2) 10Daniel Kinzler: rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 [21:26:10] !log cscott@deploy2002 ksarabia, ihurbain, cscott: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:27:56] (03CR) 10CI reject: [V:04-1] rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 (owner: 10Daniel Kinzler) [21:32:05] !log cscott@deploy2002 ksarabia, ihurbain, cscott: Continuing with sync [21:32:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11470692 (10VRiley-WMF) [21:36:09] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]] (duration: 12m 13s) [21:36:14] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [21:36:15] T411558: ArticleSummaries: Decommission the extension (code changes) - https://phabricator.wikimedia.org/T411558 [21:37:00] EricGardner: ok, i'm done. do you want to do your last patch yourself? [21:37:11] Sure, I can do that now [21:37:12] also, i'm not sure who is deploying pppery's patch [21:37:42] urbanecm: are you deploying pppery's patch? [21:39:51] I will start with my WikimediaEvents patch now [21:40:08] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11470751 (10thcipriani) [21:40:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219209 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner) [21:47:52] (03Merged) 10jenkins-bot: Delay StickyHeaders section click instrumentation for slow loads [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219209 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner) [21:48:25] !log egardner@deploy2002 Started scap sync-world: Backport for [[gerrit:1219209|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] [21:48:29] T412857: Sticky Headers: Distinguish automatic vs user-initiated section toggles - https://phabricator.wikimedia.org/T412857 [21:50:36] !log egardner@deploy2002 egardner: Backport for [[gerrit:1219209|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:52:10] !log egardner@deploy2002 egardner: Continuing with sync [21:53:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:12] !log egardner@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219209|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] (duration: 07m 47s) [21:56:16] T412857: Sticky Headers: Distinguish automatic vs user-initiated section toggles - https://phabricator.wikimedia.org/T412857 [21:57:24] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, cjming: there's a volunteer patch on the schedule from pppery but I don't know who is supposed to deploy it. [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T2200) [22:07:18] I have some envoy updates to roll out, but happy to wait if that last patch is still going to go out :) [22:09:43] cscott, rzl: I recommend leaving the change undeployed and moving on. [22:09:50] 10SRE-Access-Requests: FIDO ssh key for ariel - https://phabricator.wikimedia.org/T413019 (10ArielGlenn) 03NEW [22:10:07] 10SRE-Access-Requests: Add FIDO ssh key(s) for ariel - https://phabricator.wikimedia.org/T413019#11470889 (10ArielGlenn) [22:10:28] rzl: any objections if I sneak in some shellbox updates before you start? [22:10:38] lest you pick them up :) [22:10:44] nope, fire away [22:10:58] ack, starting momentarily [22:11:01] you're also welcome to leave em for me, you'd just have to wait until I get all the way to S :P [22:11:56] (03PS1) 10ArielGlenn: Add the first of two yubikey FIDO-compliant ssh keys for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019) [22:12:18] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [22:12:32] (03CR) 10CI reject: [V:04-1] Add the first of two yubikey FIDO-compliant ssh keys for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [22:12:53] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [22:12:59] rzl: thanks for offering! this probably warrants a wee bit more supervision than I'd want to burden you with, though. [22:13:24] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [22:13:38] 👍 [22:14:03] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [22:14:34] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [22:14:53] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [22:15:24] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [22:15:45] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [22:16:16] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [22:16:43] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [22:16:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218775 (https://phabricator.wikimedia.org/T412455) (owner: 10LorenMora) [22:17:14] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [22:17:25] FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:33] dancy: yep sounds good to me [22:17:57] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [22:18:32] (03PS2) 10ArielGlenn: Add the first of two yubikey FIDO-compliant ssh keys for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019) [22:19:25] rzl: I'll let that soak for 10m or so, then update codfw, then all yours [22:19:30] sgtm [22:30:00] service metrics and logstash look good. off to codfw we go. [22:30:18] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [22:30:56] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [22:31:27] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [22:36:09] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [22:41:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [22:42:25] FIRING: [5x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:15] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [22:43:50] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [22:45:04] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [22:45:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [22:45:50] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [22:46:06] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [22:46:37] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [22:46:59] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [22:47:31] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [22:48:07] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [22:52:49] rzl: all yours. thanks for your patience! [22:52:59] thanks! [22:53:04] !log upload new version of corto [22:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:17] rolling out envoy 1.35.7 to eqiad services [22:55:20] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply [22:55:56] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [22:56:33] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [22:57:05] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [22:58:02] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:58:08] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:59:26] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [22:59:44] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T2300) [23:03:24] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [23:03:43] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [23:03:50] (if anyone has plans to use the Web Team window today, I'm happy to pause for as long as you need!) [23:03:57] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [23:04:17] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [23:04:30] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [23:04:47] (03PS1) 10Bearloga: EventStreamConfig: enrich stream with more headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219234 (https://phabricator.wikimedia.org/T396562) [23:04:48] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [23:04:59] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/echostore: apply [23:05:57] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [23:06:16] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [23:06:32] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [23:06:48] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [23:07:14] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [23:07:33] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [23:08:11] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [23:08:25] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [23:08:47] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [23:08:55] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [23:09:11] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [23:09:21] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [23:09:47] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [23:10:05] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [23:10:35] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [23:10:53] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [23:11:41] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [23:11:57] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [23:12:14] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [23:12:29] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [23:12:50] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [23:13:09] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [23:13:34] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [23:13:45] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [23:14:58] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [23:17:47] (03CR) 10CDanis: [C:03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219234 (https://phabricator.wikimedia.org/T396562) (owner: 10Bearloga) [23:18:06] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [23:19:29] (03CR) 10ArielGlenn: [C:03+2] Add the first of two yubikey FIDO-compliant ssh keys for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn) [23:26:49] FIRING: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:30:44] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [23:30:48] this is going to time out soonish, same thing that happened last time I tried to deploy this serv-- yeah [23:31:01] moving on for now, I'll come back around to it [23:31:11] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [23:31:27] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [23:31:42] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [23:33:31] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [23:34:09] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [23:34:52] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [23:35:22] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [23:35:28] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [23:35:47] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [23:36:08] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [23:41:09] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [23:42:03] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [23:42:14] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [23:42:54] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [23:43:08] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [23:43:14] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [23:43:21] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [23:43:47] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [23:44:08] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [23:44:26] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [23:45:01] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [23:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:45:34] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [23:45:54] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [23:46:40] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [23:47:24] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [23:47:47] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [23:48:14] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [23:48:31] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [23:49:26] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [23:49:47] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [23:50:33] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [23:51:02] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [23:51:18] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [23:52:03] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [23:52:06] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:52:12] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [23:52:38] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [23:53:39] FIRING: [4x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:54:11] letting that rest a moment for extremely responsible operations reasons (i.e. I want a snack) and then I'll roll the same thing in codfw