[00:01:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86686 and previous config saved to /var/cache/conftool/dbconfig/20251217-000109-marostegui.json
[00:01:16] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[00:01:16] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[00:10:24] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[00:16:18] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P86687 and previous config saved to /var/cache/conftool/dbconfig/20251217-001617-marostegui.json
[00:17:54] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply
[00:18:01] <rzl>	 rolling some envoy updates, staging only
[00:18:18] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply
[00:20:07] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply
[00:20:27] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply
[00:20:38] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[00:20:45] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[00:20:53] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[00:21:12] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[00:22:15] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[00:22:25] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[00:22:32] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[00:22:54] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[00:23:22] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply
[00:23:34] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply
[00:23:44] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[00:23:54] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[00:24:20] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply
[00:24:30] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[00:24:37] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply
[00:24:48] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[00:25:24] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[00:25:37] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[00:25:43] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[00:25:55] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[00:26:15] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[00:26:27] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[00:26:37] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[00:26:49] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[00:27:01] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[00:27:12] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[00:27:18] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[00:27:29] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[00:27:54] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[00:28:06] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[00:28:17] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[00:28:36] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[00:28:50] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply
[00:29:02] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply
[00:29:28] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[00:29:37] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[00:30:11] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[00:30:37] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[00:30:44] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[00:31:26] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P86688 and previous config saved to /var/cache/conftool/dbconfig/20251217-003126-marostegui.json
[00:31:29] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[00:32:34] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[00:33:50] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[00:34:50] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[00:37:30] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[00:37:42] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply
[00:38:09] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[00:38:29] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[00:39:25] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[00:39:33] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[00:39:47] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[00:40:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860
[00:40:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860 (owner: 10TrainBranchBot)
[00:41:36] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[00:41:42] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[00:42:16] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply
[00:42:43] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
[00:42:50] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply
[00:43:01] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply
[00:43:09] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply
[00:43:19] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply
[00:43:33] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[00:43:39] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[00:43:45] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply
[00:43:55] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply
[00:45:28] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[00:45:39] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[00:45:55] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply
[00:46:07] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[00:46:20] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[00:46:35] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86689 and previous config saved to /var/cache/conftool/dbconfig/20251217-004634-marostegui.json
[00:46:36] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[00:46:40] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[00:46:40] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[00:46:45] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[00:46:51] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance
[00:46:57] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[00:47:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86690 and previous config saved to /var/cache/conftool/dbconfig/20251217-004659-marostegui.json
[00:48:27] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[00:48:45] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[00:48:56] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[00:49:15] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[00:49:25] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[00:49:53] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[00:49:58] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[00:50:31] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[00:50:39] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply
[00:50:50] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[00:52:57] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860 (owner: 10TrainBranchBot)
[00:56:19] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply
[00:56:30] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[00:56:39] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[00:56:52] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[00:57:03] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[00:57:21] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[00:57:30] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[00:58:06] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[00:58:12] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply
[00:58:31] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[01:01:03] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:10:13] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862
[01:10:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862 (owner: 10TrainBranchBot)
[01:25:14] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 24m 10s)
[01:34:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862 (owner: 10TrainBranchBot)
[01:44:06] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11466958 (10Aklapper) 05Open→03Declined
[01:48:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11466974 (10Papaul) a:05Papaul→03ayounsi @ayounsi assigned back to you since you are working on it. thanks
[01:55:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86691 and previous config saved to /var/cache/conftool/dbconfig/20251217-015538-marostegui.json
[01:55:44] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[01:55:45] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[02:10:47] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P86692 and previous config saved to /var/cache/conftool/dbconfig/20251217-021046-marostegui.json
[02:13:10] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86693 and previous config saved to /var/cache/conftool/dbconfig/20251217-021310-ladsgroup.json
[02:13:14] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[02:25:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P86694 and previous config saved to /var/cache/conftool/dbconfig/20251217-022554-marostegui.json
[02:28:19] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86695 and previous config saved to /var/cache/conftool/dbconfig/20251217-022818-ladsgroup.json
[02:30:24] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[02:41:03] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86696 and previous config saved to /var/cache/conftool/dbconfig/20251217-024103-marostegui.json
[02:41:09] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[02:41:09] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[02:41:19] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[02:41:28] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86697 and previous config saved to /var/cache/conftool/dbconfig/20251217-024127-marostegui.json
[02:43:27] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86698 and previous config saved to /var/cache/conftool/dbconfig/20251217-024326-ladsgroup.json
[02:58:36] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86699 and previous config saved to /var/cache/conftool/dbconfig/20251217-025835-ladsgroup.json
[02:58:40] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[02:58:52] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance
[02:59:01] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P86700 and previous config saved to /var/cache/conftool/dbconfig/20251217-025900-ladsgroup.json
[03:41:44] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86701 and previous config saved to /var/cache/conftool/dbconfig/20251217-034143-marostegui.json
[03:41:50] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[03:41:50] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[03:45:24] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:49:20] <wikibugs>	 (03CR) 10Dzahn: [C:04-2] "this can go last after everything else, cleanup-only and it needs a typo fix" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[03:54:19] <icinga-wm>	 PROBLEM - Host lsw1-e2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:54:52] <papaul>	 that is me
[03:55:12] <rzl>	 evening papaul :) thanks
[03:55:31] <papaul>	 rzl: hello
[03:56:52] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P86702 and previous config saved to /var/cache/conftool/dbconfig/20251217-035651-marostegui.json
[04:02:23] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-e2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[04:04:33] <icinga-wm>	 RECOVERY - Host lsw1-e2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms
[04:07:22] <jinxer-wm>	 RESOLVED: GnmiTargetDown: lsw1-e2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[04:10:24] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[04:12:00] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P86703 and previous config saved to /var/cache/conftool/dbconfig/20251217-041200-marostegui.json
[04:17:26] <wikibugs>	 (03CR) 10Dzahn: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[04:27:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86704 and previous config saved to /var/cache/conftool/dbconfig/20251217-042708-marostegui.json
[04:27:15] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[04:27:15] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[04:27:25] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1253.eqiad.wmnet with reason: Maintenance
[04:27:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86705 and previous config saved to /var/cache/conftool/dbconfig/20251217-042733-marostegui.json
[04:29:45] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86706 and previous config saved to /var/cache/conftool/dbconfig/20251217-042943-marostegui.json
[04:44:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P86707 and previous config saved to /var/cache/conftool/dbconfig/20251217-044453-marostegui.json
[04:51:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11467201 (10Papaul) I took a quick look at this before getting the support ticket going on.  On lsw1-e2-codfw we have  ` Frame length statistics for m...
[04:55:41] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 562521992 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:59:41] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:00:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P86708 and previous config saved to /var/cache/conftool/dbconfig/20251217-050001-marostegui.json
[05:01:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[05:01:51] <jinxer-wm>	 FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation
[05:01:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:02:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[05:02:09] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[05:02:48] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:02:58] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:02:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:02:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.206 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:04:12] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[05:04:13] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[05:04:13] <akosiaris>	 !incidents
[05:04:14] <sirenbot>	 7196 (UNACKED)  TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad)
[05:04:14] <sirenbot>	 7197 (UNACKED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[05:04:14] <sirenbot>	 7198 (UNACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[05:04:14] <sirenbot>	 7199 (UNACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[05:04:15] <sirenbot>	 7195 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad)
[05:04:24] <akosiaris>	 !ack 7196
[05:04:24] <sirenbot>	 7196 (ACKED)  TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad)
[05:04:28] <akosiaris>	 !ack 7197
[05:04:29] <sirenbot>	 7197 (ACKED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[05:04:33] <akosiaris>	 !ack 7198
[05:04:34] <sirenbot>	 7198 (ACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[05:04:37] <akosiaris>	 !ack 7199
[05:04:37] <sirenbot>	 7199 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[05:06:51] <jinxer-wm>	 RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[05:06:51] <jinxer-wm>	 RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation
[05:06:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:08:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467204 (10Marostegui)
[05:08:25] <akosiaris>	 !incidents
[05:08:25] <sirenbot>	 7199 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[05:08:25] <sirenbot>	 7200 (UNACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[05:08:25] <sirenbot>	 7198 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[05:08:26] <sirenbot>	 7197 (RESOLVED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[05:08:26] <sirenbot>	 7196 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad)
[05:08:26] <sirenbot>	 7195 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad)
[05:08:32] <akosiaris>	 !ack 7200
[05:08:32] <sirenbot>	 7200 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[05:09:11] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:12] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[05:09:13] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[05:11:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467205 (10Marostegui)
[05:12:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467206 (10Marostegui) p:05Triage→03Medium
[05:15:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86709 and previous config saved to /var/cache/conftool/dbconfig/20251217-051509-marostegui.json
[05:15:15] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[05:15:15] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[05:15:15] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[05:17:25] <wikibugs>	 (03PS5) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422)
[05:21:17] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86710 and previous config saved to /var/cache/conftool/dbconfig/20251217-052117-marostegui.json
[05:21:23] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[05:21:23] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[05:24:32] <wikibugs>	 (03PS6) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422)
[05:24:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Access Admin menu in Airflow - https://phabricator.wikimedia.org/T412119#11467222 (10Marostegui) 05Open→03Resolved I believe this is all done - please reopen if not. Thanks Ben for handling this.
[05:25:20] <slyngs>	 !incidents
[05:25:20] <sirenbot>	 7200 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[05:25:20] <sirenbot>	 7199 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[05:25:21] <sirenbot>	 7198 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[05:25:21] <sirenbot>	 7197 (RESOLVED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[05:25:21] <sirenbot>	 7196 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad)
[05:25:21] <sirenbot>	 7195 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad)
[05:25:23] <wikibugs>	 (03PS7) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422)
[05:27:57] <wikibugs>	 (03PS1) 10Marostegui: es2028: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1218872
[05:29:00] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1218872 (owner: 10Marostegui)
[05:29:01] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2028: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1218872 (owner: 10Marostegui)
[05:30:14] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: schema change
[05:33:24] <wikibugs>	 (03PS4) 10Pppery: Add an internal translation file for this repo's own strings [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217873 (https://phabricator.wikimedia.org/T412651)
[05:34:11] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:36:25] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P86711 and previous config saved to /var/cache/conftool/dbconfig/20251217-053625-marostegui.json
[05:51:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P86712 and previous config saved to /var/cache/conftool/dbconfig/20251217-055133-marostegui.json
[06:06:42] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86713 and previous config saved to /var/cache/conftool/dbconfig/20251217-060641-marostegui.json
[06:06:48] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[06:06:48] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[06:06:58] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2221.codfw.wmnet with reason: Maintenance
[06:07:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86714 and previous config saved to /var/cache/conftool/dbconfig/20251217-060706-marostegui.json
[06:07:45] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:07:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:07:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:07:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:07:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:07:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.082 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:07:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.088 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:07:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.156 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:07:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:00] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.212 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:02] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:08:02] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.103 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:08:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:08:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:08:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:08:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:08:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:08:09] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:08:09] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:08:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:09:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.278 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:09:11] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:09:12] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[06:09:13] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[06:09:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:09:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:10:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:11:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.246 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:11:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.093 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:11:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.105 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:12:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.555 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:12:51] <jinxer-wm>	 FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[06:12:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:12:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.580 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:13:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.089 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:13:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.473 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:13:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.914 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:13:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.131 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:13:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:14:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:14:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.509 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:14:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:14:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:14:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2018.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:14:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:14:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:14:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:14:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:14:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:14:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:15:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:15:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.257 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:15:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:15:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.610 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:15:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:15:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:15:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:15:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:15:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.080 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:15:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:16:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:16:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:16:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.886 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:16:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:16:25] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:16:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:16:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:16:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.835 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:16:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:16:51] <jinxer-wm>	 FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation
[06:16:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.093 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:16:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:17:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.611 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:17:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.713 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:17:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:17:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.720 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:17:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.715 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:17:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.800 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:17:51] <jinxer-wm>	 RESOLVED: [5x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[06:17:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:17:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:17:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.518 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:18:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.080 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:18:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:18:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:18:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.694 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:18:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.548 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:18:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.571 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:18:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:18:24] <slyngs>	 !incidents
[06:18:25] <sirenbot>	 7201 (UNACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:18:25] <sirenbot>	 7202 (UNACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[06:18:25] <sirenbot>	 7203 (UNACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:18:25] <sirenbot>	 7205 (UNACKED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[06:18:26] <sirenbot>	 7204 (RESOLVED)  [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw)
[06:18:26] <sirenbot>	 7200 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:18:26] <sirenbot>	 7199 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[06:18:26] <sirenbot>	 7198 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:18:26] <sirenbot>	 7197 (RESOLVED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[06:18:27] <sirenbot>	 7196 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad)
[06:18:27] <sirenbot>	 7195 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad)
[06:18:42] <slyngs>	 !ack 7205
[06:18:43] <sirenbot>	 7205 (ACKED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[06:18:49] <slyngs>	 !ack 7203
[06:18:49] <sirenbot>	 7203 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:18:51] <slyngs>	 !ack 7202
[06:18:52] <sirenbot>	 7202 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[06:18:53] <slyngs>	 !ack 7201
[06:18:54] <sirenbot>	 7201 (ACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:18:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:19:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:19:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.495 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:19:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.998 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:19:11] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.864 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:19:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:19:11] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:19:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:19:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:20:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:20:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.132 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:20:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:20:24] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:20:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:20:25] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:20:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:20:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:21:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:21:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:21:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:21:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:21:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.778 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:21:51] <jinxer-wm>	 FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:21:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:22:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 5.789 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:23:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.158 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:24:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.484 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:24:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.523 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:24:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 5.819 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:24:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.068 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.246 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.392 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:25:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:25:24] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:25:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:25:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:25:27] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.214 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:27] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.303 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:25:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:26:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:26:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.157 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:26:11] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.374 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:26:27] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.217 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:26:29] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.263 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:26:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:26:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:26:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:27:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:27:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.388 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:27:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:27:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.945 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:27:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.116 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:27:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:27:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.713 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:27:31] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.561 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:28:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.139 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:28:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.291 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:28:25] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:28:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:28:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:28:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.686 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:28:33] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:28:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:29:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:29:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.501 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:29:11] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:29:31] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[06:29:37] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.096 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:29:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.232 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:30:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:30:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.052 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:30:24] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:30:24] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:30:25] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 5.155 second response time https://wikitech.wikimedia.org/wiki/Docker
[06:30:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.380 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:30:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:31:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.714 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:31:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:31:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:31:51] <jinxer-wm>	 FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:32:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.936 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:32:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:32:18] <slyngs>	 !incidents
[06:32:18] <sirenbot>	 7201 (ACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:32:18] <sirenbot>	 7202 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[06:32:18] <sirenbot>	 7203 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:32:19] <sirenbot>	 7205 (ACKED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[06:32:19] <sirenbot>	 7206 (UNACKED)  [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[06:32:19] <sirenbot>	 7204 (RESOLVED)  [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw)
[06:32:19] <sirenbot>	 7200 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:32:19] <sirenbot>	 7199 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[06:32:20] <sirenbot>	 7198 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:32:20] <sirenbot>	 7197 (RESOLVED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[06:32:21] <sirenbot>	 7196 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad)
[06:32:21] <sirenbot>	 7195 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad)
[06:32:27] <slyngs>	 !ack 7206
[06:32:28] <sirenbot>	 7206 (ACKED)  [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[06:32:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.289 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:32:29] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.569 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:33:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:33:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.363 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:33:09] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:33:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:33:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.566 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:33:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:33:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:34:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.149 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:34:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.362 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:34:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:34:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:34:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:34:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.030 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:34:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.371 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:35:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:35:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:35:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.775 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:35:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.618 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:35:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:35:24] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:35:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:35:41] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 525440072 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:35:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:36:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.601 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:36:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.206 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:36:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.599 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:36:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.825 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:36:09] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.312 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:36:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:36:27] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:36:33] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:36:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:36:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.067 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:37:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:37:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:37:41] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:37:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:37:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:37:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.096 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:38:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:38:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:38:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.789 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:38:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.960 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:38:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.036 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:38:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:38:37] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.096 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:38:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:38:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.301 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:39:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.241 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:39:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.469 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:39:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.498 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:39:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:39:11] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:39:25] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:39:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:40:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 1.762 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:40:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.043 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:40:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.328 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:40:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:40:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.444 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:40:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.270 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:40:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:40:24] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:40:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:40:25] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:40:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:41:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:41:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:41:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.567 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:41:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.007 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:41:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:42:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:42:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.780 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:42:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:42:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.090 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:42:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.280 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:42:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:43:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.297 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:43:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:43:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.238 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:43:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:43:09] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.957 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:43:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:43:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:43:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:43:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:43:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:43:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:43:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:43:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:43:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:44:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.877 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:44:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.896 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:44:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:44:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:44:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.475 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:44:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:44:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[06:44:11] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:44:33] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.122 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:44:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:44:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.933 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.317 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.579 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.822 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.293 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.961 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:45:25] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:45:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:45:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:46:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:46:02] <akosiaris>	 !incidents
[06:46:03] <sirenbot>	 7202 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[06:46:03] <sirenbot>	 7203 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:46:03] <sirenbot>	 7205 (ACKED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[06:46:03] <sirenbot>	 7206 (ACKED)  [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[06:46:03] <sirenbot>	 7207 (UNACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:46:04] <sirenbot>	 7201 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:46:04] <sirenbot>	 7204 (RESOLVED)  [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw)
[06:46:04] <sirenbot>	 7200 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[06:46:04] <sirenbot>	 7199 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[06:46:05] <sirenbot>	 7198 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:46:05] <sirenbot>	 7197 (RESOLVED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad)
[06:46:06] <sirenbot>	 7196 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad)
[06:46:06] <sirenbot>	 7195 (RESOLVED)  TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad)
[06:46:13] <akosiaris>	 !ack 7207
[06:46:13] <sirenbot>	 7207 (ACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[06:46:51] <jinxer-wm>	 RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation
[06:47:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.121 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:47:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:47:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.161 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.586 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:12] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:48:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:48:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.932 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:57] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.386 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 2.087 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:48:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.320 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:49:00] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:49:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.955 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:49:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:49:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:49:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.799 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:49:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.908 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:49:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.496 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:49:11] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:49:51] <jinxer-wm>	 FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation
[06:49:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:51:51] <jinxer-wm>	 FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:52:51] <jinxer-wm>	 FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[06:54:12] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[06:54:13] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[06:54:51] <jinxer-wm>	 RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation
[06:56:51] <jinxer-wm>	 RESOLVED: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[06:57:51] <jinxer-wm>	 RESOLVED: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation
[06:59:11] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0700)
[07:04:11] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:07:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:07:48] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:17:43] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:22:43] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:27:43] <jinxer-wm>	 FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:27:53] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:32:43] <jinxer-wm>	 FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:37:43] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:42:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:42:53] <jinxer-wm>	 FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:45:24] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:47:43] <jinxer-wm>	 FIRING: [11x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:47:48] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:52:43] <jinxer-wm>	 RESOLVED: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:02:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by akosiaris@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) (owner: 10Alexandros Kosiaris)
[08:03:46] <wikibugs>	 (03Merged) 10jenkins-bot: Update fc-list to point to fc-list Tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) (owner: 10Alexandros Kosiaris)
[08:04:41] <logmsgbot>	 !log akosiaris@deploy2002 Started scap sync-world: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]]
[08:04:45] <stashbot>	 T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718
[08:07:36] <logmsgbot>	 !log akosiaris@deploy2002 akosiaris: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:08:24] <logmsgbot>	 !log akosiaris@deploy2002 akosiaris: Continuing with sync
[08:09:41] <icinga-wm>	 PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 8160.08 ms
[08:10:03] <icinga-wm>	 RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 0%, RTA = 1393.21 ms
[08:10:24] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:13:03] <logmsgbot>	 !log akosiaris@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]] (duration: 08m 22s)
[08:13:07] <stashbot>	 T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice - https://phabricator.wikimedia.org/T280718
[08:26:37] <moritzm>	 !log installing jq security updates
[08:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:43] <wikibugs>	 (03PS1) 10Elukey: scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524)
[08:27:49] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524) (owner: 10Elukey)
[08:29:05] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524) (owner: 10Elukey)
[08:32:13] <wikibugs>	 (03PS1) 10Muehlenhoff: debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115
[08:40:43] <logmsgbot>	 !log elukey@deploy2002 Started deploy [docker-pkg/deploy@4533f76]: Deploy docker-pkg
[08:41:39] <logmsgbot>	 !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@4533f76]: Deploy docker-pkg (duration: 01m 08s)
[08:42:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove scap_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508)
[08:45:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris)
[08:45:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris)
[08:48:44] <wikibugs>	 (03CR) 10Elukey: [C:03+1] debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115 (owner: 10Muehlenhoff)
[08:50:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737 (owner: 10Elukey)
[08:50:58] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119
[08:54:21] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey)
[08:58:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11467826 (10ayounsi) My guess is that SR-Linux < 25 doesn't have stats for mgmt0 (either not implemented yet or a bug), with the upgrade we've started...
[09:00:05] <jouncebot>	 dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0900)
[09:00:08] <wikibugs>	 (03PS1) 10Elukey: images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121
[09:01:14] <wikibugs>	 (03PS2) 10Elukey: images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121
[09:03:13] <wikibugs>	 (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey)
[09:04:28] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:05:34] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:06:35] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:07:21] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:09:38] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[09:10:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Adding Blake and Jasmine per comments in https://phabricator.wikimedia.org/T411508 for review (also feel free to deploy)" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris)
[09:10:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey)
[09:12:05] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[09:12:37] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119 (owner: 10KartikMistry)
[09:13:02] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[09:13:39] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest-gateway: log x-wmf- headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123
[09:13:55] <moritzm>	 !log installing nginx security updates
[09:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:19] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[09:14:28] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119 (owner: 10KartikMistry)
[09:17:35] <urbanecm>	 jouncebot: nowandnext
[09:17:35] <jouncebot>	 For the next 1 hour(s) and 42 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0900)
[09:17:35] <jouncebot>	 In 1 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1100)
[09:18:01] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[09:18:53] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[09:23:54] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, I deployed this on all wikikube clusters" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[09:26:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11467881 (10MoritzMuehlenhoff)
[09:28:50] <fabfur>	 !log depool and disable puppet on cp7009 for haproxy qos testing (T412785)
[09:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:54] <stashbot>	 T412785: Enable QoS for upload video files - https://phabricator.wikimedia.org/T412785
[09:32:05] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7009.*
[09:32:12] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7009.*
[09:36:02] <wikibugs>	 (03PS3) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512)
[09:37:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:38:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:39:00] <wikibugs>	 (03PS4) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512)
[09:40:47] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11467954 (10tappof)
[09:46:42] <wikibugs>	 (03CR) 10Elukey: [V:03+2] images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey)
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:54:35] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[09:55:10] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[09:55:41] <wikibugs>	 (03PS4) 10Elukey: DNM - Reimage: dup-uefi after the first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:57:37] <wikibugs>	 (03PS1) 10Muehlenhoff: kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126
[09:59:15] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[09:59:50] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[10:02:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115 (owner: 10Muehlenhoff)
[10:05:54] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran)
[10:07:18] <kart_>	 !log Updated cxserver to 2025-12-15-140202-production
[10:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:50] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[10:09:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran)
[10:19:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role and related Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798)
[10:22:52] <wikibugs>	 (03PS1) 10Elukey: Rework Makefile.build to ease additional distributions [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219130
[10:22:52] <wikibugs>	 (03PS1) 10Elukey: Add Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219131
[10:25:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126 (owner: 10Muehlenhoff)
[10:25:28] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:25:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.provision: fix retry logic for the Supermicro BMC password [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey)
[10:25:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468080 (10ayounsi) We can set the rule now as non-paging to start collecting data and test it. So we can gain trust in it before...
[10:26:54] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[10:26:57] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 2 others: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11468090 (10elukey) @VRiley-WMF @Jclark-ctr the new code is merged, so you can test it once you have servers ready (I don't want to rush you). Please r...
[10:27:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927)
[10:27:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi)
[10:30:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126 (owner: 10Muehlenhoff)
[10:30:24] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:30:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7829/co" [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi)
[10:33:03] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[10:33:41] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[10:34:02] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[10:34:29] <logmsgbot>	 !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: apply
[10:35:33] <logmsgbot>	 !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply
[10:35:56] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[10:36:16] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi)
[10:36:19] <logmsgbot>	 !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply
[10:37:44] <logmsgbot>	 !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply
[10:42:41] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86716 and previous config saved to /var/cache/conftool/dbconfig/20251217-104240-marostegui.json
[10:42:46] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[10:42:47] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[10:44:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Add missing stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1219133
[10:45:17] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add missing stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1219133 (owner: 10Muehlenhoff)
[10:45:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove puppetmaster::backend role and related Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798)
[10:47:35] <wikibugs>	 (03PS1) 10Filippo Giunchedi: typos: match .wmet [puppet] - 10https://gerrit.wikimedia.org/r/1219134
[10:50:06] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:51:38] <moritzm>	 !log installing libssh security updates
[10:51:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:49] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P86717 and previous config saved to /var/cache/conftool/dbconfig/20251217-105748-marostegui.json
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1100)
[11:04:05] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie
[11:09:20] <wikibugs>	 (03CR) 10Majavah: [C:03+1] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi)
[11:12:57] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P86718 and previous config saved to /var/cache/conftool/dbconfig/20251217-111257-marostegui.json
[11:14:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] "CI failure will be fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1219134 (only a typo)" [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi)
[11:14:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi)
[11:14:40] <wikibugs>	 (03CR) 10Muehlenhoff: "There's some noise in the PCC, which seems to be around stale PCC data, puppetmaster2002 is already gone e.g." [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[11:18:41] <wikibugs>	 (03CR) 10Silvan Heintze: [C:03+1] "nice - now the symlinks are working in our local dev environment, too 👍" [dumps] - 10https://gerrit.wikimedia.org/r/1218317 (https://phabricator.wikimedia.org/T412726) (owner: 10Jakob)
[11:22:38] <wikibugs>	 (03CR) 10FNegri: [C:03+1] typos: match .wmet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219134 (owner: 10Filippo Giunchedi)
[11:23:32] <Amir1>	 !log dropped "trash" and "percona" databases in x1
[11:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:47] <wikibugs>	 (03PS1) 10Muehlenhoff: spamassassin: Remove OS check [puppet] - 10https://gerrit.wikimedia.org/r/1219137
[11:23:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for spamd [puppet] - 10https://gerrit.wikimedia.org/r/1219138 (https://phabricator.wikimedia.org/T135991)
[11:23:58] <wikibugs>	 (03PS2) 10Filippo Giunchedi: typos: match .wmet [puppet] - 10https://gerrit.wikimedia.org/r/1219134
[11:24:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] typos: match .wmet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219134 (owner: 10Filippo Giunchedi)
[11:25:32] <icinga-wm>	 PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[11:25:32] <icinga-wm>	 PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp7009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[11:25:36] <icinga-wm>	 PROBLEM - haproxy process on cp7009 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[11:26:38] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove puppetmaster::backend role and related Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[11:26:45] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff)
[11:28:06] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86719 and previous config saved to /var/cache/conftool/dbconfig/20251217-112805-marostegui.json
[11:28:11] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: Maintenance
[11:28:11] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[11:28:11] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[11:28:19] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2222 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86720 and previous config saved to /var/cache/conftool/dbconfig/20251217-112818-marostegui.json
[11:29:32] <icinga-wm>	 RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7009 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2026-02-04 04:29:30 +0000 (expires in 48 days) https://wikitech.wikimedia.org/wiki/HTTPS
[11:29:32] <icinga-wm>	 RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp7009 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2026-01-13 14:24:42 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/HTTPS
[11:29:36] <icinga-wm>	 RECOVERY - haproxy process on cp7009 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[11:30:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86721 and previous config saved to /var/cache/conftool/dbconfig/20251217-113031-marostegui.json
[11:31:49] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff)
[11:31:58] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): throttle: Allow for overriding temp account creation limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan)
[11:32:23] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): lift throttle limits for Sing Lit 2025 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[11:34:05] <urbanecm>	 jouncebot: nowandnext
[11:34:05] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1100)
[11:34:05] <jouncebot>	 In 0 hour(s) and 25 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1200)
[11:35:10] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply Redundancy alert on db2247 - https://phabricator.wikimedia.org/T412935 (10FCeratto-WMF) 03NEW
[11:40:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468298 (10fgiunchedi) >>! In T384052#11462541, @cmooney wrote: >  > https://grafana.wikimedia.org/goto/YOk1qBMDg >  > In terms of...
[11:42:26] <moritzm>	 !log installing libsndfile security updates
[11:42:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:45:24] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:45:40] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P86722 and previous config saved to /var/cache/conftool/dbconfig/20251217-114539-marostegui.json
[11:47:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:54:31] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218294 (owner: 10PipelineBot)
[11:56:18] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218294 (owner: 10PipelineBot)
[12:00:05] <jouncebot>	 mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1200).
[12:00:48] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P86723 and previous config saved to /var/cache/conftool/dbconfig/20251217-120047-marostegui.json
[12:01:30] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[12:02:04] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[12:04:22] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply
[12:04:57] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[12:06:47] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[12:07:16] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[12:07:48] <wikibugs>	 (03PS11) 10Matthieulec: Add new script to export A/A and A/P service types from Cumin hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663)
[12:08:42] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214072 (owner: 10PipelineBot)
[12:08:48] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217229 (owner: 10PipelineBot)
[12:08:55] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217558 (owner: 10PipelineBot)
[12:09:01] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217722 (owner: 10PipelineBot)
[12:09:19] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler)
[12:10:24] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:15:32] <moritzm>	 !log installing pam security updates
[12:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86724 and previous config saved to /var/cache/conftool/dbconfig/20251217-121556-marostegui.json
[12:16:02] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[12:16:02] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[12:24:11] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:43:55] <wikibugs>	 (03PS13) 10Daniel Kinzler: rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605
[12:43:59] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler)
[12:59:13] <wikibugs>	 (03PS8) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379)
[13:08:35] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm now, should be merged (and tested) in January" [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[13:09:36] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler)
[13:13:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[13:15:19] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7830/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff)
[13:15:25] <wikibugs>	 (03PS1) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924)
[13:15:25] <wikibugs>	 (03PS1) 10Tiziano Fogli: Thanos/Store: add a ruler(s) dedicate store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924)
[13:15:38] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler)
[13:15:42] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler)
[13:15:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[13:15:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[13:16:29] <wikibugs>	 (03PS2) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924)
[13:17:32] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler)
[13:17:37] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468560 (10ABran-WMF)
[13:17:39] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler)
[13:21:24] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm, thanks for the cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff)
[13:22:02] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "The idea is excellent and aligns well with our future plans to add post-upgrade hooks for running smoke tests (as part of T412941). For no" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler)
[13:23:30] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468568 (10ABran-WMF) >>! In T286066#11465434, @Dzahn wrote: > You can remove the "Prepare tcpproxy VMs for accepting traffic on the...
[13:24:26] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1219138 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:25:50] <wikibugs>	 (03CR) 10Effie Mouzeli: "rephrase: The idea is excellent and aligns well with potential future plans to add post-upgrade hooks for running smoke tests (for example" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler)
[13:25:54] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] spamassassin: Remove OS check [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff)
[13:25:58] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Enable profile::auto_restarts::service for spamd [puppet] - 10https://gerrit.wikimedia.org/r/1219138 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:27:49] <wikibugs>	 (03PS2) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820)
[13:27:53] <moritzm>	 !log upgtrade Envoy on an-web T410975
[13:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:57] <stashbot>	 T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
[13:28:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[13:29:26] <wikibugs>	 (03CR) 10Robertsky: lift throttle limits for Sing Lit 2025 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[13:29:33] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[13:31:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11468599 (10cmooney) >>! In T412807#11465779, @elukey wrote: > @cmooney I am +1 on testing something like `d-i netcfg/link_wait_timeout string 10`...
[13:31:57] <wikibugs>	 (03PS3) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820)
[13:32:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for the clamav [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991)
[13:32:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[13:32:42] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[13:34:35] <wikibugs>	 (03PS1) 10Jelto: lists: remove duplicate spamd auto restart [puppet] - 10https://gerrit.wikimedia.org/r/1219148
[13:35:07] <wikibugs>	 (03PS4) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820)
[13:35:32] <wikibugs>	 (03CR) 10Jelto: "puppet fails with" [puppet] - 10https://gerrit.wikimedia.org/r/1219148 (owner: 10Jelto)
[13:35:52] <wikibugs>	 (03PS2) 10Tiziano Fogli: Thanos/Store: add support for multi-instance setup [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924)
[13:35:54] <wikibugs>	 (03PS3) 10Tiziano Fogli: Thanos/Store: add a ruler(s)-dedicated store gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924)
[13:36:23] <moritzm>	 !log installing apache2 security updates
[13:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:56] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:37:09] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468622 (10ABran-WMF)
[13:37:38] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7831/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219148 (owner: 10Jelto)
[13:38:58] <wikibugs>	 (03PS1) 10Majavah: spec: Stop running tests on buster [puppet] - 10https://gerrit.wikimedia.org/r/1219149
[13:39:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, sorry for missing that" [puppet] - 10https://gerrit.wikimedia.org/r/1219148 (owner: 10Jelto)
[13:39:42] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7832/co" [puppet] - 10https://gerrit.wikimedia.org/r/1218808 (owner: 10Majavah)
[13:40:29] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for clamav [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991)
[13:40:42] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] lists: remove duplicate spamd auto restart [puppet] - 10https://gerrit.wikimedia.org/r/1219148 (owner: 10Jelto)
[13:40:56] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec)
[13:41:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] spec: Stop running tests on buster [puppet] - 10https://gerrit.wikimedia.org/r/1219149 (owner: 10Majavah)
[13:42:26] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468628 (10ABran-WMF)
[13:44:40] <moritzm>	 !log upgtrade Envoy on grafana* T410975
[13:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:44] <stashbot>	 T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
[13:45:54] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[13:46:56] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:47:15] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[13:47:40] <wikibugs>	 (03PS2) 10Majavah: spec: Stop running tests on buster [puppet] - 10https://gerrit.wikimedia.org/r/1219149
[13:52:56] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[13:53:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Neat" [puppet] - 10https://gerrit.wikimedia.org/r/1218808 (owner: 10Majavah)
[13:53:28] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[13:53:32] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:mail::smarthost: Remove NRPE monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1218808 (owner: 10Majavah)
[13:55:54] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): lift throttle limits for Sing Lit 2025 (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[13:57:31] <wikibugs>	 (03CR) 10Tiziano Fogli: "I tested it on Pontoon. The catalog was applied without errors and gave me the following two processes:" [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[13:58:42] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: log x-wmf-* headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123
[13:58:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11468667 (10MoritzMuehlenhoff) >>! In T412807#11468599, @cmooney wrote: > @elukey yeah it probably won't work but it's worth a throw of the dice....
[13:59:28] <wikibugs>	 (03PS5) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820)
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1400).
[14:00:05] <jouncebot>	 Robertsky, Tran, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11468678 (10cmooney) >>! In T412807#11468667, @MoritzMuehlenhoff wrote: > We don't configure netcfg/link_wait_timeout ourselves, 10 is the built-i...
[14:00:13] <robertsky>	 o/
[14:00:43] <wikibugs>	 (03CR) 10Robertsky: lift throttle limits for Sing Lit 2025 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[14:01:32] <Lucas_WMDE>	 o/
[14:02:17] <robertsky>	 will need help with deploying.
[14:02:20] <cscott>	 o/
[14:02:20] <Lucas_WMDE>	 I can deploy
[14:02:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[14:02:50] <robertsky>	 thanks! 
[14:03:36] <wikibugs>	 (03Merged) 10jenkins-bot: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky)
[14:04:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1218853|lift throttle limits for Sing Lit 2025 (T412820)]]
[14:04:11] <stashbot>	 T412820: Requesting temporary lift of IP cap for editathon on 27 Dec 2025 - https://phabricator.wikimedia.org/T412820
[14:06:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, robertsky: Backport for [[gerrit:1218853|lift throttle limits for Sing Lit 2025 (T412820)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:06:50] <Lucas_WMDE>	 robertsky: anything to test on mwdebug for this change?
[14:06:54] <robertsky>	 push ahead, changes can't be verified until the day. 
[14:06:59] <Lucas_WMDE>	 yeah, makes sense
[14:07:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, robertsky: Continuing with sync
[14:09:26] <moritzm>	 !log installing pdns-recursor security updates
[14:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218853|lift throttle limits for Sing Lit 2025 (T412820)]] (duration: 07m 10s)
[14:11:22] <stashbot>	 T412820: Requesting temporary lift of IP cap for editathon on 27 Dec 2025 - https://phabricator.wikimedia.org/T412820
[14:11:39] <Lucas_WMDE>	 I don’t see Tran yet
[14:11:43] <Lucas_WMDE>	 cscott: want to continue with your config change?
[14:12:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:12:43] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11468717 (10ABran-WMF)
[14:13:14] <robertsky>	 thanks! signing off. gotta get that dinner. ciao.
[14:13:19] <Lucas_WMDE>	 see you!
[14:13:30] <phuedx>	 Lucas_WMDE: Tran is on their way
[14:13:38] <Lucas_WMDE>	 hi Tran :)
[14:13:42] <Tran>	 👋 hi hi I'm a little late to the party, so sorry I was distracted by a meeting
[14:13:52] <Lucas_WMDE>	 no problem, we just finished deploying another change
[14:13:55] <Lucas_WMDE>	 do you want to deploy yours now?
[14:14:04] <Tran>	 yes please! Would you like me to or are you already there?
[14:14:11] <Lucas_WMDE>	 either works for me
[14:14:21] <cscott>	 Lucas_WMDE: i can wait (sorry, i was distracted)
[14:14:25] <Tran>	 I wouldn't say no if you did it :p
[14:14:29] <Lucas_WMDE>	 alright, sure ^^
[14:14:36] <Tran>	 🙇
[14:14:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran)
[14:16:02] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: log x-wmf-* headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123 (owner: 10Daniel Kinzler)
[14:16:13] <wikibugs>	 (03Merged) 10jenkins-bot: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran)
[14:16:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1207845|Enable v2 non-emergency workflow by default (T410512 T412715)]]
[14:16:51] <stashbot>	 T410512: Add support for maintaining legacy non-emergency flow during transition to v2 - https://phabricator.wikimedia.org/T410512
[14:16:51] <stashbot>	 T412715: Deploy Incident Reporting System to test2wiki - https://phabricator.wikimedia.org/T412715
[14:17:36] <wikibugs>	 (03PS1) 10Cathal Mooney: Trixie d-i preseed file: increase link_wait_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1219154 (https://phabricator.wikimedia.org/T412807)
[14:18:19] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: log x-wmf-* headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123 (owner: 10Daniel Kinzler)
[14:18:23] <moritzm>	 !log installing redis security updates
[14:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1219154 (https://phabricator.wikimedia.org/T412807) (owner: 10Cathal Mooney)
[14:19:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 stran, lucaswerkmeister-wmde: Backport for [[gerrit:1207845|Enable v2 non-emergency workflow by default (T410512 T412715)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:19:17] <Lucas_WMDE>	 Tran: can you test the change?
[14:19:22] <Tran>	 Yes, on it
[14:19:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Trixie d-i preseed file: increase link_wait_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1219154 (https://phabricator.wikimedia.org/T412807) (owner: 10Cathal Mooney)
[14:22:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:22:33] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[14:22:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[14:27:08] <Tran>	 Hm...I made the assumption that the train rolled out to group 1 today but it looks like there was a blocker
[14:28:28] <Lucas_WMDE>	 ah
[14:28:40] <Tran>	 I think this config can't go through
[14:29:00] <Lucas_WMDE>	 I don’t see a blocker in https://phabricator.wikimedia.org/T408277 but maybe they’re using the primary time slot this week https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1900
[14:29:09] <Lucas_WMDE>	 ok, so abort sync and revert?
[14:29:30] <Tran>	 Yes I think so, sorry I should have confirmed (and will do so next time before scheduling the config change again)
[14:29:37] <Lucas_WMDE>	 alright
[14:29:42] <Lucas_WMDE>	 we can deploy the revert together with cscott’s change then
[14:29:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Sync cancelled.
[14:29:49] <cscott>	 works for me.
[14:30:15] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Revert "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219158 (https://phabricator.wikimedia.org/T410512)
[14:30:26] <Lucas_WMDE>	 cscott: do you want to deploy or should I?
[14:30:26] <wikibugs>	 (03PS1) 10STran: Revert "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219159
[14:30:45] <cscott>	 i'm going to let you do it, since it is being combined with the revert
[14:30:50] <Lucas_WMDE>	 ok
[14:30:54] <Tran>	 oh you made the revert, thank you 🙇 I'll abandon mine
[14:31:02] <Lucas_WMDE>	 ah, ok :D
[14:31:04] <cscott>	 anything that happens after an aborted scap makes me nervous. ;)
[14:31:09] <wikibugs>	 (03Abandoned) 10STran: Revert "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219159 (owner: 10STran)
[14:31:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219158 (https://phabricator.wikimedia.org/T410512) (owner: 10Lucas Werkmeister (WMDE))
[14:31:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[14:31:40] <Lucas_WMDE>	 Tran: are there any potential errors we should look out for during the revert deploy?
[14:32:18] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable v2 non-emergency workflow by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219158 (https://phabricator.wikimedia.org/T410512) (owner: 10Lucas Werkmeister (WMDE))
[14:32:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468772 (10cmooney) >>! In T384052#11468080, @ayounsi wrote: > We can set the rule now as non-paging to start collecting data and...
[14:32:33] <wikibugs>	 (03Merged) 10jenkins-bot: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[14:33:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]]
[14:33:11] <stashbot>	 T410512: Add support for maintaining legacy non-emergency flow during transition to v2 - https://phabricator.wikimedia.org/T410512
[14:33:11] <stashbot>	 T412715: Deploy Incident Reporting System to test2wiki - https://phabricator.wikimedia.org/T412715
[14:33:11] <stashbot>	 T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255
[14:33:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: Thanos/Store: add a ruler(s)-dedicated store gateway (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli)
[14:34:23] <Tran>	 No, the config shouldn't have had any effect as the critical fields it would have enabled access to weren't deployed yet and it was meant to fallback gracefully
[14:34:29] <Lucas_WMDE>	 ok
[14:34:32] <Lucas_WMDE>	 thanks
[14:35:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ihurbain: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:35:46] <Lucas_WMDE>	 cscott: can you test your change?
[14:37:48] <cscott>	 yup, testing
[14:41:12] <moritzm>	 !log installing tiff security updates
[14:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:36] <icinga-wm>	 PROBLEM - haproxy process on cp7009 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[14:42:38] <icinga-wm>	 RECOVERY - haproxy process on cp7009 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[14:44:04] <cscott>	 Lucas_WMDE: still checking
[14:44:12] <Lucas_WMDE>	 ack
[14:46:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability: Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11468866 (10herron)
[14:47:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q2): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11468868 (10herron)
[14:47:18] <cscott>	 Lucas_WMDE: ok, looks good
[14:47:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06SRE Observability: Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11468869 (10herron)
[14:47:47] <Lucas_WMDE>	 ok, thanks!
[14:47:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ihurbain: Continuing with sync
[14:47:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q2): Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11468871 (10herron)
[14:48:13] <Lucas_WMDE>	 hm, there’s one warning in mwdebug logstash
[14:48:17] <Lucas_WMDE>	 Pool key 'simplewiki:parsoid-pcache:232335:|#|:idhash:useParsoid=1:revid:10648812' (ArticleView): Usage error: You may only aquire a single non-nowait lock.
[14:48:19] <Lucas_WMDE>	 is that relevant?
[14:49:22] <cscott>	 i was testing just now on simplewiki, let me see if i can reproduce that
[14:49:48] * Lucas_WMDE searches further back in time
[14:49:53] <Lucas_WMDE>	 ok it’s happened before, at least
[14:50:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[14:50:33] <cscott>	 i'm wondering if it happens on purge, because that's part of what i did on simplewiki to test the new cache mechanism.
[14:50:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[14:51:03] <wikibugs>	 (03CR) 10Kamila Součková: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec)
[14:51:32] <Lucas_WMDE>	 one of the previous logstash hits was apparently a purge too
[14:51:38] <Lucas_WMDE>	 https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2025.12.15?id=EP9bIpsBVE0pYbVvzWpE
[14:51:41] <Lucas_WMDE>	 judging by its referrer
[14:51:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]] (duration: 18m 45s)
[14:51:50] <Lucas_WMDE>	 but the others weren’t
[14:51:56] <stashbot>	 T410512: Add support for maintaining legacy non-emergency flow during transition to v2 - https://phabricator.wikimedia.org/T410512
[14:51:57] <stashbot>	 T412715: Deploy Incident Reporting System to test2wiki - https://phabricator.wikimedia.org/T412715
[14:51:57] <stashbot>	 T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255
[14:52:33] <cscott>	 anything before that?  the postprocessing cache is also enabled on idwiki, which is where that message came from.
[14:52:37] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818)
[14:53:09] <Lucas_WMDE>	 one result on 4 December
[14:53:30] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[14:53:38] <Lucas_WMDE>	 nothing earlier in the last 90 days, at least in mwdebug logstash
[14:53:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[14:54:15] <cscott>	 i think this is fine to continue to deploy since it didn't result in any user-visible errors, and the 4 dec predates our code
[14:54:15] <Lucas_WMDE>	 oh. it’s… rather common in non-mwdebug logstash, if you remove the error “error channels” requirement
[14:54:23] <Lucas_WMDE>	 580259 hits in the last 24 hours
[14:54:32] <cscott>	 we didn't deploy to idwiki until 15 dec (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1217768)
[14:54:49] <Lucas_WMDE>	 the 4 Dec one was officewiki
[14:55:40] <cscott>	 hm, that is more suspicious: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1215115 was the officewiki deploy and it was 4 dec.
[14:55:57] <Lucas_WMDE>	 logstash link for the 580k messages: https://logstash.wikimedia.org/goto/28dd0fe40e67c1a37039eeb6f4456f16
[14:56:20] <Lucas_WMDE>	 almost all of those are on idwiki (560k)
[14:56:28] <Lucas_WMDE>	 then 4k on testwiki, 4k on dewiki, 2k on thwiki
[14:56:47] * Lucas_WMDE goes back in time and hopes logstash won’t melt
[14:57:22] <Lucas_WMDE>	 yeah that definitely looks like a very sharp uptick
[14:57:28] <Lucas_WMDE>	 on 15 December
[14:58:09] <Lucas_WMDE>	 I think there’s some background noise in the poolcounter channel, but most of the "You may only acquire a single non-nowait lock" messages are likely due to the postprocessing cache
[14:58:16] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[14:58:20] <wikibugs>	 (03PS1) 10Clément Goubert: campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818)
[14:58:25] <Lucas_WMDE>	 should I make a task or are you going to?
[14:58:38] <wikibugs>	 (03PS2) 10Clément Goubert: campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412544)
[14:58:42] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert)
[14:58:50] <cscott>	 plenty of hits before dec 1, but all of those seem to be on Special:Contributions.  So that seems like a different bug.
[14:59:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412544) (owner: 10Clément Goubert)
[14:59:42] <moritzm>	 !log installing nodejs security updates
[14:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:44] <cscott>	 Lucas_WMDE: can you make the bug?  I think we're still okay with the deploy, its on small wikis and I believe what's happening is that we're doing a recursive lock acquisition, but the outer lock is sufficient for what we're doing.  so it's a usage error but not a practical bug.
[14:59:51] <Lucas_WMDE>	 ok
[15:00:01] <wikibugs>	 (03PS1) 10Bking: stat hosts: remove load average alerts [alerts] - 10https://gerrit.wikimedia.org/r/1219163 (https://phabricator.wikimedia.org/T401589)
[15:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1500)
[15:00:11] <cscott>	 but i'll hold off rolling out the postprocessing cache further until we better understand this & to prevent further logspa.
[15:00:14] <wikibugs>	 (03PS3) 10Clément Goubert: campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818)
[15:00:57] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11468976 (10ayounsi)
[15:01:12] <cscott>	 if you're making a bug for the recent messages, i'll make a bug for the pre-dec-3 messages (Special:Contributions) which look unrelated
[15:01:19] <wikibugs>	 (03PS1) 10Krinkle: scap: Remove unused php7_admin_port option [puppet] - 10https://gerrit.wikimedia.org/r/1219164 (https://phabricator.wikimedia.org/T224491)
[15:01:32] <wikibugs>	 (03PS3) 10Daimona Eaytoy: Stop setting $wgCampaignEventsEnableContributionTracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210716 (https://phabricator.wikimedia.org/T410939)
[15:01:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[15:01:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[15:02:01] <wikibugs>	 (03CR) 10Daimona Eaytoy: "(Memo: waiting for 1.46.0-wmf.7)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210716 (https://phabricator.wikimedia.org/T410939) (owner: 10Daimona Eaytoy)
[15:03:08] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Release-Engineering-Team, 06serviceops: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11469007 (10MatthewVernon)
[15:03:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[15:03:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[15:04:08] <Lucas_WMDE>	 cscott: created T412959
[15:04:09] <stashbot>	 T412959: Logstash poolcounter warnings "Usage error: You may only aquire a single non-nowait lock" on wikis with post-processing cache enabled - https://phabricator.wikimedia.org/T412959
[15:04:12] <icinga-wm>	 PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 100%
[15:04:15] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[15:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:40] <icinga-wm>	 RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 0%, RTA = 546.57 ms
[15:05:02] <cscott>	 Lucas_WMDE: ok, and I created T412960 for the pre-dec 4 instances.
[15:05:03] <stashbot>	 T412960: Pool key 'dewiki:SpecialContributions:a:127.0.0.1' (SpecialContributions): Usage error: You may only aquire a single non-nowait lock. - https://phabricator.wikimedia.org/T412960
[15:05:10] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[15:05:11] <Lucas_WMDE>	 thanks!
[15:05:11] <cscott>	 Lucas_WMDE: thanks!
[15:05:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[15:06:53] <XioNoX>	 !log add AAAA record to restbase1031.eqiad.wmnet - T271140
[15:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:58] <stashbot>	 T271140: Some Data Persistence clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271140
[15:07:29] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[15:07:37] <wikibugs>	 (03PS2) 10Clément Goubert: mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818)
[15:07:37] <wikibugs>	 (03PS4) 10Clément Goubert: campaignevents: Skip mesh check in aggregateanswers [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818)
[15:08:29] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert)
[15:09:11] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:11:38] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[15:11:42] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.move-vlan for host es2028
[15:11:42] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host es2028
[15:11:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie
[15:11:58] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA to restbase1031 - ayounsi@cumin1003"
[15:12:03] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add AAAA to restbase1031 - ayounsi@cumin1003"
[15:12:03] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:12:29] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache restbase1031.eqiad.wmnet on all recursors
[15:12:33] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase1031.eqiad.wmnet on all recursors
[15:13:20] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+2] stat hosts: remove load average alerts [alerts] - 10https://gerrit.wikimedia.org/r/1219163 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking)
[15:14:26] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert)
[15:14:34] <wikibugs>	 (03Merged) 10jenkins-bot: stat hosts: remove load average alerts [alerts] - 10https://gerrit.wikimedia.org/r/1219163 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking)
[15:17:51] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:27:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469199 (10cmooney) Hmmm so this didn't work, but also I see in the log file it still only waited 3 seconds (and indeed that is shorter than the...
[15:28:35] <moritzm>	 !log upgrade Envoy on etherpad* T410975
[15:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:39] <stashbot>	 T410975: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975
[15:29:08] <wikibugs>	 (03PS1) 10Scott French: php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169
[15:29:22] <wikibugs>	 (03PS1) 10FNegri: P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815)
[15:29:37] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06serviceops, 06Release-Engineering-Team (Radar): Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11469210 (10thcipriani)
[15:29:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) (owner: 10FNegri)
[15:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1500)
[15:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1530)
[15:30:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:32:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for bmartinez [puppet] - 10https://gerrit.wikimedia.org/r/1219172
[15:34:11] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:35:23] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Add new script to export A/A and A/P service types from Cumin hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1216763 (https://phabricator.wikimedia.org/T327663) (owner: 10Matthieulec)
[15:37:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Record LDAP access for bmartinez [puppet] - 10https://gerrit.wikimedia.org/r/1219172
[15:40:13] <wikibugs>	 (03PS2) 10FNegri: P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815)
[15:42:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174
[15:42:16] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Enable profile::auto_restarts::service for clamav [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:45:24] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:45:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 (owner: 10Muehlenhoff)
[15:45:49] <logmsgbot>	 !log eevans@cumin1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-*
[15:46:35] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 (owner: 10Muehlenhoff)
[15:47:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174
[15:47:29] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7833/co" [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) (owner: 10FNegri)
[15:47:53] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+1] P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) (owner: 10FNegri)
[15:49:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11469293 (10MoritzMuehlenhoff)
[15:49:53] <logmsgbot>	 !log eevans@cumin1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-backend
[15:50:25] <logmsgbot>	 !log eevans@cumin1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-https
[15:50:52] <logmsgbot>	 !log eevans@cumin1003 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=restbase,service=restbase-ssl
[15:51:47] <wikibugs>	 (03CR) 10FNegri: [C:03+2] P:openstack::base::opentofu: specify git branch [puppet] - 10https://gerrit.wikimedia.org/r/1219170 (https://phabricator.wikimedia.org/T373815) (owner: 10FNegri)
[15:53:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for bmartinez [puppet] - 10https://gerrit.wikimedia.org/r/1219172 (owner: 10Muehlenhoff)
[15:53:46] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] scap: Remove unused php7_admin_port option [puppet] - 10https://gerrit.wikimedia.org/r/1219164 (https://phabricator.wikimedia.org/T224491) (owner: 10Krinkle)
[15:54:04] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971 (10MatthewVernon) 03NEW
[15:54:11] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:03:19] <wikibugs>	 10ops-codfw, 06SRE, 07sre-alert-triage, 06DC-Ops, 06Infrastructure-Foundations: Alert in need of triage: SmartNotHealthy (instance sretest2006:9100) - https://phabricator.wikimedia.org/T412078#11469413 (10Jhancock.wm) if you zoom out to half a year, this alert has been active since the end of July. Could...
[16:04:12] <wikibugs>	 (03CR) 10Ahmon Dancy: "This change has broken puppet on deployment-mx03.deployment-prep. I filed https://phabricator.wikimedia.org/T412975" [puppet] - 10https://gerrit.wikimedia.org/r/1219137 (owner: 10Muehlenhoff)
[16:09:12] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] team-sre/mw-cron: Improve dashboard and description [alerts] - 10https://gerrit.wikimedia.org/r/1218756 (https://phabricator.wikimedia.org/T412799) (owner: 10Clément Goubert)
[16:10:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:10:24] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[16:10:47] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/mw-cron: Improve dashboard and description [alerts] - 10https://gerrit.wikimedia.org/r/1218756 (https://phabricator.wikimedia.org/T412799) (owner: 10Clément Goubert)
[16:12:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "tested starting the new service on vrts2002 - looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/1219147 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[16:15:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:16:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11469513 (10Clement_Goubert)
[16:16:58] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert)
[16:18:07] <wikibugs>	 (03CR) 10Scott French: [C:03+1] campaignevents: Skip mesh check in aggregateanswers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1219162 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert)
[16:18:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11469518 (10Clement_Goubert) Updated racking plan to: - Row A: 0 - **Row B: 2** - **Row C: 3** - **Row D: 6** - **Row E: 1** - **Row F: 1**  This would still leave us with A...
[16:24:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply Redundancy alert on db2247 - https://phabricator.wikimedia.org/T412935#11469569 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated cable. alert has cleared.
[16:25:24] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:27:32] <wikibugs>	 (03PS1) 10Dzahn: mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975)
[16:29:36] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn)
[16:30:32] <icinga-wm>	 PROBLEM - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[16:30:32] <icinga-wm>	 PROBLEM - HAProxy HTTPS upload.wikimedia.org ECDSA on cp7009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[16:30:38] <icinga-wm>	 PROBLEM - haproxy process on cp7009 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[16:31:28] <topranks>	 ^^ I think this may be due to fabfur testing
[16:31:33] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reboot-single for host restbase1031.eqiad.wmnet
[16:34:18] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "you get the idea. would have to fix this though: https://puppet-compiler.wmflabs.org/output/1219180/7835/lists1004.wikimedia.org/change.li" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn)
[16:34:34] <icinga-wm>	 RECOVERY - HAProxy HTTPS upload.wikimedia.org ECDSA on cp7009 is OK: SSL OK - Certificate upload.wikimedia.org contains all required SANs:Certificate upload.wikimedia.org (ECDSA) valid until 2026-01-13 14:24:42 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS
[16:34:34] <icinga-wm>	 RECOVERY - HAProxy HTTPS measure-eqiad.wikimedia.org ECDSA on cp7009 is OK: SSL OK - Certificate measure-eqiad.wikimedia.org contains all required SANs:Certificate measure-eqiad.wikimedia.org (ECDSA) valid until 2026-02-04 04:29:30 +0000 (expires in 48 days) https://wikitech.wikimedia.org/wiki/HTTPS
[16:34:36] <icinga-wm>	 RECOVERY - haproxy process on cp7009 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[16:35:24] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service restbase1031-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:35:59] <wikibugs>	 (03PS2) 10Dzahn: mx/spamassassin: allow overriding sa daemon package name in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975)
[16:38:08] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1219180/7836/" [puppet] - 10https://gerrit.wikimedia.org/r/1219180 (https://phabricator.wikimedia.org/T412975) (owner: 10Dzahn)
[16:38:11] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1031.eqiad.wmnet
[16:39:11] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:40:24] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:44:11] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:53:21] <wikibugs>	 (03PS1) 10Fabfur: P:cache::haproxy: TOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1219182 (https://phabricator.wikimedia.org/T412785)
[16:54:50] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T412983 (10phaultfinder) 03NEW
[16:56:32] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 (owner: 10Scott French)
[17:00:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:01:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:01:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] P:cache::haproxy: TOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1219182 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[17:01:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:02:04] <topranks>	 that is all we need :D 
[17:02:17] <fabfur>	 :|
[17:03:59] <fabfur>	 !log enabling puppet and repooling cp7009 (T412785)
[17:04:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:04] <stashbot>	 T412785: Enable QoS for upload video files - https://phabricator.wikimedia.org/T412785
[17:04:39] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] P:cache::haproxy: TOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1219182 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[17:05:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:06:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469757 (10Papaul) Ticket 05304338 has been submitted with Nokia
[17:06:22] <swfrench-wmf>	 jouncebot: nowandnext
[17:06:22] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 53 minute(s)
[17:06:22] <jouncebot>	 In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1800)
[17:07:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11469771 (10cmooney) Hi @VRiley-WMF just to be aware please try to spread these as much as is practical evenly across the racks in each row.  The "row-wide" view is sort of...
[17:08:37] <logmsgbot>	 !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7009.*
[17:09:28] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable video tos on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1219185 (https://phabricator.wikimedia.org/T412785)
[17:09:33] <swfrench-wmf>	 FYI, during the upcoming infra window, I'll be releasing some changes that will incur a full mediawiki image rebuild and deployment. depending on how quiet things are by ~ 17:20 UTC, I might get that (long) process started on the early side.
[17:10:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:11:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:11:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:11:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:12:41] <swfrench-wmf>	 !log reprepro include php8.3_8.3.28-1+wmf11u2 in component/php83
[17:12:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:10] <wikibugs>	 (03CR) 10JavierMonton: [C:03+1] Add javiermonton to kafka-jumbo-access group [puppet] - 10https://gerrit.wikimedia.org/r/1218337 (https://phabricator.wikimedia.org/T411774) (owner: 10Muehlenhoff)
[17:16:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:16:19] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 (owner: 10Scott French)
[17:16:38] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 (owner: 10Scott French)
[17:16:49] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219185 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur)
[17:17:18] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: rebuild to pick up new PHP packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219169 (owner: 10Scott French)
[17:18:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:20:05] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable video tos on upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1219186 (https://phabricator.wikimedia.org/T412785)
[17:22:21] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable video tos on cache upload [puppet] - 10https://gerrit.wikimedia.org/r/1219187 (https://phabricator.wikimedia.org/T412785)
[17:23:39] <wikibugs>	 (03PS1) 10Daniel Kinzler: smokepy: send http requests in parallel [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219188
[17:24:15] <swfrench-wmf>	 as noted previously, I am going to get this build / deploy process started shortly
[17:24:22] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie
[17:24:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie execu...
[17:27:00] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Rebuild deployment to pick up new production image
[17:28:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:28:18] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[17:28:22] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.move-vlan for host es2028
[17:28:22] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host es2028
[17:28:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie
[17:32:07] <wikibugs>	 (03PS1) 10Krinkle: scap: Remove unused mwmaint config, obsolete wikitech/php7 comments [puppet] - 10https://gerrit.wikimedia.org/r/1219189 (https://phabricator.wikimedia.org/T397017)
[17:33:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:33:30] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lswtest-d8-eqiad,lswtest-d8-eqiad IPv6 with reason: upgradiing sr-linux on lswtest-d8-eqiad
[17:33:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469916 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4ac5ae06-34f5-425c-b0df-bc77a3758cd3) set by cmooney@cumin1003 for 2:00:0...
[17:34:00] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1006.eqiad.wmnet with reason: upgrading connected switch
[17:36:32] <wikibugs>	 (03PS1) 10Krinkle: scap: Add php_l10n build in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1219190 (https://phabricator.wikimedia.org/T99740)
[17:39:36] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] scap: Remove unused mwmaint config, obsolete wikitech/php7 comments [puppet] - 10https://gerrit.wikimedia.org/r/1219189 (https://phabricator.wikimedia.org/T397017) (owner: 10Krinkle)
[17:44:41] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] scap: Remove unused mwmaint config, obsolete wikitech/php7 comments [puppet] - 10https://gerrit.wikimedia.org/r/1219189 (https://phabricator.wikimedia.org/T397017) (owner: 10Krinkle)
[17:45:52] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 (owner: 10Muehlenhoff)
[17:45:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Pass link_wait_timeout tab-separated [puppet] - 10https://gerrit.wikimedia.org/r/1219174 (owner: 10Muehlenhoff)
[17:46:42] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie
[17:46:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11469985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie execu...
[17:48:59] <wikibugs>	 (03PS1) 10Cathal Mooney: lswtest-d8-eqiad: define srlinux_version var as v25.10.1 [homer/public] - 10https://gerrit.wikimedia.org/r/1219192 (https://phabricator.wikimedia.org/T412733)
[17:50:59] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ssw1-d[1,8]-eqiad.mgmt with reason: upgradiing sr-linux on lswtest-d8-eqiad
[17:51:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11469994 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec73e489-e95a-4824-ad67-a99943eae0e7) set by cmoone...
[17:51:29] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ssw1-d[1,8]-eqiad with reason: upgradiing sr-linux on lswtest-d8-eqiad
[17:51:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470001 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=98bc0d0a-c3e1-4862-b66a-e386322de608) set by cmoone...
[17:51:46] <topranks>	 !log upgrading OS on lswtest-d8-eqiad T412733
[17:51:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:49] <stashbot>	 T412733: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733
[17:53:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:54:24] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[17:54:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11470010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie
[17:55:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery)
[17:55:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11470015 (10Marostegui)
[17:58:19] <wikibugs>	 (03CR) 10Papaul: [C:03+1] lswtest-d8-eqiad: define srlinux_version var as v25.10.1 [homer/public] - 10https://gerrit.wikimedia.org/r/1219192 (https://phabricator.wikimedia.org/T412733) (owner: 10Cathal Mooney)
[18:00:04] <jouncebot>	 swfrench-wmf: #bothumor I � Unicode. All rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1800).
[18:01:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] lswtest-d8-eqiad: define srlinux_version var as v25.10.1 [homer/public] - 10https://gerrit.wikimedia.org/r/1219192 (https://phabricator.wikimedia.org/T412733) (owner: 10Cathal Mooney)
[18:02:50] <wikibugs>	 (03Merged) 10jenkins-bot: lswtest-d8-eqiad: define srlinux_version var as v25.10.1 [homer/public] - 10https://gerrit.wikimedia.org/r/1219192 (https://phabricator.wikimedia.org/T412733) (owner: 10Cathal Mooney)
[18:05:42] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mediawiki::periodic_job: Add mesh_check_skip [puppet] - 10https://gerrit.wikimedia.org/r/1219161 (https://phabricator.wikimedia.org/T412818) (owner: 10Clément Goubert)
[18:12:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11470078 (10VRiley-WMF)
[18:13:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11470083 (10VRiley-WMF)
[18:14:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:15:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470088 (10cmooney) >>! In T412733#11467826, @ayounsi wrote: > My guess is that SR-Linux < 25 doesn't have stats for mgmt0 (eit...
[18:23:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470106 (10cmooney) @papaul lswtest-d8-eqiad is upgraded to v25.10.1 now for you.  {F71107154 width=500}
[18:24:15] <swfrench-wmf>	 mediawiki rebuild / deployment still chugging along
[18:32:39] <logmsgbot>	 !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie
[18:32:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11470138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie execu...
[18:34:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:38:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11470156 (10Jclark-ctr) @RKemper I replaced the battery and that error has cleared. It still shows an error for Drive Slot 1. I’ve opened an RMA for the drive since it was pur...
[18:39:47] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:42:48] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Rebuild deployment to pick up new production image (duration: 78m 01s)
[18:43:30] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11470166 (10CDobbins) Sorry, I was (blindly) following the instructions on wikitech and didn't stop to think. I'll take care of this myself!
[18:43:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11470167 (10cmooney) Unfortunately it wasn't just a quirk to do with the tabs v. spaces in the preseed file.  I tried again and the same happens,...
[18:46:57] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[18:47:15] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[18:48:10] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[18:48:24] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[18:48:45] <swfrench-wmf>	 alright, I'm done with mediawiki deployments for this window. as expected this took quite a while :)
[18:54:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11470232 (10Jhancock.wm) if we use 1G copper, we don't need to order anything. I can probably get it pre-ran tomorrow. Then papaul or I can conne...
[18:55:16] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[19:00:05] <jouncebot>	 dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1900)
[19:00:13] <dancy>	 o/
[19:00:38] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219202 (https://phabricator.wikimedia.org/T408277)
[19:00:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219202 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[19:01:37] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219202 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot)
[19:09:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11470297 (10Marostegui) Thank you - if you need help with the verification out band, let me know!
[19:11:40] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.7  refs T408277
[19:11:44] <stashbot>	 T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277
[19:13:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 15.16% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:23:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:26:34] <jinxer-wm>	 FIRING: DiskSpace: Disk space serpens:9100:/ 3.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[19:27:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11470334 (10Papaul) We are seeing the same error on lswtest-d8 in eqiad  ` in-error-packets 2466 `
[19:33:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:34:34] <icinga-wm>	 PROBLEM - Host lswtest-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[19:34:57] <wikibugs>	 (03Abandoned) 10CDobbins: icinga: add cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1014589 (owner: 10CDobbins)
[19:36:51] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lswtest-d8-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[19:36:54] <wikibugs>	 (03CR) 10CDobbins: [V:03+2] admin: add fido-based ssh access for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1218360 (https://phabricator.wikimedia.org/T412755) (owner: 10CDobbins)
[19:36:58] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T413004 (10phaultfinder) 03NEW
[19:37:00] <icinga-wm>	 PROBLEM - Host lswtest-d8-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[19:37:14] <wikibugs>	 (03CR) 10CDobbins: [V:03+2 C:03+2] admin: add fido-based ssh access for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1218360 (https://phabricator.wikimedia.org/T412755) (owner: 10CDobbins)
[19:37:49] <wikibugs>	 10ops-drmrs: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T413005 (10phaultfinder) 03NEW
[19:42:04] <wikibugs>	 (03PS1) 10Eric Gardner: Delay StickyHeaders section click instrumentation for slow loads [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219209 (https://phabricator.wikimedia.org/T412857)
[19:42:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219209 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner)
[19:43:03] <dancy>	 jouncebot nowandnext
[19:43:03] <jouncebot>	 For the next 1 hour(s) and 16 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1900)
[19:43:03] <jouncebot>	 In 1 hour(s) and 16 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T2100)
[19:45:09] <dancy>	 The train looks good so I'm okay with folks using the rest of the window for backports.  
[19:45:24] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:46:07] <dancy>	 (^ EricGardner)
[19:47:22] <wikibugs>	 10SRE-Access-Requests: Add yubikey SSH key for 'denisse' - https://phabricator.wikimedia.org/T413006 (10andrea.denisse) 03NEW
[19:48:29] <wikibugs>	 10SRE-Access-Requests: Add yubikey SSH key for 'denisse' - https://phabricator.wikimedia.org/T413006#11470411 (10andrea.denisse) 05Open→03In progress
[19:51:51] <jinxer-wm>	 FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[19:53:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:55:59] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06Release-Engineering-Team, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008 (10CDanis) 03NEW
[19:56:19] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06Release-Engineering-Team, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11470444 (10CDanis) This is at least High and possibly UBN!
[20:04:33] <wikibugs>	 (03Abandoned) 10Daniel Kinzler: api-gateway chart: add values-rest-staging.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211656 (owner: 10Daniel Kinzler)
[20:05:05] <wikibugs>	 (03Abandoned) 10Daniel Kinzler: rest-gateway: add prefix to all user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212239 (owner: 10Daniel Kinzler)
[20:06:14] <wikibugs>	 (03Abandoned) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler)
[20:08:03] <wikibugs>	 10SRE-Access-Requests: Add FIDO-backed SSH key for aklapper - https://phabricator.wikimedia.org/T413009 (10Aklapper) 03NEW
[20:08:17] <wikibugs>	 (03PS1) 10Aklapper: admin: add fido backed ssh key for aklapper [puppet] - 10https://gerrit.wikimedia.org/r/1219213 (https://phabricator.wikimedia.org/T413009)
[20:10:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:10:24] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:15:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:17:52] <wikibugs>	 (03PS1) 10Andrea Denisse: admin: Add yubikey SSH key for denisse. [puppet] - 10https://gerrit.wikimedia.org/r/1219211 (https://phabricator.wikimedia.org/T413006)
[20:19:11] <denisse>	 Hi, can this patch be merged?? CDobbins: admin: add fido-based ssh access for cdobbins (476b0919fe)
[20:19:40] <denisse>	 ChrisDobbins901_ ^
[20:20:18] <ChrisDobbins901_>	 yes. I thought I merged it 😳 
[20:20:47] <denisse>	 It was merged on Gerrit but not on the Puppet  host, no worries, I'll merge it. :)
[20:21:01] <ChrisDobbins901_>	 thank you 🤦🏽
[20:23:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T412983#11470512 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm removing Phase, Active Power values until T401937 is resolved.
[20:28:35] <wikibugs>	 (03PS1) 10C. Scott Ananian: ParserOutputAccess: don't use PoolCounter recursively [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219216 (https://phabricator.wikimedia.org/T412959)
[20:28:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219216 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian)
[20:29:10] <wikibugs>	 (03PS1) 10C. Scott Ananian: ParserOutputAccess: don't use PoolCounter recursively [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219217 (https://phabricator.wikimedia.org/T412959)
[20:30:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219217 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian)
[20:30:13] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P86725 and previous config saved to /var/cache/conftool/dbconfig/20251217-203012-ladsgroup.json
[20:30:17] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[20:30:25] <wikibugs>	 (03PS1) 10Scott French: shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215
[20:32:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:34:17] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Offline Script not completing - https://phabricator.wikimedia.org/T411551#11470555 (10Jhancock.wm) i had a decomm ticket that passed without issues. T412783
[20:35:37] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 (owner: 10Scott French)
[20:39:54] <wikibugs>	 (03CR) 10Scott French: "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 (owner: 10Scott French)
[20:39:58] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 (owner: 10Scott French)
[20:40:26] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw-videoscaler: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217609 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[20:42:24] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219215 (owner: 10Scott French)
[20:44:46] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mw-videoscaler: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217609 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[20:45:21] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P86726 and previous config saved to /var/cache/conftool/dbconfig/20251217-204520-ladsgroup.json
[20:45:24] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:46:30] <wikibugs>	 (03Merged) 10jenkins-bot: mw-videoscaler: Update to Envoy 1.35.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217609 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[20:47:52] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply
[20:48:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[20:48:51] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[20:49:07] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[20:49:38] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[20:49:41] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[20:49:52] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[20:49:55] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[20:50:21] <wikibugs>	 (03PS1) 10Neriah: Enable protection indicators for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219219
[20:50:23] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[20:50:41] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[20:50:59] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[20:51:09] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[20:51:12] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[20:51:33] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[20:52:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[20:52:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T2100).
[21:00:05] <jouncebot>	 cscott, Pppery, and EricGardner: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:22] * swfrench-wmf has some pending shellbox updates, but will hold off until the backport window wraps up
[21:00:29] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P86727 and previous config saved to /var/cache/conftool/dbconfig/20251217-210029-ladsgroup.json
[21:00:35] <cscott>	 o/
[21:00:44] <rzl>	 I'm through with what I was doing, also :)
[21:00:59] <cscott>	 my backports should go out before the mediawiki-config patch
[21:01:02] <rzl>	 (for now... *ominous chord* *thunderclap* *maniacal laughter*)
[21:01:12] <EricGardner>	 I'm here and can deploy my patches (a simple backport and a config patch) when others are done
[21:01:22] <cscott>	 i can spiderpig my patches as well.
[21:01:37] <EricGardner>	 This is the config patch (it's already merged so I could not add it to the schedule): https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1217799
[21:02:03] <EricGardner>	 Neither of my patches should produce any user-facing changes
[21:02:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:02:48] <cscott>	 EricGardner: if it's merged already, will it go out at the new scap, or only at the next scap of mediawiki-config?
[21:03:16] <EricGardner>	 I'm not totally clear on that. The exact timing does not really matter, this is more of a housekeeping change.
[21:03:29] <EricGardner>	 We are just removing some reference to a dead project.
[21:03:36] <cscott>	 i can get started then, and then your config change will probably go out with my config change.
[21:04:07] <EricGardner>	 That would be great if you want to include it
[21:04:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219216 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian)
[21:04:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219217 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian)
[21:08:59] <wikibugs>	 (03Merged) 10jenkins-bot: ParserOutputAccess: don't use PoolCounter recursively [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219216 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian)
[21:09:03] <wikibugs>	 (03Merged) 10jenkins-bot: ParserOutputAccess: don't use PoolCounter recursively [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219217 (https://phabricator.wikimedia.org/T412959) (owner: 10C. Scott Ananian)
[21:09:38] <logmsgbot>	 !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1219216|ParserOutputAccess: don't use PoolCounter recursively (T412959)]], [[gerrit:1219217|ParserOutputAccess: don't use PoolCounter recursively (T412959)]]
[21:09:42] <stashbot>	 T412959: Logstash poolcounter warnings "Usage error: You may only aquire a single non-nowait lock" on wikis with post-processing cache enabled - https://phabricator.wikimedia.org/T412959
[21:11:50] <logmsgbot>	 !log cscott@deploy2002 cscott: Backport for [[gerrit:1219216|ParserOutputAccess: don't use PoolCounter recursively (T412959)]], [[gerrit:1219217|ParserOutputAccess: don't use PoolCounter recursively (T412959)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:13:20] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Offline Script not completing - https://phabricator.wikimedia.org/T411551#11470629 (10Papaul) 05Open→03Resolved @Jhancock.wm thank you for the update. WE can resolve this task for now if it does happen again we can reopen.
[21:14:23] <logmsgbot>	 !log cscott@deploy2002 cscott: Continuing with sync
[21:15:38] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P86728 and previous config saved to /var/cache/conftool/dbconfig/20251217-211537-ladsgroup.json
[21:15:42] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[21:15:54] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2239.codfw.wmnet with reason: Maintenance
[21:18:28] <logmsgbot>	 !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219216|ParserOutputAccess: don't use PoolCounter recursively (T412959)]], [[gerrit:1219217|ParserOutputAccess: don't use PoolCounter recursively (T412959)]] (duration: 08m 50s)
[21:18:32] <stashbot>	 T412959: Logstash poolcounter warnings "Usage error: You may only aquire a single non-nowait lock" on wikis with post-processing cache enabled - https://phabricator.wikimedia.org/T412959
[21:19:14] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222
[21:19:21] <cscott>	 EricGardner: ok, my mediawiki-core patches are done.  the config patch is next.  do you want to do your core patch before the config, or does it not matter?
[21:20:08] <EricGardner>	 It doesn't matter for my change
[21:20:31] <cscott>	 ok, i'm going to do the config patches now then.
[21:20:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:20:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:21:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 (owner: 10Daniel Kinzler)
[21:21:37] <wikibugs>	 (03PS7) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:22:21] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:23:23] <wikibugs>	 (03Merged) 10jenkins-bot: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin)
[21:23:56] <logmsgbot>	 !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]]
[21:24:01] <stashbot>	 T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255
[21:24:02] <stashbot>	 T411558: ArticleSummaries: Decommission the extension (code changes) - https://phabricator.wikimedia.org/T411558
[21:25:44] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222
[21:26:10] <logmsgbot>	 !log cscott@deploy2002 ksarabia, ihurbain, cscott: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:27:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest-gateway: move values-minikube.minikube to service definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219222 (owner: 10Daniel Kinzler)
[21:32:05] <logmsgbot>	 !log cscott@deploy2002 ksarabia, ihurbain, cscott: Continuing with sync
[21:32:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11470692 (10VRiley-WMF)
[21:36:09] <logmsgbot>	 !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]] (duration: 12m 13s)
[21:36:14] <stashbot>	 T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255
[21:36:15] <stashbot>	 T411558: ArticleSummaries: Decommission the extension (code changes) - https://phabricator.wikimedia.org/T411558
[21:37:00] <cscott>	 EricGardner: ok, i'm done.  do you want to do your last patch yourself?
[21:37:11] <EricGardner>	 Sure, I can do that now
[21:37:12] <cscott>	 also, i'm not sure who is deploying pppery's patch
[21:37:42] <cscott>	 urbanecm: are you deploying pppery's patch?
[21:39:51] <EricGardner>	 I will start with my WikimediaEvents patch now
[21:40:08] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11470751 (10thcipriani)
[21:40:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219209 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner)
[21:47:52] <wikibugs>	 (03Merged) 10jenkins-bot: Delay StickyHeaders section click instrumentation for slow loads [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1219209 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner)
[21:48:25] <logmsgbot>	 !log egardner@deploy2002 Started scap sync-world: Backport for [[gerrit:1219209|Delay StickyHeaders section click instrumentation for slow loads (T412857)]]
[21:48:29] <stashbot>	 T412857: Sticky Headers: Distinguish automatic vs user-initiated section toggles - https://phabricator.wikimedia.org/T412857
[21:50:36] <logmsgbot>	 !log egardner@deploy2002 egardner: Backport for [[gerrit:1219209|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:52:10] <logmsgbot>	 !log egardner@deploy2002 egardner: Continuing with sync
[21:53:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:56:12] <logmsgbot>	 !log egardner@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219209|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] (duration: 07m 47s)
[21:56:16] <stashbot>	 T412857: Sticky Headers: Distinguish automatic vs user-initiated section toggles - https://phabricator.wikimedia.org/T412857
[21:57:24] <cscott>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, cjming: there's a volunteer patch on the schedule from pppery but I don't know who is supposed to deploy it.
[22:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T2200)
[22:07:18] <rzl>	 I have some envoy updates to roll out, but happy to wait if that last patch is still going to go out :)
[22:09:43] <dancy>	 cscott, rzl:  I recommend leaving the change undeployed and moving on.  
[22:09:50] <wikibugs>	 10SRE-Access-Requests: FIDO ssh key for ariel - https://phabricator.wikimedia.org/T413019 (10ArielGlenn) 03NEW
[22:10:07] <wikibugs>	 10SRE-Access-Requests: Add FIDO ssh key(s) for ariel - https://phabricator.wikimedia.org/T413019#11470889 (10ArielGlenn)
[22:10:28] <swfrench-wmf>	 rzl: any objections if I sneak in some shellbox updates before you start?
[22:10:38] <swfrench-wmf>	 lest you pick them up :)
[22:10:44] <rzl>	 nope, fire away
[22:10:58] <swfrench-wmf>	 ack, starting momentarily
[22:11:01] <rzl>	 you're also welcome to leave em for me, you'd just have to wait until I get all the way to S :P
[22:11:56] <wikibugs>	 (03PS1) 10ArielGlenn: Add the first of two yubikey FIDO-compliant ssh keys for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019)
[22:12:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[22:12:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add the first of two yubikey FIDO-compliant ssh keys for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn)
[22:12:53] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[22:12:59] <swfrench-wmf>	 rzl: thanks for offering! this probably warrants a wee bit more supervision than I'd want to burden you with, though.
[22:13:24] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[22:13:38] <rzl>	 👍
[22:14:03] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[22:14:34] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[22:14:53] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[22:15:24] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[22:15:45] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[22:16:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[22:16:43] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[22:16:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218775 (https://phabricator.wikimedia.org/T412455) (owner: 10LorenMora)
[22:17:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[22:17:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:17:33] <cscott>	 dancy: yep sounds good to me
[22:17:57] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[22:18:32] <wikibugs>	 (03PS2) 10ArielGlenn: Add the first of two yubikey FIDO-compliant ssh keys for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019)
[22:19:25] <swfrench-wmf>	 rzl: I'll let that soak for 10m or so, then update codfw, then all yours
[22:19:30] <rzl>	 sgtm
[22:30:00] <swfrench-wmf>	 service metrics and logstash look good. off to codfw we go.
[22:30:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[22:30:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[22:31:27] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[22:36:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn)
[22:41:58] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[22:42:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:43:15] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[22:43:50] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[22:45:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[22:45:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[22:45:50] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[22:46:06] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[22:46:37] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[22:46:59] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[22:47:31] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[22:48:07] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[22:52:49] <swfrench-wmf>	 rzl: all yours. thanks for your patience!
[22:52:59] <rzl>	 thanks!
[22:53:04] <jhathaway>	 !log upload new version of corto
[22:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:55:17] <rzl>	 rolling out envoy 1.35.7 to eqiad services
[22:55:20] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply
[22:55:56] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply
[22:56:33] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[22:57:05] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[22:58:02] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:58:08] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:59:26] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply
[22:59:44] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply
[23:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T2300)
[23:03:24] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply
[23:03:43] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply
[23:03:50] <rzl>	 (if anyone has plans to use the Web Team window today, I'm happy to pause for as long as you need!)
[23:03:57] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[23:04:17] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[23:04:30] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply
[23:04:47] <wikibugs>	 (03PS1) 10Bearloga: EventStreamConfig: enrich stream with more headers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219234 (https://phabricator.wikimedia.org/T396562)
[23:04:48] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply
[23:04:59] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/echostore: apply
[23:05:57] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply
[23:06:16] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[23:06:32] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[23:06:48] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[23:07:14] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[23:07:33] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[23:08:11] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[23:08:25] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[23:08:47] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[23:08:55] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[23:09:11] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[23:09:21] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[23:09:47] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[23:10:05] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[23:10:35] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[23:10:53] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[23:11:41] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[23:11:57] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply
[23:12:14] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply
[23:12:29] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply
[23:12:50] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply
[23:13:09] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[23:13:34] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[23:13:45] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply
[23:14:58] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply
[23:17:47] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219234 (https://phabricator.wikimedia.org/T396562) (owner: 10Bearloga)
[23:18:06] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[23:19:29] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+2] Add the first of two yubikey FIDO-compliant ssh keys for ariel [puppet] - 10https://gerrit.wikimedia.org/r/1219230 (https://phabricator.wikimedia.org/T413019) (owner: 10ArielGlenn)
[23:26:49] <jinxer-wm>	 FIRING: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[23:30:44] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[23:30:48] <rzl>	 this is going to time out soonish, same thing that happened last time I tried to deploy this serv-- yeah
[23:31:01] <rzl>	 moving on for now, I'll come back around to it
[23:31:11] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply
[23:31:27] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply
[23:31:42] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[23:33:31] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[23:34:09] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[23:34:52] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[23:35:22] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[23:35:28] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[23:35:47] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
[23:36:08] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
[23:41:09] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply
[23:42:03] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[23:42:14] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply
[23:42:54] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply
[23:43:08] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[23:43:14] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[23:43:21] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply
[23:43:47] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply
[23:44:08] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply
[23:44:26] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply
[23:45:01] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[23:45:24] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:45:34] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[23:45:54] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply
[23:46:40] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply
[23:47:24] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[23:47:47] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[23:48:14] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[23:48:31] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[23:49:26] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[23:49:47] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[23:50:33] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[23:51:02] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[23:51:18] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[23:52:03] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[23:52:06] <jinxer-wm>	 FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - lswtest-d8-eqiad:ethernet-1/56 (Core: ssw1-d1-eqiad:ethernet-1/17 {#temp1848392398}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[23:52:12] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[23:52:38] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[23:53:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between lswtest-d8-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:54:11] <rzl>	 letting that rest a moment for extremely responsible operations reasons (i.e. I want a snack) and then I'll roll the same thing in codfw