[00:36:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:56:27] <urbanecm>	 jouncebot: nowandnext
[00:56:27] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 3 minute(s)
[00:56:27] <jouncebot>	 In 2 hour(s) and 3 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0300)
[00:56:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890173 (https://phabricator.wikimedia.org/T330015) (owner: 10Urbanecm)
[00:57:33] <wikibugs>	 (03Merged) 10jenkins-bot: cswikibooks: Enable visualeditor for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890173 (https://phabricator.wikimedia.org/T330015) (owner: 10Urbanecm)
[00:57:50] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:890173|cswikibooks: Enable visualeditor for all users (T330015)]]
[00:57:54] <stashbot>	 T330015: Enable VisualEditor by default on cs.wikibooks.org - https://phabricator.wikimedia.org/T330015
[00:59:25] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:890173|cswikibooks: Enable visualeditor for all users (T330015)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[01:06:38] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:890173|cswikibooks: Enable visualeditor for all users (T330015)]] (duration: 08m 47s)
[01:06:42] <stashbot>	 T330015: Enable VisualEditor by default on cs.wikibooks.org - https://phabricator.wikimedia.org/T330015
[01:07:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10phaultfinder)
[01:48:46] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] Remove wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890496 (https://phabricator.wikimedia.org/T329992) (owner: 10Arlolra)
[01:49:02] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[01:53:03] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[01:58:22] <wikibugs>	 (03PS1) 10Legoktm: varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787)
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:21:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:02] <wikibugs>	 (03PS1) 10Andrew Bogott: systemd timers: Add the 'After' requirement to the timer module [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022)
[02:30:11] <wikibugs>	 (03CR) 10Andrew Bogott: "PCC results:" [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0300)
[03:07:50] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.24 [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890352 (https://phabricator.wikimedia.org/T325587)
[03:07:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.24 [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890352 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[03:23:00] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.24 [core] (wmf/1.40.0-wmf.24) - 10https://gerrit.wikimedia.org/r/890352 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0400)
[04:05:54] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:03:18] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[05:05:02] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[05:53:46] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[05:57:48] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[06:09:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:10:32] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0700)
[07:00:04] <jouncebot>	 kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0700).
[07:45:04] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:45:10] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:46:44] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:46:52] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:49:34] <XioNoX>	 !log Staging the new Junos version on the codfw row B switches - T327991
[07:49:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:39] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[07:54:09] <wikibugs>	 (03PS2) 10KartikMistry: Section Translation: Fix language code for Cantonese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890482 (https://phabricator.wikimedia.org/T304865)
[07:57:30] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[07:58:15] <XioNoX>	 unexpected but seems link impactless ^
[07:59:28] <wikibugs>	 (03PS3) 10Muehlenhoff: sre.ganeti.makevm: Stop printing the dhcp config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/889999 (https://phabricator.wikimedia.org/T306661)
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0800).
[08:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:45] * kart_ is here
[08:00:56] <kart_>	 I'll go ahead with self-deploy..
[08:02:30] <jinxer-wm>	 (Emergency syslog message) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[08:03:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890482 (https://phabricator.wikimedia.org/T304865) (owner: 10KartikMistry)
[08:04:17] <wikibugs>	 (03Merged) 10jenkins-bot: Section Translation: Fix language code for Cantonese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890482 (https://phabricator.wikimedia.org/T304865) (owner: 10KartikMistry)
[08:04:48] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]]
[08:04:52] <stashbot>	 T304865: Enable Content and Section Translation for Cantonese Wikipedia - https://phabricator.wikimedia.org/T304865
[08:04:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: admin_ng: Update wikikube-codfw settings to k8s 1.23 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[08:05:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Update wikikube-codfw settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[08:07:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.makevm: Stop printing the dhcp config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/889999 (https://phabricator.wikimedia.org/T306661) (owner: 10Muehlenhoff)
[08:09:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, but 1 minor comment regarding the order of things." [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[08:09:33] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[08:21:24] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]] (duration: 16m 36s)
[08:21:29] <stashbot>	 T304865: Enable Content and Section Translation for Cantonese Wikipedia - https://phabricator.wikimedia.org/T304865
[08:23:09] <wikibugs>	 (03CR) 10Slyngshede: P:idm configure production IDM (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede)
[08:23:16] <wikibugs>	 (03PS33) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753
[08:23:25] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890772 (https://phabricator.wikimedia.org/T325587)
[08:23:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890772 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[08:24:00] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890772 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[08:24:22] <logmsgbot>	 !log hashar@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.24  refs T325587
[08:24:26] <stashbot>	 T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587
[08:29:41] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede)
[08:36:20] <wikibugs>	 (03CR) 10Phedenskog: [C: 03+1] "I prepared a test we shoot asap when this is merged, and then I'll clean it up when you tell me you are done Nicholas." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray)
[08:43:32] <hashar>	 I kind of forgot about the backport window :(
[08:49:59] <moritzm>	 !log installing clamav security updates
[08:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:15] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: outlink model upgrade debian and python [deployment-charts] - 10https://gerrit.wikimedia.org/r/890471 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos)
[08:56:53] <wikibugs>	 (03PS4) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361)
[08:57:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison)
[08:58:02] <wikibugs>	 (03PS5) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361)
[08:58:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison)
[09:00:05] <jouncebot>	 jayme: That opportune time is upon us again. Time for a Kubernetes upgrade wikikube codfw deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900).
[09:00:05] <jouncebot>	 hashar and dduvall: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900).
[09:01:07] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: outlink model upgrade debian and python [deployment-charts] - 10https://gerrit.wikimedia.org/r/890471 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos)
[09:01:30] <icinga-wm>	 PROBLEM - Check that envoy is running on idm2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[09:06:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Restore two old entries (and mark as absented) [puppet] - 10https://gerrit.wikimedia.org/r/890774
[09:08:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Restore two old entries (and mark as absented) [puppet] - 10https://gerrit.wikimedia.org/r/890774 (owner: 10Muehlenhoff)
[09:08:15] <wikibugs>	 (03PS1) 10Elukey: knative-serving: add network policies for domain-mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/890775 (https://phabricator.wikimedia.org/T327767)
[09:08:27] <wikibugs>	 (03CR) 10Muehlenhoff: admin: remove users mmarble/marble (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890497 (owner: 10Jbond)
[09:10:20] <logmsgbot>	 !log hashar@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.24  refs T325587 (duration: 45m 58s)
[09:10:28] <stashbot>	 T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587
[09:12:08] <vgutierrez>	 !log update thirdparty/haproxy26 to version 2.6.9 for bullseye and buster (apt.wm.o)
[09:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:39] <logmsgbot>	 !log hashar@deploy1002 Pruned MediaWiki: 1.40.0-wmf.22 (duration: 02m 16s)
[09:13:37] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in codfw: maintenance
[09:14:17] <vgutierrez>	 !log testing HAProxy 2.6.9 in cp4052 and cp4044
[09:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:44] <icinga-wm>	 PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:46] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] wikikube: Update cluster settings for k8s 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[09:16:52] <wikibugs>	 (03PS6) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361)
[09:18:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: add network policies for domain-mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/890775 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:20:26] <icinga-wm>	 PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:20:53] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] knative-serving: add network policies for domain-mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/890775 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:22:51] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Volans) @Papaul ideally we should find what's the characteristic that determines the change (iDRAC version?, BIOS version?, Dell GEN?) and automatically detect that ins...
[09:24:10] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet
[09:24:27] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=prometheus2005.codfw.wmnet
[09:24:31] <wikibugs>	 (03PS7) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361)
[09:24:58] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service Slyngshede Setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:02] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service Slyngshede Setup https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:02] <icinga-wm>	 ACKNOWLEDGEMENT - Check that envoy is running on idm2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed Slyngshede Setup https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[09:26:36] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert dry-run looks good, resolving
[09:26:48] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[09:27:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah)
[09:31:00] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in codfw: maintenance
[09:32:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39750/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[09:34:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Just a nit inline, LGTM though! nicely done" [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[09:34:27] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+2] presto: add last 5 nodes to prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889995 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison)
[09:35:59] <wikibugs>	 (03CR) 10Jelto: "comment in line" [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[09:36:41] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:38:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:39:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:40:05] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10JMeybohm) Global depool of a/a services from codfw is done.
[09:40:23] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto)
[09:44:27] <wikibugs>	 (03PS1) 10EoghanGaffney: Change the active gitlab replica host to be the eqiad instance [puppet] - 10https://gerrit.wikimedia.org/r/890779 (https://phabricator.wikimedia.org/T329930)
[09:44:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:46:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:48:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:48:54] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.discovery.service-route depool 2 services in codfw: T329664
[09:48:59] <stashbot>	 T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664
[09:49:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[09:53:58] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool 2 services in codfw: T329664
[09:54:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Looks good to me for a test. Adding Moritz since this may eventually be refactored into a standard recipe." [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison)
[09:54:01] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[09:54:02] <stashbot>	 T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664
[09:58:04] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[09:58:44] <wikibugs>	 10SRE, 10Znuny: Puppet template for /etc/clamav/clamd.conf needs to be updated - https://phabricator.wikimedia.org/T330129 (10MoritzMuehlenhoff)
[09:59:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:59:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:01:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:02:52] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:04:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:04:28] <wikibugs>	 (03PS1) 10Elukey: knative-serving: fix some bugs in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/890780 (https://phabricator.wikimedia.org/T327767)
[10:08:49] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[10:09:09] <dcausse>	 expected ^ will silence them
[10:11:32] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] knative-serving: fix some bugs in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/890780 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[10:11:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[10:13:01] <wikibugs>	 (03PS44) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[10:13:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:13:53] <wikibugs>	 (03PS2) 10Majavah: alerts: Allow customizing the git repository info [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716)
[10:13:55] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::prometheus: deploy alert rules from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/890490 (https://phabricator.wikimedia.org/T284860)
[10:14:07] <wikibugs>	 (03CR) 10Majavah: alerts: Allow customizing the git repository info (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[10:15:03] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, can be merged during switchover tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/890779 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney)
[10:16:22] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890353 (https://phabricator.wikimedia.org/T330134)
[10:16:41] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39751/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[10:18:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:20:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[10:20:46] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:22:12] <wikibugs>	 (03PS45) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[10:23:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: fix some bugs in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/890780 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[10:24:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[10:24:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:25:23] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Decide which cookbooks using icinga_hosts.wait_for_optimal() should use skip_acked=True - https://phabricator.wikimedia.org/T330136 (10Volans) p:05Triage→03Medium
[10:26:18] <wikibugs>	 (03PS46) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[10:26:51] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] systemd timers: Add the 'After' requirement to the timer module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[10:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT clusterissuers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:29:51] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39752/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[10:29:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:29:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s8 T330134
[10:30:02] <stashbot>	 T330134: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T330134
[10:30:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s8 T330134
[10:30:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:30:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2165 with weight 0 T330134', diff saved to https://phabricator.wikimedia.org/P44696 and previous config saved to /var/cache/conftool/dbconfig/20230221-103053-ladsgroup.json
[10:31:56] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:32:23] <wikibugs>	 (03PS1) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782
[10:33:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT clusterissuers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:34:53] <wikibugs>	 (03PS47) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[10:36:50] <wikibugs>	 (03PS1) 10Elukey: knative-serving: extend delay for domain mapping webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/890783 (https://phabricator.wikimedia.org/T327767)
[10:37:13] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:37:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:38:33] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39753/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[10:39:00] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert)
[10:39:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:40:10] <wikibugs>	 (03PS1) 10EoghanGaffney: Update DNS to switch gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930)
[10:43:50] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1003.eqiad.wmnet with OS bullseye
[10:44:35] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] knative-serving: extend delay for domain mapping webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/890783 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[10:44:44] <wikibugs>	 (03PS2) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782
[10:44:51] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.datacenter: Logging improvements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert)
[10:46:24] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 23 hosts with reason: Reinitialize wikikube codfw with k8s 1.23
[10:46:41] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 23 hosts with reason: Reinitialize wikikube codfw with k8s 1.23
[10:48:11] <Amir1>	 jouncebot: nowandnext
[10:48:11] <jouncebot>	 For the next 5 hour(s) and 11 minute(s): Kubernetes upgrade wikikube codfw (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900)
[10:48:11] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900)
[10:48:11] <jouncebot>	 In 0 hour(s) and 11 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1100)
[10:49:21] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[10:49:38] <wikibugs>	 (03CR) 10Nicolas Fraison: netboot: create dedicated partman recipe for presto workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison)
[10:49:40] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+2] netboot: create dedicated partman recipe for presto workers [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison)
[10:49:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:50:57] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1001.eqiad.wmnet with OS bullseye
[10:51:25] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, can be merged during switchover tomorrow" [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney)
[10:53:22] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890353 (https://phabricator.wikimedia.org/T330134) (owner: 10Gerrit maintenance bot)
[10:53:26] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890353 (https://phabricator.wikimedia.org/T330134) (owner: 10Gerrit maintenance bot)
[10:54:35] <Amir1>	 !log Starting s8 codfw failover from db2161 to db2165 - T330134
[10:54:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:38] <stashbot>	 T330134: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T330134
[10:54:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:55:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2165 to s8 primary T330134', diff saved to https://phabricator.wikimedia.org/P44697 and previous config saved to /var/cache/conftool/dbconfig/20230221-105503-ladsgroup.json
[10:55:30] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664
[10:55:34] <stashbot>	 T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664
[10:55:34] <wikibugs>	 (03PS3) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782
[10:56:14] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1003.eqiad.wmnet with reason: host reimage
[10:56:32] <wikibugs>	 (03PS4) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782
[10:57:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2161 T330134', diff saved to https://phabricator.wikimedia.org/P44698 and previous config saved to /var/cache/conftool/dbconfig/20230221-105714-ladsgroup.json
[10:57:42] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] wikikube: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[10:58:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44699 and previous config saved to /var/cache/conftool/dbconfig/20230221-105823-root.json
[10:59:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:59:42] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd2004.codfw.wmnet with OS bullseye
[10:59:51] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd2005.codfw.wmnet with OS bullseye
[11:00:04] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubetcd2006.codfw.wmnet with OS bullseye
[11:00:04] <jouncebot>	 jayme: How many deployers does it take to do Kubernetes upgrade wikikube codfw deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900).
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1100)
[11:01:06] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1003.eqiad.wmnet with reason: host reimage
[11:01:06] <jayme>	 Alerts arising from kubernetes / wikikube in codfw is me reimaging (jynus, godog)
[11:01:21] <jynus>	 yeah, I've seen them already
[11:01:29] <jynus>	 we were aware
[11:01:34] <wikibugs>	 (03PS16) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[11:01:39] <jayme>	 ok, cool
[11:01:43] <jynus>	 but please ping if maintenance finishes/codfw is repooled
[11:01:57] <jynus>	 to be more on top of it in case something is bad when it shouldn't
[11:02:26] <jayme>	 sure. but repool won't happen before switch maintenenace is done
[11:04:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: extend delay for domain mapping webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/890783 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:05:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert)
[11:06:33] <wikibugs>	 (03PS1) 10Ladsgroup: Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932)
[11:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job kubetcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:07:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[11:07:14] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[11:08:15] <wikibugs>	 (03PS2) 10Ladsgroup: Move userrights related configs from IS.php to core-Permissions.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890789 (https://phabricator.wikimedia.org/T308932)
[11:09:51] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[11:10:35] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond)
[11:11:24] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd2006.codfw.wmnet with reason: host reimage
[11:11:28] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd2005.codfw.wmnet with reason: host reimage
[11:12:40] <vgutierrez>	 !log rolling upgrade to HAproxy 2.6.9 on ulsfo
[11:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44700 and previous config saved to /var/cache/conftool/dbconfig/20230221-111328-root.json
[11:13:45] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: offline 2003 for switch upgrade [puppet] - 10https://gerrit.wikimedia.org/r/890792 (https://phabricator.wikimedia.org/T327991)
[11:13:53] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd2006.codfw.wmnet with reason: host reimage
[11:14:52] <wikibugs>	 (03PS1) 10Elukey: sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767)
[11:15:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:15:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:16:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: offline 2003 for switch upgrade [puppet] - 10https://gerrit.wikimedia.org/r/890792 (https://phabricator.wikimedia.org/T327991) (owner: 10Jbond)
[11:16:22] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd2005.codfw.wmnet with reason: host reimage
[11:16:31] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd2004.codfw.wmnet with reason: host reimage
[11:16:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:16:42] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond)
[11:16:56] <wikibugs>	 (03PS1) 10Elukey: ml-services: set proper EventGate stream values [deployment-charts] - 10https://gerrit.wikimedia.org/r/890795 (https://phabricator.wikimedia.org/T327767)
[11:16:57] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1003.eqiad.wmnet with OS bullseye
[11:17:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1 NOOP 21): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39754/console" [puppet] - 10https://gerrit.wikimedia.org/r/890792 (https://phabricator.wikimedia.org/T327991) (owner: 10Jbond)
[11:18:13] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[11:18:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF)
[11:19:09] <wikibugs>	 (03PS2) 10Elukey: sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767)
[11:19:39] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd2004.codfw.wmnet with reason: host reimage
[11:21:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: add final ask_confirmation before the end [cookbooks] - 10https://gerrit.wikimedia.org/r/890794 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:23:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[11:23:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[11:24:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[11:24:41] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: T329664
[11:24:45] <stashbot>	 T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664
[11:25:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[11:25:15] <jayme>	 I've aborted the cookbook on purpose
[11:25:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[11:25:25] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664
[11:25:41] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[11:25:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:26:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:26:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: Allow customizing the git repository info [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[11:26:22] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubetcd2006.codfw.wmnet with OS bullseye
[11:26:29] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070)
[11:26:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) Approval to deployment group is required from @thcipriani according to the data.yaml info.
[11:27:41] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubetcd2005.codfw.wmnet with OS bullseye
[11:28:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44701 and previous config saved to /var/cache/conftool/dbconfig/20230221-112833-root.json
[11:28:53] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: set proper EventGate stream values [deployment-charts] - 10https://gerrit.wikimedia.org/r/890795 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:30:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[11:30:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/890439 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[11:30:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:30:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:31:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi)
[11:32:08] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubetcd2004.codfw.wmnet with OS bullseye
[11:32:37] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[11:32:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: set proper EventGate stream values [deployment-charts] - 10https://gerrit.wikimedia.org/r/890795 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[11:33:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:34:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:34:05] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[11:34:35] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubemaster2001.codfw.wmnet with OS bullseye
[11:35:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[11:38:49] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:39:00] <wikibugs>	 (03CR) 10JMeybohm: admin_ng: Update wikikube-codfw settings to k8s 1.23 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[11:39:47] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:40:18] <jayme>	 AS64602 BGP errors is me as well - T329664
[11:40:18] <stashbot>	 T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664
[11:40:58] <wikibugs>	 10SRE, 10Znuny, 10serviceops-collab: Puppet template for /etc/clamav/clamd.conf needs to be updated - https://phabricator.wikimedia.org/T330129 (10MoritzMuehlenhoff)
[11:41:21] <wikibugs>	 (03PS1) 10Muehlenhoff: clamd.conf: Remove some config entries [puppet] - 10https://gerrit.wikimedia.org/r/890799 (https://phabricator.wikimedia.org/T330129)
[11:43:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44702 and previous config saved to /var/cache/conftool/dbconfig/20230221-114338-root.json
[11:45:47] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[11:46:34] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubemaster2001.codfw.wmnet with reason: host reimage
[11:47:29] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[11:49:05] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubemaster2001.codfw.wmnet with reason: host reimage
[11:49:38] <wikibugs>	 (03PS1) 10Jbond: systemd::timer::job: update email subject and body [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120)
[11:49:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert)
[11:50:46] <wikibugs>	 (03PS15) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193)
[11:51:06] <wikibugs>	 (03PS3) 10Clément Goubert: sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924)
[11:51:18] <wikibugs>	 (03PS48) 10Stevemunene: Update airflow conf compatibility with airflow 2.5.0 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[11:51:29] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment use Redis password [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797)
[11:51:46] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "change no longer required, see comments" [puppet] - 10https://gerrit.wikimedia.org/r/890497 (owner: 10Jbond)
[11:51:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert)
[11:51:57] <wikibugs>	 (03Abandoned) 10Jbond: admin: remove users mmarble/marble [puppet] - 10https://gerrit.wikimedia.org/r/890497 (owner: 10Jbond)
[11:52:06] <wikibugs>	 (03PS5) 10Jbond: admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah)
[11:52:16] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1001.eqiad.wmnet with reason: host reimage
[11:52:39] <wikibugs>	 (03PS9) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539
[11:55:22] <logmsgbot>	 !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1001.eqiad.wmnet with reason: host reimage
[11:56:10] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39755/console" [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797) (owner: 10Slyngshede)
[11:56:59] <wikibugs>	 (03CR) 10Volans: "FYI all those records have a 1H TTL, depending on how quicker you want the failover to happen and depending if you plan to run the sre.dns" [dns] - 10https://gerrit.wikimedia.org/r/890785 (https://phabricator.wikimedia.org/T329930) (owner: 10EoghanGaffney)
[11:57:02] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39756/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[11:57:22] <Amir1>	 jayme: can I do a couple of mw deploys or you prefer I'd do later?
[11:57:32] <Amir1>	 (nothing urgent)
[11:57:59] <jayme>	 Amir1: You can try now if you like. Deploying to wikikube codfw will fail, though
[11:58:27] <jayme>	 probably with very strange errors as the first control-plane is about to come up again
[11:58:28] <Amir1>	 hmm, I'd wait then, it's not anything important, ping me once done please
[11:58:34] <jayme>	 ack
[11:58:43] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: codfw: Add new WikiKube IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890802 (https://phabricator.wikimedia.org/T326617)
[12:00:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (once the group owner approval is in)" [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) (owner: 10Slyngshede)
[12:00:15] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: WikiKube codfw: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890803 (https://phabricator.wikimedia.org/T326617)
[12:00:54] <wikibugs>	 (03PS2) 10Slyngshede: C:idm::deployment use Redis password [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797)
[12:02:21] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[12:02:23] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39757/console" [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797) (owner: 10Slyngshede)
[12:02:51] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:02:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: WikiKube eqiad: Add the new larger IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890804 (https://phabricator.wikimedia.org/T326617)
[12:02:58] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: WikiKube eqiad: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890805 (https://phabricator.wikimedia.org/T326617)
[12:03:07] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment use Redis password [puppet] - 10https://gerrit.wikimedia.org/r/890801 (https://phabricator.wikimedia.org/T320797) (owner: 10Slyngshede)
[12:03:25] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[12:04:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] codfw: Add new WikiKube IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890802 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris)
[12:04:34] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "Its hard to tell from the task the error this is attempting to fix, however it doesn't seem like the correct way forward.  perhaps the iss" [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[12:04:59] <wikibugs>	 (03Merged) 10jenkins-bot: codfw: Add new WikiKube IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890802 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris)
[12:05:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:05:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (one typo inline)" [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120) (owner: 10Jbond)
[12:05:54] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubemaster2001.codfw.wmnet with OS bullseye
[12:05:56] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment missing comma [puppet] - 10https://gerrit.wikimedia.org/r/890806
[12:06:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah)
[12:06:09] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert)
[12:06:13] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert)
[12:06:35] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubemaster2002.codfw.wmnet with OS bullseye
[12:06:35] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment missing comma [puppet] - 10https://gerrit.wikimedia.org/r/890806 (owner: 10Slyngshede)
[12:08:02] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert)
[12:08:31] <wikibugs>	 (03PS16) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193)
[12:08:47] <wikibugs>	 (03PS2) 10Jbond: systemd::timer::job: update email subject and body [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120)
[12:09:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120) (owner: 10Jbond)
[12:09:06] <wikibugs>	 (03CR) 10Ayounsi: WikiKube codfw: Remove the old IP space (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890803 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris)
[12:09:08] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] systemd::timer::job: update email subject and body [puppet] - 10https://gerrit.wikimedia.org/r/890800 (https://phabricator.wikimedia.org/T330120) (owner: 10Jbond)
[12:09:51] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:10:07] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received: /v1/
[12:10:07] <icinga-wm>	 nguage}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning with a given provider) timed out before a response was received: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning with a given p
[12:10:07] <icinga-wm>	  timed out before a response was received: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v1/mt/{from}/{to} (Machine translate an HTML fragment using TestClient.) timed out before a response was received: /v1/mt/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient.) timed out before a response was received: /v1/list/pai
[12:10:07] <icinga-wm>	 /{to} (Get the tools between two language pairs) timed out before a response was received: /v1/list/languagepairs (Get all the language pairs) timed out before a response was received: /v1/list/{tool} (Get the MT tool between two language pairs) timed out before a response was received: /v1/list/{tool}/{from}/{to} (Get the MT tool between two language pairs) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlangua
[12:10:07] <icinga-wm>	 le} (Translate enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an H
[12:10:08] <icinga-wm>	 ment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/title/{title}/{from}/{to} (Suggest a target title for the given source title and language pairs) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Su
[12:10:08] <icinga-wm>	 urce sections to translate) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /_info/name (retrieve service name) timed out before a response was received: /_info/version (retrieve service version) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[12:10:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service eventgate-main:4492 has failed probes (http_eventgate-main_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:10:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2022.codf
[12:10:25] <icinga-wm>	  kubernetes2008.codfw.wmnet are marked down but pooled: mw-web_4450: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled: eventgate-analytics-external_4692: Servers kubernetes2010.codfw.wmnet, kubernetes2013.
[12:10:25] <icinga-wm>	 net, kubernetes2020.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes20 https://wikitech.wikimedia.org/wiki/PyBal
[12:10:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-api-ext_4447: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled: mw-web_4450: Servers kubernetes2010.codfw.wmnet,
[12:10:43] <icinga-wm>	 tes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2011.codfw.wmnet are marked down but pooled: eventgate-analytics-external_4692: Servers kubernetes2007.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2006.
[12:10:43] <icinga-wm>	 net, kubernetes2012.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes20 https://wikitech.wikimedia.org/wiki/PyBal
[12:10:45] <jynus>	 no impact, right?
[12:10:53] <icinga-wm>	 PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:11:00] <jynus>	 we just got some paging issues
[12:11:06] <claime>	 jynus: shouldn't, it's depooled
[12:11:09] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fc1c0221b70, Connection to wikifeeds.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:11:10] <claime>	 jayme: ^
[12:11:12] <jynus>	 ok, acking
[12:11:13] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:11:14] <godog>	 here too, checking
[12:11:29] <jynus>	 godog: help me double check on monitoring 0 impact
[12:11:45] <godog>	 for sure jynus 
[12:11:55] <akosiaris>	 it should be 0 impact indeed, DC has been depooled
[12:11:57] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert)
[12:12:07] <godog>	 yeah can confirm, I'm not seeing any impact so far
[12:12:08] <jynus>	 I acked on splunk
[12:12:09] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:12:18] <jinxer-wm>	 (ProbeDown) firing: (15) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:12:32] <wikibugs>	 (03CR) 10TheDJ: [C: 03+1] varnish: Set `X-Content-Type-Options: nosniff` on upload requests [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm)
[12:12:35] <akosiaris>	 !log add 10.194.128.0/18 to kubernetes-ipv4 prefix-list for codfw. T326617
[12:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:39] <stashbot>	 T326617: Decide on new Pod and Sevice IPv4 ranges for wikikube clusters - https://phabricator.wikimedia.org/T326617
[12:12:43] <jynus>	 indeed traffic is flatlined at 0
[12:13:07] <godog>	 ok to ack/silence the alerts that are paging ?
[12:13:23] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response w
[12:13:23] <icinga-wm>	 ved: /{format}/ (Invalid command (texvcinfo)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid
[12:13:31] <icinga-wm>	 PROBLEM - eventgate-logging-external LVS codfw on eventgate-logging-external.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fa304688cf8, Connection to eventgate-logging-external.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitec
[12:13:31] <icinga-wm>	 dia.org/wiki/Event_Platform/EventGate
[12:13:31] <akosiaris>	 godog: yes
[12:13:37] <vgutierrez>	 godog: at least ack them or they are gonna page everybody
[12:13:48] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert)
[12:13:57] <godog>	 akosiaris: ack (ha ha) thank you
[12:14:13] <godog>	 vgutierrez: yeah the VO page has been ack'd already by jynus, I was referring to alerts.w.o
[12:14:16] <akosiaris>	 lol
[12:14:25] <vgutierrez>	 I think I need a beer already after that joke
[12:14:42] <wikibugs>	 (03PS4) 10Clément Goubert: sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924)
[12:14:47] <icinga-wm>	 PROBLEM - eventgate-analytics-external LVS codfw on eventgate-analytics-external.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7ff350b51c50, Connection to eventgate-analytics-external.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://w
[12:14:47] <icinga-wm>	 wikimedia.org/wiki/Event_Platform/EventGate
[12:15:03] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:15:08] <godog>	 haha you are welcome
[12:15:11] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/data/css/mobile/pagelib (Get CSS bundle from wikimedia-page-library) timed out before a response was received: /{domain}/v1/data/c
[12:15:11] <icinga-wm>	 e/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/data/i18n/pcs (Get i18n strings for the Page Content Service) timed out before a response was received: /{domain}/v1/data/javascript/mobile/pagelib (Get javascript bundle for page library) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /
[12:15:11] <icinga-wm>	 /v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve
[12:15:11] <icinga-wm>	 ge via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[12:15:18] <jinxer-wm>	 (ProbeDown) firing: (7) Service eventgate-main:4492 has failed probes (http_eventgate-main_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:15:57] <jynus>	 I will keep an eye on phab too, in case some deployer finds issues or something
[12:16:05] <claime>	 vgutierrez: Well it's 5PM in Bhutan
[12:16:31] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubemaster2002.codfw.wmnet with reason: host reimage
[12:16:43] <jayme>	 sorry for the noise :/
[12:16:50] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:17:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924) (owner: 10Clément Goubert)
[12:17:07] <godog>	 jayme: no worries, that's what we're here for
[12:17:27] <jinxer-wm>	 (ProbeDown) firing: (16) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:17:39] <jayme>	 not sure if I should silence all probedown alerts from codfw as services are depooled anyways. wdyt godog?
[12:18:10] <godog>	 jayme: mmhh interesting, I'm not sure offhand
[12:18:23] <godog>	 happy to discuss later though, I was in the middle of lunch
[12:18:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 554962312 and 181 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:18:37] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi)
[12:18:46] <jayme>	 unfortunately there is no clever way to filter kubernetes services I guess..
[12:19:00] <godog>	 indeed
[12:19:06] <godog>	 ok going back to lunch, ttyl
[12:19:10] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubemaster2002.codfw.wmnet with reason: host reimage
[12:19:11] <jayme>	 oh well, yeah. Get back to lunch then :) not urgent obviously
[12:19:17] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924) (owner: 10Clément Goubert)
[12:19:33] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{format}/ (mass-energy equivalence (complete)) timed out before a response was received: /{format}/ (mass-energy equivalence (svg)) timed out before a response wa
[12:19:33] <icinga-wm>	 ed: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /{format}/ (Invalid command (texvcinfo)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid
[12:19:44] <jynus>	 incident autoresolved
[12:20:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 716064 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:20:36] <jynus>	 althought there is still some failed probes
[12:21:45] <jinxer-wm>	 (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:21:49] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured ar
[12:21:49] <icinga-wm>	 r April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for unsupported site (with aggregated=true)) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most 
[12:21:49] <icinga-wm>	 icles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News
[12:21:49] <icinga-wm>	 ) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:21:59] <icinga-wm>	 PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: reload-acme-chief-backend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:23] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment missing comma in Ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/890807
[12:23:11] <icinga-wm>	 PROBLEM - eventgate-analytics LVS codfw on eventgate-analytics.svc.codfw.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate
[12:23:17] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:23:21] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/data/css/mobile/pagelib (Get CSS bundle from wikimedia-page-library) timed out before a response was received: /{domain}/v1/data/i
[12:23:21] <icinga-wm>	 (Get i18n strings for the Page Content Service) timed out before a response was received: /{domain}/v1/data/javascript/mobile/pagelib (Get javascript bundle for page library) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was
[12:23:21] <icinga-wm>	 d: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/talk/{title} (Get st
[12:23:21] <icinga-wm>	  talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[12:23:23] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /_info/name (retrieve service name) timed out before a response was received: /_info/home (redirect to the home page) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{t
[12:23:23] <icinga-wm>	 ormat}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond bad request for an unsupported format) timed out before a response was received https://wikitech.wikimedia.or
[12:23:23] <icinga-wm>	 roton
[12:23:32] <vgutierrez>	 hmmm acme-chief is screaming?
[12:23:37] * vgutierrez checking
[12:25:39] <icinga-wm>	 PROBLEM - Check unit status of reload-acme-chief-backend on acmechief1001 is CRITICAL: CRITICAL: Status of the systemd unit reload-acme-chief-backend https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:26:27] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{format}/ (mass-energy equivalence (complete)) timed out before a response was received: /{format}/ (mass-energy equivalence (svg)) timed out before a response wa
[12:26:27] <icinga-wm>	 ed: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /{format}/ (Invalid command (texvcinfo)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid
[12:26:45] <jinxer-wm>	 (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:27:09] <jynus>	 vgutierrez: let me know if I can help in any way
[12:27:19] <vgutierrez>	 jynus: -sre
[12:28:15] <icinga-wm>	 PROBLEM - eventgate-main LVS codfw on eventgate-main.svc.codfw.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (root with no query params) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received https://wikitech.wiki
[12:28:15] <icinga-wm>	 g/wiki/Event_Platform/EventGate
[12:29:39] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:29:57] <icinga-wm>	 PROBLEM - eventgate-logging-external LVS codfw on eventgate-logging-external.svc.codfw.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (root with no query params) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was receiv
[12:29:57] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Event_Platform/EventGate
[12:30:06] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39758/console" [puppet] - 10https://gerrit.wikimedia.org/r/890807 (owner: 10Slyngshede)
[12:30:51] <icinga-wm>	 RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:42] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment missing comma in Ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/890807 (owner: 10Slyngshede)
[12:31:45] <jinxer-wm>	 (JobUnavailable) firing: (13) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:34:30] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubemaster2002.codfw.wmnet with OS bullseye
[12:34:32] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: T329664
[12:34:37] <stashbot>	 T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664
[12:35:42] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: T329664
[12:36:27] <icinga-wm>	 RECOVERY - Check unit status of reload-acme-chief-backend on acmechief1001 is OK: OK: Status of the systemd unit reload-acme-chief-backend https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:36:42] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2009.codfw.wmnet with OS bullseye
[12:36:45] <jinxer-wm>	 (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:36:57] <jinxer-wm>	 (ProbeDown) firing: (16) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:37:35] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f64db51fba8, Connection to mathoid.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Mathoid
[12:38:45] <icinga-wm>	 PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f480741ccc0, Connection to sessionstore.svc.codfw.wmnet timed out. (connect timeout=15)): /openapi https://www.mediawiki.org/wiki/Kask
[12:39:59] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes2005.codfw.wmnet with OS bullseye
[12:40:00] <wikibugs>	 (03CR) 10Muehlenhoff: C:idm::deployment missing comma in Ferm rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890807 (owner: 10Slyngshede)
[12:40:03] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes2006.codfw.wmnet with OS bullseye
[12:40:07] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes2016.codfw.wmnet with OS bullseye
[12:40:10] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubernetes2015.codfw.wmnet with OS bullseye
[12:41:37] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fd8f3e5aba8, Connection to termbox.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[12:41:40] <logmsgbot>	 !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1001.eqiad.wmnet with OS bullseye
[12:41:45] <jinxer-wm>	 (JobUnavailable) firing: (13) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:41:57] <jinxer-wm>	 (ProbeDown) firing: (16) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:43:36] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2010.codfw.wmnet with OS bullseye
[12:43:43] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2020.codfw.wmnet with OS bullseye
[12:43:52] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2023.codfw.wmnet with OS bullseye
[12:44:31] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: / (spec from root) timed out before a response was received: / (mass-energy equivalence (json)) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{format}/ (mass-energy equivalence (complete)) timed out before a response was received: /{format}/ (mass-energy equivalence (svg)) timed out before a response wa
[12:44:31] <icinga-wm>	 ed: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received: /{format}/ (mass-energy equivalence (texvcinfo)) timed out before a response was received: /{format}/ (Invalid command (texvcinfo)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid
[12:46:45] <jinxer-wm>	 (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:47:19] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[12:49:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2007.codfw.wmnet with OS bullseye
[12:50:37] <wikibugs>	 (03PS2) 10Jbond: sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153)
[12:50:45] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2008.codfw.wmnet with OS bullseye
[12:50:46] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1002.eqiad.wmnet with OS bullseye
[12:50:48] <wikibugs>	 (03CR) 10Jbond: sre.hosts.reboot-single: Simplify icinga logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[12:50:49] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[12:51:15] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2013.codfw.wmnet with OS bullseye
[12:51:28] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2005.codfw.wmnet with reason: host reimage
[12:51:34] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2016.codfw.wmnet with reason: host reimage
[12:51:39] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2015.codfw.wmnet with reason: host reimage
[12:51:39] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2006.codfw.wmnet with reason: host reimage
[12:51:45] <jinxer-wm>	 (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:52:03] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7ff7c7c32b70, Connection to mathoid.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Mathoid
[12:52:34] <wikibugs>	 (03CR) 10Muehlenhoff: netboot: create dedicated partman recipe for presto workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison)
[12:53:19] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2011.codfw.wmnet with OS bullseye
[12:54:15] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2005.codfw.wmnet with reason: host reimage
[12:54:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2012.codfw.wmnet with OS bullseye
[12:54:33] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f93a3f66cc0, Connection to termbox.svc.codfw.wmnet timed out. (connect timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[12:55:35] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2014.codfw.wmnet with OS bullseye
[12:56:41] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2006.codfw.wmnet with reason: host reimage
[12:56:42] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2016.codfw.wmnet with reason: host reimage
[12:56:45] <jinxer-wm>	 (JobUnavailable) firing: (14) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:57:32] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2022.codfw.wmnet with OS bullseye
[12:57:59] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2024.codfw.wmnet with OS bullseye
[12:58:08] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage
[12:58:57] <wikibugs>	 (03PS1) 10Nicolas Fraison: presto.coordinator: reduce max heap size of coordinator [puppet] - 10https://gerrit.wikimedia.org/r/890810
[12:58:59] <wikibugs>	 (03PS8) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153)
[12:59:05] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2023.codfw.wmnet with reason: host reimage
[12:59:06] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2015.codfw.wmnet with reason: host reimage
[12:59:07] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2010.codfw.wmnet with reason: host reimage
[12:59:15] <wikibugs>	 (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[13:01:23] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1002.eqiad.wmnet with reason: host reimage
[13:01:45] <jinxer-wm>	 (JobUnavailable) firing: (15) Reduced availability for job swagger_check_citoid_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:01:57] <jinxer-wm>	 (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:02:08] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2023.codfw.wmnet with reason: host reimage
[13:03:29] <wikibugs>	 (03PS1) 10Ayounsi: Use port 2222 for management router ssh [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438)
[13:04:32] <logmsgbot>	 !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1002.eqiad.wmnet with reason: host reimage
[13:04:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2007.codfw.wmnet with reason: host reimage
[13:06:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2008.codfw.wmnet with reason: host reimage
[13:06:20] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage
[13:06:31] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2010.codfw.wmnet with reason: host reimage
[13:06:36] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2013.codfw.wmnet with reason: host reimage
[13:06:45] <jinxer-wm>	 (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:06:54] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10JMeybohm) This might be caused by {T330048}
[13:08:32] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes2005.codfw.wmnet with OS bullseye
[13:09:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2011.codfw.wmnet with reason: host reimage
[13:09:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2007.codfw.wmnet with reason: host reimage
[13:09:30] <wikibugs>	 (03PS2) 10Ayounsi: Management routers: move ssh port to 2222 [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438)
[13:10:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2012.codfw.wmnet with reason: host reimage
[13:10:14] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[13:10:40] <icinga-wm>	 PROBLEM - confd service on kubernetes2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.135: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:10:48] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:11:00] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2014.codfw.wmnet with reason: host reimage
[13:11:42] <icinga-wm>	 PROBLEM - dhclient process on kubernetes2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.135: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[13:11:45] <jinxer-wm>	 (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:11:45] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2013.codfw.wmnet with reason: host reimage
[13:12:09] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes2016.codfw.wmnet with OS bullseye
[13:12:28] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2022.codfw.wmnet with reason: host reimage
[13:12:31] <wikibugs>	 (03Abandoned) 10Ayounsi: Management routers: move ssh port to 2222 [homer/public] - 10https://gerrit.wikimedia.org/r/785274 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi)
[13:12:38] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on alert1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[13:13:02] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2020 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.135: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:13:17] <wikibugs>	 (03CR) 10Ayounsi: "Note that this is the end result, the change will be done manually on the devices and verified afterwards." [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi)
[13:13:22] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2009.codfw.wmnet with OS bullseye
[13:13:25] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2024.codfw.wmnet with reason: host reimage
[13:14:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2012.codfw.wmnet with reason: host reimage
[13:14:19] <icinga-wm>	 RECOVERY - confd service on kubernetes2020 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:14:45] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2009.codfw.wmnet with OS bullseye
[13:15:00] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes2006.codfw.wmnet with OS bullseye
[13:16:12] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2008.codfw.wmnet with reason: host reimage
[13:16:41] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2022.codfw.wmnet with reason: host reimage
[13:16:42] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host kubernetes2015.codfw.wmnet with OS bullseye
[13:16:45] <jinxer-wm>	 (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:17:03] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2023.codfw.wmnet with OS bullseye
[13:18:36] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2024.codfw.wmnet with reason: host reimage
[13:18:48] <icinga-wm>	 PROBLEM - Host kubernetes2020 is DOWN: PING CRITICAL - Packet loss = 100%
[13:19:12] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2011.codfw.wmnet with reason: host reimage
[13:19:20] <icinga-wm>	 RECOVERY - Host kubernetes2020 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms
[13:20:30] <icinga-wm>	 PROBLEM - DPKG on kubernetes2011 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.109: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:21:00] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes2014.codfw.wmnet with reason: host reimage
[13:21:14] <icinga-wm>	 RECOVERY - dhclient process on kubernetes2020 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[13:21:14] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2020 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:21:30] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2020.codfw.wmnet with OS bullseye
[13:21:45] <jinxer-wm>	 (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:22:07] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on kubernetes2012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.110. Check system logs on 10.192.32.110 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T330150 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[13:22:11] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on kubernetes2012 - https://phabricator.wikimedia.org/T330150 (10ops-monitoring-bot)
[13:22:46] <jynus>	 false positive ^ could be useful to tune the script
[13:23:09] <jynus>	 I will add that ticket to infra team
[13:23:34] <icinga-wm>	 PROBLEM - Check for large files in client bucket on kubernetes2008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.197: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[13:24:17] <logmsgbot>	 !log nfraison@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1002.eqiad.wmnet with OS bullseye
[13:24:42] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo)
[13:24:54] <icinga-wm>	 RECOVERY - Check for large files in client bucket on kubernetes2008 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[13:25:22] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2010.codfw.wmnet with OS bullseye
[13:25:34] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment escape password and use 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/890814
[13:25:48] <icinga-wm>	 PROBLEM - dhclient process on kubernetes2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[13:26:14] <icinga-wm>	 PROBLEM - confd service on kubernetes2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:26:33] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2007.codfw.wmnet with OS bullseye
[13:26:33] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2024.codfw.wmnet with OS bullseye
[13:26:38] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39759/console" [puppet] - 10https://gerrit.wikimedia.org/r/890814 (owner: 10Slyngshede)
[13:26:45] <jinxer-wm>	 (JobUnavailable) firing: (16) Reduced availability for job dragonfly_dfdaemon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:27:05] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment escape password and use 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/890814 (owner: 10Slyngshede)
[13:27:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] improvement: display update-known-host-productions zsh hint on macOS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770 (owner: 10Clément Goubert)
[13:27:30] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:27:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Set ferm access for redis [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797)
[13:27:59] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff)
[13:28:02] <icinga-wm>	 PROBLEM - MD RAID on kubernetes2014 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.29: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[13:28:11] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) This looks like it could be avoided with some extra check, maybe? I added @jbond  and @Volans as I think they were involved in th...
[13:28:31] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2013.codfw.wmnet with OS bullseye
[13:29:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo)
[13:29:18] <icinga-wm>	 RECOVERY - confd service on kubernetes2014 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:30:21] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2009.codfw.wmnet with reason: host reimage
[13:31:29] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Set ferm access for redis [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff)
[13:31:40] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff)
[13:31:44] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Set ferm access for redis [puppet] - 10https://gerrit.wikimedia.org/r/890815 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff)
[13:31:53] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2012.codfw.wmnet with OS bullseye
[13:31:57] <jinxer-wm>	 (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:32:24] <godog>	 I've silenced the jobunavailable for swagger checks in codfw
[13:32:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[13:33:11] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2008.codfw.wmnet with OS bullseye
[13:33:25] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2009.codfw.wmnet with reason: host reimage
[13:33:52] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10jcrespo) Adding @JMeybohm in case it was just a fluke (reimage taking more time than usual).
[13:34:12] <icinga-wm>	 RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:34] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2022.codfw.wmnet with OS bullseye
[13:35:14] <icinga-wm>	 PROBLEM - Host kubernetes2014 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:44] <icinga-wm>	 RECOVERY - Host kubernetes2014 is UP: PING OK - Packet loss = 0%, RTA = 33.30 ms
[13:36:12] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10JMeybohm) /cc @elukey this is one of "yours" :)
[13:36:48] <icinga-wm>	 RECOVERY - DPKG on kubernetes2011 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:36:59] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2011.codfw.wmnet with OS bullseye
[13:37:14] <wikibugs>	 (03CR) 10Volans: "couple of comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[13:37:40] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) In the meantime we have created two cookbook:  * sre.k8s.upgrade-cluster.py * sre.k8s.wipe-cluster.py
[13:37:44] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2014 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:38:10] <icinga-wm>	 RECOVERY - MD RAID on kubernetes2014 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[13:38:38] <icinga-wm>	 RECOVERY - dhclient process on kubernetes2014 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[13:38:56] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2014.codfw.wmnet with OS bullseye
[13:38:58] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 215 hosts with reason: codfw row B upgrade
[13:39:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] admin_ng: Update wikikube-codfw settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[13:41:17] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 215 hosts with reason: codfw row B upgrade
[13:41:31] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aec8ddda-9ad5-4b7f-8bca-c273e036a282) set by ayounsi@cumin1001 for 2:00:00 on 215 host(s...
[13:41:57] <jinxer-wm>	 (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:42:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix condition for including haveged [puppet] - 10https://gerrit.wikimedia.org/r/890816
[13:44:36] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Update wikikube-codfw settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[13:45:22] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10fgiunchedi) Following up for silences, especially the ones paging in production (`ProbeDown`).  * ProbeDown: the most e...
[13:46:33] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+1] "LGTM, would be better to have also a review from @Ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[13:48:28] <wikibugs>	 10SRE, 10Data-Engineering-Planning, 10Observability-Alerting, 10Traffic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10JArguello-WMF) 05Open→03Resolved
[13:48:34] <godog>	 !log stop kafka on kafka-logging[2002,2004].codfw.wmnet - T327991
[13:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:39] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[13:49:35] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Add support for live schema changes on replicas too [software] - 10https://gerrit.wikimedia.org/r/890820
[13:49:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for live schema changes on replicas too [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup)
[13:49:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:49:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:49:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:49:58] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:50:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:50:18] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:50:22] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:50:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:50:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:51:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:51:50] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2009.codfw.wmnet with OS bullseye
[13:52:00] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[13:53:32] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[13:54:10] <gehel>	 !log depooling elastic[2041-2044,2057-2058,2063-2064,2070,2077-2080].codfw.wmnet for switch maintenance - T327991
[13:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:14] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[13:54:14] <vgutierrez>	 !log depool doh2002 - T327991
[13:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:54] <gehel>	 !log depooling wcqs2001.codfw.wmnet for switch maintenance - T327991
[13:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:22] <gehel>	 !log depooling wdqs[2005,2007,2010].codfw.wmnet for switch maintenance - T327991
[13:55:24] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Vgutierrez)
[13:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[13:55:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[13:55:57] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:56:00] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:56:34] <gehel>	 ryankemper, inflatador: see the 3 depool above ^^^ (and please check that they have been repooled at some point)
[13:56:57] <jinxer-wm>	 (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:57:58] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:58:17] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:58:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:58:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:58:35] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:58:40] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:58:42] <wikibugs>	 (03PS1) 10Vgutierrez: admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/890822
[13:58:47] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[13:58:56] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:59:05] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:59:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/890822 (owner: 10Vgutierrez)
[13:59:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/890822 (owner: 10Vgutierrez)
[13:59:22] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:59:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:59:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'.
[13:59:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[13:59:45] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/890822 (owner: 10Vgutierrez)
[13:59:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[13:59:57] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:00:05] <jouncebot>	 jayme: Your horoscope predicts another unfortunate Kubernetes upgrade wikikube codfw deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900).
[14:00:05] <jouncebot>	 Deploy window codfw row B switches upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400)
[14:00:05] <vgutierrez>	 !log depooling codfw - T327991
[14:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:10] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[14:00:27] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:00:28] <Amir1>	 my ssh is really slow, I need to downtime s2 in codfw
[14:00:40] <TheresNoTime>	 (ack no patches for deploy)
[14:00:57] <Lucas_WMDE>	 sheesh, four simultaneously active windows in the deployments calendar
[14:01:00] <Lucas_WMDE>	 jouncebot: now
[14:01:00] <jouncebot>	 For the next 1 hour(s) and 59 minute(s): Kubernetes upgrade wikikube codfw (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T0900)
[14:01:00] <jouncebot>	 For the next 1 hour(s) and 59 minute(s): codfw row B switches upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400)
[14:01:00] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400)
[14:01:00] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1400)
[14:01:00] <jynus>	 TheresNoTime: it should be cancelled anyway, as there is going to be hw maintenance 
[14:01:08] * Lucas_WMDE not deploying
[14:02:30] <claime>	 jynus: Do you know how to do what Amir1 needs?
[14:02:32] <Amir1>	 sudo cookbook sre.hosts.downtime --hours 1 -r "codfw maint (T327991)" 'A:db-section-s2'
[14:02:40] <claime>	 ok I'll do it
[14:02:44] <Amir1>	 someone needs to do this
[14:02:45] <Amir1>	 thanks
[14:02:48] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: codfw maint (T327991)
[14:02:51] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder)
[14:03:01] <claime>	 Amir1: running
[14:03:07] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: codfw maint (T327991)
[14:03:13] <Amir1>	 thanks
[14:03:36] <claime>	 np <3
[14:04:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:06:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: codfw maint (T327991)
[14:06:40] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[14:06:50] <Amir1>	 finally logged in to cumin
[14:06:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2134,2160].codfw.wmnet,db[1117,1159].eqiad.wmnet with reason: codfw maint (T327991)
[14:06:57] <jinxer-wm>	 (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:07:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 27 hosts with reason: codfw maint (T327991)
[14:07:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 27 hosts with reason: codfw maint (T327991)
[14:07:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2186.mgmt.codfw.wmnet with reboot policy FORCED
[14:11:25] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup)
[14:11:57] <jinxer-wm>	 (ProbeDown) firing: (17) Service citoid:4003 has failed probes (http_citoid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:12:57] <wikibugs>	 (03PS1) 10Elukey: Add kubernetes202[3,4] to the wikikube-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/890824 (https://phabricator.wikimedia.org/T329664)
[14:13:50] <godog>	 I've silenced the probedown for known-down k8s services ^
[14:15:23] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825
[14:15:25] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722)
[14:15:27] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827
[14:15:29] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593)
[14:15:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:15:51] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::deployment Redis binding and ports [puppet] - 10https://gerrit.wikimedia.org/r/890829
[14:15:58] <wikibugs>	 (03PS2) 10Elukey: Add kubernetes202[3,4] to the wikikube-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/890824 (https://phabricator.wikimedia.org/T329664)
[14:16:24] <XioNoX>	 log asw-b-codfw> request system reboot all-members - T327991
[14:16:31] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[14:17:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 (owner: 10Jbond)
[14:17:11] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39760/console" [puppet] - 10https://gerrit.wikimedia.org/r/890829 (owner: 10Slyngshede)
[14:17:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond)
[14:17:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond)
[14:17:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond)
[14:17:37] <XioNoX>	 going down in  < 1 min
[14:17:56] <claime>	 ack
[14:19:25] <wikibugs>	 (03CR) 10Andrew Bogott: systemd timers: Add the 'After' requirement to the timer module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[14:20:11] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722)
[14:20:13] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827
[14:20:15] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593)
[14:20:46] <icinga-wm>	 PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:47] <wikibugs>	 (03PS5) 10Clément Goubert: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782
[14:21:11] <jynus>	 looking good
[14:21:26] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) @Volans thanks one thing i know for sure is that the R650 and R750 are Dell 15th generation and the other like  R440 and R740 are 14th generation. I will check...
[14:21:30] <jinxer-wm>	 (virtual-chassis crash) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[14:21:57] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:22:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond)
[14:22:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond)
[14:22:11] <jynus>	 Amir1: not a big deal, but es replicas complaining, may need manual restart later
[14:22:26] <icinga-wm>	 PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 5 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[14:22:35] <jynus>	 (restart of replication, not service or host)
[14:22:38] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2003 is CRITICAL: 79 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003
[14:22:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond)
[14:23:00] <Amir1>	 jynus: oh thanks
[14:23:06] <wikibugs>	 (03PS1) 10Elukey: role::kubernetes::{master,worker}: add kubernetes202[34] [puppet] - 10https://gerrit.wikimedia.org/r/890832 (https://phabricator.wikimedia.org/T329664)
[14:23:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 129, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:23:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:23:18] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:23:22] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es4 on es2022 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:23:26] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:23:32] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2001 is CRITICAL: 79 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001
[14:23:37] <Amir1>	 I should downtime it
[14:23:47] <jynus>	 I wonder if that will page
[14:23:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2006.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:24:00] <jynus>	 in any case, not a big deal
[14:24:00] <claime>	 Amir1: same command as earlier but with section-es4 ?
[14:24:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:24:20] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1083 threshold =0.2 breach: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1664, active_shards: 3889, relocating_shards: 0, initializing_shards: 142, unassigned_shards: 941, delayed_unassigned_shards: 0, number
[14:24:20] <icinga-wm>	 ing_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 78.21802091713596 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:24:21] <wikibugs>	 (03PS1) 10Elukey: conftool: add kubernetes202[3,4] to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/890833 (https://phabricator.wikimedia.org/T329664)
[14:24:21] <Amir1>	 yeah but I restart my pc :D
[14:24:26] <claime>	 lol k
[14:24:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: codfw maint (T327991)
[14:24:43] <Amir1>	 done &
[14:24:44] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[14:24:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: codfw maint (T327991)
[14:24:45] <jynus>	 claime: ah, it doesn't page because it is not primary, so should be ok
[14:24:53] <claime>	 a'ight
[14:24:55] <jynus>	 although not sure if that makes sense anymore
[14:24:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:25:03] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:25:16] <icinga-wm>	 RECOVERY - eventgate-logging-external LVS codfw on eventgate-logging-external.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f910182eb00: Failed to establish a new connection: [Errno 113] No route to host): /robots.txt https://wikitech.wikimedia.org/wiki/Ev
[14:25:16] <icinga-wm>	 form/EventGate
[14:25:31] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup)
[14:25:33] <wikibugs>	 (03PS1) 10Elukey: Add kubernetes202[3,4] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664)
[14:25:50] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1083 threshold =0.2 breach: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1664, active_shards: 3889, relocating_shards: 0, initializing_shards: 142, unassigned_shards: 941, delayed_unassigned_shards: 0
[14:25:50] <icinga-wm>	 _of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 78.21802091713596 Brian_King T327991 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:25:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/890829 (owner: 10Slyngshede)
[14:25:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:26:45] <jinxer-wm>	 (JobUnavailable) firing: (15) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:26:54] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:57] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:26:58] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert)
[14:27:01] <jynus>	 dbproxyies will also need a reload, they failed over
[14:27:17] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825
[14:27:22] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service,wmf_auto_restart_airflow-webserver@search.service Brian_King T327970 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:22] <icinga-wm>	 ACKNOWLEDGEMENT - Checks that the airflow database for airflow search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check did not succeed Brian_King T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[14:27:23] <icinga-wm>	 ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed Brian_King T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[14:27:40] <icinga-wm>	 PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) timed out before a response was received https://www.mediawiki.org/wiki/Kask
[14:27:47] <Amir1>	 I'll take care of it
[14:28:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1664, active_shards: 4030, relocating_shards: 0, initializing_shards: 140, unassigned_shards: 802, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, n
[14:28:06] <icinga-wm>	 _in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 81.05390185036202 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:28:08] <icinga-wm>	 PROBLEM - configured eth on lvs2010 is CRITICAL: ens3f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[14:28:49] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: Logging improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/890782 (owner: 10Clément Goubert)
[14:28:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:29:09] <moritzm>	 !log installing NSS security updates
[14:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:36] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:29:54] <jinxer-wm>	 (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown
[14:29:58] <icinga-wm>	 RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms
[14:30:00] <icinga-wm>	 RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[14:30:05] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:30:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration  - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:30:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:30:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:30:59] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722)
[14:31:04] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:31:05] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827
[14:31:10] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593)
[14:31:23] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment Redis binding and ports [puppet] - 10https://gerrit.wikimedia.org/r/890829 (owner: 10Slyngshede)
[14:31:26] <icinga-wm>	 RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask
[14:31:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:31:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add kubernetes202[3,4] to the wikikube-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/890824 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[14:31:45] <jinxer-wm>	 (JobUnavailable) firing: (116) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:32:31] <brett>	 here
[14:32:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond)
[14:32:49] <godog>	 brett: see -sre, all good
[14:32:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond)
[14:32:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond)
[14:32:58] <icinga-wm>	 RECOVERY - MariaDB Replica IO: es4 on es2022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:33:17] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:33:17] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:33:19] <brett>	 ah, thanks godog
[14:33:28] <godog>	 yw
[14:33:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (10) ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:34:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] role::kubernetes::{master,worker}: add kubernetes202[34] [puppet] - 10https://gerrit.wikimedia.org/r/890832 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[14:34:04] <wikibugs>	 (03CR) 10Andrew Bogott: "For additional context, this is the patch that "doesn't work".  If there's an obvious flaw in that I would definitely prefer to fix that! " [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[14:34:08] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01925 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:34:12] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] conftool: add kubernetes202[3,4] to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/890833 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[14:34:17] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:34:54] <jinxer-wm>	 (EtcdReplicationDown) resolved: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown
[14:34:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:34:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:35:00] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:19] <jinxer-wm>	 (ProbeDown) resolved: (36) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:35:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration  - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:35:38] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "IPs match" [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[14:36:01] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=prometheus2005.codfw.wmnet
[14:36:10] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe2002.codfw.wmnet
[14:36:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:36:30] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:30] <jinxer-wm>	 (virtual-chassis crash) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[14:36:42] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:45] <jinxer-wm>	 (JobUnavailable) resolved: (116) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:16] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) Upgrade went smoothly, less than 15min hard downtime here too.
[14:37:54] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[14:38:00] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:38:00] <icinga-wm>	 RECOVERY - configured eth on lvs2010 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[14:38:31] <jinxer-wm>	 (Device rebooted) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[14:38:40] <Amir1>	 jynus: I'm not seeing any host having broken replication/lagged/not-responding in https://orchestrator.wikimedia.org/web/clusters beside reloading haporxy anything I should do?
[14:39:05] <jynus>	 no, they caught up replication this time well, I think
[14:39:17] <jynus>	 I think last time one host didn't fully recover automatically
[14:40:01] <jynus>	 on backup side, I need to restart es5 backup
[14:40:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add kubernetes202[3,4] to its k8s_neighbors list (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[14:42:29] <wikibugs>	 (03PS2) 10Elukey: Add kubernetes202[3,4] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664)
[14:42:44] <wikibugs>	 (03CR) 10Elukey: Add kubernetes202[3,4] to its k8s_neighbors list (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[14:43:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add kubernetes202[3,4] to the wikikube-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/890824 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[14:43:19] <wikibugs>	 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10Ladsgroup)
[14:43:36] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0009872 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:43:52] <wikibugs>	 (03PS1) 10Jbond: Revert "puppetmaster: offline 2003 for switch upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/890846
[14:44:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppetmaster: offline 2003 for switch upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/890846 (owner: 10Jbond)
[14:44:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add kubernetes202[3,4] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/890834 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[14:44:47] <wikibugs>	 (03PS2) 10Elukey: role::kubernetes::{master,worker}: add kubernetes202[34] [puppet] - 10https://gerrit.wikimedia.org/r/890832 (https://phabricator.wikimedia.org/T329664)
[14:45:17] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jbond)
[14:45:28] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:30] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:45:30] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:45:37] <wikibugs>	 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10Ladsgroup) It's a backup source so it doesn't even need depooling, just downtime, mabye gracefully shutting down mysql but that's not even that important. cc @jcrespo
[14:47:18] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003
[14:47:18] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001
[14:50:09] <wikibugs>	 (03PS1) 10Elukey: admin_ng: fix Istio gateways configured in Knative for ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890837 (https://phabricator.wikimedia.org/T327767)
[14:50:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:50:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[14:51:13] <wikibugs>	 (03PS1) 10JMeybohm: restbase: Update kubernetes ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/890838 (https://phabricator.wikimedia.org/T326617)
[14:51:24] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10fgiunchedi)
[14:53:40] <wikibugs>	 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10jcrespo) @Papaul you can ping me directly, as Manuel is off these days for backup sources. If only network is going to be down for this server for a small amount of time, just go ahead at any time. If it is going to be...
[14:54:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] restbase: Update kubernetes ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/890838 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[14:54:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] restbase: Update kubernetes ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/890838 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm)
[14:57:38] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[14:58:31] <jinxer-wm>	 (Device rebooted) resolved: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[15:00:40] <icinga-wm>	 RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[15:00:40] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[15:00:40] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[15:00:44] <wikibugs>	 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10jcrespo) I checked db2099 and saw some stats about packet errors when using the link to saturation:  {F36864017}  But compared to, eg db2098, those seem expected when running at full bandwidth:  {F36864019}
[15:01:40] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[15:01:40] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[15:01:40] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[15:02:26] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[15:03:46] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[15:04:48] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f5537e8db70: Failed to establish a new connection: [Errno 111] Connection refused): /?spec https://wikitech.wikimedia.org/wiki/Mathoid
[15:05:58] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10jcrespo) I restarted es5 codfw backup job, the only backup-related thingy affected by the downtime.
[15:06:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::kubernetes::{master,worker}: add kubernetes202[34] [puppet] - 10https://gerrit.wikimedia.org/r/890832 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[15:06:32] <wikibugs>	 (03PS2) 10Elukey: conftool: add kubernetes202[3,4] to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/890833 (https://phabricator.wikimedia.org/T329664)
[15:07:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[15:08:20] <icinga-wm>	 PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid
[15:10:16] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:10:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] conftool: add kubernetes202[3,4] to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/890833 (https://phabricator.wikimedia.org/T329664) (owner: 10Elukey)
[15:11:12] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[15:12:00] <icinga-wm>	 RECOVERY - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mathoid
[15:13:02] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[15:13:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2101.codfw.wmnet with reason: Maintenance
[15:13:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2101.codfw.wmnet with reason: Maintenance
[15:13:35] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[15:16:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2111.codfw.wmnet with reason: Maintenance
[15:16:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2111.codfw.wmnet with reason: Maintenance
[15:16:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2123.codfw.wmnet with reason: Maintenance
[15:16:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2123.codfw.wmnet with reason: Maintenance
[15:16:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2128.codfw.wmnet with reason: Maintenance
[15:17:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2128.codfw.wmnet with reason: Maintenance
[15:17:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2094.codfw.wmnet with reason: Maintenance
[15:17:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2094.codfw.wmnet with reason: Maintenance
[15:17:26] <wikibugs>	 (03PS1) 10Ayounsi: Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/890847
[15:17:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance
[15:17:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2137.codfw.wmnet with reason: Maintenance
[15:18:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance
[15:18:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2157.codfw.wmnet with reason: Maintenance
[15:18:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:19:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:19:22] <wikibugs>	 10SRE, 10Infrastructure Security, 10observability, 10SRE Observability (FY2022/2023-Q3): Grafana: CVE-2022-39324 CVE-2022-23552 - https://phabricator.wikimedia.org/T328405 (10lmata)
[15:19:28] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/890847 (owner: 10Ayounsi)
[15:19:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2178.codfw.wmnet with reason: Maintenance
[15:19:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2178.codfw.wmnet with reason: Maintenance
[15:20:42] <icinga-wm>	 RECOVERY - eventgate-analytics-external LVS codfw on eventgate-analytics-external.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate
[15:22:24] <Amir1>	 is codfw still fully depooled (for mw I mean)? 
[15:22:43] <Amir1>	 If yes, it'd make some of my schema changes much simpler
[15:23:02] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10akosiaris)
[15:23:12] <wikibugs>	 (03PS1) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848
[15:23:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (owner: 10SBassett)
[15:23:24] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10akosiaris)
[15:23:48] <icinga-wm>	 RECOVERY - eventgate-analytics LVS codfw on eventgate-analytics.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate
[15:23:52] <claime>	 Amir1: yes, it's DNS depooled
[15:24:02] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[15:24:02] <Amir1>	 awesome
[15:24:11] <Amir1>	 time to clean up some drifst
[15:24:25] <claime>	 Amir1: just tell us if we need to hold on repooling
[15:25:07] <wikibugs>	 (03PS2) 10Elukey: admin_ng: fix Istio gateways configured in Knative for ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890837 (https://phabricator.wikimedia.org/T327767)
[15:25:07] <Amir1>	 thanks. I need an hour at most
[15:25:30] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::cloudceph::client::rbd_backy: Fix reduce() syntax [puppet] - 10https://gerrit.wikimedia.org/r/890840
[15:26:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance
[15:26:52] <icinga-wm>	 RECOVERY - eventgate-main LVS codfw on eventgate-main.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate
[15:26:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance
[15:27:22] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:28:20] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:29:42] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.upgrade-cluster (exit_code=0) Upgrade K8s version: T329664
[15:29:46] <stashbot>	 T329664: Update wikikube codfw to k8s 1.23 - https://phabricator.wikimedia.org/T329664
[15:32:44] <wikibugs>	 (03PS1) 10Clément Goubert: sre.discovery.datacenter: fix documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/890842
[15:33:08] <wikibugs>	 (03PS1) 10Jbond: systemd::timer: update services to onshot and set RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/890843
[15:34:40] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] systemd timers: Add the 'After' requirement to the timer module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[15:35:21] <Amir1>	 claime: I'm done for the offline maint
[15:35:57] <claime>	 Amir1: ack
[15:36:22] <wikibugs>	 (03PS2) 10Jbond: systemd::timer: update services to onshot and set RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/890843
[15:37:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.discovery.datacenter: fix documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/890842 (owner: 10Clément Goubert)
[15:37:07] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: fix documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/890842 (owner: 10Clément Goubert)
[15:38:33] <wikibugs>	 (03PS2) 10Andrew Bogott: profile::cloudceph::client::rbd_backy: Fix reduce() syntax [puppet] - 10https://gerrit.wikimedia.org/r/890840
[15:38:47] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: fix documentation [cookbooks] - 10https://gerrit.wikimedia.org/r/890842 (owner: 10Clément Goubert)
[15:38:55] <wikibugs>	 (03PS2) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848
[15:39:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (owner: 10SBassett)
[15:39:49] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:14] <wikibugs>	 (03PS3) 10SBassett: Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848
[15:47:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:48:23] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Bad power supply on cr1-codfw(PEM 0) - https://phabricator.wikimedia.org/T329943 (10Papaul) 05Open→03Resolved Replaced PEM0 everything looks good now . {F36864090}
[15:48:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:49:01] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 (owner: 10Jbond)
[15:49:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10lmata)
[15:50:07] <wikibugs>	 (03PS1) 10Clément Goubert: sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844
[15:50:32] <wikibugs>	 10SRE, 10Maps, 10Observability-Metrics, 10observability, and 3 others: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10lmata)
[15:51:10] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto)
[15:52:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 (owner: 10Clément Goubert)
[15:53:23] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on alert1001 is OK: (C)0 le (W)3 le 153.5 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[15:54:38] <wikibugs>	 (03PS3) 10Elukey: admin_ng: fix Istio gateways configured in Knative for ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890837 (https://phabricator.wikimedia.org/T327767)
[15:54:57] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on alert1001 is OK: (C)0 le (W)3 le 149.4 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[15:55:37] <wikibugs>	 (03PS1) 10David Caro: [cloudceph.client.rbd_backy] Fix wrong reduce call [puppet] - 10https://gerrit.wikimedia.org/r/890845
[15:57:15] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39763/console" [puppet] - 10https://gerrit.wikimedia.org/r/890845 (owner: 10David Caro)
[15:58:01] <wikibugs>	 10SRE, 10Traffic, 10IPv6: Start a pure IPv6 web site for wikimedia services - https://phabricator.wikimedia.org/T330020 (10BCornwall) a:03BCornwall
[15:58:19] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service,send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:24] <wikibugs>	 (03PS2) 10Clément Goubert: sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844
[15:59:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: fix Istio gateways configured in Knative for ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890837 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[16:00:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825 (owner: 10Jbond)
[16:00:09] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: switch to using functools.cache [cookbooks] - 10https://gerrit.wikimedia.org/r/890825
[16:00:21] <wikibugs>	 (03PS4) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722)
[16:00:23] <wikibugs>	 (03PS4) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827
[16:00:25] <wikibugs>	 (03PS4) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593)
[16:00:33] <wikibugs>	 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) 05Open→03Resolved Perfect! Thanks so much for your magic hands and making this a reality, @Sbenchagra.
[16:00:37] <wikibugs>	 (03CR) 10Atieno: [C: 04-1] "Maybe we should add a unit test to check that the dpi is now dynamic but defaults to 150" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T325771) (owner: 10Vlad.shapik)
[16:00:52] <wikibugs>	 (03CR) 10Jbond: "this and the others in the chain should be ready for review now" [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond)
[16:01:16] <wikibugs>	 (03CR) 10Atieno: [C: 04-1] Add the ability to specify the default DPI value for PDF files (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T325771) (owner: 10Vlad.shapik)
[16:01:31] <icinga-wm>	 RECOVERY - Host ps1-a2-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[16:02:27] <icinga-wm>	 RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[16:02:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[16:02:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[16:03:35] <wikibugs>	 (03Abandoned) 10Andrew Bogott: profile::cloudceph::client::rbd_backy: Fix reduce() syntax [puppet] - 10https://gerrit.wikimedia.org/r/890840 (owner: 10Andrew Bogott)
[16:05:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[16:05:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[16:06:35] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Add static "Cleopatra" page to facilitate synthetic testing of 885362 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray)
[16:06:57] <moritzm>	 !log imported libxml2 2.9.4+dfsg1-7+deb10u5+icu67+wmf1 to component/icu67 for buster-wikimedia T329491
[16:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:01] <stashbot>	 T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491
[16:07:02] <papaul>	 XioNoX: yes
[16:07:11] <icinga-wm>	 RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.23 ms
[16:07:26] <XioNoX>	 papaul: any idea what could have caused that?
[16:07:33] <papaul>	 should be coming back up it was 190 now 180
[16:08:46] <papaul>	 XioNoX: i think it was last week pdu maintenane since we took the main mgmt switch went down that is the only thing i can think of right now
[16:09:07] <XioNoX>	 ok
[16:09:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "Manuel is out, tested with job_cmd and it works like a charm" [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup)
[16:09:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] auto_schema: Add support for live schema changes on replicas too [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup)
[16:10:27] <papaul>	 !log rebooting mgmt switch in rack a5
[16:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:31] <icinga-wm>	 RECOVERY - Host ps1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[16:12:47] <Krinkle>	 Amir1: had a little flashback to the pre-REdis jobqueue at WMF. https://codesearch.wmcloud.org/core/?q=job_cmd
[16:13:10] <papaul>	 !log rebooting mgmt switch in rack a7
[16:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:17] <wikibugs>	 (03PS1) 10Jbond: redfish: ensure versions are parsed as packging.version.Version instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/890870
[16:13:54] <wikibugs>	 10SRE, 10serviceops: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) PHP build depends on libxml2, which itself also uses ICU by default. I have patched it to build without ICU for the component/icu67 component, it falls back to iconv internally.
[16:14:13] <Amir1>	 Krinkle: thankfully we don't use mysql as a hammer for everything anymore. The tables are in production but somehow the schema drifted and since they were empty, it didn't make sense to make them depool etc. so I wrote this code :D
[16:14:30] <Krinkle>	 oh it actually is that schema?
[16:14:41] <Krinkle>	 I thought it was something unrelated that used that same field name
[16:14:46] <Krinkle>	 even better :D
[16:15:12] <Krinkle>	 I see, you use it as a test case.
[16:15:14] <Krinkle>	 Nice
[16:15:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance
[16:15:31] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1003.eqiad.wmnet with OS bullseye
[16:15:33] <icinga-wm>	 RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.20 ms
[16:15:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance
[16:15:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T328255)', diff saved to https://phabricator.wikimedia.org/P44704 and previous config saved to /var/cache/conftool/dbconfig/20230221-161552-ladsgroup.json
[16:15:57] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[16:16:13] <Amir1>	 (part of T328255 before the switchover)
[16:16:38] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "please" [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup)
[16:17:27] <papaul>	 !log rebooting mgmt switch in rack b1
[16:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:11] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Add support for live schema changes on replicas too [software] - 10https://gerrit.wikimedia.org/r/890820 (owner: 10Ladsgroup)
[16:19:18] <Amir1>	 you see, being nice to jenkins works
[16:19:41] <icinga-wm>	 RECOVERY - Host ps1-b1-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.81 ms
[16:20:17] <papaul>	 !log rebooting mgmt switch in rack b3
[16:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm FYI we no longer have an stretch vms in production so this is just around for any slacking cloud stretch hosts" [puppet] - 10https://gerrit.wikimedia.org/r/890816 (owner: 10Muehlenhoff)
[16:21:01] <wikibugs>	 (03PS3) 10Jbond: sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153)
[16:21:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[16:22:13] <icinga-wm>	 RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[16:22:24] <wikibugs>	 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) You are welcome! I am curious @BCornwall, why did it take more than two years for this task to be completed?
[16:22:31] <papaul>	 !log rebooting mgmt switch in rack c3
[16:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:31] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[16:24:39] <Lucas_WMDE>	 Amir1: https://bash.toolforge.org/quip/wyzKdIYBtR_B8fLxgTWk
[16:24:42] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[16:26:39] <wikibugs>	 (03CR) 10Muehlenhoff: Fix condition for including haveged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890816 (owner: 10Muehlenhoff)
[16:27:13] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: asw-a-codfw management interface unreachable - https://phabricator.wikimedia.org/T330048 (10Papaul) 05Open→03Resolved Rebooting the mgmt switch fix the issue
[16:27:38] <Amir1>	 :D
[16:35:54] <wikibugs>	 (03PS9) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153)
[16:36:14] <wikibugs>	 (03CR) 10Jbond: "thanks see responses inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[16:40:05] <icinga-wm>	 RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.34 ms
[16:40:29] <hashar>	 looks like I forgot to promote group0 wikis
[16:41:00] <hashar>	 I am doing it now
[16:41:24] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890878 (https://phabricator.wikimedia.org/T325587)
[16:41:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890878 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[16:42:01] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890878 (https://phabricator.wikimedia.org/T325587) (owner: 10TrainBranchBot)
[16:45:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10andrea.denisse)
[16:45:03] <wikibugs>	 (03PS6) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799)
[16:47:14] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10RLazarus) We decided we'll put these into service after the upcoming DC switchover, so we'll make a plan at the March 6 serviceops meeting.
[16:47:32] <wikibugs>	 (03PS3) 10Clément Goubert: sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844
[16:48:06] <logmsgbot>	 !log nfraison@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1003.eqiad.wmnet with reason: host reimage
[16:48:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 (owner: 10Clément Goubert)
[16:49:57] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: T327991 - None
[16:50:00] <wikibugs>	 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10Papaul) 05Open→03Resolved a:03Papaul it has been an hour now no more errors on the interface  ``     Input errors:     Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Bucket drops: 0,
[16:50:03] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[16:50:34] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 (owner: 10Clément Goubert)
[16:51:07] <logmsgbot>	 !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1003.eqiad.wmnet with reason: host reimage
[16:52:35] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: fix status action args [cookbooks] - 10https://gerrit.wikimedia.org/r/890844 (owner: 10Clément Goubert)
[16:53:01] <wikibugs>	 10ops-codfw, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10jcrespo) Thank you!
[16:53:19] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond)
[16:53:54] <hashar>	 scap is still going on
[16:54:16] <wikibugs>	 (03CR) 10Vgutierrez: "tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[16:55:25] <wikibugs>	 (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond)
[16:56:54] <wikibugs>	 (03PS7) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799)
[16:57:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede
[16:57:29] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter status all services in codfw: None - None
[16:57:31] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in codfw: None - None
[16:59:10] <wikibugs>	 (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357
[16:59:16] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=codfw
[16:59:23] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal,name=codfw
[16:59:31] <logmsgbot>	 !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wcqs,name=codfw
[17:00:04] <jouncebot>	 jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:04] <jouncebot>	 cwhite: May I have your attention please! Grafana 9 Upgrade. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1700)
[17:00:05] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2186.mgmt.codfw.wmnet with reboot policy FORCED
[17:00:18] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "This looks right to me!" [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[17:00:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2186.mgmt.codfw.wmnet with reboot policy FORCED
[17:01:09] <hashar>	 somehow `scap` is blocked on deploying to codfw Kubernetes namespaces `mw-api-int` and  `mw-web` :-\
[17:01:30] <wikibugs>	 (03PS1) 10David Caro: profile.cloudceph: Add some tests [puppet] - 10https://gerrit.wikimedia.org/r/890881
[17:01:35] <claime>	 probably because it's trying to schedule too many replicas
[17:02:13] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/890870 (owner: 10Jbond)
[17:02:14] <hashar>	 at least the helm3 upgrade command has a 600s timeout
[17:02:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee
[17:02:19] <hashar>	 so I guess that will eventually fail
[17:02:57] <claime>	 yeah
[17:03:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] [cloudceph.client.rbd_backy] Fix wrong reduce call [puppet] - 10https://gerrit.wikimedia.org/r/890845 (owner: 10David Caro)
[17:03:24] <claime>	 So what happened was we had to scale back mw-* deployments during the upgrade because some nodes couldn't be reimaged, because a mgmt switch is dead
[17:03:39] <claime>	 That was done manually and not reflected in the helmfiles for the services
[17:03:58] <claime>	 scap uses the helmfiles, and is curently trying to schedule way too many pods
[17:04:12] <hashar>	 can we update the helmfiles?
[17:04:16] <claime>	 I'll go make a CR 
[17:04:18] <claime>	 yes
[17:04:23] <claime>	 but they won't get picked up rn
[17:04:34] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: T327991 - None
[17:04:36] <hashar>	 isn't there a timer updating them every minutes or so?
[17:04:38] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[17:04:53] <claime>	 hashar: Yes, but I don't think helmfile picks up changes *during* an applt
[17:04:54] <claime>	 apply
[17:05:00] <hashar>	 ah yeah
[17:05:09] <hashar>	 well I can always cancel `scap` and start again
[17:05:19] * hashar orders another demi
[17:05:40] <akosiaris>	 ok, done. codfw wikikube cluster repooled
[17:05:40] <hashar>	 17:05:24 K8s deployment to stage production failed: K8s deployment had the following errors:
[17:05:40] <hashar>	  codfw: Deployment of mw-web-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
[17:06:21] <hashar>	 claime: maybe akosiaris change above is related? it says it is repooling stuff 
[17:06:28] <hashar>	 so maybe there is no need to mess with the helmfiles
[17:07:11] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2023.codfw.wmnet
[17:07:21] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=kubernetes2024.codfw.wmnet
[17:07:58] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/weight=10; selector: name=kubernetes2024.codfw.wmnet
[17:07:59] <claime>	 hashar: we do, because we lost capacity
[17:08:02] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/weight=10; selector: name=kubernetes2023.codfw.wmnet
[17:08:30] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:09:00] <wikibugs>	 (03CR) 10Physikerwelt: "Please remember that this needs to be deployed with the related restbase change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot)
[17:09:12] <logmsgbot>	 !log nfraison@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1003.eqiad.wmnet with OS bullseye
[17:09:46] <wikibugs>	 (03Abandoned) 10BCornwall: varnish: Check upload.wm.o for analytics cookies [puppet] - 10https://gerrit.wikimedia.org/r/889846 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[17:10:16] <wikibugs>	 (03CR) 10Physikerwelt: [C: 04-1] "Abandon, please. Outdated." [deployment-charts] - 10https://gerrit.wikimedia.org/r/890350 (owner: 10PipelineBot)
[17:10:44] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/890882 (https://phabricator.wikimedia.org/T330048)
[17:10:46] <wikibugs>	 (03CR) 10Physikerwelt: [C: 04-1] "Abandon, please. Outdated." [deployment-charts] - 10https://gerrit.wikimedia.org/r/885482 (owner: 10PipelineBot)
[17:11:01] <wikibugs>	 (03CR) 10Physikerwelt: [C: 04-1] "Abandon, please. Outdated." [deployment-charts] - 10https://gerrit.wikimedia.org/r/872978 (owner: 10PipelineBot)
[17:13:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/890882 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert)
[17:13:33] <wikibugs>	 (03Abandoned) 10Hashar: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890350 (owner: 10PipelineBot)
[17:13:36] <wikibugs>	 (03Abandoned) 10Hashar: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/872978 (owner: 10PipelineBot)
[17:13:38] <wikibugs>	 (03Abandoned) 10Hashar: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/885482 (owner: 10PipelineBot)
[17:14:21] <hashar>	 I have manually  killed the helm3 upgrade command
[17:14:30] <claime>	 hashar: ok
[17:14:45] <hashar>	 that was for `mw-api-ext-deploy-codfw.config`
[17:14:50] <claime>	 hashar: I'm waiting on CI for the scale, then I'll go ahead and apply that manually
[17:14:52] <hashar>	 a a couple others failed earlier
[17:15:00] <claime>	 mw-debug and mw-web
[17:15:06] <hashar>	 sorry for the mess
[17:15:06] <hashar>	 :D
[17:15:09] <claime>	 Not your fault
[17:15:23] <claime>	 We should have immediately transcribed the manual action in code
[17:15:28] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[17:15:38] <hashar>	 fun thing, if you `kill` the `helm3 upgrade`  nothing happens, I had to `kill -9` it
[17:16:14] <wikibugs>	 (03PS1) 10Elukey: admin_ng: fix knative settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890883 (https://phabricator.wikimedia.org/T327767)
[17:16:22] <claime>	 yes...
[17:16:24] <hashar>	 fpm restarting
[17:16:33] <claime>	 Because if you kill it it tries to rollback iirc
[17:16:58] <hashar>	 ahh
[17:17:29] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.24  refs T325587
[17:17:34] <stashbot>	 T325587: 1.40.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T325587
[17:17:35] <hashar>	 pfiou
[17:17:42] <wikibugs>	 (03PS3) 10JHathaway: Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553)
[17:17:48] <wikibugs>	 (03PS1) 10Cwhite: profile: Re-enable grafana db sync post 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/890849 (https://phabricator.wikimedia.org/T317887)
[17:18:29] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] Upgrade plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/886861 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite)
[17:18:31] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/890882 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert)
[17:18:52] <hashar>	 group0 looks good so far
[17:19:10] <wikibugs>	 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10BCornwall) Good question. I fear I'm not equipped to give an authoritative answer, but generally low priority combined with ownership doubts (who o...
[17:20:02] <wikibugs>	 (03PS1) 10Andrea Denisse: rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T329131)
[17:20:06] <wikibugs>	 (03PS5) 10Jbond: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722)
[17:20:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] CI runner: skip helm library charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/888275 (owner: 10JHathaway)
[17:21:10] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] CI runner: skip helm library charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/888275 (owner: 10JHathaway)
[17:22:54] <wikibugs>	 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) Thank you @BCornwall! Same, please flag any tickets that need my attention. Three months ago, I started managing the [[ https://wikimed...
[17:23:40] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[17:23:50] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: reduce codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/890882 (https://phabricator.wikimedia.org/T330048) (owner: 10Clément Goubert)
[17:24:02] <hashar>	 group 0 looks good, I am calling it a day
[17:25:17] <wikibugs>	 (03CR) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (owner: 10Jbond)
[17:25:38] <wikibugs>	 (03PS5) 10Jbond: sre.hardware.upgrade-firmware: switch to using _upload_session [cookbooks] - 10https://gerrit.wikimedia.org/r/890827 (https://phabricator.wikimedia.org/T328593)
[17:25:41] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:25:45] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[17:25:56] <wikibugs>	 (03PS5) 10Jbond: sre.hardware.upgrade-firmware: use upload_file if supported [cookbooks] - 10https://gerrit.wikimedia.org/r/890828 (https://phabricator.wikimedia.org/T328593)
[17:25:59] <cwhite>	 !log Grafana 9x upgrade in production complete T317887
[17:26:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:03] <stashbot>	 T317887: Upgrade to Grafana 9 - https://phabricator.wikimedia.org/T317887
[17:26:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond)
[17:27:39] <jinxer-wm>	 (HelmReleaseBadStatus) firing: (3) Helm release mw-api-ext/main on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:27:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: sync
[17:27:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: sync
[17:28:17] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: update to use properties [cookbooks] - 10https://gerrit.wikimedia.org/r/890826 (https://phabricator.wikimedia.org/T329722) (owner: 10Jbond)
[17:28:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: ensure versions are parsed as packging.version.Version instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/890870 (owner: 10Jbond)
[17:29:22] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/890476 (owner: 10Jbond)
[17:29:52] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/890477 (owner: 10Jbond)
[17:29:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:30:18] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:30:36] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[17:31:04] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:31:09] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:31:16] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:31:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:31:35] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:31:45] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:32:05] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: ensure versions are parsed as packging.version.Version instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/890870 (owner: 10Jbond)
[17:32:39] <jinxer-wm>	 (HelmReleaseBadStatus) firing: (3) Helm release mw-api-ext/main on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:33:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2186.mgmt.codfw.wmnet with reboot policy FORCED
[17:33:13] <wikibugs>	 (03PS4) 10JHathaway: Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553)
[17:33:35] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2186']
[17:33:43] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['db2186']
[17:33:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2186']
[17:34:15] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[17:34:17] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[17:34:55] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] "I think most of the concerns have been addressed, so going ahead with the merge" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[17:35:13] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[17:35:17] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[17:36:23] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway)
[17:36:30] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[17:36:33] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[17:37:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync
[17:37:32] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync
[17:39:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: fix knative settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/890883 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[17:41:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[17:41:55] <wikibugs>	 (03Merged) 10jenkins-bot: Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway)
[17:42:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[17:42:39] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: (3) Helm release mw-api-ext/main on k8s@codfw in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:44:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2186']
[17:45:26] <wikibugs>	 (03CR) 10Nray: "Thank you! Planning to deploy this today during the UTC late backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray)
[17:45:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance
[17:45:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance
[17:47:09] <wikibugs>	 (03CR) 10Dzahn: "ah, right, others already use underscores too, thanks, I'll do that" [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn)
[17:48:30] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:50:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2186.codfw.wmnet with OS bullseye
[17:50:07] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye
[17:50:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[17:51:00] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[17:52:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:52:29] <wikibugs>	 (03PS2) 10Dzahn: site: differentiate between both serviceops teams for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/890014
[17:52:58] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[17:53:07] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[17:53:39] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/890847 (owner: 10Ayounsi)
[17:53:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[17:54:01] <sukhe>	 !log run authdns-update for Gerrit: 890847. repooling codfw
[17:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:12] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[17:55:56] <wikibugs>	 (03PS4) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761
[17:57:48] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool restbase-async in eqiad: T327991
[17:57:49] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors
[17:57:52] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[17:57:53] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors
[17:58:54] <TheresNoTime>	 jouncebot: nowandnext
[17:58:54] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1700)
[17:58:54] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): Grafana 9 Upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1700)
[17:58:54] <jouncebot>	 In 0 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1800)
[17:59:49] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Chinese-Sites: Let all requests from mainland China will be processed to codfw/esams/drmrs - https://phabricator.wikimedia.org/T330024 (10BCornwall) 05Open→03Declined Hi, @I. Thank you for reporting and for your detailed descriptions.  The team's limited capacity prevents the...
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1800)
[18:00:09] <wikibugs>	 10SRE, 10Traffic, 10IPv6: Start a pure IPv6 web site for wikimedia services - https://phabricator.wikimedia.org/T330020 (10BCornwall) 05Open→03Declined p:05Triage→03Lowest Hi, @I. Thank you for reporting and for your detailed descriptions.  The team's limited capacity prevents the maintenance work re...
[18:02:18] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: add REST gateway LUA CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T329049)
[18:02:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (owner: 10JHathaway)
[18:02:52] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in eqiad: T327991
[18:02:56] <stashbot>	 T327991: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991
[18:03:11] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321)
[18:03:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[18:04:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[18:06:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto)
[18:08:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/890843 (owner: 10Jbond)
[18:09:37] <wikibugs>	 (03CR) 10Andrew Bogott: systemd timers: Add the 'After' requirement to the timer module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[18:09:50] <wikibugs>	 (03Abandoned) 10Andrew Bogott: systemd timers: Add the 'After' requirement to the timer module [puppet] - 10https://gerrit.wikimedia.org/r/890513 (https://phabricator.wikimedia.org/T330022) (owner: 10Andrew Bogott)
[18:10:58] <wikibugs>	 (03PS3) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321)
[18:11:25] <wikibugs>	 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) p:05Triage→03Lowest a:03BCornwall
[18:18:35] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T330218 (10phaultfinder)
[18:21:54] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.38.0" for 564 hosts
[18:23:23] <wikibugs>	 (03CR) 10Dzahn: "+1 removed based on "no consensus comment" on ticket, which is from 2016 though" [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix)
[18:23:55] <wikibugs>	 (03PS3) 10BCornwall: Remove FLoC headers [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823)
[18:26:10] <wikibugs>	 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) a:05BCornwall→03Legoktm
[18:26:26] <wikibugs>	 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) 05Open→03In progress
[18:30:42] <wikibugs>	 (03CR) 10Krinkle: "Might be worth covering by a test. We're going to seriously rely on this, i.e. avoid "oh this seems old, lets remove it" or accidental bre" [puppet] - 10https://gerrit.wikimedia.org/r/890512 (https://phabricator.wikimedia.org/T309787) (owner: 10Legoktm)
[18:36:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:37:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:38:18] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2186.codfw.wmnet with OS bullseye
[18:38:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:38:23] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye executed with errors: - db2186 (**FAIL...
[18:38:36] <wikibugs>	 (03CR) 10BCornwall: [V: 03+2 C: 03+2] "PCC is happy again (with the same caveat of the one unrelated fail):" [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823) (owner: 10BCornwall)
[18:42:06] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: Re-enable grafana db sync post 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/890849 (https://phabricator.wikimedia.org/T317887) (owner: 10Cwhite)
[18:43:15] <wikibugs>	 (03PS1) 10Slyngshede: idm.wikimedia.org CNAME to idm1001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/890891
[18:44:36] <wikibugs>	 (03CR) 10Slyngshede: "Required to complete production setup, and enable OIDC." [dns] - 10https://gerrit.wikimedia.org/r/890891 (owner: 10Slyngshede)
[18:46:29] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01382 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:52:29] <sukhe>	 er
[18:53:06] <mutante>	 sukhe: seems like all cp* hosts, maybe more
[18:53:18] <sukhe>	 mutante: yeah definitely more than one thing happening here!
[18:53:45] <mutante>	 it's reload-vcl-failed-frontend
[18:54:08] <sukhe>	 brett: ^
[18:54:13] <sukhe>	 seems like https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/19b9dcb8264cb3fc57e5160d54856ea30d9f897e is failing
[18:54:17] <sukhe>	 https://puppetboard.wikimedia.org/report/cp5031.eqsin.wmnet/3ba20e8655ef419095dc627d4bd28bef357fb054
[18:54:20] <sukhe>	 looking
[18:54:48] <mutante>	 returns: ('/etc/varnish/wikimedia_upload-frontend.vcl' Line 1043
[18:55:06] <mutante>	 Undefined sub https_deliver_permissionspolicy
[18:55:52] <sukhe>	 yep
[18:56:00] <sukhe>	 brett: modules/varnish/templates/wikimedia-frontend.vcl.erb, line 1124
[18:56:03] <sukhe>	 1124     call https_deliver_permissionspolicy;
[18:56:11] <sukhe>	 we should probably remove this?
[18:56:55] <sukhe>	 cwhite: apologies but https://puppetboard.wikimedia.org/report/grafana1002.eqiad.wmnet/68b1c08326f83f7501a26875e12b44cbe144cbcb ?
[18:56:58] <sukhe>	 is this expected?
[18:57:35] <sukhe>	 so yeah, there is the cp hosts failure which is the bigger issue
[18:57:37] <sukhe>	 let me patch it
[18:59:24] <wikibugs>	 (03PS1) 10Ssingh: varnish: update wikimedia-frontend.vcl.erb for 19b9dcb8264 [puppet] - 10https://gerrit.wikimedia.org/r/890895 (https://phabricator.wikimedia.org/T312823)
[19:00:05] <jouncebot>	 hashar and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T1900).
[19:01:24] <sukhe>	 mutante: thanks for the debug above
[19:01:25] <wikibugs>	 10SRE, 10API Platform, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, and 4 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10JArguello-WMF)
[19:01:30] <sukhe>	 will wait for brett to review the patch once
[19:03:18] <cwhite>	 sukhe: that failure on grafana1002 has been there for 10 hours.  Possibly some manual action with the grafana-grizzly repo?
[19:04:12] <sukhe>	 yep, just flagged it in case it was missed
[19:05:24] <mutante>	 sukhe: thanks for being on top of it :))
[19:05:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] varnish: update wikimedia-frontend.vcl.erb for 19b9dcb8264 [puppet] - 10https://gerrit.wikimedia.org/r/890895 (https://phabricator.wikimedia.org/T312823) (owner: 10Ssingh)
[19:06:56] <sukhe>	 cwhite: sorry for the confusion, that wasn't your change! the grafana thing was just a coincidence :)
[19:08:36] <sukhe>	 ok, the cp errors should clear up, forcing a puppet run on all cp
[19:14:14] <sukhe>	 ok some other failed runs here
[19:14:15] <sukhe>	 https://puppetboard.wikimedia.org/nodes?status=failed
[19:14:19] <sukhe>	 but the cp ones have cleared up
[19:17:03] <wikibugs>	 (03PS2) 10Andrea Denisse: rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803)
[19:36:25] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] api-gateway: add rest gateway configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/890012 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[19:38:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Andrew)
[19:40:07] <wikibugs>	 10SRE, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review, 10Technical-Debt: Remove IEContentAnalyzer - https://phabricator.wikimedia.org/T309787 (10BCornwall) @bblack, @Vgutierrez: Is it reasonable to put this header into Varnish itself as per https://gerrit.wikimedia.org/r/c/890512? Seems sound...
[19:42:36] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Update grants to use wikiuser@10.% only [puppet] - 10https://gerrit.wikimedia.org/r/890900 (https://phabricator.wikimedia.org/T330185)
[20:00:20] <wikibugs>	 (03PS5) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761
[20:04:10] <wikibugs>	 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Dzahn) @Sbenchagra and @BCornwall Thank you soooo much for resolving this. It's great to see long-standing tickets closed.  @Sbenchagra regarding w...
[20:04:20] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite)
[20:05:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (owner: 10JHathaway)
[20:10:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2186']
[20:10:55] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2186']
[20:13:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44705 and previous config saved to /var/cache/conftool/dbconfig/20230221-201308-root.json
[20:26:21] <wikibugs>	 (03PS1) 10Dzahn: doc: fix hostname used in http::blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/890903 (https://phabricator.wikimedia.org/T327973)
[20:28:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44706 and previous config saved to /var/cache/conftool/dbconfig/20230221-202813-root.json
[20:32:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[20:35:00] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2186.codfw.wmnet with OS bullseye
[20:35:07] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye
[20:43:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44707 and previous config saved to /var/cache/conftool/dbconfig/20230221-204317-root.json
[20:44:37] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@5edcd7b]: Test deployment of search airflow dags
[20:45:45] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@5edcd7b]: Test deployment of search airflow dags (duration: 01m 08s)
[20:53:01] <wikibugs>	 (03PS2) 10Dzahn: doc: fix hostname used in http::blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/890903 (https://phabricator.wikimedia.org/T327973)
[20:54:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: host reimage
[20:57:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[20:58:02] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: host reimage
[20:58:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44708 and previous config saved to /var/cache/conftool/dbconfig/20230221-205822-root.json
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230221T2100)
[21:00:04] <jouncebot>	 RoanKattouw, nray, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:10] <TheresNoTime>	 I can deploy, but I note RoanKattouw you have some patches? Did you want to self-serve and do the others in the queue?
[21:00:17] <RoanKattouw>	 Yes I'd be happy to
[21:00:26] <RoanKattouw>	 I have been remiss doing deployments lately
[21:00:32] <TheresNoTime>	 okay ^^ all yours!
[21:00:42] <nray>	 o/
[21:03:46] <RoanKattouw>	 nray: I will start with your patch
[21:04:08] <wikibugs>	 (03PS9) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783
[21:04:24] <wikibugs>	 (03CR) 10Jbond: "updated, not sure what to do about the remaining pylint issues" [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond)
[21:07:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray)
[21:07:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond)
[21:07:59] <wikibugs>	 (03Merged) 10jenkins-bot: Add static "Cleopatra" page to facilitate synthetic testing of 885362 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) (owner: 10Nray)
[21:08:29] <logmsgbot>	 !log catrope@deploy1002 Started scap: Backport for [[gerrit:890509|Add static "Cleopatra" page to facilitate synthetic testing of 885362 (T326147 T293303)]]
[21:08:36] <stashbot>	 T326147: Stop fragmenting ParserCache entries for mobile frontend  - https://phabricator.wikimedia.org/T326147
[21:08:36] <stashbot>	 T293303: Mobile Wikipedia displays blurry thumbnails on hi-res devices - https://phabricator.wikimedia.org/T293303
[21:10:23] <logmsgbot>	 !log catrope@deploy1002 catrope and nray: Backport for [[gerrit:890509|Add static "Cleopatra" page to facilitate synthetic testing of 885362 (T326147 T293303)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:10:36] <RoanKattouw>	 nray: Your patch is on the debug servers, please test
[21:11:57] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2186.codfw.wmnet with OS bullseye
[21:12:03] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2186.codfw.wmnet with OS bullseye executed with errors: - db2186 (**FAIL...
[21:12:05] <nray>	 RoanKattouw: thank you
[21:12:45] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] Revert "admin: Add kelhurd to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/890848 (owner: 10SBassett)
[21:12:50] <nray>	 @RoanKattouw looks good!
[21:14:48] <RoanKattouw>	 arlolra: Are you here for your backport deployment (Remove wgLinterSubmitterWhitelist)?
[21:15:14] <arlolra>	 yes
[21:15:29] <arlolra>	 This just cleans up the var, it is unused
[21:16:07] <RoanKattouw>	 Is it ununsed in production currently? The change removing action=record-lint in the Linter extension was only just merged a few hours ago, but maybe I'm missing something
[21:16:52] <arlolra>	 https://github.com/wikimedia/mediawiki/blob/master/includes/parser/Parsoid/Config/DataAccess.php#L434-L451
[21:17:16] <arlolra>	 Parsoid calls the hook directly, the api action stopped with Parsoid/JS
[21:17:25] <arlolra>	 so, yes, unused in production
[21:17:32] <RoanKattouw>	 OK
[21:18:07] <RoanKattouw>	 I'll deploy it as soon as Nick's patch finishes deploying
[21:18:34] <arlolra>	 thanks
[21:18:35] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[21:18:58] <logmsgbot>	 !log catrope@deploy1002 Finished scap: Backport for [[gerrit:890509|Add static "Cleopatra" page to facilitate synthetic testing of 885362 (T326147 T293303)]] (duration: 10m 28s)
[21:19:05] <stashbot>	 T326147: Stop fragmenting ParserCache entries for mobile frontend  - https://phabricator.wikimedia.org/T326147
[21:19:06] <stashbot>	 T293303: Mobile Wikipedia displays blurry thumbnails on hi-res devices - https://phabricator.wikimedia.org/T293303
[21:19:16] <wikibugs>	 (03PS2) 10Catrope: Remove wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890496 (https://phabricator.wikimedia.org/T329992) (owner: 10Arlolra)
[21:19:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890496 (https://phabricator.wikimedia.org/T329992) (owner: 10Arlolra)
[21:20:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[21:21:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[21:21:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T328255)', diff saved to https://phabricator.wikimedia.org/P44709 and previous config saved to /var/cache/conftool/dbconfig/20230221-212123-ladsgroup.json
[21:21:28] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[21:22:57] <nray>	 thank you for your help @RoanKattouw !
[21:23:36] <wikibugs>	 (03PS1) 10Ebernhardson: Deploy analytics-refinery to search airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/890906 (https://phabricator.wikimedia.org/T329870)
[21:24:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr)
[21:24:32] <logmsgbot>	 !log catrope@deploy1002 Started scap: Backport for [[gerrit:890496|Remove wgLinterSubmitterWhitelist (T329992)]]
[21:24:36] <stashbot>	 T329992: Remove Linter API action=record-lint - https://phabricator.wikimedia.org/T329992
[21:25:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T328255)', diff saved to https://phabricator.wikimedia.org/P44710 and previous config saved to /var/cache/conftool/dbconfig/20230221-212503-ladsgroup.json
[21:25:08] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.00592 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[21:26:13] <logmsgbot>	 !log catrope@deploy1002 arlolra and catrope: Backport for [[gerrit:890496|Remove wgLinterSubmitterWhitelist (T329992)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:29:26] <wikibugs>	 (03PS2) 10Aklapper: redirects.dat: Provide acme-chief/TLS SNI list support in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/514477 (https://phabricator.wikimedia.org/T225096) (owner: 10Vgutierrez)
[21:29:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) kubernetes2017.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:30:37] <arlolra>	 RoanKattouw: I think you can continue
[21:30:44] <RoanKattouw>	 Yes, I hit continue
[21:35:08] <logmsgbot>	 !log catrope@deploy1002 Finished scap: Backport for [[gerrit:890496|Remove wgLinterSubmitterWhitelist (T329992)]] (duration: 10m 36s)
[21:35:13] <stashbot>	 T329992: Remove Linter API action=record-lint - https://phabricator.wikimedia.org/T329992
[21:36:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope)
[21:36:19] <wikibugs>	 (03PS3) 10Catrope: Add VueTest to extension-list, add config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621)
[21:36:23] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope)
[21:36:38] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Add VueTest to extension-list, add config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope)
[21:36:55] <arlolra>	 RoanKattouw: can confirm that linting is still working https://en.wikipedia.org/wiki/Special:LintErrors/obsolete-tag?namespace=&titlecategorysearch=User%3AArlolra%2Fsandbox&exactmatch=1
[21:36:59] <arlolra>	 Thanks!
[21:37:01] <RoanKattouw>	 Great!
[21:37:01] <jinxer-wm>	 (DatasourceNoData) resolved: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[21:37:26] <wikibugs>	 (03Merged) 10jenkins-bot: Add VueTest to extension-list, add config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884361 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope)
[21:37:50] <logmsgbot>	 !log catrope@deploy1002 Started scap: Backport for [[gerrit:884361|Add VueTest to extension-list, add config var (T315621)]]
[21:37:54] <stashbot>	 T315621: Install VueTest extension in beta labs - https://phabricator.wikimedia.org/T315621
[21:38:26] <wikibugs>	 (03PS2) 10Aklapper: add sretools.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/815376 (https://phabricator.wikimedia.org/T313355) (owner: 10CDanis)
[21:39:23] <wikibugs>	 10SRE, 10Security-Team, 10Traffic-Icebox, 10WMF-General-or-Unknown, and 2 others: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 (10Aklapper)
[21:40:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P44711 and previous config saved to /var/cache/conftool/dbconfig/20230221-214009-ladsgroup.json
[21:44:17] <RoanKattouw>	 ger
[21:45:46] <wikibugs>	 (03PS1) 10Zabe: Add mobile domain for ombuds.wm.o [dns] - 10https://gerrit.wikimedia.org/r/890908
[21:46:17] <wikibugs>	 10SRE, 10noc.wikimedia.org, 10serviceops: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Aklapper)
[21:47:01] <jinxer-wm>	 (DatasourceNoData) firing: (2) <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[21:48:51] <mutante>	 ^ somehow cxserver but perfect example for "alert because no data"
[21:54:25] <wikibugs>	 10SRE, 10Traffic: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431 (10Aklapper)
[21:55:14] <wikibugs>	 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Aklapper)
[21:55:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P44712 and previous config saved to /var/cache/conftool/dbconfig/20230221-215515-ladsgroup.json
[22:01:16] <logmsgbot>	 !log catrope@deploy1002 catrope: Backport for [[gerrit:884361|Add VueTest to extension-list, add config var (T315621)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[22:01:20] <stashbot>	 T315621: Install VueTest extension in beta labs - https://phabricator.wikimedia.org/T315621
[22:02:51] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2001.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:08:35] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[22:09:00] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2001.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:10:08] <wikibugs>	 10SRE: Make bpfcc-tools available fleet-wide - https://phabricator.wikimedia.org/T261193 (10Aklapper)
[22:10:15] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2002.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:10:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T328255)', diff saved to https://phabricator.wikimedia.org/P44713 and previous config saved to /var/cache/conftool/dbconfig/20230221-221021-ladsgroup.json
[22:10:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:10:26] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[22:10:28] <wikibugs>	 10SRE, 10Patch-Needs-Improvement, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10Aklapper)
[22:10:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:10:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44714 and previous config saved to /var/cache/conftool/dbconfig/20230221-221042-ladsgroup.json
[22:14:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10Aklapper)
[22:14:39] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-Needs-Improvement, 10User-jbond: Refactor puppet-merge - https://phabricator.wikimedia.org/T254249 (10Aklapper)
[22:14:58] <logmsgbot>	 !log catrope@deploy1002 Finished scap: Backport for [[gerrit:884361|Add VueTest to extension-list, add config var (T315621)]] (duration: 37m 07s)
[22:15:02] <stashbot>	 T315621: Install VueTest extension in beta labs - https://phabricator.wikimedia.org/T315621
[22:15:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44715 and previous config saved to /var/cache/conftool/dbconfig/20230221-221529-ladsgroup.json
[22:15:34] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[22:15:41] <wikibugs>	 (03PS5) 10Aklapper: api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) (owner: 10Daniel Kinzler)
[22:15:54] <wikibugs>	 (03PS12) 10Aklapper: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) (owner: 10Hnowlan)
[22:16:26] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2002.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:16:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) (owner: 10Daniel Kinzler)
[22:16:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[22:16:47] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2003.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:18:00] <wikibugs>	 (03PS3) 10Catrope: Enable VueTest in labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621)
[22:18:06] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Enable VueTest in labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope)
[22:18:44] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VueTest in labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884364 (https://phabricator.wikimedia.org/T315621) (owner: 10Catrope)
[22:22:58] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2003.codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:24:21] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1001.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:25:03] <wikibugs>	 (03PS6) 10Aklapper: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup)
[22:25:40] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Mobile, 10Patch-Needs-Improvement: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Aklapper)
[22:26:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup)
[22:30:31] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1001.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:30:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P44716 and previous config saved to /var/cache/conftool/dbconfig/20230221-223036-ladsgroup.json
[22:31:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1002.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:36:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:37:37] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1002.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:37:55] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1003.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:38:40] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[22:43:33] <tzatziki>	 !log removing 15 files for legal compliance
[22:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:06] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1003.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001
[22:45:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P44717 and previous config saved to /var/cache/conftool/dbconfig/20230221-224542-ladsgroup.json
[22:49:17] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic: Remove obsolete "Permissions-Policy: interest-cohort" header - https://phabricator.wikimedia.org/T312823 (10BCornwall) 05In progress→03Resolved Thanks @ssingh for that followup patch ._.
[22:51:40] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/890903/39770/" [puppet] - 10https://gerrit.wikimedia.org/r/890903 (https://phabricator.wikimedia.org/T327973) (owner: 10Dzahn)
[22:58:38] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[23:00:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44718 and previous config saved to /var/cache/conftool/dbconfig/20230221-230048-ladsgroup.json
[23:00:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[23:00:53] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[23:01:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[23:01:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44719 and previous config saved to /var/cache/conftool/dbconfig/20230221-230109-ladsgroup.json
[23:04:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44720 and previous config saved to /var/cache/conftool/dbconfig/20230221-230454-ladsgroup.json
[23:09:17] <tzatziki>	 !log removing 5 files for legal compliance
[23:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:14:44] <wikibugs>	 10SRE, 10Traffic, 10User-MoritzMuehlenhoff: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 (10BCornwall) AFAICT we aren't packaging auditd ourselves. It might be easiest to just notify a trigger to re-start the stupid service after install since it looks like Debian isn...
[23:15:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10BCornwall) 05Open→03Stalled
[23:17:09] <wikibugs>	 (03PS1) 10Dzahn: ci::firewall: allow monitoring hosts to check httpd on contint* [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972)
[23:17:40] <wikibugs>	 (03PS2) 10Dzahn: ci::firewall: allow monitoring hosts to check http on contint* [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972)
[23:20:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P44721 and previous config saved to /var/cache/conftool/dbconfig/20230221-232000-ladsgroup.json
[23:20:15] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "sigh, the old discussion about DNS names or IPs in firewall rules strikes again.   parameter 'monitoring_hosts' index 1 expects a match fo" [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)
[23:23:15] <wikibugs>	 (03PS3) 10Dzahn: ci::firewall: allow monitoring hosts to check http on contint* [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972)
[23:25:30] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/890919/39772/" [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)
[23:25:58] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "this is to fix ": dial tcp 208.80.153.15:443: connect: connection refused"" errors in logstash" [puppet] - 10https://gerrit.wikimedia.org/r/890919 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)
[23:35:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P44722 and previous config saved to /var/cache/conftool/dbconfig/20230221-233506-ladsgroup.json
[23:43:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[23:45:28] <wikibugs>	 10SRE: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10Dzahn)
[23:45:35] <wikibugs>	 10SRE: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10Dzahn) {F36864393}
[23:45:54] <wikibugs>	 10SRE, 10observability: syslog::centralserver: TLS cert only valid for centrallog1002 but centrallog1001 is checked - https://phabricator.wikimedia.org/T330244 (10Dzahn)
[23:46:57] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:50:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T328255)', diff saved to https://phabricator.wikimedia.org/P44723 and previous config saved to /var/cache/conftool/dbconfig/20230221-235012-ladsgroup.json
[23:50:18] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[23:51:57] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:55:20] <wikibugs>	 (03PS1) 10Dzahn: ci::firewall: allow http monitoring from prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972)
[23:55:23] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[23:58:44] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "oh my.. back to DNS names, opposite from previous comment:" [puppet] - 10https://gerrit.wikimedia.org/r/890920 (https://phabricator.wikimedia.org/T327972) (owner: 10Dzahn)