[00:05:38] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:10] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:17] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespace translations for Angika [extensions/Gadgets] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901652 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe)
[00:10:27] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespace translations for Angika [extensions/Scribunto] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901653 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe)
[00:10:53] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespaces, linktrail and digit transform table for Angika [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901651 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe)
[00:11:18] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:901652|Add namespace translations for Angika (T332118)]], [[gerrit:901653|Add namespace translations for Angika (T332118)]], [[gerrit:901651|Add namespaces, linktrail and digit transform table for Angika (T332118)]]
[00:11:24] <stashbot>	 T332118: Add namespace translations in Angika - https://phabricator.wikimedia.org/T332118
[00:18:38] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:07] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:26:07] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[00:26:40] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Initial configuration for anpwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901727 (https://phabricator.wikimedia.org/T332115) (owner: 10Zabe)
[00:27:24] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for anpwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901727 (https://phabricator.wikimedia.org/T332115) (owner: 10Zabe)
[00:29:15] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:901652|Add namespace translations for Angika (T332118)]], [[gerrit:901653|Add namespace translations for Angika (T332118)]], [[gerrit:901651|Add namespaces, linktrail and digit transform table for Angika (T332118)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[00:29:21] <stashbot>	 T332118: Add namespace translations in Angika - https://phabricator.wikimedia.org/T332118
[00:38:19] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:901652|Add namespace translations for Angika (T332118)]], [[gerrit:901653|Add namespace translations for Angika (T332118)]], [[gerrit:901651|Add namespaces, linktrail and digit transform table for Angika (T332118)]] (duration: 27m 00s)
[00:38:24] <stashbot>	 T332118: Add namespace translations in Angika - https://phabricator.wikimedia.org/T332118
[00:39:30] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:40:02] <zabe>	 !log create Wikipedia Angika (anpwiki) # T332115
[00:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:40:07] <stashbot>	 T332115: Create Wikipedia Angika - https://phabricator.wikimedia.org/T332115
[00:40:24] <logmsgbot>	 !log zabe@deploy2002 Started scap: T332115
[00:47:20] <logmsgbot>	 !log zabe@deploy2002 Finished scap: T332115 (duration: 06m 56s)
[00:47:26] <stashbot>	 T332115: Create Wikipedia Angika - https://phabricator.wikimedia.org/T332115
[00:48:58] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:49:41] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901635
[00:49:43] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901635 (owner: 10Zabe)
[00:50:25] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901635 (owner: 10Zabe)
[00:50:49] <logmsgbot>	 !log zabe@deploy2002 Started scap: update interwiki cache
[00:57:51] <logmsgbot>	 !log zabe@deploy2002 Finished scap: update interwiki cache (duration: 07m 02s)
[00:58:04] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:59:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:08:00] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:19:22] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:19:37] <wikibugs>	 (03CR) 10Samwilson: Remove WikiEditor's Realtime Preview config vars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson)
[01:38:16] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:47:44] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:53:00] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:53:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:36] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:18:06] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:26:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:02] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:48:32] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:09:26] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:13:02] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: add autoscaling settings for enwiki drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/901671 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[03:14:10] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] services: tweak lift wing endpoints to allow wikidata-specific endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 (owner: 10Elukey)
[03:18:54] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:19:08] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Temporarily disable xenon/excimer for switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T330165) (owner: 10Tim Starling)
[03:19:40] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:21:42] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:48] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:42:34] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:16] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:49:10] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:50:08] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:57:40] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:08:10] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:11:07] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[04:14:54] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:17:42] <icinga-wm>	 RECOVERY - orchestrator process on dborch1002 is OK: PROCS OK: 1 process with regex args orchestrator http https://wikitech.wikimedia.org/wiki/Orchestrator
[04:19:34] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:26:07] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[04:28:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:07] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[04:38:36] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:41:30] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:46:07] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[04:48:06] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:09:02] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:18:34] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:39:26] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:46:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] etcd: add alert for high traffic volumes [alerts] - 10https://gerrit.wikimedia.org/r/901622 (https://phabricator.wikimedia.org/T322400) (owner: 10Giuseppe Lavagetto)
[05:48:12] <wikibugs>	 (03Merged) 10jenkins-bot: etcd: add alert for high traffic volumes [alerts] - 10https://gerrit.wikimedia.org/r/901622 (https://phabricator.wikimedia.org/T322400) (owner: 10Giuseppe Lavagetto)
[05:48:58] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T0600)
[06:08:00] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:19:26] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:10] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983)
[06:32:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: tegola-vector-tiles: update to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901767 (https://phabricator.wikimedia.org/T287983)
[06:32:14] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768
[06:32:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)
[06:33:11] <wikibugs>	 (03PS1) 10Marostegui: ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/901770 (https://phabricator.wikimedia.org/T298959)
[06:33:56] <wikibugs>	 (03PS2) 10Marostegui: ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/901770 (https://phabricator.wikimedia.org/T298959)
[06:34:34] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto)
[06:37:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[06:37:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[06:38:24] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:38:34] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:51] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768
[06:39:53] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)
[06:45:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[06:45:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto)
[06:49:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:27] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983)
[06:49:30] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768
[06:49:32] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)
[06:49:58] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:51:01] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Marostegui) @Jclark-ctr could you take a look at db1121's mgmt cable?
[06:53:14] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:54:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[06:56:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[06:56:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto)
[06:58:52] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T0700)
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:04:36] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:08:32] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:10:54] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:18:32] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:22:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 141082
[07:23:25] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 141082
[07:27:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10ayounsi) Let's use a new task for the new racks and keep this one for the spines. Speaking of spines we might want to hold on cabling the ne...
[07:35:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: tweak lift wing endpoints to allow wikidata-specific endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 (owner: 10Elukey)
[07:36:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add autoscaling settings for enwiki drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/901671 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[07:39:26] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:40:52] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add htriedman to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/901614 (https://phabricator.wikimedia.org/T331647) (owner: 10Muehlenhoff)
[07:48:58] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:52:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Htriedman Your access has been enabled (it will take up to 30 minutes to have the change reach all servers), please re...
[07:53:45] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983)
[07:53:47] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768
[07:53:49] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)
[08:06:13] <wikibugs>	 (03PS6) 10Muehlenhoff: Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363)
[08:06:16] <wikibugs>	 (03CR) 10Muehlenhoff: Make Python2 removal on Bullseye configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff)
[08:07:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/901630 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[08:08:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: "This LGTM, things left to do:" [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth)
[08:09:56] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:13:07] <wikibugs>	 (03PS6) 10Ayounsi: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar)
[08:14:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff)
[08:17:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[08:18:20] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[08:18:56] <wikibugs>	 (03PS16) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:19:26] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:20:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff)
[08:20:20] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
[08:20:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar)
[08:20:39] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
[08:21:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:23:30] <wikibugs>	 (03PS17) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:24:33] <XioNoX>	 !log deploy measure-$site.wikimedia.org CNAMES
[08:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:06] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[08:25:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[08:25:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:25:45] <wikibugs>	 (03CR) 10JMeybohm: mesh.configuration: add support for custom error pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[08:27:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto)
[08:27:31] <vgutierrez>	 hmm I misclicked that one
[08:27:39] <wikibugs>	 (03CR) 10Vgutierrez: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto)
[08:27:57] <wikibugs>	 (03PS18) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:29:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:30:13] <XioNoX>	 vgutierrez: it was https://gerrit.wikimedia.org/r/c/operations/dns/+/899214
[08:30:22] <wikibugs>	 (03PS9) 10Muehlenhoff: Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[08:30:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[08:30:28] <vgutierrez>	 XioNoX: :?
[08:31:00] <vgutierrez>	 XioNoX: I was referring to my +2 in https://gerrit.wikimedia.org/r/901768 
[08:31:02] <XioNoX>	 vgutierrez: I though you were talking about my last log
[08:31:07] <XioNoX>	 nevermind :)
[08:31:22] <wikibugs>	 (03PS19) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:33:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:35:36] <wikibugs>	 (03PS20) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:37:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:38:30] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:39:45] <wikibugs>	 (03PS21) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:41:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:43:24] <wikibugs>	 (03PS22) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:45:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:45:39] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reboot-single.py: replace "pool" with "depool" [cookbooks] - 10https://gerrit.wikimedia.org/r/902009
[08:47:55] <wikibugs>	 (03PS23) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:48:00] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:48:30] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.reboot-single.py: replace "pool" with "depool" [cookbooks] - 10https://gerrit.wikimedia.org/r/902009 (https://phabricator.wikimedia.org/T325153)
[08:49:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:49:58] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/902009 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey)
[08:51:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hosts.reboot-single.py: replace "pool" with "depool" [cookbooks] - 10https://gerrit.wikimedia.org/r/902009 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey)
[08:51:44] <wikibugs>	 (03PS24) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:52:09] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye
[08:53:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:56:17] <wikibugs>	 (03PS25) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[08:58:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[08:58:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on pybal-test2003.codfw.wmnet with reason: Some tests with pybal/Bullseye
[08:58:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on pybal-test2003.codfw.wmnet with reason: Some tests with pybal/Bullseye
[08:58:53] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dcc641f3-257f-4a0d-875d-85c9d542b7f8) set by jmm@cumin2002 for 3 days, 0:00:00 on 1 host(s) and their services with r...
[08:59:47] <wikibugs>	 (03PS26) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[09:00:05] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801 (10ayounsi)
[09:01:07] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[09:01:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[09:01:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-main1004.eqiad.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware
[09:01:55] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-main1004.eqiad.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware
[09:02:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1004.eqiad.wmnet
[09:04:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a component/pybal and respective build hook [puppet] - 10https://gerrit.wikimedia.org/r/902011
[09:04:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a component/pybal and respective build hook [puppet] - 10https://gerrit.wikimedia.org/r/902011 (owner: 10Muehlenhoff)
[09:06:07] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[09:09:00] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:12] <wikibugs>	 (03PS27) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[09:09:37] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hadoop: Authorize access from dse k8s pods to hdfs and hive-metastore prod [puppet] - 10https://gerrit.wikimedia.org/r/901562 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison)
[09:10:37] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Znuny, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10ayounsi)
[09:10:54] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main1004.eqiad.wmnet
[09:11:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[09:11:19] <wikibugs>	 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) The incident report is at https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade  The #wiki...
[09:11:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1004.eqiad.wmnet
[09:11:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a component/pybal and respective build hook [puppet] - 10https://gerrit.wikimedia.org/r/902011
[09:12:19] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: cookbooks: sre.hosts.reboot-single update to support disabled puppet - https://phabricator.wikimedia.org/T325153 (10elukey) 05Open→03Resolved Fixed :)
[09:12:39] <wikibugs>	 (03PS28) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[09:12:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main1004.eqiad.wmnet
[09:12:51] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-main1004.eqiad.wmnet
[09:14:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[09:14:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[09:15:58] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reboot-single: set self.depool in any case [cookbooks] - 10https://gerrit.wikimedia.org/r/902013 (https://phabricator.wikimedia.org/T325153)
[09:16:08] <wikibugs>	 (03PS29) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[09:16:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add a component/pybal and respective build hook [puppet] - 10https://gerrit.wikimedia.org/r/902011 (owner: 10Muehlenhoff)
[09:18:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[09:18:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] changeprop-jobqueue: reduce concurrency of video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901602 (https://phabricator.wikimedia.org/T278945) (owner: 10Giuseppe Lavagetto)
[09:18:32] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:18:38] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Use/adopt search cluster ES management cookbooks for logging ES too - https://phabricator.wikimedia.org/T255864 (10ayounsi)
[09:20:31] <wikibugs>	 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10jcrespo) > I guess we can drop the SRE-OnFire tag?  Hashar: alternatively, this could be closed, as per title scope and...
[09:20:46] <wikibugs>	 (03PS30) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[09:21:34] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main1004.eqiad.wmnet
[09:21:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/902013 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey)
[09:22:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hosts.reboot-single: set self.depool in any case [cookbooks] - 10https://gerrit.wikimedia.org/r/902013 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey)
[09:22:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[09:23:23] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: reduce concurrency of video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901602 (https://phabricator.wikimedia.org/T278945) (owner: 10Giuseppe Lavagetto)
[09:23:37] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Use/adopt search cluster ES management cookbooks for logging ES too - https://phabricator.wikimedia.org/T255864 (10ayounsi) > Note: Once OpenSearch compatibili...
[09:23:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1004.eqiad.wmnet
[09:26:23] <wikibugs>	 (03PS31) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[09:27:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main1004.eqiad.wmnet
[09:27:20] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-main1004.eqiad.wmnet
[09:28:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[09:28:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos) Can I also get access to superset? I can login and everything but, I need some more permissions to access the same data sources for example I have acce...
[09:28:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos) 05Resolved→03Open
[09:30:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[09:30:29] <wikibugs>	 (03CR) 10Btullis: Allow hive on bullseye to install and use the correct packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[09:32:27] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reboot-single: fix corner case when puppet is disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/902015 (https://phabricator.wikimedia.org/T325153)
[09:33:31] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Sustainability (Incident Followup): Automatically compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10ayounsi)
[09:35:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/902015 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey)
[09:36:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main1004.eqiad.wmnet
[09:38:15] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main1004.eqiad.wmnet with OS bullseye
[09:39:30] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:42:23] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10Sustainability (Incident Followup), 10Wikimedia-Slow-DB-Query: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983 (10ayounsi)
[09:45:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:47:12] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Introduce alerting to monitor mediawiki databases QPS rate of change - https://phabricator.wikimedia.org/T281833 (10ayounsi)
[09:49:00] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:50:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[09:54:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1004.eqiad.wmnet with reason: host reimage
[09:56:37] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1004.eqiad.wmnet with reason: host reimage
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1000)
[10:06:07] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[10:07:30] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-client1002.eqiad.wmnet with OS bullseye
[10:07:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10dr0ptp4kt) >>! In T332063#8717190, @Jgiannelos wrote: > Can I also get access to superset? I can login and everything but, I need some more permissions to access t...
[10:08:06] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: monitoring: cosmetic-only changes to check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902019 (https://phabricator.wikimedia.org/T332764)
[10:08:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: monitoring: write node-exporter dpkg_success metric [puppet] - 10https://gerrit.wikimedia.org/r/902020 (https://phabricator.wikimedia.org/T332764)
[10:08:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764)
[10:09:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:11:07] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[10:15:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "This diff looks like it's going to break the way flink did. As you probably fixed that with mesh.config 1.1.1 I'd suggest to abandon this " [deployment-charts] - 10https://gerrit.wikimedia.org/r/901767 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[10:16:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1004.eqiad.wmnet with OS bullseye
[10:16:38] <wikibugs>	 (03CR) 10JMeybohm: modules: re-add base.kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto)
[10:19:23] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "This LGTM. As said on IRC I think I8f0ffd3f4f3730a353d9ac78d5c1c65e70fe538d fixed the issue I saw when trying to update the mesh.configura" [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[10:19:34] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:22:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2005.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware
[10:23:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2005.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware
[10:24:28] <wikibugs>	 (03PS32) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[10:26:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[10:26:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10MatthewVernon) I //think// the issue is that the deleted container has different permissions: ` root@ms-fe1009:/home/mvernon# swift stat wikipedia-mediawik...
[10:26:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:27:52] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:28:37] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10serviceops, and 2 others: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) a:03Joe
[10:29:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm can merge Monday after sprint week" [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah)
[10:29:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:29:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40272/console" [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah)
[10:30:01] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:30:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/901770 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui)
[10:30:48] <wikibugs>	 (03PS33) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[10:32:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[10:32:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/901770 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui)
[10:33:31] <wikibugs>	 (03Abandoned) 10Jbond: sre.hosts.reboot-single: args.depool not args.pool [cookbooks] - 10https://gerrit.wikimedia.org/r/900405 (owner: 10Jbond)
[10:33:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) How to add mw:backup to local-deleted containers? (I assumed this is handled on wiki creation- I will check that on my own), but how to do if for...
[10:34:02] <elukey>	 !log `racadm racreset` for kafka-main2005 - http idrac not available (ssh on works fine)
[10:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:35:46] <wikibugs>	 (03PS1) 10Jbond: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026
[10:36:05] <wikibugs>	 (03PS2) 10Jbond: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026
[10:36:06] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:36:10] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:36:30] <wikibugs>	 (03PS34) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[10:37:30] <wikibugs>	 (03PS3) 10Jbond: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026
[10:38:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[10:38:36] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop: allow setting strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/901246 (owner: 10Hnowlan)
[10:38:40] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:39:41] <wikibugs>	 (03PS6) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[10:39:54] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01663 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[10:40:06] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) During Sprint week I tried to evaluate a setup where we keep Pybal on Python 2 (as shipped in Bullseye) and build the Twisted packages (which no longer ship Py2 packages in Bullseye) (plus the...
[10:40:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[10:41:00] <volans>	 marostegui: I think your ferm patch might not be liked by the dbs
[10:41:03] <volans>	 see ^^^
[10:41:08] <marostegui>	 yeah
[10:41:10] <marostegui>	 reverting
[10:41:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:41:20] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:41:24] <wikibugs>	 (03PS1) 10Marostegui: Revert "ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902027
[10:41:26] <volans>	 "," expected
[10:41:27] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "-1: this is not the intended behaviuour.  see" [cookbooks] - 10https://gerrit.wikimedia.org/r/902015 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey)
[10:41:48] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:41:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902027 (owner: 10Marostegui)
[10:41:59] <wikibugs>	 (03PS4) 10Jbond: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 (https://phabricator.wikimedia.org/T325153)
[10:42:20] <taavi>	 the syntax in that is wrong, replace `) (` with ` `
[10:42:25] <wikibugs>	 (03PS35) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[10:42:28] <volans>	 marostegui: https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed if you need it
[10:43:05] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: allow setting strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/901246 (owner: 10Hnowlan)
[10:43:13] <marostegui>	 volans: thanks, doing!
[10:44:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[10:44:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10MatthewVernon) Yeah, that's a good question - I think there are about 21675 deleted containers. I think there's no automation for container management (is...
[10:44:49] <wikibugs>	 (03PS1) 10Marostegui: ferm.pp: Add dborch1002 [puppet] - 10https://gerrit.wikimedia.org/r/902046 (https://phabricator.wikimedia.org/T298959)
[10:46:09] <wikibugs>	 (03PS36) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[10:47:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[10:47:58] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:48:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) > I think there's no automation for container management  Don't worry too much about details/implementation, as that is something I can solve- my...
[10:48:49] <wikibugs>	 (03PS37) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[10:49:16] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0009785 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[10:49:28] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:49:51] <wikibugs>	 (03CR) 10Jbond: "lgtm optional nit inlline" [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[10:50:21] <wikibugs>	 (03PS1) 10Hnowlan: changeprop-jobqueue: change deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/902048
[10:50:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[10:51:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902019 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[10:51:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902046 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui)
[10:51:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ferm.pp: Add dborch1002 [puppet] - 10https://gerrit.wikimedia.org/r/902046 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui)
[10:52:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902020 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[10:52:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[10:52:50] <wikibugs>	 (03PS38) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[10:53:10] <wikibugs>	 (03PS1) 10Muehlenhoff: rt: Remove some old migration cruft [puppet] - 10https://gerrit.wikimedia.org/r/902049
[10:54:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889628 (owner: 10Majavah)
[10:54:36] <wikibugs>	 (03Abandoned) 10Elukey: sre.hosts.reboot-single: fix corner case when puppet is disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/902015 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey)
[10:54:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/890391 (owner: 10Majavah)
[10:55:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[10:55:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[10:56:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[10:56:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10MatthewVernon) I wonder (but this is not a settled position) whether using an account ACL is the more elegant solution, as we do that once and it'll work f...
[10:56:42] <wikibugs>	 (03PS39) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[10:57:03] <wikibugs>	 (03PS1) 10Marostegui: common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS [puppet] - 10https://gerrit.wikimedia.org/r/902050 (https://phabricator.wikimedia.org/T298959)
[10:58:08] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[10:58:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[10:59:04] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:59:07] <logmsgbot>	 !log elukey@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:59:23] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[10:59:51] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2005.codfw.wmnet
[10:59:54] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-main2005.codfw.wmnet
[11:00:08] <wikibugs>	 (03PS40) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:02:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:02:46] <jbond>	 !log upgrader prometheus-ipmi-exporter on buster and bullseye
[11:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:37] <wikibugs>	 (03PS41) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:05:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:06:03] <wikibugs>	 (03PS2) 10Marostegui: common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS [puppet] - 10https://gerrit.wikimedia.org/r/902050 (https://phabricator.wikimedia.org/T298959)
[11:06:39] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Set haproxy->varnish connection limits on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/902051 (https://phabricator.wikimedia.org/T310609)
[11:06:49] <wikibugs>	 (03PS1) 10EoghanGaffney: Alert on sessionstore scheduling on non-dedicated k8s hosts [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139)
[11:08:09] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[11:08:42] <icinga-wm>	 PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[11:09:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2005.codfw.wmnet
[11:10:31] <wikibugs>	 (03PS42) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:12:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:12:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Aklapper) @FNavas-foundation: Hi, thanks for caring and no worries - basivcally see my comment T331482#8703089 what would be nice to do here (and feel free to elaborate...
[11:13:48] <wikibugs>	 (03PS5) 10Jbond: prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810)
[11:13:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902050 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui)
[11:14:19] <wikibugs>	 (03PS43) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:14:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS [puppet] - 10https://gerrit.wikimedia.org/r/902050 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui)
[11:14:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2005.codfw.wmnet
[11:14:47] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[11:15:10] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet
[11:15:47] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-main2005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[11:15:50] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[11:16:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:16:11] <volans>	 elukey: ^^^
[11:16:25] <volans>	 wasn't silenced?
[11:16:26] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2005.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware
[11:16:40] <elukey>	 volans: yeah my bad, it was one hour, but the whole thing took mroe
[11:16:50] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2005.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware
[11:16:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff) @Ottomata @odimitrijevic This needs your approval for analytics-privatedata-users
[11:17:00] <elukey>	 sorry folks
[11:17:13] <slyngs>	 No problem, we'll ignore the alert :-)
[11:17:20] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2005.codfw.wmnet
[11:18:06] <icinga-wm>	 RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:18:24] <wikibugs>	 (03PS44) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:19:39] <wikibugs>	 (03PS1) 10Marostegui: mariadb/ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/902055 (https://phabricator.wikimedia.org/T298959)
[11:20:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:20:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool needs to be rebooted T323961', diff saved to https://phabricator.wikimedia.org/P45910 and previous config saved to /var/cache/conftool/dbconfig/20230322-112031-root.json
[11:20:41] <stashbot>	 T323961: ManagementSSHDown - https://phabricator.wikimedia.org/T323961
[11:21:00] * volans got paged
[11:21:08] <moritzm>	 Luca is doing maintenance for that host
[11:21:20] <moritzm>	 firmware update
[11:21:22] <_joe_>	 !incidents
[11:21:23] <sirenbot>	 3482 (ACKED)  kafka-main2005/Kafka Broker Server (paged)
[11:21:27] <wikibugs>	 (03PS45) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:21:27] <volans>	 already acked
[11:21:30] <volans>	 in -sre
[11:22:06] <claime>	 ack
[11:22:08] <volans>	 people oncall, please make sure to ack the pages when you get them and are known
[11:22:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902055 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui)
[11:22:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb/ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/902055 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui)
[11:23:17] <slyngs>	 volans: Sorry, didn't think to ack it
[11:23:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:24:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983)
[11:24:35] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2005.codfw.wmnet
[11:24:37] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2005.codfw.wmnet
[11:24:56] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[11:25:09] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-main2005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[11:25:21] <wikibugs>	 (03PS46) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:25:26] <elukey>	 again? Srsly? There is down time
[11:25:37] <elukey>	 anyway, kafka is up now
[11:25:43] <elukey>	 sorry for the extra alerts
[11:25:44] <icinga-wm>	 RECOVERY - orchestrator TCP port on dborch1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 3000 https://wikitech.wikimedia.org/wiki/Orchestrator
[11:26:16] <marostegui>	 ^ me testing
[11:26:21] <marostegui>	 I am going to disable notifications for that host
[11:26:50] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-main2005 is OK: SSL OK - Certificate kafka_main-codfw_broker valid until 2023-05-01 16:32:37 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[11:27:03] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983)
[11:27:05] <icinga-wm>	 RECOVERY - Kafka Broker Server #page on kafka-main2005 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[11:27:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:27:28] <wikibugs>	 (03PS7) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[11:27:35] <elukey>	 kafka already recovered, all good
[11:28:06] <wikibugs>	 (03PS1) 10Jbond: ipmisled: Send ipmisled logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639)
[11:28:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[11:28:47] <wikibugs>	 (03PS47) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:29:23] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983)
[11:29:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I have no context on the config file, but the addition LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond)
[11:30:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[11:30:26] <wikibugs>	 (03PS8) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[11:30:26] <icinga-wm>	 PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:26] <marostegui>	 !log Poweroff db1121 (lag will show on wikireplicas for s4 section) T323961
[11:30:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:31] <stashbot>	 T323961: ManagementSSHDown - https://phabricator.wikimedia.org/T323961
[11:30:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40275/console" [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[11:30:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:31:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[11:33:13] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/902061 (https://phabricator.wikimedia.org/T331995)
[11:36:04] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-staging-ctrl2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:42:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/901536 (https://phabricator.wikimedia.org/T330120) (owner: 10Hashar)
[11:43:53] <wikibugs>	 (03PS48) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:44:20] <wikibugs>	 (03PS6) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321)
[11:44:22] <wikibugs>	 (03PS2) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074)
[11:47:02] <wikibugs>	 (03PS9) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[11:48:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[11:52:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:53:16] <wikibugs>	 (03PS49) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:53:36] <wikibugs>	 (03PS10) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[11:53:40] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[11:53:43] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[11:55:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[11:55:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[11:56:42] <wikibugs>	 (03PS50) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[11:57:38] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Marostegui) db1121 is now off and ready for you @Jclark-ctr
[11:58:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[12:00:19] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[12:00:22] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[12:00:36] <wikibugs>	 (03PS1) 10MVernon: Provision the revised Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872)
[12:02:01] <wikibugs>	 (03PS51) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[12:02:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Provision the revised Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon)
[12:03:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[12:03:52] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[12:03:54] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[12:05:20] <wikibugs>	 (03PS1) 10Stevemunene: Add an-test-client1002 dummy keytab [labs/private] - 10https://gerrit.wikimedia.org/r/902065 (https://phabricator.wikimedia.org/T329363)
[12:05:46] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[12:05:48] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[12:06:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[12:06:24] <wikibugs>	 (03PS11) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[12:06:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/902066
[12:06:49] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: changeprop-jobqueue: Double resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/902067
[12:07:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[12:15:35] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074
[12:15:53] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10SRE Observability, 10Patch-For-Review: How should we monitor for faulty memory modules? - https://phabricator.wikimedia.org/T302639 (10jbond) We have now added a [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/901674/ | cha...
[12:17:17] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/902061 (https://phabricator.wikimedia.org/T331995) (owner: 10Hnowlan)
[12:17:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris)
[12:19:18] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074
[12:19:32] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[12:19:35] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[12:21:05] <wikibugs>	 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): 2023-01-10 eqsin network outage - https://phabricator.wikimedia.org/T328354 (10ayounsi) 05Open→03Invalid Closing this task as there are no direct actionable.
[12:21:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40276/console" [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris)
[12:22:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Alert on sessionstore scheduling on non-dedicated k8s hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney)
[12:22:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] hiera: Set haproxy->varnish connection limits on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/902051 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez)
[12:26:21] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/902061 (https://phabricator.wikimedia.org/T331995) (owner: 10Hnowlan)
[12:27:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[12:27:26] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[12:27:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] "Note that the hosts PCC lists don't even have redis running and listening on the ports that nutcracker expects to find them." [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris)
[12:32:32] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[12:33:47] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768
[12:33:49] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983)
[12:33:51] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)
[12:38:14] <wikibugs>	 (03PS52) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[12:40:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[12:41:16] <wikibugs>	 (03PS12) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[12:41:33] <wikibugs>	 (03PS53) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[12:42:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[12:43:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[12:44:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] ipmisled: Send ipmisled logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond)
[12:44:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[12:44:40] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) Cable was replaced yesterday with no luck. today preformed flea power drain on db1121
[12:45:22] <wikibugs>	 (03PS2) 10EoghanGaffney: Alert on sessionstore scheduling on non-dedicated k8s hosts [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139)
[12:45:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[12:45:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Provision the revised Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon)
[12:46:05] <wikibugs>	 (03CR) 10EoghanGaffney: Alert on sessionstore scheduling on non-dedicated k8s hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney)
[12:46:25] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] Provision the revised Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon)
[12:47:13] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Marostegui) db1121's mgmt is reachable now
[12:47:53] <wikibugs>	 (03PS54) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[12:49:09] <wikibugs>	 (03CR) 10Herron: [C: 03+1] application_servers/kafka: Remove IDs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901719 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall)
[12:49:16] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): LVS should handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10ayounsi) Another approach is to put them in a distinct namespace (one without a default route) see {T114979}
[12:49:29] <wikibugs>	 (03CR) 10Herron: [C: 03+1] ipmisled: Send ipmisled logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond)
[12:49:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[12:51:00] <wikibugs>	 (03PS1) 10Majavah: labstore: drop wmde-templates-alpha volumes [puppet] - 10https://gerrit.wikimedia.org/r/902076 (https://phabricator.wikimedia.org/T332773)
[12:51:36] <wikibugs>	 (03PS2) 10Filippo Giunchedi: monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764)
[12:51:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: monitoring: simplify check_dpkg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[12:51:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: cosmetic-only changes to check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902019 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[12:52:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: write node-exporter dpkg_success metric [puppet] - 10https://gerrit.wikimedia.org/r/902020 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[12:52:22] <wikibugs>	 (03PS13) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[12:52:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: monitoring: write node-exporter dpkg_success metric [puppet] - 10https://gerrit.wikimedia.org/r/902020 (https://phabricator.wikimedia.org/T332764)
[12:52:38] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[12:53:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[12:53:40] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[12:53:45] <wikibugs>	 (03PS55) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[12:55:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702)
[12:55:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[12:55:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[12:56:21] <wikibugs>	 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Volans) a:03Volans
[12:56:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[12:56:44] <wikibugs>	 (03PS3) 10Filippo Giunchedi: monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764)
[12:58:25] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983)
[12:58:27] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)
[12:58:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078
[13:00:03] <wikibugs>	 (03PS56) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:06] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[13:00:55] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[13:01:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[13:01:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[13:03:13] <wikibugs>	 (03PS2) 10Muehlenhoff: Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702)
[13:03:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[13:04:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45912 and previous config saved to /var/cache/conftool/dbconfig/20230322-130359-root.json
[13:04:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[13:04:49] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Add an-test-client1002 dummy keytab [labs/private] - 10https://gerrit.wikimedia.org/r/902065 (https://phabricator.wikimedia.org/T329363) (owner: 10Stevemunene)
[13:04:59] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[13:05:32] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:05:49] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add an-test-client1002 dummy keytab [labs/private] - 10https://gerrit.wikimedia.org/r/902065 (https://phabricator.wikimedia.org/T329363) (owner: 10Stevemunene)
[13:05:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[13:06:26] <wikibugs>	 (03PS3) 10Muehlenhoff: Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702)
[13:06:29] <wikibugs>	 (03PS57) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[13:06:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[13:08:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[13:09:17] <wikibugs>	 (03PS4) 10Muehlenhoff: Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702)
[13:09:48] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi)
[13:11:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "See comment, mesh.configuration 1.1.0 also introduced a strange looking "if and" construct with only one argument which we could clean up " [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[13:13:04] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:13:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[13:14:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "The changes to `charts/flink/app` don't belong here but into the following CR" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto)
[13:14:18] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@a83464d]: Deplying latest country_project_page DAG
[13:14:30] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@a83464d]: Deplying latest country_project_page DAG (duration: 00m 12s)
[13:17:36] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) @Andrew  any update on being able to reboot labstore1004
[13:18:26] <wikibugs>	 (03PS58) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[13:19:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45913 and previous config saved to /var/cache/conftool/dbconfig/20230322-131904-root.json
[13:20:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[13:22:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Jclark-ctr) pdu's have been connected to msw in rack and scs in f8.   temp sensors are installed
[13:23:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) @Jclark-ctr thanks i will start setting them up.
[13:24:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Jclark-ctr)
[13:25:04] <wikibugs>	 (03PS14) 10Filippo Giunchedi: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[13:25:46] <wikibugs>	 (03PS1) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565)
[13:27:31] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder)
[13:28:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) p:05Triage→03Medium
[13:30:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Alert on sessionstore scheduling on non-dedicated k8s hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney)
[13:31:29] <wikibugs>	 (03Merged) 10jenkins-bot: Alert on sessionstore scheduling on non-dedicated k8s hosts [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney)
[13:32:04] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:34:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45914 and previous config saved to /var/cache/conftool/dbconfig/20230322-133409-root.json
[13:35:04] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1014.mgmt.eqiad.wmnet with reboot policy FORCED
[13:35:05] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/901636
[13:35:16] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[13:36:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369 (10cmooney) Just a note on this task, related to T332781  If we do have stretched L2 segments across multiple LEAFs, we may wish to also export the /32...
[13:37:01] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[13:37:31] <wikibugs>	 (03PS1) 10Cathal Mooney: Set BGP MED based on OSPF cost for EVPN originated routes [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781)
[13:37:48] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:38:34] <wikibugs>	 (03PS6) 10Jbond: prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810)
[13:39:27] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] labstore: drop wmde-templates-alpha volumes [puppet] - 10https://gerrit.wikimedia.org/r/902076 (https://phabricator.wikimedia.org/T332773) (owner: 10Majavah)
[13:39:32] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM,thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/901630 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[13:40:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40277/console" [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[13:41:04] <wikibugs>	 (03PS59) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[13:41:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[13:42:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[13:44:42] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40278/console" [puppet] - 10https://gerrit.wikimedia.org/r/902051 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez)
[13:45:12] <wikibugs>	 (03PS60) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[13:45:24] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Set haproxy->varnish connection limits on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/902051 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez)
[13:46:58] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01125 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:47:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[13:47:26] <volans>	 jbond: ipmi sudo wrapper failures ^^^
[13:47:49] <marostegui>	 jbond: I think puppet broke
[13:47:52] <marostegui>	 Oh, volans was faster
[13:48:17] <wikibugs>	 (03PS61) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[13:49:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45915 and previous config saved to /var/cache/conftool/dbconfig/20230322-134913-root.json
[13:50:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[13:51:36] <wikibugs>	 (03PS62) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[13:53:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[13:57:14] <wikibugs>	 (03PS63) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[13:57:25] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): LVS should handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10cmooney) >>! In T286924#8717679, @ayounsi wrote: > Another approach is to put them in a distinct namespace (one...
[13:58:23] <wikibugs>	 (03PS15) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[13:59:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[13:59:36] <wikibugs>	 (03PS1) 10Muehlenhoff: * Rebuild for bullseye T332584 T332589 * Move to Java 11 * Remove adduser dependency for anything but druid-common, the rest don't need it * Remove versioned druid-common dependency, we're way past 0.10 for a while * Move to debhelper 13 (which absorbed dh-systemd) [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902092
[13:59:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[14:00:34] <wikibugs>	 (03PS16) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328)
[14:01:30] <wikibugs>	 (03PS64) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[14:02:13] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye
[14:03:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[14:04:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45916 and previous config saved to /var/cache/conftool/dbconfig/20230322-140418-root.json
[14:06:47] <wikibugs>	 (03PS65) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[14:08:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[14:09:30] <wikibugs>	 (03PS66) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[14:11:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede)
[14:11:39] <dcausse>	 jouncebot: nowandnext
[14:11:39] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 48 minute(s)
[14:11:39] <jouncebot>	 In 2 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1700)
[14:11:53] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use correct release and app [deployment-charts] - 10https://gerrit.wikimedia.org/r/901240 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking)
[14:12:55] <wikibugs>	 (03PS67) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544
[14:13:02] <sukhe>	 !log disable Puppet on A:wikidough to roll out dnsdist.conf change
[14:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:38] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0009785 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:16:37] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: use correct release and app [deployment-charts] - 10https://gerrit.wikimedia.org/r/901240 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking)
[14:17:13] <sukhe>	 !log enable Puppet on A:wikidough to roll out dnsdist.conf change
[14:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:55] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-client1002.eqiad.wmnet with reason: host reimage
[14:18:52] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[14:19:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45917 and previous config saved to /var/cache/conftool/dbconfig/20230322-141923-root.json
[14:21:29] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-client1002.eqiad.wmnet with reason: host reimage
[14:24:08] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:24:32] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:29:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097
[14:29:59] <wikibugs>	 (03PS2) 10Muehlenhoff: Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097
[14:38:33] <wikibugs>	 (03CR) 10Hashar: "Great thank you Bartosz for the confirmation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar)
[14:43:07] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Feel free to merge and deploy as you see fit. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar)
[14:43:35] <wikibugs>	 (03PS3) 10Muehlenhoff: Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097
[14:47:45] <wikibugs>	 (03PS4) 10Muehlenhoff: Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097
[14:48:49] <wikibugs>	 (03PS1) 10Jbond: team-sre/hardware: Add alert for sel events [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810)
[14:49:13] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10serviceops, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10akosiaris) a:03akosiaris
[14:50:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop-jobqueue: change deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/902048 (owner: 10Hnowlan)
[14:51:17] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: change deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/902048 (owner: 10Hnowlan)
[14:53:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ipmisled: Send ipmisled logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond)
[14:53:19] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:54:08] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:57:00] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: change deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/902048 (owner: 10Hnowlan)
[14:57:33] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[14:57:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:58:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:59:10] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:59:21] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[14:59:37] <wikibugs>	 (03CR) 10Jbond: team-sre/hardware: Add alert for sel events (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[15:00:12] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:04:11] <wikibugs>	 (03PS3) 10Volans: es_exporter: add NEL metrics by country [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941)
[15:04:13] <wikibugs>	 (03PS1) 10Volans: superset: add static html for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009)
[15:07:29] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10Volans)
[15:07:51] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) (owner: 10Volans)
[15:08:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:08:49] <wikibugs>	 (03PS1) 10Effie Mouzeli: maps: remove OSM Synchronisation Lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/902109 (https://phabricator.wikimedia.org/T285328)
[15:12:07] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10Volans) I've sent a small improvement proposal in the above patch, let me know what...
[15:13:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:13:45] <hnowlan>	 !log removing cassandra packages from maps hosts
[15:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:43] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Relax nodeAffinity of sessionstore pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/901572 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney)
[15:15:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[15:16:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902109 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[15:17:44] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[15:17:50] <logmsgbot>	 !log eoghan@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[15:17:53] <logmsgbot>	 !log eoghan@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[15:18:12] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: remove OSM Synchronisation Lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/902109 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[15:18:56] <wikibugs>	 (03Merged) 10jenkins-bot: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli)
[15:20:29] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10akosiaris) Let me note that we also have an alert on `exim_queue_length` per...
[15:21:11] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Use/adopt search cluster ES management cookbooks for logging ES too - https://phabricator.wikimedia.org/T255864 (10colewhite) >>! In T255864#8717171, @ayounsi...
[15:22:04] <wikibugs>	 (03PS1) 10Jbond: P:monitoring: drop check for filesystem_avail_bigger_than_size [puppet] - 10https://gerrit.wikimedia.org/r/902110 (https://phabricator.wikimedia.org/T302687)
[15:22:48] <hnowlan>	 !log removing java packages from maps hosts
[15:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney)
[15:23:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2004.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware
[15:23:32] <wikibugs>	 (03CR) 10BCornwall: [V: 03+2 C: 03+2] application_servers/kafka: Remove IDs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901719 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall)
[15:23:41] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2004.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware
[15:23:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney)
[15:24:54] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/902111
[15:25:21] <wikibugs>	 (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/902111
[15:25:53] <logmsgbot>	 !log eoghan@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply
[15:25:54] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:25:55] <logmsgbot>	 !log eoghan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply
[15:26:51] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:27:44] <elukey>	 !log `racadm racreset` for kafka-main2004 (no http idrac available for the cookbook, ssh one available)
[15:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:45] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.reimage for host dborch1001.wikimedia.org with OS bullseye
[15:30:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:30:57] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:31:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097 (owner: 10Muehlenhoff)
[15:31:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/902111 (owner: 10Muehlenhoff)
[15:31:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:32:24] <wikibugs>	 (03Abandoned) 10Muehlenhoff: * Rebuild for bullseye T332584 T332589 * Move to Java 11 * Remove adduser dependency for anything but druid-common, the rest don't need it * Remove versioned druid-common dependency, we're way past 0.10 for a while * Move to debhelper 13 (which absorbed dh-systemd) [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902092 (owner: 10Muehlenhoff)
[15:35:21] <wikibugs>	 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) > Hashar: alternatively, this could be closed, as per title scope and the mentioned work could be filed on a sep...
[15:36:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:37:02] <wikibugs>	 (03PS1) 10EoghanGaffney: Fix preferredDuringScheduling[...] change for sessionstore [deployment-charts] - 10https://gerrit.wikimedia.org/r/902114 (https://phabricator.wikimedia.org/T325139)
[15:37:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10MoritzMuehlenhoff) >>! In T332063#8717190, @Jgiannelos wrote: > Can I also get access to superset? I can login and everything but, I need some more permissions to...
[15:39:35] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:39:50] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:40:27] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2004.codfw.wmnet
[15:41:38] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dborch1001.wikimedia.org with reason: host reimage
[15:44:09] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dborch1001.wikimedia.org with reason: host reimage
[15:44:43] <icinga-wm>	 RECOVERY - Check systemd state on elastic1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:29] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2004.codfw.wmnet
[15:46:30] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:46:49] <icinga-wm>	 PROBLEM - Host kafka-main2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:46:49] <icinga-wm>	 RECOVERY - Host kafka-main2004 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms
[15:46:52] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-main2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[15:46:54] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:47:33] <rzl>	 good morning 👋
[15:47:34] <elukey>	 folks I am sorry for the page but I have downtimed the node for 2 hours
[15:47:43] <elukey>	 not really sure why it paged now
[15:47:45] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[15:47:57] <rzl>	 got it, thanks <3 need anything?
[15:48:08] <elukey>	 nono regular maintenance, I am upgrading bios etc..
[15:48:14] <rzl>	 👍
[15:48:22] <wikibugs>	 (03Abandoned) 10Aklapper: Phabricator: Disable setting lowest priority on tasks [puppet] - 10https://gerrit.wikimedia.org/r/699493 (https://phabricator.wikimedia.org/T228759) (owner: 10Aklapper)
[15:48:25] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2004.codfw.wmnet
[15:48:41] <godog>	 I like to think it is because we're being punished for paging on ps | grep
[15:50:57] <icinga-wm>	 RECOVERY - Check systemd state on cp1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:21] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:36] <moritzm>	 !log uploaded druid 0.19.wmf0-2 to bullseye-wikimedia T332584 T332589
[15:53:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:43] <stashbot>	 T332584: Upgrade an-test-druid1001 to bullseye - https://phabricator.wikimedia.org/T332584
[15:53:43] <stashbot>	 T332589: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589
[15:56:11] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2004.codfw.wmnet
[15:56:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2004.codfw.wmnet
[15:56:24] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-main2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[15:56:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Fix preferredDuringScheduling[...] change for sessionstore [deployment-charts] - 10https://gerrit.wikimedia.org/r/902114 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney)
[15:56:44] <vgutierrez>	 elukey: :_)
[15:56:57] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[15:57:17] <elukey>	 vgutierrez: I know, not really sure what to do, I downtimed for two hours, and now it pages
[15:57:44] <elukey>	 kafka is up now, there is probably something that escapes the downtime logic, or I missed something
[15:58:07] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host dborch1001.wikimedia.org with OS bullseye
[15:58:18] <icinga-wm>	 RECOVERY - Kafka Broker Server #page on kafka-main2004 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[15:58:30] <rzl>	 fwiw, that second #p.age didn't actually come though victorops
[15:58:49] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-main2004 is OK: SSL OK - Certificate kafka_main-codfw_broker valid until 2023-05-01 16:32:37 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[15:58:55] <rzl>	 or rather, it's there, but it seems to be a second email under the same incident
[16:00:23] <jynus>	 rzl: interesting, let me check
[16:00:31] <rzl>	 all to say, I'm not sure if it's actually a new alert or just a re-notification of the previous one for some reason 🤷 I wouldn't sweat it too much, especially given that'll be moved to alertmanager anyhow
[16:01:07] <rzl>	 jynus: if you like! elukey is the one working on it though, I'm not really looking :)
[16:01:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[16:01:19] <jynus>	 no, I mean the victorops stuff
[16:01:29] <wikibugs>	 (03PS1) 10JHathaway: repackage for bullseye [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902116
[16:01:54] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] repackage for bullseye [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902116 (owner: 10JHathaway)
[16:01:56] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] repackage for bullseye [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902116 (owner: 10JHathaway)
[16:01:57] <jynus>	 trying to understand the events
[16:04:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add mmkubernetes ECS early-stage filter [puppet] - 10https://gerrit.wikimedia.org/r/901630 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[16:04:20] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[16:05:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: team-sre/hardware: Add alert for sel events (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[16:11:04] <wikibugs>	 (03PS3) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793)
[16:12:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney)
[16:15:12] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Fix preferredDuringScheduling[...] change for sessionstore [deployment-charts] - 10https://gerrit.wikimedia.org/r/902114 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney)
[16:18:37] <logmsgbot>	 !log eoghan@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[16:18:40] <logmsgbot>	 !log eoghan@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[16:19:45] <logmsgbot>	 !log eoghan@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply
[16:20:33] <wikibugs>	 (03PS2) 10Clément Goubert: cpufrequtils: Force reload init script on change [puppet] - 10https://gerrit.wikimedia.org/r/900645
[16:23:27] <elukey>	 jynus: sorry I was in a meeting, so I have both times 
[16:23:35] <elukey>	 1) downtimed with the cookbook
[16:23:44] <elukey>	 2) stopped kafka etc.. on the node + puppet disabled
[16:24:01] <elukey>	 3) run the firmware upgrade cookbooks
[16:24:15] <elukey>	 this morning I've set 1 hour of downtime and it expired, my bad
[16:24:29] <elukey>	 but today it was two hours, and I am sure I was into the right time window
[16:24:29] <logmsgbot>	 !log eoghan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply
[16:24:42] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Many thanks indeed." [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097 (owner: 10Muehlenhoff)
[16:25:00] <elukey>	 it may be all those icinga/nagios ps|etcc based alerts that just need to be removed
[16:25:08] <elukey>	 (as godo*g mentioned earlier on)
[16:25:38] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes: Remove old kubernetes metric from alerts [alerts] - 10https://gerrit.wikimedia.org/r/902120 (https://phabricator.wikimedia.org/T322919)
[16:26:55] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:27:07] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:27:33] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes: Remove old kubernetes metric from alerts [alerts] - 10https://gerrit.wikimedia.org/r/902120 (https://phabricator.wikimedia.org/T322919)
[16:27:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb/ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902043
[16:27:54] <wikibugs>	 (03PS1) 10Marostegui: Revert "common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS" [puppet] - 10https://gerrit.wikimedia.org/r/902044
[16:28:01] <wikibugs>	 (03PS1) 10Marostegui: Revert "ferm.pp: Add dborch1002" [puppet] - 10https://gerrit.wikimedia.org/r/902045
[16:28:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:28:16] <marostegui>	 jhathaway: I am going to revert the patches used for dborch1002 testing for now
[16:28:34] <jhathaway>	 marostegui: sounds good
[16:28:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS" [puppet] - 10https://gerrit.wikimedia.org/r/902044 (owner: 10Marostegui)
[16:28:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ferm.pp: Add dborch1002" [puppet] - 10https://gerrit.wikimedia.org/r/902045 (owner: 10Marostegui)
[16:29:29] <wikibugs>	 (03PS2) 10Marostegui: Revert "mariadb/ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902043
[16:29:57] <wikibugs>	 (03PS4) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793)
[16:30:46] <marostegui>	 jhathaway: how are things looking on your side?
[16:31:03] <wikibugs>	 (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/902043 (owner: 10Marostegui)
[16:31:06] <jhathaway>	 okay, just need to figure out how to make dh_golang happy
[16:31:17] <marostegui>	 haha
[16:31:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney)
[16:31:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb/ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902043 (owner: 10Marostegui)
[16:33:04] <wikibugs>	 (03PS2) 10Cathal Mooney: Enable OSPF check by default for l3 switch mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/900431 (https://phabricator.wikimedia.org/T315053)
[16:35:40] <vgutierrez>	 !log rolling downgrade to HAProxy 2.6.9 in text@esams - T332796
[16:35:43] <jhathaway>	 marostegui: new version is installed, if you would like to take a gander
[16:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:45] <stashbot>	 T332796: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796
[16:36:36] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Enable OSPF check by default for l3 switch mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/900431 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney)
[16:36:40] <marostegui>	 jhathaway: checking
[16:36:57] <wikibugs>	 (03CR) 10Jbond: "thanks" [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[16:37:15] <logmsgbot>	 !log eoghan@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply
[16:37:27] <marostegui>	 jhathaway: looking good! \o/
[16:37:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:monitoring: drop check for filesystem_avail_bigger_than_size [puppet] - 10https://gerrit.wikimedia.org/r/902110 (https://phabricator.wikimedia.org/T302687) (owner: 10Jbond)
[16:37:37] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:37:44] <jhathaway>	 marostegui: great
[16:37:46] <logmsgbot>	 !log eoghan@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply
[16:38:09] <marostegui>	 jhathaway: you can destroy dborch1002
[16:38:21] <jhathaway>	 marostegui: great, will do
[16:38:23] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:38:30] <marostegui>	 jhathaway: And close the task too \o/
[16:38:35] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:38:44] <jhathaway>	 marostegui: thanks for the help
[16:39:29] <marostegui>	 jhathaway: no, thank you for taking on that task!
[16:39:55] <marostegui>	 jhathaway: If you could comment on the last issue, for future references
[16:40:01] <marostegui>	 Before closing the task, that'd be great
[16:40:08] <jhathaway>	 will do
[16:40:18] <marostegui>	 thanks
[16:40:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Revert "Revert: Remove the .Values.kubernetesApi hack" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto)
[16:42:22] <wikibugs>	 (03PS5) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793)
[16:42:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[16:42:47] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[16:42:53] <stashbot>	 elukey@deploy2002: Failed to log message to wiki. Somebody should check the error logs.
[16:43:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney)
[16:43:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[16:44:55] <wikibugs>	 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10MisterSynergy) Some remarks: * We should consider these canonical HTTP URIs to be //names// in the first place, which are unique worldwide and issued by the Wiki...
[16:45:07] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@6cbc3bc]: (no justification provided)
[16:45:19] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@6cbc3bc]: (no justification provided) (duration: 00m 12s)
[16:47:40] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jhathaway) @jbond this should be fixed, following the Puppet 7 upgrade. Do we have any way of noting post puppet 7 followup t...
[16:49:40] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[16:49:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[16:50:33] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jbond) @jhathaway not currently but we could request a new tag or possibly a milestone.   @Aklapper are you able to offer any...
[16:51:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:51:12] <wikibugs>	 (03PS6) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793)
[16:51:46] <wikibugs>	 (03PS1) 10JHathaway: update changelog [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902122
[16:52:12] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] update changelog [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902122 (owner: 10JHathaway)
[16:52:20] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] update changelog [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902122 (owner: 10JHathaway)
[16:52:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney)
[16:54:37] <wikibugs>	 (03PS7) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793)
[16:54:45] <icinga-wm>	 RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:20] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[16:55:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Is this needed? I think that the service (which has status parameter) of" [puppet] - 10https://gerrit.wikimedia.org/r/900645 (owner: 10Clément Goubert)
[16:55:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney)
[16:56:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:57:48] <wikibugs>	 (03PS1) 10Elukey: services: stop changeprop's lift wing test [deployment-charts] - 10https://gerrit.wikimedia.org/r/902123 (https://phabricator.wikimedia.org/T328576)
[16:57:57] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10eoghan) We've deployed the change to relax the `nodeAffinity` setting, tomorrow morning we'll drain one of the nodes to test that t...
[16:59:15] <wikibugs>	 10SRE, 10Traffic, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) Gathering data on esams after downgrading: ` vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-text_esams' 'apt-cache policy haproxy|grep Installed' 8 hosts will be targeted: cp[3050...
[16:59:23] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] services: stop changeprop's lift wing test [deployment-charts] - 10https://gerrit.wikimedia.org/r/902123 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[16:59:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1700)
[17:00:08] <wikibugs>	 (03PS1) 10JHathaway: add .gitreview [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902124
[17:00:22] <wikibugs>	 (03PS8) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793)
[17:00:24] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] add .gitreview [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902124 (owner: 10JHathaway)
[17:00:26] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] add .gitreview [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902124 (owner: 10JHathaway)
[17:01:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney)
[17:02:06] <wikibugs>	 (03PS1) 10JHathaway: Revert "dborch: allow dborch1002 to issue an ssl cert" [puppet] - 10https://gerrit.wikimedia.org/r/902125
[17:02:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: stop changeprop's lift wing test [deployment-charts] - 10https://gerrit.wikimedia.org/r/902123 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[17:02:28] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Revert "dborch: allow dborch1002 to issue an ssl cert" [puppet] - 10https://gerrit.wikimedia.org/r/902125 (owner: 10JHathaway)
[17:03:02] <wikibugs>	 (03CR) 10Clément Goubert: cpufrequtils: Force reload init script on change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900645 (owner: 10Clément Goubert)
[17:04:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:04:57] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[17:05:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[17:05:34] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.decommission for hosts dborch1002.wikimedia.org
[17:06:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::errorpage: allow avoiding percentage sizes. [puppet] - 10https://gerrit.wikimedia.org/r/902126
[17:06:26] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[17:06:39] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[17:06:47] <hashar>	 I am doing a noop pull of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/899520
[17:06:58] <wikibugs>	 (03PS1) 10JHathaway: Revert "Add a dborch vm for testing the bullseye upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/902127 (https://phabricator.wikimedia.org/T289657)
[17:07:11] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Thanks. I will pull it on the deployment server." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar)
[17:07:14] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Revert "Add a dborch vm for testing the bullseye upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/902127 (https://phabricator.wikimedia.org/T289657) (owner: 10JHathaway)
[17:07:28] <hashar>	 it has a script to composer.json and thus have no effect to production
[17:08:01] <wikibugs>	 (03Merged) 10jenkins-bot: build: add local typos check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar)
[17:08:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40282/console" [puppet] - 10https://gerrit.wikimedia.org/r/902126 (owner: 10Giuseppe Lavagetto)
[17:08:21] <wikibugs>	 (03PS2) 10JHathaway: Revert "Add a dborch vm for testing the bullseye upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/902127 (https://phabricator.wikimedia.org/T298959)
[17:09:37] <hashar>	 and of course something is breaking :/
[17:09:43] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[17:09:46] <hashar>	 failed to register layer: devmapper: Thin Pool has 94446 free data blocks which is less than minimum required 163840 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior
[17:10:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) 05Open→03Resolved I've merged the patch and the EVPN switches are now being checked by Icinga, all looks healthy.
[17:12:09] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dborch1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jhathaway@cumin1001"
[17:14:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:15:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-staging-ctrl2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:15:25] <wikibugs>	 (03CR) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth)
[17:15:30] <logmsgbot>	 !log hashar@deploy2002 Synchronized composer.json: build: add local typos check to composer.json # T332121 (duration: 06m 44s)
[17:15:36] <stashbot>	 T332121: Migrate CI job operations-mw-config-typos-docker job to be inside operations/mediawiki-config - https://phabricator.wikimedia.org/T332121
[17:17:10] <wikibugs>	 (03PS2) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565)
[17:17:27] <wikibugs>	 (03CR) 10Ssingh: "The tests are failing, see the comment inline:" [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh)
[17:18:12] <wikibugs>	 (03CR) 10Ssingh: logstash: add pybal ECS filter and tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh)
[17:18:42] <wikibugs>	 (03PS3) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565)
[17:19:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:19:47] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::errorpage: allow avoiding percentage sizes. [puppet] - 10https://gerrit.wikimedia.org/r/902126
[17:20:31] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh)
[17:20:37] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10akosiaris) >>! In T320398#8711722, @Eevans wrote: > TL;DR Is there someone(s) —who isn't as close to this as I am— who has...
[17:20:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40283/console" [puppet] - 10https://gerrit.wikimedia.org/r/902126 (owner: 10Giuseppe Lavagetto)
[17:22:51] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::errorpage: allow avoiding percentage sizes. [puppet] - 10https://gerrit.wikimedia.org/r/902126
[17:23:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[17:24:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40284/console" [puppet] - 10https://gerrit.wikimedia.org/r/902126 (owner: 10Giuseppe Lavagetto)
[17:27:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::errorpage: allow avoiding percentage sizes. [puppet] - 10https://gerrit.wikimedia.org/r/902126 (owner: 10Giuseppe Lavagetto)
[17:29:24] <wikibugs>	 (03PS6) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009)
[17:30:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth)
[17:36:59] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: disable add a link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902131 (https://phabricator.wikimedia.org/T304551)
[17:38:08] <_joe_>	 !log stopping apache on mwdebug1001 to test the new envoy error page
[17:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:46] <wikibugs>	 (03PS7) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009)
[17:42:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth)
[17:43:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: switch security.wm.org microsite to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901320 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn)
[17:45:22] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10MusikAnimal) >>! In T332650#8716712, @Tgr wrote: >>>...
[17:45:32] <wikibugs>	 (03PS8) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009)
[17:46:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth)
[17:48:18] <wikibugs>	 (03CR) 10SBassett: api-gateway: add REST gateway Lua CSP handler (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[17:53:08] <wikibugs>	 (03PS2) 10Hnowlan: thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033)
[17:53:40] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dborch1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jhathaway@cumin1001"
[17:53:40] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:53:40] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dborch1002.wikimedia.org
[17:53:50] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jhathaway@cumin1001 for hosts: `dborch1002.wikimedia.org` - dborch1002....
[17:54:48] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10MusikAnimal)
[17:54:52] <wikibugs>	 (03PS1) 10Dduvall: buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804)
[17:57:35] <wikibugs>	 (03PS2) 10Dduvall: buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804)
[18:00:05] <jouncebot>	 dancy and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1800).
[18:00:05] <jouncebot>	 dancy and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1800).
[18:02:48] <wikibugs>	 (03PS1) 10Jbond: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135
[18:04:23] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) >>! In T332650#8718814, @MusikAnimal wrote: > I...
[18:05:20] <dancy>	 Alright... let's see what happens.
[18:07:36] <claime>	 dancy: tell me if I need to clean up space
[18:07:42] <claime>	 I can't fix the actual problem tonight
[18:08:24] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902136 (https://phabricator.wikimedia.org/T330207)
[18:08:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902136 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot)
[18:08:28] <dancy>	 claime: OK!
[18:08:48] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[18:09:11] <wikibugs>	 (03CR) 10Hnowlan: thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[18:09:14] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902136 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot)
[18:09:16] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[18:11:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn)
[18:12:38] <mutante>	 !log rsyncing /srv/org/wikimedia/sitemaps files for https://sitemaps.wikimedia.org from old to new machines. most other things are auto-deployed by puppet or puppet running intial scap or automatic rsync.. this is not. rsync -av /srv/org/wikimedia/sitemaps/ rsync://miscweb2003.codfw.wmnet/miscapps-srv/org/wikimedia/sitemaps/ T331896 - but also see T332101 
[18:12:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:46] <stashbot>	 T332101: determine whether https://sitemaps.wikimedia.org still serves a purpose - https://phabricator.wikimedia.org/T332101
[18:12:46] <stashbot>	 T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896
[18:14:27] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[18:16:06] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.1  refs T330207
[18:16:12] <stashbot>	 T330207: 1.41.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T330207
[18:16:32] <claime>	 dancy: I'll clean up space pre-emptively
[18:17:00] <dancy>	 So I saw those same messages that hashar reported.
[18:17:14] <claime>	 Then not pre-emptively :D
[18:18:16] <claime>	 basically it just won't schedule mw containers on these hosts
[18:18:20] <claime>	 Because it can't pull the images
[18:18:34] <dancy>	 are the new pods otherwise deployed?
[18:19:13] <wikibugs>	 (03PS4) 10Cwhite: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh)
[18:19:17] <claime>	 Yeah, just not to these hosts
[18:19:29] <claime>	 There are no mw pods deployed to them that I can see
[18:19:31] <dancy>	 okay great.. k8s working the way its supposed to 
[18:20:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: switch sitemaps, transparency and tr-archives to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901321 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn)
[18:20:41] <dancy>	 Thanks for being around claime.
[18:21:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh)
[18:21:08] <wikibugs>	 (03CR) 10Cwhite: "Tests still failing because the legacy expected output doesn't quite match yet, but the ECS one does!" [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh)
[18:21:12] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/901639
[18:29:03] <wikibugs>	 (03CR) 10Herron: "Something to consider also re: SNR is some logs represent good events like 'PS Redundancy    | Power Supply                | Fully Redunda" [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[18:36:43] <wikibugs>	 (03PS1) 10Dzahn: miscweb: move transparency httpd site templates out of role/apache [puppet] - 10https://gerrit.wikimedia.org/r/902140
[18:39:03] <wikibugs>	 (03PS1) 10Dzahn: miscweb: move simplestatic.erb out of role/templates/apache/sites/ [puppet] - 10https://gerrit.wikimedia.org/r/902141
[18:41:55] <wikibugs>	 (03PS1) 10Dzahn: miscweb: move os_reports httpd template to profile/microsites/ [puppet] - 10https://gerrit.wikimedia.org/r/902142
[18:43:33] <wikibugs>	 (03PS2) 10Dzahn: miscweb: move os_reports httpd template to profile/microsites/ [puppet] - 10https://gerrit.wikimedia.org/r/902142
[18:44:16] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "template goes to ./templates/ not manifests ..fixing later" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn)
[18:46:33] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/901640
[18:46:35] <wikibugs>	 (03PS1) 10Dzahn: miscweb: add custom and error log for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902144
[18:47:29] <wikibugs>	 (03PS1) 10Nray: Enable pinning for anon main menu when page tools is enabled [skins/Vector] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902150 (https://phabricator.wikimedia.org/T331657)
[18:49:54] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall)
[18:51:36] <wikibugs>	 (03PS2) 10Nray: Enable page tools for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052)
[18:53:06] <wikibugs>	 (03PS1) 10Dzahn: miscweb: add custom and error log for transparency and archives [puppet] - 10https://gerrit.wikimedia.org/r/902166
[18:58:22] <wikibugs>	 (03CR) 10SBassett: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[19:02:56] <wikibugs>	 (03PS1) 10Dzahn: miscweb: switch research.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902167 (https://phabricator.wikimedia.org/T331896)
[19:03:56] <wikibugs>	 (03PS1) 10Dzahn: miscweb: switch wikiworkshop.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902169 (https://phabricator.wikimedia.org/T331896)
[19:04:54] <wikibugs>	 (03PS1) 10Dzahn: miscweb: switch design.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902170 (https://phabricator.wikimedia.org/T331896)
[19:05:39] <wikibugs>	 (03PS1) 10Dzahn: miscweb: switch os-reports.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902172 (https://phabricator.wikimedia.org/T331896)
[19:06:56] <wikibugs>	 (03PS1) 10Dzahn: miscweb: switch static-codereview to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902174 (https://phabricator.wikimedia.org/T331896)
[19:08:29] <wikibugs>	 (03PS1) 10Dzahn: delete webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/902175
[19:15:31] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Enable pinning for anon main menu when page tools is enabled [skins/Vector] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902150 (https://phabricator.wikimedia.org/T331657) (owner: 10Nray)
[19:28:16] <wikibugs>	 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10Ennomeijers) +1!
[19:48:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM %request for doc1002 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse)
[19:48:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM %request for doc1002 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse) a:03andrea.denisse
[19:49:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1002 - https://phabricator.wikimedia.org/T332812 (10RhinosF1)
[19:58:33] <wikibugs>	 (03PS5) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565)
[19:59:00] <wikibugs>	 (03PS2) 10Samtar: Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901723 (https://phabricator.wikimedia.org/T332745) (owner: 10Bartosz Dziewoński)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T2000).
[20:00:05] <jouncebot>	 MatmaRex and nray: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:13] * TheresNoTime can deploy
[20:00:17] <MatmaRex>	 hi
[20:00:24] <MatmaRex>	 my changes are no-ops or labs only
[20:00:40] <TheresNoTime>	 I'll set them going now then :)
[20:00:46] <dancy>	 The best change is no change
[20:00:55] <dancy>	 also, never upgrade.
[20:01:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901723 (https://phabricator.wikimedia.org/T332745) (owner: 10Bartosz Dziewoński)
[20:01:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901724 (owner: 10Bartosz Dziewoński)
[20:01:05] <nray>	 o/
[20:01:19] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host doc1003.wikimedia.org
[20:01:20] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[20:01:33] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "Start for deploy" [skins/Vector] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902150 (https://phabricator.wikimedia.org/T331657) (owner: 10Nray)
[20:01:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse)
[20:01:49] <wikibugs>	 (03Merged) 10jenkins-bot: Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901723 (https://phabricator.wikimedia.org/T332745) (owner: 10Bartosz Dziewoński)
[20:01:52] <wikibugs>	 (03Merged) 10jenkins-bot: Clean up DiscussionTools labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901724 (owner: 10Bartosz Dziewoński)
[20:02:09] <taavi>	 o/
[20:02:24] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@822dfed]: bump discolytics to 0.9.0
[20:02:32] <taavi>	 TheresNoTime: ping when when done pls? I might have a patch of my own by then
[20:02:34] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:901723|Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes (T332745)]], [[gerrit:901724|Clean up DiscussionTools labs config]]
[20:02:39] <stashbot>	 T332745: Allow running persistRevisionThreadItems.php per-namespace and document that this should be done after changing wgExtraSignatureNamespaces - https://phabricator.wikimedia.org/T332745
[20:02:44] <TheresNoTime>	 taavi: will do 
[20:02:45] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@822dfed]: bump discolytics to 0.9.0 (duration: 00m 21s)
[20:03:07] <wikibugs>	 (03PS1) 10JHathaway: lists: new server to test bookworm functionality [puppet] - 10https://gerrit.wikimedia.org/r/902182 (https://phabricator.wikimedia.org/T331706)
[20:04:08] <logmsgbot>	 !log samtar@deploy2002 samtar and matmarex: Backport for [[gerrit:901723|Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes (T332745)]], [[gerrit:901724|Clean up DiscussionTools labs config]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[20:04:16] <TheresNoTime>	 (syncing)
[20:04:32] <MatmaRex>	 thanks
[20:05:12] <MatmaRex>	 also - any suggestions for where else i should document/announce the persistRevisionThreadItems.php and wgExtraSignatureNamespaces thing?
[20:05:28] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:05:28] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc1003.wikimedia.org on all recursors
[20:05:31] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc1003.wikimedia.org on all recursors
[20:05:33] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] lists: new server to test bookworm functionality [puppet] - 10https://gerrit.wikimedia.org/r/902182 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[20:05:34] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[20:06:46] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:06:46] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc1003.wikimedia.org on all recursors
[20:06:49] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc1003.wikimedia.org on all recursors
[20:06:54] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doc1003.wikimedia.org
[20:07:15] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host lists1003.wikimedia.org
[20:07:16] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[20:07:38] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host doc1003.eqiad.wmnet
[20:07:40] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[20:07:47] <kostajh>	 TheresNoTime: I'm adding another config patch to the window, is that alright?
[20:07:59] <TheresNoTime>	 kostajh: sure :)
[20:08:41] <wikibugs>	 10SRE: failed to register layer: devmapper during scap deploy - https://phabricator.wikimedia.org/T332818 (10TheresNoTime)
[20:09:56] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901723|Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes (T332745)]], [[gerrit:901724|Clean up DiscussionTools labs config]] (duration: 07m 22s)
[20:10:00] <wikibugs>	 10SRE: failed to register layer: devmapper during scap deploy - https://phabricator.wikimedia.org/T332818 (10TheresNoTime)
[20:10:01] <stashbot>	 T332745: Allow running persistRevisionThreadItems.php per-namespace and document that this should be done after changing wgExtraSignatureNamespaces - https://phabricator.wikimedia.org/T332745
[20:10:03] <kostajh>	 (added)
[20:10:06] <wikibugs>	 10SRE: failed to register layer: devmapper during scap deploy - https://phabricator.wikimedia.org/T332818 (10dancy)
[20:10:39] <dancy>	 Deployers: You can ignore the `failed to register layer: devmapper: ` error that happens during deployment.
[20:10:46] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc1003.eqiad.wmnet - denisse@cumin1001"
[20:10:54] <TheresNoTime>	 dancy: ack, thank you
[20:11:01] <dancy>	 officially https://phabricator.wikimedia.org/T332803
[20:11:25] <wikibugs>	 (03PS3) 10Samtar: GrowthExperiments: Enable Leveling Up features on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T330358) (owner: 10Kosta Harlan)
[20:11:42] <TheresNoTime>	 kostajh: going to do your 901144 next, while I wait for nray's other patch to merge
[20:11:50] <kostajh>	 ok
[20:11:52] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc1003.eqiad.wmnet - denisse@cumin1001"
[20:11:52] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:11:52] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc1003.eqiad.wmnet on all recursors
[20:11:55] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc1003.eqiad.wmnet on all recursors
[20:12:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T330358) (owner: 10Kosta Harlan)
[20:12:58] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Enable Leveling Up features on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T330358) (owner: 10Kosta Harlan)
[20:13:20] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:901144|GrowthExperiments: Enable Leveling Up features on pilot wikis (T330358 T317813)]]
[20:13:27] <stashbot>	 T317813: [EPIC] Positive Reinforcement: Leveling Up  - https://phabricator.wikimedia.org/T317813
[20:13:28] <stashbot>	 T330358: Leveling Up: Start experiment for Leveling up on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T330358
[20:14:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse)
[20:15:01] <logmsgbot>	 !log samtar@deploy2002 kharlan and samtar: Backport for [[gerrit:901144|GrowthExperiments: Enable Leveling Up features on pilot wikis (T330358 T317813)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:15:05] <TheresNoTime>	 kostajh: live on mwdebug
[20:15:07] <logmsgbot>	 !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:15:07] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache lists1003.wikimedia.org on all recursors
[20:15:10] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lists1003.wikimedia.org on all recursors
[20:15:17] <kostajh>	 TheresNoTime: thanks, I'll need a minute or two to verify
[20:15:28] <wikibugs>	 (03PS3) 10Samtar: Enable page tools for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) (owner: 10Nray)
[20:17:22] <wikibugs>	 (03Merged) 10jenkins-bot: Enable pinning for anon main menu when page tools is enabled [skins/Vector] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902150 (https://phabricator.wikimedia.org/T331657) (owner: 10Nray)
[20:17:48] <kostajh>	 TheresNoTime: lgtm
[20:17:54] <TheresNoTime>	 syncing
[20:20:04] <wikibugs>	 (03PS10) 10Alex Paskulin: Assign the API portal to the Wikimedia group for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg)
[20:23:18] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901144|GrowthExperiments: Enable Leveling Up features on pilot wikis (T330358 T317813)]] (duration: 09m 57s)
[20:23:25] <stashbot>	 T317813: [EPIC] Positive Reinforcement: Leveling Up  - https://phabricator.wikimedia.org/T317813
[20:23:25] <stashbot>	 T330358: Leveling Up: Start experiment for Leveling up on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T330358
[20:23:28] <TheresNoTime>	 kostajh: live :)
[20:23:32] <TheresNoTime>	 nray: ready?
[20:23:34] <kostajh>	 TheresNoTime: thanks!
[20:24:08] <nray>	 TheresNoTime: Yes, is on the debug servers?
[20:24:13] <nray>	 is it*
[20:24:19] <TheresNoTime>	 nray: not yet
[20:24:51] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:902150|Enable pinning for anon main menu when page tools is enabled (T331657)]]
[20:24:56] <stashbot>	 T331657: Enable pinning for anonymous users when page tools is enabled - https://phabricator.wikimedia.org/T331657
[20:25:05] <icinga-wm>	 PROBLEM - Host kubernetes1023 is DOWN: PING CRITICAL - Packet loss = 100%
[20:25:48] <akosiaris>	 !log reboot kubernetes1023 for a test
[20:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:54] <TheresNoTime>	 akosiaris: not saying thats related ^, but just as it happened, scap backport got stuck on `20:25:28 docker_pull_k8s:  96% (in-flight: 1; ok: 29; fail: 2; left: 0)`
[20:27:03] <icinga-wm>	 RECOVERY - Host kubernetes1023 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[20:28:29] <logmsgbot>	 !log samtar@deploy2002 samtar and nray: Backport for [[gerrit:902150|Enable pinning for anon main menu when page tools is enabled (T331657)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:28:38] <TheresNoTime>	 nray: 902150 is live on mwdebug now
[20:28:50] <nray>	 TheresNoTime: Thank you, I will check now
[20:29:24] <akosiaris>	 TheresNoTime: did it proceed eventually ? 
[20:29:30] <TheresNoTime>	 yeah :)
[20:29:39] <akosiaris>	 ok, good to know. Thanks for the notice
[20:29:48] <akosiaris>	 and yeah, it's probably related
[20:29:56] <akosiaris>	 but also self-healed apparently
[20:30:24] <TheresNoTime>	 well the stage failed on 3 nodes instead of 2, so guessing it just timed out?
[20:31:04] <nray>	 TheresNoTime: You can proceed with that one
[20:31:13] <TheresNoTime>	 syncing :)
[20:31:58] <akosiaris>	 well, it's drained now, rebooting once more, this time around we shouldn't see anything 
[20:32:03] <akosiaris>	 !log reboot kubernetes1023 for a test once more
[20:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:42] <akosiaris>	 !log reboot kubernetes1023 for a test once more, ⚓ T332803
[20:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:47] <stashbot>	 T332803: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2  - https://phabricator.wikimedia.org/T332803
[20:32:58] <akosiaris>	 that's gonna be properly added now
[20:34:03] <icinga-wm>	 PROBLEM - Host kubernetes1023 is DOWN: PING CRITICAL - Packet loss = 100%
[20:35:25] <icinga-wm>	 RECOVERY - Host kubernetes1023 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[20:36:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] es_exporter: add NEL metrics by country [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans)
[20:36:39] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:902150|Enable pinning for anon main menu when page tools is enabled (T331657)]] (duration: 11m 47s)
[20:36:44] <stashbot>	 T331657: Enable pinning for anonymous users when page tools is enabled - https://phabricator.wikimedia.org/T331657
[20:36:52] <TheresNoTime>	 live, and moving on to 900748
[20:37:18] <nray>	 TheresNoTime: thank you!
[20:37:28] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) (owner: 10Nray)
[20:37:40] <akosiaris>	 !log uncordon reboot kubernetes1023. It was drained previously for ⚓ T332803
[20:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:27] <wikibugs>	 (03Merged) 10jenkins-bot: Enable page tools for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) (owner: 10Nray)
[20:39:10] <wikibugs>	 (03PS1) 10Majavah: Set OATHAuthMultipleDevicesMigrationStage in IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187
[20:39:12] <wikibugs>	 (03PS1) 10Majavah: Remove OATHAuthMultipleDevicesMigrationStage from CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902188
[20:39:14] <wikibugs>	 (03PS1) 10Majavah: [beta] Write both for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902189 (https://phabricator.wikimedia.org/T242031)
[20:41:43] <TheresNoTime>	 nray: live on mwdebug1002
[20:41:50] <TheresNoTime>	 (had to do this one manually)
[20:41:55] <nray>	 TheresNoTime: Thank you, checking now
[20:44:19] <nray>	 TheresNoTime: Looks good, you can proceed!
[20:44:27] <TheresNoTime>	 syncing
[20:49:41] <TheresNoTime>	 (err, if it announces that its ready on mwdebug, ignore it :p)
[20:54:50] <logmsgbot>	 !log samtar@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:900748|Enable page tools for anonymous users (T331052)]] (duration: 10m 10s)
[20:54:55] <stashbot>	 T331052: Enable page tools for anonymous users - https://phabricator.wikimedia.org/T331052
[20:55:04] <TheresNoTime>	 nray: got there eventually, should be live now :)
[20:55:09] <TheresNoTime>	 taavi: all yours
[20:55:13] <taavi>	 thanks!
[20:55:16] <nray>	 TheresNoTime: Thanks for your help!
[20:55:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187 (owner: 10Majavah)
[20:55:50] <TheresNoTime>	 (one of those k8s steps takes a while to time-out and fail fwiw)
[20:55:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:56:10] <taavi>	 :(
[20:56:15] <taavi>	 the deployment process is already slow as is
[20:56:49] <wikibugs>	 (03PS2) 10Majavah: Set OATHAuthMultipleDevicesMigrationStage in IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187
[20:56:52] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Set OATHAuthMultipleDevicesMigrationStage in IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187 (owner: 10Majavah)
[20:57:46] <wikibugs>	 (03Merged) 10jenkins-bot: Set OATHAuthMultipleDevicesMigrationStage in IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187 (owner: 10Majavah)
[20:58:12] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:902187|Set OATHAuthMultipleDevicesMigrationStage in IS]]
[20:58:37] <wikibugs>	 (03CR) 10Cwhite: "one nit inline, but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh)
[20:59:41] <logmsgbot>	 !log taavi@deploy2002 taavi: Backport for [[gerrit:902187|Set OATHAuthMultipleDevicesMigrationStage in IS]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[21:00:29] <wikibugs>	 (03PS2) 10Majavah: Remove OATHAuthMultipleDevicesMigrationStage from CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902188
[21:00:35] <wikibugs>	 (03PS2) 10Majavah: [beta] Write both for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902189 (https://phabricator.wikimedia.org/T242031)
[21:00:53] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Remove OATHAuthMultipleDevicesMigrationStage from CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902188 (owner: 10Majavah)
[21:00:58] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] [beta] Write both for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902189 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[21:00:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:01:42] <wikibugs>	 (03Merged) 10jenkins-bot: Remove OATHAuthMultipleDevicesMigrationStage from CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902188 (owner: 10Majavah)
[21:01:45] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Write both for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902189 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[21:05:30] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:902187|Set OATHAuthMultipleDevicesMigrationStage in IS]] (duration: 07m 17s)
[21:05:34] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [debs/cqlsh4] - 10https://gerrit.wikimedia.org/r/902194
[21:05:36] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/cqlsh4] - 10https://gerrit.wikimedia.org/r/902194 (owner: 10QChris)
[21:06:26] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:902188|Remove OATHAuthMultipleDevicesMigrationStage from CS]], [[gerrit:902189|[beta] Write both for OATHAuthMultipleDevicesMigrationStage (T242031)]]
[21:06:31] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[21:08:09] <logmsgbot>	 !log taavi@deploy2002 taavi: Backport for [[gerrit:902188|Remove OATHAuthMultipleDevicesMigrationStage from CS]], [[gerrit:902189|[beta] Write both for OATHAuthMultipleDevicesMigrationStage (T242031)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[21:08:20] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doc1003.eqiad.wmnet
[21:13:13] <wikibugs>	 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10Peachey88)
[21:13:55] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:902188|Remove OATHAuthMultipleDevicesMigrationStage from CS]], [[gerrit:902189|[beta] Write both for OATHAuthMultipleDevicesMigrationStage (T242031)]] (duration: 07m 29s)
[21:14:01] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[21:14:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:15:27] <wikibugs>	 (03PS1) 10Jdlrobson: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093)
[21:15:46] <taavi>	 !log UTC late backports complete
[21:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:36] <taavi>	 jouncebot: nowandnext
[21:16:36] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 43 minute(s)
[21:16:36] <jouncebot>	 In 8 hour(s) and 43 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600)
[21:16:37] <jouncebot>	 In 8 hour(s) and 43 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600)
[21:16:41] <taavi>	 sorry, I got one more
[21:17:27] <wikibugs>	 (03PS1) 10Majavah: [beta] Read new for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902198 (https://phabricator.wikimedia.org/T242031)
[21:17:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902198 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[21:18:23] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Read new for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902198 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah)
[21:19:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:41:07] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[21:42:22] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10Aklapper) >>! In T273637#8718626, @jbond wrote: > are you able to offer any advice on this, thanks?  See "[Request a project]...
[21:45:01] <wikibugs>	 (03PS1) 10Cwhite: logstash: add grafana-server ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/901642 (https://phabricator.wikimedia.org/T234565)
[21:46:07] <jinxer-wm>	 (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[22:29:11] <TheresNoTime>	 NovemLinguae: if you're around to test, I could look at backporting 902153
[22:31:14] <wikibugs>	 (03PS1) 10Samtar: Revert "Remove 50% opacity from notification badges when they are all read" [extensions/Echo] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902154 (https://phabricator.wikimedia.org/T331502)
[22:32:12] <wikibugs>	 (03PS1) 10Samtar: Revert "Remove 50% opacity from notification badges when they are all read" [extensions/Echo] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902155 (https://phabricator.wikimedia.org/T331502)
[22:33:34] <NovemLinguae>	 i was thinking it was pretty minor. wasn't thinking it was backport worthy. but i'm around to test if you disagree
[22:34:55] <TheresNoTime>	 A skin issue which is pretty minor, that's a first /s
[22:35:07] <NovemLinguae>	 lol :)
[22:38:30] <TheresNoTime>	 hm, well I don't disagree that it's a minor regression — may as well let it ride the train then :) sorry for the ping!
[22:40:28] <NovemLinguae>	 nah no worries, ping me anytime. i appreciate the backport offer
[22:54:15] <wikibugs>	 (03CR) 10Tim Starling: Temporarily disable xenon/excimer for switch maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T330165) (owner: 10Tim Starling)
[23:01:38] <wikibugs>	 (03PS1) 10Samtar: core-Permissions: Add `ipblock-exempt` to `bot` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759)
[23:02:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] core-Permissions: Add `ipblock-exempt` to `bot` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759) (owner: 10Samtar)
[23:03:50] <wikibugs>	 (03PS2) 10Samtar: core-Permissions: Add `ipblock-exempt` to `bot` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759)
[23:04:21] <wikibugs>	 (03PS1) 10Zabe: wikimaniawiki: Add namespace for 2024 wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902208 (https://phabricator.wikimedia.org/T332782)
[23:06:09] <wikibugs>	 (03PS2) 10Krinkle: Temporarily disable xenon/excimer for mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling)
[23:06:25] <icinga-wm>	 RECOVERY - Check systemd state on mw1372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:17:51] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:20:13] <zabe>	 jouncebot: nowandnext
[23:20:13] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 39 minute(s)
[23:20:14] <jouncebot>	 In 6 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600)
[23:20:14] <jouncebot>	 In 6 hour(s) and 39 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600)
[23:20:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902208 (https://phabricator.wikimedia.org/T332782) (owner: 10Zabe)
[23:21:03] <wikibugs>	 (03Merged) 10jenkins-bot: wikimaniawiki: Add namespace for 2024 wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902208 (https://phabricator.wikimedia.org/T332782) (owner: 10Zabe)
[23:21:24] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:902208|wikimaniawiki: Add namespace for 2024 wikimania (T332782)]]
[23:21:31] <stashbot>	 T332782: Create 2024 namespace for wikimaniawiki - https://phabricator.wikimedia.org/T332782
[23:22:58] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:902208|wikimaniawiki: Add namespace for 2024 wikimania (T332782)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[23:24:30] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host doc2002.codfw.wmnet
[23:24:31] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[23:24:32] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host lists1003.wikimedia.org
[23:26:07] <jinxer-wm>	 (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[23:31:07] <jinxer-wm>	 (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[23:31:28] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:902208|wikimaniawiki: Add namespace for 2024 wikimania (T332782)]] (duration: 10m 03s)
[23:31:34] <stashbot>	 T332782: Create 2024 namespace for wikimaniawiki - https://phabricator.wikimedia.org/T332782
[23:32:14] <zabe>	 !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php wikimaniawiki --fix # T332782
[23:32:15] <wikibugs>	 (03PS1) 10Andrea Denisse: doc: Add the doc1003 node definition [puppet] - 10https://gerrit.wikimedia.org/r/902209 (https://phabricator.wikimedia.org/T332812)
[23:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:56] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc2002.codfw.wmnet - denisse@cumin1001"
[23:33:26] <wikibugs>	 (03CR) 10Tim Starling: Temporarily disable xenon/excimer for mwlog1002 switch maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling)
[23:33:58] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc2002.codfw.wmnet - denisse@cumin1001"
[23:33:58] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:33:58] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc2002.codfw.wmnet on all recursors
[23:34:01] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc2002.codfw.wmnet on all recursors
[23:34:24] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40285/console" [puppet] - 10https://gerrit.wikimedia.org/r/902209 (https://phabricator.wikimedia.org/T332812) (owner: 10Andrea Denisse)
[23:35:26] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] doc: Add the doc1003 node definition [puppet] - 10https://gerrit.wikimedia.org/r/902209 (https://phabricator.wikimedia.org/T332812) (owner: 10Andrea Denisse)
[23:35:50] <wikibugs>	 (03CR) 10Tim Starling: Temporarily disable xenon/excimer for mwlog1002 switch maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling)
[23:36:07] <jinxer-wm>	 (RedisMemoryFull) resolved: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc  - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
[23:38:47] <wikibugs>	 (03PS1) 10Superpes15: [dkwikimedia] Fixing current logo with an HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902211 (https://phabricator.wikimedia.org/T332784)
[23:46:49] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host doc1003.eqiad.wmnet with OS bullseye
[23:46:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc1003.eqiad.wmnet with OS bullseye
[23:52:58] <wikibugs>	 (03PS6) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565)
[23:56:27] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on doc1003.eqiad.wmnet with reason: host reimage
[23:59:41] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doc1003.eqiad.wmnet with reason: host reimage