[00:05:38] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:10] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:17] (03Merged) 10jenkins-bot: Add namespace translations for Angika [extensions/Gadgets] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901652 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe) [00:10:27] (03Merged) 10jenkins-bot: Add namespace translations for Angika [extensions/Scribunto] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901653 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe) [00:10:53] (03Merged) 10jenkins-bot: Add namespaces, linktrail and digit transform table for Angika [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901651 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe) [00:11:18] !log zabe@deploy2002 Started scap: Backport for [[gerrit:901652|Add namespace translations for Angika (T332118)]], [[gerrit:901653|Add namespace translations for Angika (T332118)]], [[gerrit:901651|Add namespaces, linktrail and digit transform table for Angika (T332118)]] [00:11:24] T332118: Add namespace translations in Angika - https://phabricator.wikimedia.org/T332118 [00:18:38] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:26:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:26:40] (03CR) 10Zabe: [C: 03+2] Initial configuration for anpwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901727 (https://phabricator.wikimedia.org/T332115) (owner: 10Zabe) [00:27:24] (03Merged) 10jenkins-bot: Initial configuration for anpwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901727 (https://phabricator.wikimedia.org/T332115) (owner: 10Zabe) [00:29:15] !log zabe@deploy2002 zabe: Backport for [[gerrit:901652|Add namespace translations for Angika (T332118)]], [[gerrit:901653|Add namespace translations for Angika (T332118)]], [[gerrit:901651|Add namespaces, linktrail and digit transform table for Angika (T332118)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [00:29:21] T332118: Add namespace translations in Angika - https://phabricator.wikimedia.org/T332118 [00:38:19] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:901652|Add namespace translations for Angika (T332118)]], [[gerrit:901653|Add namespace translations for Angika (T332118)]], [[gerrit:901651|Add namespaces, linktrail and digit transform table for Angika (T332118)]] (duration: 27m 00s) [00:38:24] T332118: Add namespace translations in Angika - https://phabricator.wikimedia.org/T332118 [00:39:30] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:02] !log create Wikipedia Angika (anpwiki) # T332115 [00:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:07] T332115: Create Wikipedia Angika - https://phabricator.wikimedia.org/T332115 [00:40:24] !log zabe@deploy2002 Started scap: T332115 [00:47:20] !log zabe@deploy2002 Finished scap: T332115 (duration: 06m 56s) [00:47:26] T332115: Create Wikipedia Angika - https://phabricator.wikimedia.org/T332115 [00:48:58] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:41] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901635 [00:49:43] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901635 (owner: 10Zabe) [00:50:25] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901635 (owner: 10Zabe) [00:50:49] !log zabe@deploy2002 Started scap: update interwiki cache [00:57:51] !log zabe@deploy2002 Finished scap: update interwiki cache (duration: 07m 02s) [00:58:04] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:32] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:08:00] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:22] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:37] (03CR) 10Samwilson: Remove WikiEditor's Realtime Preview config vars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson) [01:38:16] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:44] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:53:00] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:53:58] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:36] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:06] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:02] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:48:32] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:26] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:02] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: add autoscaling settings for enwiki drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/901671 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [03:14:10] (03CR) 10Kevin Bazira: [C: 03+1] services: tweak lift wing endpoints to allow wikidata-specific endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 (owner: 10Elukey) [03:18:54] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:08] (03CR) 10Krinkle: [C: 03+1] Temporarily disable xenon/excimer for switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T330165) (owner: 10Tim Starling) [03:19:40] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:21:42] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:48] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:34] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:10] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:08] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:57:40] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:10] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:14:54] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:42] RECOVERY - orchestrator process on dborch1002 is OK: PROCS OK: 1 process with regex args orchestrator http https://wikitech.wikimedia.org/wiki/Orchestrator [04:19:34] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:28:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:38:36] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:30] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:48:06] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:02] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:34] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:26] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] etcd: add alert for high traffic volumes [alerts] - 10https://gerrit.wikimedia.org/r/901622 (https://phabricator.wikimedia.org/T322400) (owner: 10Giuseppe Lavagetto) [05:48:12] (03Merged) 10jenkins-bot: etcd: add alert for high traffic volumes [alerts] - 10https://gerrit.wikimedia.org/r/901622 (https://phabricator.wikimedia.org/T322400) (owner: 10Giuseppe Lavagetto) [05:48:58] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T0600) [06:08:00] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:26] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:10] (03PS2) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) [06:32:12] (03PS1) 10Giuseppe Lavagetto: tegola-vector-tiles: update to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901767 (https://phabricator.wikimedia.org/T287983) [06:32:14] (03PS1) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 [06:32:16] (03PS1) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) [06:33:11] (03PS1) 10Marostegui: ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/901770 (https://phabricator.wikimedia.org/T298959) [06:33:56] (03PS2) 10Marostegui: ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/901770 (https://phabricator.wikimedia.org/T298959) [06:34:34] PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:09] (03CR) 10CI reject: [V: 04-1] modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto) [06:37:19] (03CR) 10CI reject: [V: 04-1] mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [06:37:22] (03CR) 10CI reject: [V: 04-1] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [06:38:24] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:34] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:51] (03PS2) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 [06:39:53] (03PS2) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) [06:45:19] (03CR) 10CI reject: [V: 04-1] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [06:45:21] (03CR) 10CI reject: [V: 04-1] modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto) [06:49:20] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:27] (03PS3) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) [06:49:30] (03PS3) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 [06:49:32] (03PS3) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) [06:49:58] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:01] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Marostegui) @Jclark-ctr could you take a look at db1121's mgmt cable? [06:53:14] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:29] (03CR) 10CI reject: [V: 04-1] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [06:56:16] (03CR) 10CI reject: [V: 04-1] mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [06:56:18] (03CR) 10CI reject: [V: 04-1] modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto) [06:58:52] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T0700) [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:04:36] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:32] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:54] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:32] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 141082 [07:23:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 141082 [07:27:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10ayounsi) Let's use a new task for the new racks and keep this one for the spines. Speaking of spines we might want to hold on cabling the ne... [07:35:03] (03CR) 10Elukey: [C: 03+2] services: tweak lift wing endpoints to allow wikidata-specific endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 (owner: 10Elukey) [07:36:07] (03CR) 10Elukey: [C: 03+2] ml-services: add autoscaling settings for enwiki drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/901671 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [07:39:26] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:52] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:41] (03CR) 10Muehlenhoff: [C: 03+2] Add htriedman to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/901614 (https://phabricator.wikimedia.org/T331647) (owner: 10Muehlenhoff) [07:48:58] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Htriedman Your access has been enabled (it will take up to 30 minutes to have the change reach all servers), please re... [07:53:45] (03PS4) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) [07:53:47] (03PS4) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 [07:53:49] (03PS4) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) [08:06:13] (03PS6) 10Muehlenhoff: Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) [08:06:16] (03CR) 10Muehlenhoff: Make Python2 removal on Bullseye configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [08:07:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/901630 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [08:08:59] (03CR) 10Filippo Giunchedi: "This LGTM, things left to do:" [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [08:09:56] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:07] (03PS6) 10Ayounsi: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar) [08:14:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [08:17:56] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [08:18:20] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [08:18:56] (03PS16) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:19:26] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:06] (03CR) 10Muehlenhoff: [C: 03+2] Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [08:20:20] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [08:20:36] (03CR) 10Ayounsi: [C: 03+2] Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar) [08:20:39] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [08:21:16] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:23:30] (03PS17) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:24:33] !log deploy measure-$site.wikimedia.org CNAMES [08:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:06] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:25:25] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:25:26] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:25:45] (03CR) 10JMeybohm: mesh.configuration: add support for custom error pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [08:27:13] (03CR) 10Vgutierrez: [C: 03+2] modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto) [08:27:31] hmm I misclicked that one [08:27:39] (03CR) 10Vgutierrez: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto) [08:27:57] (03PS18) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:29:57] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:30:13] vgutierrez: it was https://gerrit.wikimedia.org/r/c/operations/dns/+/899214 [08:30:22] (03PS9) 10Muehlenhoff: Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [08:30:26] (03CR) 10Elukey: [C: 03+2] services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [08:30:28] XioNoX: :? [08:31:00] XioNoX: I was referring to my +2 in https://gerrit.wikimedia.org/r/901768 [08:31:02] vgutierrez: I though you were talking about my last log [08:31:07] nevermind :) [08:31:22] (03PS19) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:33:23] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:35:36] (03PS20) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:37:30] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:38:30] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:45] (03PS21) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:41:39] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:43:24] (03PS22) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:45:15] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:45:39] (03PS1) 10Elukey: sre.hosts.reboot-single.py: replace "pool" with "depool" [cookbooks] - 10https://gerrit.wikimedia.org/r/902009 [08:47:55] (03PS23) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:48:00] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:30] (03PS2) 10Elukey: sre.hosts.reboot-single.py: replace "pool" with "depool" [cookbooks] - 10https://gerrit.wikimedia.org/r/902009 (https://phabricator.wikimedia.org/T325153) [08:49:50] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:49:58] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/902009 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey) [08:51:11] (03CR) 10Elukey: [C: 03+2] sre.hosts.reboot-single.py: replace "pool" with "depool" [cookbooks] - 10https://gerrit.wikimedia.org/r/902009 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey) [08:51:44] (03PS24) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:52:09] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye [08:53:39] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:56:17] (03PS25) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:58:13] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:58:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on pybal-test2003.codfw.wmnet with reason: Some tests with pybal/Bullseye [08:58:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on pybal-test2003.codfw.wmnet with reason: Some tests with pybal/Bullseye [08:58:53] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dcc641f3-257f-4a0d-875d-85c9d542b7f8) set by jmm@cumin2002 for 3 days, 0:00:00 on 1 host(s) and their services with r... [08:59:47] (03PS26) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:00:05] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801 (10ayounsi) [09:01:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:01:40] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:01:42] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-main1004.eqiad.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [09:01:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-main1004.eqiad.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [09:02:56] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1004.eqiad.wmnet [09:04:07] (03PS1) 10Muehlenhoff: Add a component/pybal and respective build hook [puppet] - 10https://gerrit.wikimedia.org/r/902011 [09:04:30] (03CR) 10CI reject: [V: 04-1] Add a component/pybal and respective build hook [puppet] - 10https://gerrit.wikimedia.org/r/902011 (owner: 10Muehlenhoff) [09:06:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:09:00] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:12] (03PS27) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:09:37] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hadoop: Authorize access from dse k8s pods to hdfs and hive-metastore prod [puppet] - 10https://gerrit.wikimedia.org/r/901562 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [09:10:37] 10SRE-Sprint-Week-Sustainability-March2023, 10Znuny, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10ayounsi) [09:10:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main1004.eqiad.wmnet [09:11:13] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:11:19] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) The incident report is at https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade The #wiki... [09:11:22] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1004.eqiad.wmnet [09:11:53] (03PS2) 10Muehlenhoff: Add a component/pybal and respective build hook [puppet] - 10https://gerrit.wikimedia.org/r/902011 [09:12:19] 10SRE-tools, 10Infrastructure-Foundations: cookbooks: sre.hosts.reboot-single update to support disabled puppet - https://phabricator.wikimedia.org/T325153 (10elukey) 05Open→03Resolved Fixed :) [09:12:39] (03PS28) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:12:49] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main1004.eqiad.wmnet [09:12:51] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-main1004.eqiad.wmnet [09:14:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:14:32] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:15:58] (03PS1) 10Elukey: sre.hosts.reboot-single: set self.depool in any case [cookbooks] - 10https://gerrit.wikimedia.org/r/902013 (https://phabricator.wikimedia.org/T325153) [09:16:08] (03PS29) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:16:27] (03CR) 10Muehlenhoff: [C: 03+2] Add a component/pybal and respective build hook [puppet] - 10https://gerrit.wikimedia.org/r/902011 (owner: 10Muehlenhoff) [09:18:06] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:18:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] changeprop-jobqueue: reduce concurrency of video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901602 (https://phabricator.wikimedia.org/T278945) (owner: 10Giuseppe Lavagetto) [09:18:32] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:38] 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Use/adopt search cluster ES management cookbooks for logging ES too - https://phabricator.wikimedia.org/T255864 (10ayounsi) [09:20:31] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10jcrespo) > I guess we can drop the SRE-OnFire tag? Hashar: alternatively, this could be closed, as per title scope and... [09:20:46] (03PS30) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:21:34] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main1004.eqiad.wmnet [09:21:44] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/902013 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey) [09:22:11] (03CR) 10Elukey: [C: 03+2] sre.hosts.reboot-single: set self.depool in any case [cookbooks] - 10https://gerrit.wikimedia.org/r/902013 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey) [09:22:41] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:23:23] (03Merged) 10jenkins-bot: changeprop-jobqueue: reduce concurrency of video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901602 (https://phabricator.wikimedia.org/T278945) (owner: 10Giuseppe Lavagetto) [09:23:37] 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Use/adopt search cluster ES management cookbooks for logging ES too - https://phabricator.wikimedia.org/T255864 (10ayounsi) > Note: Once OpenSearch compatibili... [09:23:56] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1004.eqiad.wmnet [09:26:23] (03PS31) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:27:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main1004.eqiad.wmnet [09:27:20] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-main1004.eqiad.wmnet [09:28:16] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:28:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos) Can I also get access to superset? I can login and everything but, I need some more permissions to access the same data sources for example I have acce... [09:28:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Jgiannelos) 05Resolved→03Open [09:30:15] (03CR) 10Btullis: [C: 03+2] Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [09:30:29] (03CR) 10Btullis: Allow hive on bullseye to install and use the correct packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [09:32:27] (03PS1) 10Elukey: sre.hosts.reboot-single: fix corner case when puppet is disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/902015 (https://phabricator.wikimedia.org/T325153) [09:33:31] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Sustainability (Incident Followup): Automatically compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10ayounsi) [09:35:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/902015 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey) [09:36:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main1004.eqiad.wmnet [09:38:15] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main1004.eqiad.wmnet with OS bullseye [09:39:30] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:23] 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10Sustainability (Incident Followup), 10Wikimedia-Slow-DB-Query: Optimize SpecialAllPages::showChunk for large wikis - https://phabricator.wikimedia.org/T160983 (10ayounsi) [09:45:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:47:12] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Introduce alerting to monitor mediawiki databases QPS rate of change - https://phabricator.wikimedia.org/T281833 (10ayounsi) [09:49:00] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:54:12] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1004.eqiad.wmnet with reason: host reimage [09:56:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1004.eqiad.wmnet with reason: host reimage [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1000) [10:06:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:07:30] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-client1002.eqiad.wmnet with OS bullseye [10:07:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10dr0ptp4kt) >>! In T332063#8717190, @Jgiannelos wrote: > Can I also get access to superset? I can login and everything but, I need some more permissions to access t... [10:08:06] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:44] (03PS1) 10Filippo Giunchedi: monitoring: cosmetic-only changes to check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902019 (https://phabricator.wikimedia.org/T332764) [10:08:46] (03PS1) 10Filippo Giunchedi: monitoring: write node-exporter dpkg_success metric [puppet] - 10https://gerrit.wikimedia.org/r/902020 (https://phabricator.wikimedia.org/T332764) [10:08:48] (03PS1) 10Filippo Giunchedi: monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) [10:09:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:11:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:15:07] (03CR) 10JMeybohm: [C: 04-1] "This diff looks like it's going to break the way flink did. As you probably fixed that with mesh.config 1.1.1 I'd suggest to abandon this " [deployment-charts] - 10https://gerrit.wikimedia.org/r/901767 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [10:16:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1004.eqiad.wmnet with OS bullseye [10:16:38] (03CR) 10JMeybohm: modules: re-add base.kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto) [10:19:23] (03CR) 10JMeybohm: [C: 03+1] "This LGTM. As said on IRC I think I8f0ffd3f4f3730a353d9ac78d5c1c65e70fe538d fixed the issue I saw when trying to update the mesh.configura" [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [10:19:34] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:59] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2005.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [10:23:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2005.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [10:24:28] (03PS32) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:26:20] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:26:28] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10MatthewVernon) I //think// the issue is that the deleted container has different permissions: ` root@ms-fe1009:/home/mvernon# swift stat wikipedia-mediawik... [10:26:59] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:27:52] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:28:37] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10serviceops, and 2 others: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) a:03Joe [10:29:01] (03CR) 10Jbond: [C: 03+1] "lgtm can merge Monday after sprint week" [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah) [10:29:07] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:29:49] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40272/console" [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah) [10:30:01] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:30:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/901770 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui) [10:30:48] (03PS33) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:32:41] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:32:57] (03CR) 10Marostegui: [C: 03+2] ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/901770 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui) [10:33:31] (03Abandoned) 10Jbond: sre.hosts.reboot-single: args.depool not args.pool [cookbooks] - 10https://gerrit.wikimedia.org/r/900405 (owner: 10Jbond) [10:33:45] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) How to add mw:backup to local-deleted containers? (I assumed this is handled on wiki creation- I will check that on my own), but how to do if for... [10:34:02] !log `racadm racreset` for kafka-main2005 - http idrac not available (ssh on works fine) [10:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:35:46] (03PS1) 10Jbond: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 [10:36:05] (03PS2) 10Jbond: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 [10:36:06] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:36:10] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:36:30] (03PS34) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:37:30] (03PS3) 10Jbond: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 [10:38:22] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:38:36] (03CR) 10Hnowlan: [C: 03+2] changeprop: allow setting strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/901246 (owner: 10Hnowlan) [10:38:40] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:41] (03PS6) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [10:39:54] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01663 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:40:06] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) During Sprint week I tried to evaluate a setup where we keep Pybal on Python 2 (as shipped in Bullseye) and build the Twisted packages (which no longer ship Py2 packages in Bullseye) (plus the... [10:40:58] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [10:41:00] marostegui: I think your ferm patch might not be liked by the dbs [10:41:03] see ^^^ [10:41:08] yeah [10:41:10] reverting [10:41:16] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:41:20] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:41:24] (03PS1) 10Marostegui: Revert "ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902027 [10:41:26] "," expected [10:41:27] (03CR) 10Jbond: [C: 04-1] "-1: this is not the intended behaviuour. see" [cookbooks] - 10https://gerrit.wikimedia.org/r/902015 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey) [10:41:48] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:41:57] (03CR) 10Marostegui: [C: 03+2] Revert "ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902027 (owner: 10Marostegui) [10:41:59] (03PS4) 10Jbond: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 (https://phabricator.wikimedia.org/T325153) [10:42:20] the syntax in that is wrong, replace `) (` with ` ` [10:42:25] (03PS35) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:42:28] marostegui: https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed if you need it [10:43:05] (03Merged) 10jenkins-bot: changeprop: allow setting strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/901246 (owner: 10Hnowlan) [10:43:13] volans: thanks, doing! [10:44:17] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:44:24] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10MatthewVernon) Yeah, that's a good question - I think there are about 21675 deleted containers. I think there's no automation for container management (is... [10:44:49] (03PS1) 10Marostegui: ferm.pp: Add dborch1002 [puppet] - 10https://gerrit.wikimedia.org/r/902046 (https://phabricator.wikimedia.org/T298959) [10:46:09] (03PS36) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:47:57] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:47:58] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:38] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10jcrespo) > I think there's no automation for container management Don't worry too much about details/implementation, as that is something I can solve- my... [10:48:49] (03PS37) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:49:16] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0009785 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:49:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:49:51] (03CR) 10Jbond: "lgtm optional nit inlline" [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [10:50:21] (03PS1) 10Hnowlan: changeprop-jobqueue: change deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/902048 [10:50:58] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:51:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902019 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [10:51:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902046 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui) [10:51:56] (03CR) 10Marostegui: [C: 03+2] ferm.pp: Add dborch1002 [puppet] - 10https://gerrit.wikimedia.org/r/902046 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui) [10:52:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902020 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [10:52:47] (03CR) 10Jbond: [C: 03+1] monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [10:52:50] (03PS38) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:53:10] (03PS1) 10Muehlenhoff: rt: Remove some old migration cruft [puppet] - 10https://gerrit.wikimedia.org/r/902049 [10:54:03] (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889628 (owner: 10Majavah) [10:54:36] (03Abandoned) 10Elukey: sre.hosts.reboot-single: fix corner case when puppet is disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/902015 (https://phabricator.wikimedia.org/T325153) (owner: 10Elukey) [10:54:50] (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/890391 (owner: 10Majavah) [10:55:02] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:55:24] (03CR) 10Elukey: [C: 03+1] Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [10:56:08] (03CR) 10Jbond: [C: 03+2] Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [10:56:35] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10MatthewVernon) I wonder (but this is not a settled position) whether using an account ACL is the more elegant solution, as we do that once and it'll work f... [10:56:42] (03PS39) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:57:03] (03PS1) 10Marostegui: common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS [puppet] - 10https://gerrit.wikimedia.org/r/902050 (https://phabricator.wikimedia.org/T298959) [10:58:08] (03Merged) 10jenkins-bot: Revert "sre.hosts.reboot-single: set self.depool in any case" [cookbooks] - 10https://gerrit.wikimedia.org/r/902026 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [10:58:35] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:59:04] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:59:07] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:59:23] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [10:59:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2005.codfw.wmnet [10:59:54] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kafka-main2005.codfw.wmnet [11:00:08] (03PS40) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:02:00] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:02:46] !log upgrader prometheus-ipmi-exporter on buster and bullseye [11:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:37] (03PS41) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:05:30] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:06:03] (03PS2) 10Marostegui: common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS [puppet] - 10https://gerrit.wikimedia.org/r/902050 (https://phabricator.wikimedia.org/T298959) [11:06:39] (03PS1) 10Vgutierrez: hiera: Set haproxy->varnish connection limits on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/902051 (https://phabricator.wikimedia.org/T310609) [11:06:49] (03PS1) 10EoghanGaffney: Alert on sessionstore scheduling on non-dedicated k8s hosts [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) [11:08:09] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts kafka-main2005.codfw.wmnet [11:08:42] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:05] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [11:09:53] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2005.codfw.wmnet [11:10:31] (03PS42) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:12:22] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:12:40] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Aklapper) @FNavas-foundation: Hi, thanks for caring and no worries - basivcally see my comment T331482#8703089 what would be nice to do here (and feel free to elaborate... [11:13:48] (03PS5) 10Jbond: prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) [11:13:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902050 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui) [11:14:19] (03PS43) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:14:30] (03CR) 10Marostegui: [C: 03+2] common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS [puppet] - 10https://gerrit.wikimedia.org/r/902050 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui) [11:14:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2005.codfw.wmnet [11:14:47] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts kafka-main2005.codfw.wmnet [11:15:10] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2005.codfw.wmnet [11:15:47] PROBLEM - Kafka Broker Server #page on kafka-main2005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [11:15:50] PROBLEM - Kafka broker TLS certificate validity on kafka-main2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:16:08] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:16:11] elukey: ^^^ [11:16:25] wasn't silenced? [11:16:26] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2005.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [11:16:40] volans: yeah my bad, it was one hour, but the whole thing took mroe [11:16:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2005.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [11:16:57] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff) @Ottomata @odimitrijevic This needs your approval for analytics-privatedata-users [11:17:00] sorry folks [11:17:13] No problem, we'll ignore the alert :-) [11:17:20] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2005.codfw.wmnet [11:18:06] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:24] (03PS44) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:19:39] (03PS1) 10Marostegui: mariadb/ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/902055 (https://phabricator.wikimedia.org/T298959) [11:20:15] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:20:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool needs to be rebooted T323961', diff saved to https://phabricator.wikimedia.org/P45910 and previous config saved to /var/cache/conftool/dbconfig/20230322-112031-root.json [11:20:41] T323961: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 [11:21:00] * volans got paged [11:21:08] Luca is doing maintenance for that host [11:21:20] firmware update [11:21:22] <_joe_> !incidents [11:21:23] 3482 (ACKED) kafka-main2005/Kafka Broker Server (paged) [11:21:27] (03PS45) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:21:27] already acked [11:21:30] in -sre [11:22:06] ack [11:22:08] people oncall, please make sure to ack the pages when you get them and are known [11:22:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902055 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui) [11:22:26] (03CR) 10Marostegui: [C: 03+2] mariadb/ferm.pp: Add dborch1002 to the firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/902055 (https://phabricator.wikimedia.org/T298959) (owner: 10Marostegui) [11:23:17] volans: Sorry, didn't think to ack it [11:23:20] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:24:11] (03PS1) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) [11:24:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2005.codfw.wmnet [11:24:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2005.codfw.wmnet [11:24:56] PROBLEM - Kafka broker TLS certificate validity on kafka-main2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:25:09] PROBLEM - Kafka Broker Server #page on kafka-main2005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [11:25:21] (03PS46) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:25:26] again? Srsly? There is down time [11:25:37] anyway, kafka is up now [11:25:43] sorry for the extra alerts [11:25:44] RECOVERY - orchestrator TCP port on dborch1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 3000 https://wikitech.wikimedia.org/wiki/Orchestrator [11:26:16] ^ me testing [11:26:21] I am going to disable notifications for that host [11:26:50] RECOVERY - Kafka broker TLS certificate validity on kafka-main2005 is OK: SSL OK - Certificate kafka_main-codfw_broker valid until 2023-05-01 16:32:37 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:27:03] (03PS2) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) [11:27:05] RECOVERY - Kafka Broker Server #page on kafka-main2005 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [11:27:18] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:27:28] (03PS7) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [11:27:35] kafka already recovered, all good [11:28:06] (03PS1) 10Jbond: ipmisled: Send ipmisled logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) [11:28:42] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [11:28:47] (03PS47) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:29:23] (03PS3) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) [11:29:59] (03CR) 10Volans: [C: 03+1] "I have no context on the config file, but the addition LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [11:30:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:30:26] (03PS8) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [11:30:26] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:26] !log Poweroff db1121 (lag will show on wikireplicas for s4 section) T323961 [11:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:31] T323961: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 [11:30:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40275/console" [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [11:30:56] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:31:42] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [11:33:13] (03PS1) 10Hnowlan: thumbor: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/902061 (https://phabricator.wikimedia.org/T331995) [11:36:04] PROBLEM - Check whether ferm is active by checking the default input chain on ml-staging-ctrl2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:42:12] (03CR) 10Jbond: [C: 03+2] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/901536 (https://phabricator.wikimedia.org/T330120) (owner: 10Hashar) [11:43:53] (03PS48) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:44:20] (03PS6) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [11:44:22] (03PS2) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) [11:47:02] (03PS9) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [11:48:14] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [11:52:03] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:53:16] (03PS49) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:53:36] (03PS10) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [11:53:40] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:53:43] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [11:55:06] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:55:22] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [11:56:42] (03PS50) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:57:38] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Marostegui) db1121 is now off and ready for you @Jclark-ctr [11:58:32] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:00:19] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:00:22] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:00:36] (03PS1) 10MVernon: Provision the revised Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) [12:02:01] (03PS51) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:02:29] (03CR) 10CI reject: [V: 04-1] Provision the revised Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon) [12:03:51] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:03:52] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:03:54] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:05:20] (03PS1) 10Stevemunene: Add an-test-client1002 dummy keytab [labs/private] - 10https://gerrit.wikimedia.org/r/902065 (https://phabricator.wikimedia.org/T329363) [12:05:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:05:48] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:06:22] (03CR) 10Vgutierrez: [C: 03+1] envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [12:06:24] (03PS11) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [12:06:47] (03PS1) 10Alexandros Kosiaris: admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/902066 [12:06:49] (03PS1) 10Alexandros Kosiaris: changeprop-jobqueue: Double resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/902067 [12:07:37] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [12:15:35] (03PS1) 10Alexandros Kosiaris: openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074 [12:15:53] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10SRE Observability, 10Patch-For-Review: How should we monitor for faulty memory modules? - https://phabricator.wikimedia.org/T302639 (10jbond) We have now added a [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/901674/ | cha... [12:17:17] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/902061 (https://phabricator.wikimedia.org/T331995) (owner: 10Hnowlan) [12:17:28] (03CR) 10CI reject: [V: 04-1] openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [12:19:18] (03PS2) 10Alexandros Kosiaris: openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074 [12:19:32] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:19:35] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:21:05] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): 2023-01-10 eqsin network outage - https://phabricator.wikimedia.org/T328354 (10ayounsi) 05Open→03Invalid Closing this task as there are no direct actionable. [12:21:51] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40276/console" [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [12:22:06] (03CR) 10Giuseppe Lavagetto: Alert on sessionstore scheduling on non-dedicated k8s hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney) [12:22:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hiera: Set haproxy->varnish connection limits on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/902051 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [12:26:21] (03Merged) 10jenkins-bot: thumbor: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/902061 (https://phabricator.wikimedia.org/T331995) (owner: 10Hnowlan) [12:27:22] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [12:27:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [12:27:29] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] "Note that the hosts PCC lists don't even have redis running and listening on the ports that nutcracker expects to find them." [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [12:32:32] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:33:47] (03PS5) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 [12:33:49] (03PS5) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) [12:33:51] (03PS5) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) [12:38:14] (03PS52) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:40:09] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:41:16] (03PS12) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [12:41:33] (03PS53) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:42:29] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [12:43:27] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:44:27] (03CR) 10Filippo Giunchedi: [C: 03+1] ipmisled: Send ipmisled logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [12:44:29] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:44:40] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) Cable was replaced yesterday with no luck. today preformed flea power drain on db1121 [12:45:22] (03PS2) 10EoghanGaffney: Alert on sessionstore scheduling on non-dedicated k8s hosts [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) [12:45:46] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [12:45:52] (03CR) 10JMeybohm: [C: 03+1] Provision the revised Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon) [12:46:05] (03CR) 10EoghanGaffney: Alert on sessionstore scheduling on non-dedicated k8s hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney) [12:46:25] (03CR) 10MVernon: [V: 03+2 C: 03+2] Provision the revised Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/902064 (https://phabricator.wikimedia.org/T328872) (owner: 10MVernon) [12:47:13] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Marostegui) db1121's mgmt is reachable now [12:47:53] (03PS54) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:49:09] (03CR) 10Herron: [C: 03+1] application_servers/kafka: Remove IDs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901719 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [12:49:16] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): LVS should handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10ayounsi) Another approach is to put them in a distinct namespace (one without a default route) see {T114979} [12:49:29] (03CR) 10Herron: [C: 03+1] ipmisled: Send ipmisled logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [12:49:46] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:51:00] (03PS1) 10Majavah: labstore: drop wmde-templates-alpha volumes [puppet] - 10https://gerrit.wikimedia.org/r/902076 (https://phabricator.wikimedia.org/T332773) [12:51:36] (03PS2) 10Filippo Giunchedi: monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) [12:51:41] (03CR) 10Filippo Giunchedi: monitoring: simplify check_dpkg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [12:51:53] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: cosmetic-only changes to check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902019 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [12:52:02] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: write node-exporter dpkg_success metric [puppet] - 10https://gerrit.wikimedia.org/r/902020 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [12:52:22] (03PS13) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [12:52:24] (03PS2) 10Filippo Giunchedi: monitoring: write node-exporter dpkg_success metric [puppet] - 10https://gerrit.wikimedia.org/r/902020 (https://phabricator.wikimedia.org/T332764) [12:52:38] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:53:32] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [12:53:40] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:53:45] (03PS55) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:55:26] (03PS1) 10Muehlenhoff: Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) [12:55:38] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:55:48] (03CR) 10CI reject: [V: 04-1] Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [12:56:21] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Volans) a:03Volans [12:56:38] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [12:56:44] (03PS3) 10Filippo Giunchedi: monitoring: simplify check_dpkg [puppet] - 10https://gerrit.wikimedia.org/r/902021 (https://phabricator.wikimedia.org/T332764) [12:58:25] (03PS6) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) [12:58:27] (03PS6) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) [12:58:29] (03PS1) 10Giuseppe Lavagetto: Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 [13:00:03] (03PS56) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:06] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:00:55] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:01:00] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:01:53] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:03:13] (03PS2) 10Muehlenhoff: Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) [13:03:36] (03CR) 10CI reject: [V: 04-1] Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [13:04:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45912 and previous config saved to /var/cache/conftool/dbconfig/20230322-130359-root.json [13:04:43] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:04:49] (03CR) 10Stevemunene: [C: 03+2] Add an-test-client1002 dummy keytab [labs/private] - 10https://gerrit.wikimedia.org/r/902065 (https://phabricator.wikimedia.org/T329363) (owner: 10Stevemunene) [13:04:59] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:05:32] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:05:49] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add an-test-client1002 dummy keytab [labs/private] - 10https://gerrit.wikimedia.org/r/902065 (https://phabricator.wikimedia.org/T329363) (owner: 10Stevemunene) [13:05:57] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:06:26] (03PS3) 10Muehlenhoff: Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) [13:06:29] (03PS57) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:06:48] (03CR) 10CI reject: [V: 04-1] Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [13:08:24] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:09:17] (03PS4) 10Muehlenhoff: Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) [13:09:48] 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [13:11:17] (03CR) 10JMeybohm: [C: 04-1] "See comment, mesh.configuration 1.1.0 also introduced a strange looking "if and" construct with only one argument which we could clean up " [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [13:13:04] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:13:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [13:14:00] (03CR) 10JMeybohm: [C: 04-1] "The changes to `charts/flink/app` don't belong here but into the following CR" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto) [13:14:18] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@a83464d]: Deplying latest country_project_page DAG [13:14:30] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@a83464d]: Deplying latest country_project_page DAG (duration: 00m 12s) [13:17:36] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) @Andrew any update on being able to reboot labstore1004 [13:18:26] (03PS58) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:19:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45913 and previous config saved to /var/cache/conftool/dbconfig/20230322-131904-root.json [13:20:16] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:22:30] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Jclark-ctr) pdu's have been connected to msw in rack and scs in f8. temp sensors are installed [13:23:51] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) @Jclark-ctr thanks i will start setting them up. [13:24:11] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Jclark-ctr) [13:25:04] (03PS14) 10Filippo Giunchedi: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [13:25:46] (03PS1) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) [13:27:31] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [13:28:54] 10SRE, 10Infrastructure-Foundations, 10netops: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) p:05Triage→03Medium [13:30:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Alert on sessionstore scheduling on non-dedicated k8s hosts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney) [13:31:29] (03Merged) 10jenkins-bot: Alert on sessionstore scheduling on non-dedicated k8s hosts [alerts] - 10https://gerrit.wikimedia.org/r/902052 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney) [13:32:04] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:34:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45914 and previous config saved to /var/cache/conftool/dbconfig/20230322-133409-root.json [13:35:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1014.mgmt.eqiad.wmnet with reboot policy FORCED [13:35:05] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/901636 [13:35:16] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [13:36:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369 (10cmooney) Just a note on this task, related to T332781 If we do have stretched L2 segments across multiple LEAFs, we may wish to also export the /32... [13:37:01] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [13:37:31] (03PS1) 10Cathal Mooney: Set BGP MED based on OSPF cost for EVPN originated routes [homer/public] - 10https://gerrit.wikimedia.org/r/902084 (https://phabricator.wikimedia.org/T332781) [13:37:48] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:38:34] (03PS6) 10Jbond: prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) [13:39:27] (03CR) 10David Caro: [C: 03+2] labstore: drop wmde-templates-alpha volumes [puppet] - 10https://gerrit.wikimedia.org/r/902076 (https://phabricator.wikimedia.org/T332773) (owner: 10Majavah) [13:39:32] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM,thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/901630 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [13:40:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40277/console" [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [13:41:04] (03PS59) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:41:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [13:42:59] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:44:42] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40278/console" [puppet] - 10https://gerrit.wikimedia.org/r/902051 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [13:45:12] (03PS60) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:45:24] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Set haproxy->varnish connection limits on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/902051 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [13:46:58] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01125 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:47:07] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:47:26] jbond: ipmi sudo wrapper failures ^^^ [13:47:49] jbond: I think puppet broke [13:47:52] Oh, volans was faster [13:48:17] (03PS61) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:49:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45915 and previous config saved to /var/cache/conftool/dbconfig/20230322-134913-root.json [13:50:05] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:51:36] (03PS62) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:53:26] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:57:14] (03PS63) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:57:25] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): LVS should handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10cmooney) >>! In T286924#8717679, @ayounsi wrote: > Another approach is to put them in a distinct namespace (one... [13:58:23] (03PS15) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [13:59:09] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:59:36] (03PS1) 10Muehlenhoff: * Rebuild for bullseye T332584 T332589 * Move to Java 11 * Remove adduser dependency for anything but druid-common, the rest don't need it * Remove versioned druid-common dependency, we're way past 0.10 for a while * Move to debhelper 13 (which absorbed dh-systemd) [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902092 [13:59:38] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [14:00:34] (03PS16) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [14:01:30] (03PS64) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [14:02:13] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye [14:03:22] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [14:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45916 and previous config saved to /var/cache/conftool/dbconfig/20230322-140418-root.json [14:06:47] (03PS65) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [14:08:36] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [14:09:30] (03PS66) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [14:11:21] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [14:11:39] jouncebot: nowandnext [14:11:39] No deployments scheduled for the next 2 hour(s) and 48 minute(s) [14:11:39] In 2 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1700) [14:11:53] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use correct release and app [deployment-charts] - 10https://gerrit.wikimedia.org/r/901240 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:12:55] (03PS67) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [14:13:02] !log disable Puppet on A:wikidough to roll out dnsdist.conf change [14:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:38] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0009785 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:16:37] (03Merged) 10jenkins-bot: rdf-streaming-updater: use correct release and app [deployment-charts] - 10https://gerrit.wikimedia.org/r/901240 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:17:13] !log enable Puppet on A:wikidough to roll out dnsdist.conf change [14:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:55] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-client1002.eqiad.wmnet with reason: host reimage [14:18:52] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [14:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45917 and previous config saved to /var/cache/conftool/dbconfig/20230322-141923-root.json [14:21:29] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-client1002.eqiad.wmnet with reason: host reimage [14:24:08] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:24:32] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:29:04] (03PS1) 10Muehlenhoff: Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097 [14:29:59] (03PS2) 10Muehlenhoff: Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097 [14:38:33] (03CR) 10Hashar: "Great thank you Bartosz for the confirmation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [14:43:07] (03CR) 10Jforrester: [C: 03+1] "Feel free to merge and deploy as you see fit. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [14:43:35] (03PS3) 10Muehlenhoff: Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097 [14:47:45] (03PS4) 10Muehlenhoff: Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097 [14:48:49] (03PS1) 10Jbond: team-sre/hardware: Add alert for sel events [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) [14:49:13] 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10serviceops, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10akosiaris) a:03akosiaris [14:50:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop-jobqueue: change deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/902048 (owner: 10Hnowlan) [14:51:17] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: change deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/902048 (owner: 10Hnowlan) [14:53:17] (03CR) 10Jbond: [C: 03+2] ipmisled: Send ipmisled logs to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/902060 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [14:53:19] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:54:08] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:57:00] (03Merged) 10jenkins-bot: changeprop-jobqueue: change deployment strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/902048 (owner: 10Hnowlan) [14:57:33] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:57:48] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [14:58:20] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:59:10] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:59:21] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:59:37] (03CR) 10Jbond: team-sre/hardware: Add alert for sel events (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [15:00:12] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:04:11] (03PS3) 10Volans: es_exporter: add NEL metrics by country [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) [15:04:13] (03PS1) 10Volans: superset: add static html for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) [15:07:29] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10Volans) [15:07:51] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) (owner: 10Volans) [15:08:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:08:49] (03PS1) 10Effie Mouzeli: maps: remove OSM Synchronisation Lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/902109 (https://phabricator.wikimedia.org/T285328) [15:12:07] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10conftool, 10Patch-For-Review, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10Volans) I've sent a small improvement proposal in the above patch, let me know what... [15:13:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:13:45] !log removing cassandra packages from maps hosts [15:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:43] (03CR) 10EoghanGaffney: [C: 03+2] Relax nodeAffinity of sessionstore pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/901572 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney) [15:15:55] (03CR) 10Filippo Giunchedi: [C: 03+1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [15:16:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902109 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [15:17:44] (03CR) 10Effie Mouzeli: [C: 03+2] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [15:17:50] !log eoghan@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [15:17:53] !log eoghan@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [15:18:12] (03CR) 10Effie Mouzeli: [C: 03+2] maps: remove OSM Synchronisation Lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/902109 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [15:18:56] (03Merged) 10jenkins-bot: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [15:20:29] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10akosiaris) Let me note that we also have an alert on `exim_queue_length` per... [15:21:11] 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Use/adopt search cluster ES management cookbooks for logging ES too - https://phabricator.wikimedia.org/T255864 (10colewhite) >>! In T255864#8717171, @ayounsi... [15:22:04] (03PS1) 10Jbond: P:monitoring: drop check for filesystem_avail_bigger_than_size [puppet] - 10https://gerrit.wikimedia.org/r/902110 (https://phabricator.wikimedia.org/T302687) [15:22:48] !log removing java packages from maps hosts [15:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) [15:23:28] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2004.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [15:23:32] (03CR) 10BCornwall: [V: 03+2 C: 03+2] application_servers/kafka: Remove IDs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901719 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [15:23:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2004.codfw.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [15:23:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) [15:24:54] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/902111 [15:25:21] (03PS2) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/902111 [15:25:53] !log eoghan@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [15:25:54] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:25:55] !log eoghan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [15:26:51] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:27:44] !log `racadm racreset` for kafka-main2004 (no http idrac available for the cookbook, ssh one available) [15:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:45] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.reimage for host dborch1001.wikimedia.org with OS bullseye [15:30:52] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:30:57] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:31:21] (03CR) 10Muehlenhoff: [C: 03+2] Build for Bullseye and update Debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097 (owner: 10Muehlenhoff) [15:31:41] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/902111 (owner: 10Muehlenhoff) [15:31:56] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:32:24] (03Abandoned) 10Muehlenhoff: * Rebuild for bullseye T332584 T332589 * Move to Java 11 * Remove adduser dependency for anything but druid-common, the rest don't need it * Remove versioned druid-common dependency, we're way past 0.10 for a while * Move to debhelper 13 (which absorbed dh-systemd) [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902092 (owner: 10Muehlenhoff) [15:35:21] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) > Hashar: alternatively, this could be closed, as per title scope and the mentioned work could be filed on a sep... [15:36:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:37:02] (03PS1) 10EoghanGaffney: Fix preferredDuringScheduling[...] change for sessionstore [deployment-charts] - 10https://gerrit.wikimedia.org/r/902114 (https://phabricator.wikimedia.org/T325139) [15:37:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10MoritzMuehlenhoff) >>! In T332063#8717190, @Jgiannelos wrote: > Can I also get access to superset? I can login and everything but, I need some more permissions to... [15:39:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:39:50] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:40:27] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2004.codfw.wmnet [15:41:38] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dborch1001.wikimedia.org with reason: host reimage [15:44:09] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dborch1001.wikimedia.org with reason: host reimage [15:44:43] RECOVERY - Check systemd state on elastic1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2004.codfw.wmnet [15:46:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:46:49] PROBLEM - Host kafka-main2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:49] RECOVERY - Host kafka-main2004 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [15:46:52] PROBLEM - Kafka Broker Server #page on kafka-main2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:46:54] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:47:33] good morning 👋 [15:47:34] folks I am sorry for the page but I have downtimed the node for 2 hours [15:47:43] not really sure why it paged now [15:47:45] PROBLEM - Kafka broker TLS certificate validity on kafka-main2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:47:57] got it, thanks <3 need anything? [15:48:08] nono regular maintenance, I am upgrading bios etc.. [15:48:14] 👍 [15:48:22] (03Abandoned) 10Aklapper: Phabricator: Disable setting lowest priority on tasks [puppet] - 10https://gerrit.wikimedia.org/r/699493 (https://phabricator.wikimedia.org/T228759) (owner: 10Aklapper) [15:48:25] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-main2004.codfw.wmnet [15:48:41] I like to think it is because we're being punished for paging on ps | grep [15:50:57] RECOVERY - Check systemd state on cp1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:21] RECOVERY - Check systemd state on ml-serve1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:36] !log uploaded druid 0.19.wmf0-2 to bullseye-wikimedia T332584 T332589 [15:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:43] T332584: Upgrade an-test-druid1001 to bullseye - https://phabricator.wikimedia.org/T332584 [15:53:43] T332589: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 [15:56:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-main2004.codfw.wmnet [15:56:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main2004.codfw.wmnet [15:56:24] PROBLEM - Kafka Broker Server #page on kafka-main2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:56:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] Fix preferredDuringScheduling[...] change for sessionstore [deployment-charts] - 10https://gerrit.wikimedia.org/r/902114 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney) [15:56:44] elukey: :_) [15:56:57] PROBLEM - Kafka broker TLS certificate validity on kafka-main2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:57:17] vgutierrez: I know, not really sure what to do, I downtimed for two hours, and now it pages [15:57:44] kafka is up now, there is probably something that escapes the downtime logic, or I missed something [15:58:07] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host dborch1001.wikimedia.org with OS bullseye [15:58:18] RECOVERY - Kafka Broker Server #page on kafka-main2004 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:58:30] fwiw, that second #p.age didn't actually come though victorops [15:58:49] RECOVERY - Kafka broker TLS certificate validity on kafka-main2004 is OK: SSL OK - Certificate kafka_main-codfw_broker valid until 2023-05-01 16:32:37 +0000 (expires in 40 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [15:58:55] or rather, it's there, but it seems to be a second email under the same incident [16:00:23] rzl: interesting, let me check [16:00:31] all to say, I'm not sure if it's actually a new alert or just a re-notification of the previous one for some reason 🤷 I wouldn't sweat it too much, especially given that'll be moved to alertmanager anyhow [16:01:07] jynus: if you like! elukey is the one working on it though, I'm not really looking :) [16:01:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:01:19] no, I mean the victorops stuff [16:01:29] (03PS1) 10JHathaway: repackage for bullseye [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902116 [16:01:54] (03CR) 10JHathaway: [C: 03+2] repackage for bullseye [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902116 (owner: 10JHathaway) [16:01:56] (03CR) 10JHathaway: [V: 03+2 C: 03+2] repackage for bullseye [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902116 (owner: 10JHathaway) [16:01:57] trying to understand the events [16:04:18] (03CR) 10Cwhite: [C: 03+2] logstash: add mmkubernetes ECS early-stage filter [puppet] - 10https://gerrit.wikimedia.org/r/901630 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:04:20] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [16:05:39] (03CR) 10Filippo Giunchedi: team-sre/hardware: Add alert for sel events (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [16:11:04] (03PS3) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [16:12:39] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [16:15:12] (03CR) 10EoghanGaffney: [C: 03+2] Fix preferredDuringScheduling[...] change for sessionstore [deployment-charts] - 10https://gerrit.wikimedia.org/r/902114 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney) [16:18:37] !log eoghan@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [16:18:40] !log eoghan@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [16:19:45] !log eoghan@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [16:20:33] (03PS2) 10Clément Goubert: cpufrequtils: Force reload init script on change [puppet] - 10https://gerrit.wikimedia.org/r/900645 [16:23:27] jynus: sorry I was in a meeting, so I have both times [16:23:35] 1) downtimed with the cookbook [16:23:44] 2) stopped kafka etc.. on the node + puppet disabled [16:24:01] 3) run the firmware upgrade cookbooks [16:24:15] this morning I've set 1 hour of downtime and it expired, my bad [16:24:29] but today it was two hours, and I am sure I was into the right time window [16:24:29] !log eoghan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [16:24:42] (03CR) 10Btullis: [C: 03+1] "Many thanks indeed." [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/902097 (owner: 10Muehlenhoff) [16:25:00] it may be all those icinga/nagios ps|etcc based alerts that just need to be removed [16:25:08] (as godo*g mentioned earlier on) [16:25:38] (03PS1) 10JMeybohm: kubernetes: Remove old kubernetes metric from alerts [alerts] - 10https://gerrit.wikimedia.org/r/902120 (https://phabricator.wikimedia.org/T322919) [16:26:55] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:07] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:33] (03PS2) 10JMeybohm: kubernetes: Remove old kubernetes metric from alerts [alerts] - 10https://gerrit.wikimedia.org/r/902120 (https://phabricator.wikimedia.org/T322919) [16:27:47] (03PS1) 10Marostegui: Revert "mariadb/ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902043 [16:27:54] (03PS1) 10Marostegui: Revert "common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS" [puppet] - 10https://gerrit.wikimedia.org/r/902044 [16:28:01] (03PS1) 10Marostegui: Revert "ferm.pp: Add dborch1002" [puppet] - 10https://gerrit.wikimedia.org/r/902045 [16:28:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:28:16] jhathaway: I am going to revert the patches used for dborch1002 testing for now [16:28:34] marostegui: sounds good [16:28:49] (03CR) 10Marostegui: [C: 03+2] Revert "common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS" [puppet] - 10https://gerrit.wikimedia.org/r/902044 (owner: 10Marostegui) [16:28:53] (03CR) 10Marostegui: [C: 03+2] Revert "ferm.pp: Add dborch1002" [puppet] - 10https://gerrit.wikimedia.org/r/902045 (owner: 10Marostegui) [16:29:29] (03PS2) 10Marostegui: Revert "mariadb/ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902043 [16:29:57] (03PS4) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [16:30:46] jhathaway: how are things looking on your side? [16:31:03] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/902043 (owner: 10Marostegui) [16:31:06] okay, just need to figure out how to make dh_golang happy [16:31:17] haha [16:31:17] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [16:31:24] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb/ferm.pp: Add dborch1002 to the firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/902043 (owner: 10Marostegui) [16:33:04] (03PS2) 10Cathal Mooney: Enable OSPF check by default for l3 switch mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/900431 (https://phabricator.wikimedia.org/T315053) [16:35:40] !log rolling downgrade to HAProxy 2.6.9 in text@esams - T332796 [16:35:43] marostegui: new version is installed, if you would like to take a gander [16:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:45] T332796: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 [16:36:36] (03CR) 10Cathal Mooney: [C: 03+2] Enable OSPF check by default for l3 switch mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/900431 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney) [16:36:40] jhathaway: checking [16:36:57] (03CR) 10Jbond: "thanks" [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [16:37:15] !log eoghan@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [16:37:27] jhathaway: looking good! \o/ [16:37:28] (03CR) 10Jbond: [C: 03+2] P:monitoring: drop check for filesystem_avail_bigger_than_size [puppet] - 10https://gerrit.wikimedia.org/r/902110 (https://phabricator.wikimedia.org/T302687) (owner: 10Jbond) [16:37:37] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:37:44] marostegui: great [16:37:46] !log eoghan@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [16:38:09] jhathaway: you can destroy dborch1002 [16:38:21] marostegui: great, will do [16:38:23] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:38:30] jhathaway: And close the task too \o/ [16:38:35] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:38:44] marostegui: thanks for the help [16:39:29] jhathaway: no, thank you for taking on that task! [16:39:55] jhathaway: If you could comment on the last issue, for future references [16:40:01] Before closing the task, that'd be great [16:40:08] will do [16:40:18] thanks [16:40:58] (03CR) 10Giuseppe Lavagetto: Revert "Revert: Remove the .Values.kubernetesApi hack" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto) [16:42:22] (03PS5) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [16:42:32] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [16:42:47] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [16:42:53] elukey@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [16:43:33] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [16:43:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] envoyproxy::tls_terminator: allow returning an HTML error page [puppet] - 10https://gerrit.wikimedia.org/r/902058 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [16:44:55] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10MisterSynergy) Some remarks: * We should consider these canonical HTTP URIs to be //names// in the first place, which are unique worldwide and issued by the Wiki... [16:45:07] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@6cbc3bc]: (no justification provided) [16:45:19] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@6cbc3bc]: (no justification provided) (duration: 00m 12s) [16:47:40] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jhathaway) @jbond this should be fixed, following the Puppet 7 upgrade. Do we have any way of noting post puppet 7 followup t... [16:49:40] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [16:49:56] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [16:50:33] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jbond) @jhathaway not currently but we could request a new tag or possibly a milestone. @Aklapper are you able to offer any... [16:51:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:12] (03PS6) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [16:51:46] (03PS1) 10JHathaway: update changelog [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902122 [16:52:12] (03CR) 10JHathaway: [C: 03+2] update changelog [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902122 (owner: 10JHathaway) [16:52:20] (03CR) 10JHathaway: [V: 03+2 C: 03+2] update changelog [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902122 (owner: 10JHathaway) [16:52:26] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [16:54:37] (03PS7) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [16:54:45] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:20] (03CR) 10Jforrester: [C: 03+1] doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [16:55:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Is this needed? I think that the service (which has status parameter) of" [puppet] - 10https://gerrit.wikimedia.org/r/900645 (owner: 10Clément Goubert) [16:55:49] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [16:56:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:57:48] (03PS1) 10Elukey: services: stop changeprop's lift wing test [deployment-charts] - 10https://gerrit.wikimedia.org/r/902123 (https://phabricator.wikimedia.org/T328576) [16:57:57] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10eoghan) We've deployed the change to relax the `nodeAffinity` setting, tomorrow morning we'll drain one of the nodes to test that t... [16:59:15] 10SRE, 10Traffic, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) Gathering data on esams after downgrading: ` vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-text_esams' 'apt-cache policy haproxy|grep Installed' 8 hosts will be targeted: cp[3050... [16:59:23] (03CR) 10Hnowlan: [C: 03+1] services: stop changeprop's lift wing test [deployment-charts] - 10https://gerrit.wikimedia.org/r/902123 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [16:59:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1700) [17:00:08] (03PS1) 10JHathaway: add .gitreview [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902124 [17:00:22] (03PS8) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [17:00:24] (03CR) 10JHathaway: [C: 03+2] add .gitreview [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902124 (owner: 10JHathaway) [17:00:26] (03CR) 10JHathaway: [V: 03+2 C: 03+2] add .gitreview [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/902124 (owner: 10JHathaway) [17:01:33] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [17:02:06] (03PS1) 10JHathaway: Revert "dborch: allow dborch1002 to issue an ssl cert" [puppet] - 10https://gerrit.wikimedia.org/r/902125 [17:02:26] (03CR) 10Elukey: [C: 03+2] services: stop changeprop's lift wing test [deployment-charts] - 10https://gerrit.wikimedia.org/r/902123 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [17:02:28] (03CR) 10JHathaway: [C: 03+2] Revert "dborch: allow dborch1002 to issue an ssl cert" [puppet] - 10https://gerrit.wikimedia.org/r/902125 (owner: 10JHathaway) [17:03:02] (03CR) 10Clément Goubert: cpufrequtils: Force reload init script on change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900645 (owner: 10Clément Goubert) [17:04:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:04:57] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [17:05:15] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [17:05:34] !log jhathaway@cumin1001 START - Cookbook sre.hosts.decommission for hosts dborch1002.wikimedia.org [17:06:02] (03PS1) 10Giuseppe Lavagetto: mediawiki::errorpage: allow avoiding percentage sizes. [puppet] - 10https://gerrit.wikimedia.org/r/902126 [17:06:26] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [17:06:39] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [17:06:47] I am doing a noop pull of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/899520 [17:06:58] (03PS1) 10JHathaway: Revert "Add a dborch vm for testing the bullseye upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/902127 (https://phabricator.wikimedia.org/T289657) [17:07:11] (03CR) 10Hashar: [C: 03+2] "Thanks. I will pull it on the deployment server." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [17:07:14] (03CR) 10JHathaway: [C: 03+2] Revert "Add a dborch vm for testing the bullseye upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/902127 (https://phabricator.wikimedia.org/T289657) (owner: 10JHathaway) [17:07:28] it has a script to composer.json and thus have no effect to production [17:08:01] (03Merged) 10jenkins-bot: build: add local typos check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [17:08:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40282/console" [puppet] - 10https://gerrit.wikimedia.org/r/902126 (owner: 10Giuseppe Lavagetto) [17:08:21] (03PS2) 10JHathaway: Revert "Add a dborch vm for testing the bullseye upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/902127 (https://phabricator.wikimedia.org/T298959) [17:09:37] and of course something is breaking :/ [17:09:43] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [17:09:46] failed to register layer: devmapper: Thin Pool has 94446 free data blocks which is less than minimum required 163840 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior [17:10:51] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) 05Open→03Resolved I've merged the patch and the EVPN switches are now being checked by Icinga, all looks healthy. [17:12:09] !log jhathaway@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dborch1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jhathaway@cumin1001" [17:14:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:15:07] RECOVERY - Check whether ferm is active by checking the default input chain on ml-staging-ctrl2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:15:25] (03CR) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [17:15:30] !log hashar@deploy2002 Synchronized composer.json: build: add local typos check to composer.json # T332121 (duration: 06m 44s) [17:15:36] T332121: Migrate CI job operations-mw-config-typos-docker job to be inside operations/mediawiki-config - https://phabricator.wikimedia.org/T332121 [17:17:10] (03PS2) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) [17:17:27] (03CR) 10Ssingh: "The tests are failing, see the comment inline:" [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh) [17:18:12] (03CR) 10Ssingh: logstash: add pybal ECS filter and tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh) [17:18:42] (03PS3) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) [17:19:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:19:47] (03PS2) 10Giuseppe Lavagetto: mediawiki::errorpage: allow avoiding percentage sizes. [puppet] - 10https://gerrit.wikimedia.org/r/902126 [17:20:31] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:35] (03CR) 10CI reject: [V: 04-1] logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh) [17:20:37] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10akosiaris) >>! In T320398#8711722, @Eevans wrote: > TL;DR Is there someone(s) —who isn't as close to this as I am— who has... [17:20:59] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40283/console" [puppet] - 10https://gerrit.wikimedia.org/r/902126 (owner: 10Giuseppe Lavagetto) [17:22:51] (03PS3) 10Giuseppe Lavagetto: mediawiki::errorpage: allow avoiding percentage sizes. [puppet] - 10https://gerrit.wikimedia.org/r/902126 [17:23:41] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [17:24:02] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40284/console" [puppet] - 10https://gerrit.wikimedia.org/r/902126 (owner: 10Giuseppe Lavagetto) [17:27:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::errorpage: allow avoiding percentage sizes. [puppet] - 10https://gerrit.wikimedia.org/r/902126 (owner: 10Giuseppe Lavagetto) [17:29:24] (03PS6) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [17:30:34] (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [17:36:59] (03PS1) 10Sergio Gimeno: GrowthExperiments: disable add a link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902131 (https://phabricator.wikimedia.org/T304551) [17:38:08] <_joe_> !log stopping apache on mwdebug1001 to test the new envoy error page [17:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:46] (03PS7) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [17:42:59] (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [17:43:01] (03CR) 10Dzahn: [C: 03+2] miscweb: switch security.wm.org microsite to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901320 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [17:45:22] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10MusikAnimal) >>! In T332650#8716712, @Tgr wrote: >>>... [17:45:32] (03PS8) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [17:46:43] (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [17:48:18] (03CR) 10SBassett: api-gateway: add REST gateway Lua CSP handler (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [17:53:08] (03PS2) 10Hnowlan: thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) [17:53:40] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dborch1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jhathaway@cumin1001" [17:53:40] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:40] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dborch1002.wikimedia.org [17:53:50] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jhathaway@cumin1001 for hosts: `dborch1002.wikimedia.org` - dborch1002.... [17:54:48] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10MusikAnimal) [17:54:52] (03PS1) 10Dduvall: buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) [17:57:35] (03PS2) 10Dduvall: buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) [18:00:05] dancy and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1800). [18:00:05] dancy and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T1800). [18:02:48] (03PS1) 10Jbond: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 [18:04:23] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) >>! In T332650#8718814, @MusikAnimal wrote: > I... [18:05:20] Alright... let's see what happens. [18:07:36] dancy: tell me if I need to clean up space [18:07:42] I can't fix the actual problem tonight [18:08:24] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902136 (https://phabricator.wikimedia.org/T330207) [18:08:26] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902136 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [18:08:28] claime: OK! [18:08:48] (03CR) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [18:09:11] (03CR) 10Hnowlan: thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [18:09:14] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902136 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [18:09:16] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [18:11:31] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [18:12:38] !log rsyncing /srv/org/wikimedia/sitemaps files for https://sitemaps.wikimedia.org from old to new machines. most other things are auto-deployed by puppet or puppet running intial scap or automatic rsync.. this is not. rsync -av /srv/org/wikimedia/sitemaps/ rsync://miscweb2003.codfw.wmnet/miscapps-srv/org/wikimedia/sitemaps/ T331896 - but also see T332101 [18:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:46] T332101: determine whether https://sitemaps.wikimedia.org still serves a purpose - https://phabricator.wikimedia.org/T332101 [18:12:46] T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896 [18:14:27] (03Merged) 10jenkins-bot: thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [18:16:06] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.1 refs T330207 [18:16:12] T330207: 1.41.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T330207 [18:16:32] dancy: I'll clean up space pre-emptively [18:17:00] So I saw those same messages that hashar reported. [18:17:14] Then not pre-emptively :D [18:18:16] basically it just won't schedule mw containers on these hosts [18:18:20] Because it can't pull the images [18:18:34] are the new pods otherwise deployed? [18:19:13] (03PS4) 10Cwhite: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh) [18:19:17] Yeah, just not to these hosts [18:19:29] There are no mw pods deployed to them that I can see [18:19:31] okay great.. k8s working the way its supposed to [18:20:18] (03CR) 10Dzahn: [C: 03+2] miscweb: switch sitemaps, transparency and tr-archives to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901321 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [18:20:41] Thanks for being around claime. [18:21:03] (03CR) 10CI reject: [V: 04-1] logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh) [18:21:08] (03CR) 10Cwhite: "Tests still failing because the legacy expected output doesn't quite match yet, but the ECS one does!" [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh) [18:21:12] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/901639 [18:29:03] (03CR) 10Herron: "Something to consider also re: SNR is some logs represent good events like 'PS Redundancy | Power Supply | Fully Redunda" [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [18:36:43] (03PS1) 10Dzahn: miscweb: move transparency httpd site templates out of role/apache [puppet] - 10https://gerrit.wikimedia.org/r/902140 [18:39:03] (03PS1) 10Dzahn: miscweb: move simplestatic.erb out of role/templates/apache/sites/ [puppet] - 10https://gerrit.wikimedia.org/r/902141 [18:41:55] (03PS1) 10Dzahn: miscweb: move os_reports httpd template to profile/microsites/ [puppet] - 10https://gerrit.wikimedia.org/r/902142 [18:43:33] (03PS2) 10Dzahn: miscweb: move os_reports httpd template to profile/microsites/ [puppet] - 10https://gerrit.wikimedia.org/r/902142 [18:44:16] (03CR) 10Dzahn: [C: 04-2] "template goes to ./templates/ not manifests ..fixing later" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn) [18:46:33] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/901640 [18:46:35] (03PS1) 10Dzahn: miscweb: add custom and error log for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902144 [18:47:29] (03PS1) 10Nray: Enable pinning for anon main menu when page tools is enabled [skins/Vector] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902150 (https://phabricator.wikimedia.org/T331657) [18:49:54] (03CR) 10Ahmon Dancy: [C: 03+1] buildkitd: Isolate build container user/process/network namespaces [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall) [18:51:36] (03PS2) 10Nray: Enable page tools for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) [18:53:06] (03PS1) 10Dzahn: miscweb: add custom and error log for transparency and archives [puppet] - 10https://gerrit.wikimedia.org/r/902166 [18:58:22] (03CR) 10SBassett: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [19:02:56] (03PS1) 10Dzahn: miscweb: switch research.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902167 (https://phabricator.wikimedia.org/T331896) [19:03:56] (03PS1) 10Dzahn: miscweb: switch wikiworkshop.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902169 (https://phabricator.wikimedia.org/T331896) [19:04:54] (03PS1) 10Dzahn: miscweb: switch design.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902170 (https://phabricator.wikimedia.org/T331896) [19:05:39] (03PS1) 10Dzahn: miscweb: switch os-reports.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902172 (https://phabricator.wikimedia.org/T331896) [19:06:56] (03PS1) 10Dzahn: miscweb: switch static-codereview to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902174 (https://phabricator.wikimedia.org/T331896) [19:08:29] (03PS1) 10Dzahn: delete webserver-misc-static.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/902175 [19:15:31] (03CR) 10Jdlrobson: [C: 03+1] Enable pinning for anon main menu when page tools is enabled [skins/Vector] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902150 (https://phabricator.wikimedia.org/T331657) (owner: 10Nray) [19:28:16] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10Ennomeijers) +1! [19:48:43] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM %request for doc1002 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse) [19:48:56] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM %request for doc1002 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse) a:03andrea.denisse [19:49:27] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1002 - https://phabricator.wikimedia.org/T332812 (10RhinosF1) [19:58:33] (03PS5) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) [19:59:00] (03PS2) 10Samtar: Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901723 (https://phabricator.wikimedia.org/T332745) (owner: 10Bartosz Dziewoński) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T2000). [20:00:05] MatmaRex and nray: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] * TheresNoTime can deploy [20:00:17] hi [20:00:24] my changes are no-ops or labs only [20:00:40] I'll set them going now then :) [20:00:46] The best change is no change [20:00:55] also, never upgrade. [20:01:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901723 (https://phabricator.wikimedia.org/T332745) (owner: 10Bartosz Dziewoński) [20:01:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901724 (owner: 10Bartosz Dziewoński) [20:01:05] o/ [20:01:19] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host doc1003.wikimedia.org [20:01:20] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [20:01:33] (03CR) 10Samtar: [C: 03+2] "Start for deploy" [skins/Vector] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902150 (https://phabricator.wikimedia.org/T331657) (owner: 10Nray) [20:01:46] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse) [20:01:49] (03Merged) 10jenkins-bot: Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901723 (https://phabricator.wikimedia.org/T332745) (owner: 10Bartosz Dziewoński) [20:01:52] (03Merged) 10jenkins-bot: Clean up DiscussionTools labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901724 (owner: 10Bartosz Dziewoński) [20:02:09] o/ [20:02:24] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@822dfed]: bump discolytics to 0.9.0 [20:02:32] TheresNoTime: ping when when done pls? I might have a patch of my own by then [20:02:34] !log samtar@deploy2002 Started scap: Backport for [[gerrit:901723|Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes (T332745)]], [[gerrit:901724|Clean up DiscussionTools labs config]] [20:02:39] T332745: Allow running persistRevisionThreadItems.php per-namespace and document that this should be done after changing wgExtraSignatureNamespaces - https://phabricator.wikimedia.org/T332745 [20:02:44] taavi: will do [20:02:45] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@822dfed]: bump discolytics to 0.9.0 (duration: 00m 21s) [20:03:07] (03PS1) 10JHathaway: lists: new server to test bookworm functionality [puppet] - 10https://gerrit.wikimedia.org/r/902182 (https://phabricator.wikimedia.org/T331706) [20:04:08] !log samtar@deploy2002 samtar and matmarex: Backport for [[gerrit:901723|Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes (T332745)]], [[gerrit:901724|Clean up DiscussionTools labs config]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:04:16] (syncing) [20:04:32] thanks [20:05:12] also - any suggestions for where else i should document/announce the persistRevisionThreadItems.php and wgExtraSignatureNamespaces thing? [20:05:28] !log denisse@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:05:28] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc1003.wikimedia.org on all recursors [20:05:31] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc1003.wikimedia.org on all recursors [20:05:33] (03CR) 10JHathaway: [C: 03+2] lists: new server to test bookworm functionality [puppet] - 10https://gerrit.wikimedia.org/r/902182 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [20:05:34] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [20:06:46] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:06:46] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc1003.wikimedia.org on all recursors [20:06:49] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc1003.wikimedia.org on all recursors [20:06:54] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doc1003.wikimedia.org [20:07:15] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host lists1003.wikimedia.org [20:07:16] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [20:07:38] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host doc1003.eqiad.wmnet [20:07:40] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [20:07:47] TheresNoTime: I'm adding another config patch to the window, is that alright? [20:07:59] kostajh: sure :) [20:08:41] 10SRE: failed to register layer: devmapper during scap deploy - https://phabricator.wikimedia.org/T332818 (10TheresNoTime) [20:09:56] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901723|Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes (T332745)]], [[gerrit:901724|Clean up DiscussionTools labs config]] (duration: 07m 22s) [20:10:00] 10SRE: failed to register layer: devmapper during scap deploy - https://phabricator.wikimedia.org/T332818 (10TheresNoTime) [20:10:01] T332745: Allow running persistRevisionThreadItems.php per-namespace and document that this should be done after changing wgExtraSignatureNamespaces - https://phabricator.wikimedia.org/T332745 [20:10:03] (added) [20:10:06] 10SRE: failed to register layer: devmapper during scap deploy - https://phabricator.wikimedia.org/T332818 (10dancy) [20:10:39] Deployers: You can ignore the `failed to register layer: devmapper: ` error that happens during deployment. [20:10:46] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc1003.eqiad.wmnet - denisse@cumin1001" [20:10:54] dancy: ack, thank you [20:11:01] officially https://phabricator.wikimedia.org/T332803 [20:11:25] (03PS3) 10Samtar: GrowthExperiments: Enable Leveling Up features on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T330358) (owner: 10Kosta Harlan) [20:11:42] kostajh: going to do your 901144 next, while I wait for nray's other patch to merge [20:11:50] ok [20:11:52] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc1003.eqiad.wmnet - denisse@cumin1001" [20:11:52] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:11:52] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc1003.eqiad.wmnet on all recursors [20:11:55] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc1003.eqiad.wmnet on all recursors [20:12:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T330358) (owner: 10Kosta Harlan) [20:12:58] (03Merged) 10jenkins-bot: GrowthExperiments: Enable Leveling Up features on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T330358) (owner: 10Kosta Harlan) [20:13:20] !log samtar@deploy2002 Started scap: Backport for [[gerrit:901144|GrowthExperiments: Enable Leveling Up features on pilot wikis (T330358 T317813)]] [20:13:27] T317813: [EPIC] Positive Reinforcement: Leveling Up - https://phabricator.wikimedia.org/T317813 [20:13:28] T330358: Leveling Up: Start experiment for Leveling up on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T330358 [20:14:01] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse) [20:15:01] !log samtar@deploy2002 kharlan and samtar: Backport for [[gerrit:901144|GrowthExperiments: Enable Leveling Up features on pilot wikis (T330358 T317813)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:15:05] kostajh: live on mwdebug [20:15:07] !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:15:07] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache lists1003.wikimedia.org on all recursors [20:15:10] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lists1003.wikimedia.org on all recursors [20:15:17] TheresNoTime: thanks, I'll need a minute or two to verify [20:15:28] (03PS3) 10Samtar: Enable page tools for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) (owner: 10Nray) [20:17:22] (03Merged) 10jenkins-bot: Enable pinning for anon main menu when page tools is enabled [skins/Vector] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902150 (https://phabricator.wikimedia.org/T331657) (owner: 10Nray) [20:17:48] TheresNoTime: lgtm [20:17:54] syncing [20:20:04] (03PS10) 10Alex Paskulin: Assign the API portal to the Wikimedia group for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [20:23:18] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901144|GrowthExperiments: Enable Leveling Up features on pilot wikis (T330358 T317813)]] (duration: 09m 57s) [20:23:25] T317813: [EPIC] Positive Reinforcement: Leveling Up - https://phabricator.wikimedia.org/T317813 [20:23:25] T330358: Leveling Up: Start experiment for Leveling up on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T330358 [20:23:28] kostajh: live :) [20:23:32] nray: ready? [20:23:34] TheresNoTime: thanks! [20:24:08] TheresNoTime: Yes, is on the debug servers? [20:24:13] is it* [20:24:19] nray: not yet [20:24:51] !log samtar@deploy2002 Started scap: Backport for [[gerrit:902150|Enable pinning for anon main menu when page tools is enabled (T331657)]] [20:24:56] T331657: Enable pinning for anonymous users when page tools is enabled - https://phabricator.wikimedia.org/T331657 [20:25:05] PROBLEM - Host kubernetes1023 is DOWN: PING CRITICAL - Packet loss = 100% [20:25:48] !log reboot kubernetes1023 for a test [20:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:54] akosiaris: not saying thats related ^, but just as it happened, scap backport got stuck on `20:25:28 docker_pull_k8s: 96% (in-flight: 1; ok: 29; fail: 2; left: 0)` [20:27:03] RECOVERY - Host kubernetes1023 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:28:29] !log samtar@deploy2002 samtar and nray: Backport for [[gerrit:902150|Enable pinning for anon main menu when page tools is enabled (T331657)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:28:38] nray: 902150 is live on mwdebug now [20:28:50] TheresNoTime: Thank you, I will check now [20:29:24] TheresNoTime: did it proceed eventually ? [20:29:30] yeah :) [20:29:39] ok, good to know. Thanks for the notice [20:29:48] and yeah, it's probably related [20:29:56] but also self-healed apparently [20:30:24] well the stage failed on 3 nodes instead of 2, so guessing it just timed out? [20:31:04] TheresNoTime: You can proceed with that one [20:31:13] syncing :) [20:31:58] well, it's drained now, rebooting once more, this time around we shouldn't see anything [20:32:03] !log reboot kubernetes1023 for a test once more [20:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:42] !log reboot kubernetes1023 for a test once more, ⚓ T332803 [20:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:47] T332803: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 [20:32:58] that's gonna be properly added now [20:34:03] PROBLEM - Host kubernetes1023 is DOWN: PING CRITICAL - Packet loss = 100% [20:35:25] RECOVERY - Host kubernetes1023 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:36:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] es_exporter: add NEL metrics by country [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [20:36:39] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:902150|Enable pinning for anon main menu when page tools is enabled (T331657)]] (duration: 11m 47s) [20:36:44] T331657: Enable pinning for anonymous users when page tools is enabled - https://phabricator.wikimedia.org/T331657 [20:36:52] live, and moving on to 900748 [20:37:18] TheresNoTime: thank you! [20:37:28] (03CR) 10Samtar: [C: 03+2] "deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) (owner: 10Nray) [20:37:40] !log uncordon reboot kubernetes1023. It was drained previously for ⚓ T332803 [20:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:27] (03Merged) 10jenkins-bot: Enable page tools for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900748 (https://phabricator.wikimedia.org/T331052) (owner: 10Nray) [20:39:10] (03PS1) 10Majavah: Set OATHAuthMultipleDevicesMigrationStage in IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187 [20:39:12] (03PS1) 10Majavah: Remove OATHAuthMultipleDevicesMigrationStage from CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902188 [20:39:14] (03PS1) 10Majavah: [beta] Write both for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902189 (https://phabricator.wikimedia.org/T242031) [20:41:43] nray: live on mwdebug1002 [20:41:50] (had to do this one manually) [20:41:55] TheresNoTime: Thank you, checking now [20:44:19] TheresNoTime: Looks good, you can proceed! [20:44:27] syncing [20:49:41] (err, if it announces that its ready on mwdebug, ignore it :p) [20:54:50] !log samtar@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:900748|Enable page tools for anonymous users (T331052)]] (duration: 10m 10s) [20:54:55] T331052: Enable page tools for anonymous users - https://phabricator.wikimedia.org/T331052 [20:55:04] nray: got there eventually, should be live now :) [20:55:09] taavi: all yours [20:55:13] thanks! [20:55:16] TheresNoTime: Thanks for your help! [20:55:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187 (owner: 10Majavah) [20:55:50] (one of those k8s steps takes a while to time-out and fail fwiw) [20:55:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:56:10] :( [20:56:15] the deployment process is already slow as is [20:56:49] (03PS2) 10Majavah: Set OATHAuthMultipleDevicesMigrationStage in IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187 [20:56:52] (03CR) 10Majavah: [C: 03+2] Set OATHAuthMultipleDevicesMigrationStage in IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187 (owner: 10Majavah) [20:57:46] (03Merged) 10jenkins-bot: Set OATHAuthMultipleDevicesMigrationStage in IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902187 (owner: 10Majavah) [20:58:12] !log taavi@deploy2002 Started scap: Backport for [[gerrit:902187|Set OATHAuthMultipleDevicesMigrationStage in IS]] [20:58:37] (03CR) 10Cwhite: "one nit inline, but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh) [20:59:41] !log taavi@deploy2002 taavi: Backport for [[gerrit:902187|Set OATHAuthMultipleDevicesMigrationStage in IS]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:00:29] (03PS2) 10Majavah: Remove OATHAuthMultipleDevicesMigrationStage from CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902188 [21:00:35] (03PS2) 10Majavah: [beta] Write both for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902189 (https://phabricator.wikimedia.org/T242031) [21:00:53] (03CR) 10Majavah: [C: 03+2] Remove OATHAuthMultipleDevicesMigrationStage from CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902188 (owner: 10Majavah) [21:00:58] (03CR) 10Majavah: [C: 03+2] [beta] Write both for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902189 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [21:00:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:01:42] (03Merged) 10jenkins-bot: Remove OATHAuthMultipleDevicesMigrationStage from CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902188 (owner: 10Majavah) [21:01:45] (03Merged) 10jenkins-bot: [beta] Write both for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902189 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [21:05:30] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:902187|Set OATHAuthMultipleDevicesMigrationStage in IS]] (duration: 07m 17s) [21:05:34] (03PS1) 10QChris: Add .gitreview [debs/cqlsh4] - 10https://gerrit.wikimedia.org/r/902194 [21:05:36] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/cqlsh4] - 10https://gerrit.wikimedia.org/r/902194 (owner: 10QChris) [21:06:26] !log taavi@deploy2002 Started scap: Backport for [[gerrit:902188|Remove OATHAuthMultipleDevicesMigrationStage from CS]], [[gerrit:902189|[beta] Write both for OATHAuthMultipleDevicesMigrationStage (T242031)]] [21:06:31] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [21:08:09] !log taavi@deploy2002 taavi: Backport for [[gerrit:902188|Remove OATHAuthMultipleDevicesMigrationStage from CS]], [[gerrit:902189|[beta] Write both for OATHAuthMultipleDevicesMigrationStage (T242031)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:08:20] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doc1003.eqiad.wmnet [21:13:13] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10Peachey88) [21:13:55] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:902188|Remove OATHAuthMultipleDevicesMigrationStage from CS]], [[gerrit:902189|[beta] Write both for OATHAuthMultipleDevicesMigrationStage (T242031)]] (duration: 07m 29s) [21:14:01] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [21:14:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:15:27] (03PS1) 10Jdlrobson: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) [21:15:46] !log UTC late backports complete [21:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:36] jouncebot: nowandnext [21:16:36] No deployments scheduled for the next 8 hour(s) and 43 minute(s) [21:16:36] In 8 hour(s) and 43 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600) [21:16:37] In 8 hour(s) and 43 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600) [21:16:41] sorry, I got one more [21:17:27] (03PS1) 10Majavah: [beta] Read new for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902198 (https://phabricator.wikimedia.org/T242031) [21:17:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902198 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [21:18:23] (03Merged) 10jenkins-bot: [beta] Read new for OATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902198 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [21:19:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:41:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:42:22] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10Aklapper) >>! In T273637#8718626, @jbond wrote: > are you able to offer any advice on this, thanks? See "[Request a project]... [21:45:01] (03PS1) 10Cwhite: logstash: add grafana-server ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/901642 (https://phabricator.wikimedia.org/T234565) [21:46:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:29:11] NovemLinguae: if you're around to test, I could look at backporting 902153 [22:31:14] (03PS1) 10Samtar: Revert "Remove 50% opacity from notification badges when they are all read" [extensions/Echo] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902154 (https://phabricator.wikimedia.org/T331502) [22:32:12] (03PS1) 10Samtar: Revert "Remove 50% opacity from notification badges when they are all read" [extensions/Echo] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902155 (https://phabricator.wikimedia.org/T331502) [22:33:34] i was thinking it was pretty minor. wasn't thinking it was backport worthy. but i'm around to test if you disagree [22:34:55] A skin issue which is pretty minor, that's a first /s [22:35:07] lol :) [22:38:30] hm, well I don't disagree that it's a minor regression — may as well let it ride the train then :) sorry for the ping! [22:40:28] nah no worries, ping me anytime. i appreciate the backport offer [22:54:15] (03CR) 10Tim Starling: Temporarily disable xenon/excimer for switch maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T330165) (owner: 10Tim Starling) [23:01:38] (03PS1) 10Samtar: core-Permissions: Add `ipblock-exempt` to `bot` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759) [23:02:16] (03CR) 10CI reject: [V: 04-1] core-Permissions: Add `ipblock-exempt` to `bot` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759) (owner: 10Samtar) [23:03:50] (03PS2) 10Samtar: core-Permissions: Add `ipblock-exempt` to `bot` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759) [23:04:21] (03PS1) 10Zabe: wikimaniawiki: Add namespace for 2024 wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902208 (https://phabricator.wikimedia.org/T332782) [23:06:09] (03PS2) 10Krinkle: Temporarily disable xenon/excimer for mwlog1002 switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling) [23:06:25] RECOVERY - Check systemd state on mw1372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:51] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:13] jouncebot: nowandnext [23:20:13] No deployments scheduled for the next 6 hour(s) and 39 minute(s) [23:20:14] In 6 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600) [23:20:14] In 6 hour(s) and 39 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600) [23:20:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902208 (https://phabricator.wikimedia.org/T332782) (owner: 10Zabe) [23:21:03] (03Merged) 10jenkins-bot: wikimaniawiki: Add namespace for 2024 wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902208 (https://phabricator.wikimedia.org/T332782) (owner: 10Zabe) [23:21:24] !log zabe@deploy2002 Started scap: Backport for [[gerrit:902208|wikimaniawiki: Add namespace for 2024 wikimania (T332782)]] [23:21:31] T332782: Create 2024 namespace for wikimaniawiki - https://phabricator.wikimedia.org/T332782 [23:22:58] !log zabe@deploy2002 zabe: Backport for [[gerrit:902208|wikimaniawiki: Add namespace for 2024 wikimania (T332782)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [23:24:30] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host doc2002.codfw.wmnet [23:24:31] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [23:24:32] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host lists1003.wikimedia.org [23:26:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:31:07] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:31:28] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:902208|wikimaniawiki: Add namespace for 2024 wikimania (T332782)]] (duration: 10m 03s) [23:31:34] T332782: Create 2024 namespace for wikimaniawiki - https://phabricator.wikimedia.org/T332782 [23:32:14] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php wikimaniawiki --fix # T332782 [23:32:15] (03PS1) 10Andrea Denisse: doc: Add the doc1003 node definition [puppet] - 10https://gerrit.wikimedia.org/r/902209 (https://phabricator.wikimedia.org/T332812) [23:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:56] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc2002.codfw.wmnet - denisse@cumin1001" [23:33:26] (03CR) 10Tim Starling: Temporarily disable xenon/excimer for mwlog1002 switch maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling) [23:33:58] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc2002.codfw.wmnet - denisse@cumin1001" [23:33:58] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:33:58] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc2002.codfw.wmnet on all recursors [23:34:01] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc2002.codfw.wmnet on all recursors [23:34:24] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40285/console" [puppet] - 10https://gerrit.wikimedia.org/r/902209 (https://phabricator.wikimedia.org/T332812) (owner: 10Andrea Denisse) [23:35:26] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] doc: Add the doc1003 node definition [puppet] - 10https://gerrit.wikimedia.org/r/902209 (https://phabricator.wikimedia.org/T332812) (owner: 10Andrea Denisse) [23:35:50] (03CR) 10Tim Starling: Temporarily disable xenon/excimer for mwlog1002 switch maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T331882) (owner: 10Tim Starling) [23:36:07] (RedisMemoryFull) resolved: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:38:47] (03PS1) 10Superpes15: [dkwikimedia] Fixing current logo with an HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902211 (https://phabricator.wikimedia.org/T332784) [23:46:49] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host doc1003.eqiad.wmnet with OS bullseye [23:46:57] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc1003.eqiad.wmnet with OS bullseye [23:52:58] (03PS6) 10Ssingh: logstash: add pybal ECS filter and tests [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) [23:56:27] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on doc1003.eqiad.wmnet with reason: host reimage [23:59:41] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doc1003.eqiad.wmnet with reason: host reimage