[00:01:07] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:06:07] (RedisMemoryFull) resolved: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:08:02] (03PS1) 10Superpes15: [ptwikinews] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902216 (https://phabricator.wikimedia.org/T332813) [00:10:33] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doc1003.eqiad.wmnet with OS bullseye [00:10:38] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc1003.eqiad.wmnet with OS bullseye completed: - doc1003 (**PASS**) - Removed fr... [00:19:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:24:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:27:51] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doc2002.codfw.wmnet [00:55:29] (03PS2) 10Dzahn: miscweb: move transparency httpd site templates out of role/apache [puppet] - 10https://gerrit.wikimedia.org/r/902140 (https://phabricator.wikimedia.org/T331896) [00:55:38] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/902140/40286/" [puppet] - 10https://gerrit.wikimedia.org/r/902140 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [00:56:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse) 05Open→03Resolved [00:57:05] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host doc2002.codfw.wmnet with OS bullseye [00:57:07] !log denisse@cumin1001 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host doc2002.codfw.wmnet with OS bullseye [00:57:11] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye [00:57:13] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors: - doc2002 (**FAIL**) - **The reimage failed, see the c... [00:57:42] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host doc2002.codfw.wmnet with OS bullseye [00:57:48] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye [01:03:48] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "doc1003 - denisse@cumin1001 - T332812" [01:03:54] T332812: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 [01:04:36] (03PS3) 10Dzahn: miscweb: move os_reports httpd template to profile/microsites/ [puppet] - 10https://gerrit.wikimedia.org/r/902142 (https://phabricator.wikimedia.org/T331896) [01:05:15] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "doc1003 - denisse@cumin1001 - T332812" [01:06:16] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/902142/40287/" [puppet] - 10https://gerrit.wikimedia.org/r/902142 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [01:09:29] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for doc1003 - https://phabricator.wikimedia.org/T332812 (10andrea.denisse) [01:09:55] (03PS2) 10Dzahn: miscweb: add custom and error log for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902144 (https://phabricator.wikimedia.org/T331896) [01:10:09] (03PS3) 10Dzahn: miscweb: add custom and error log for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902144 (https://phabricator.wikimedia.org/T331896) [01:11:03] (03CR) 10Dzahn: [C: 03+2] miscweb: add custom and error log for os-reports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902144 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [01:19:38] (03CR) 10Dzahn: [C: 03+2] miscweb: switch os-reports.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902172 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [01:20:00] (03PS1) 10Andrea Denisse: doc: Add role::doc to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) [01:21:09] (03CR) 10Dzahn: [C: 03+2] miscweb: add custom and error log for transparency and archives [puppet] - 10https://gerrit.wikimedia.org/r/902166 (owner: 10Dzahn) [01:21:18] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40288/console" [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [01:22:05] (03PS2) 10Dzahn: miscweb: add custom and error log for transparency and archives [puppet] - 10https://gerrit.wikimedia.org/r/902166 (https://phabricator.wikimedia.org/T331896) [01:22:51] (03CR) 10Dzahn: miscweb: add custom and error log for transparency and archives [puppet] - 10https://gerrit.wikimedia.org/r/902166 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [01:23:19] (03CR) 10Dzahn: [C: 03+2] "avoid that everything ends up in "other_vhosts" logs" [puppet] - 10https://gerrit.wikimedia.org/r/902166 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [01:23:32] (03PS3) 10Dzahn: miscweb: add custom and error log for transparency and archives [puppet] - 10https://gerrit.wikimedia.org/r/902166 (https://phabricator.wikimedia.org/T331896) [01:33:45] (03PS1) 10Dzahn: miscweb: add monitor for research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902224 [01:34:19] (03PS2) 10Dzahn: miscweb: switch research.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902167 (https://phabricator.wikimedia.org/T331896) [01:34:41] (03CR) 10Dzahn: [C: 03+2] miscweb: switch research.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902167 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [01:39:54] (03PS1) 10Dzahn: miscweb: add monitor for wikiworkshop.org [puppet] - 10https://gerrit.wikimedia.org/r/902225 [01:42:14] (03CR) 10Dzahn: [C: 03+2] miscweb: switch wikiworkshop.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902169 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [01:48:15] (03CR) 10Dzahn: [C: 03+2] miscweb: switch design.wikimedia.org to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902170 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [01:54:12] (03PS1) 10Dzahn: miscweb/static-codereview: add prometheus monitor, remove icinga monitor [puppet] - 10https://gerrit.wikimedia.org/r/902226 [02:00:00] !log rsyncing ~4GB files for static-codereview.wikimedia.org from old to newer VMs for T331896 - no automatic sync / deploy for these [02:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:11] T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896 [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:22] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host doc2002.codfw.wmnet with OS bullseye [02:07:27] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors: - doc2002 (**FAIL**) - Removed from Puppet and PuppetD... [02:07:36] (03PS1) 10Dzahn: miscweb/static-codereview: also monitor if files were synced [puppet] - 10https://gerrit.wikimedia.org/r/902228 [02:08:42] (03PS2) 10Dzahn: miscweb/static-codereview/httpbb: also test if files were synced [puppet] - 10https://gerrit.wikimedia.org/r/902228 [02:09:51] (03PS3) 10Dzahn: miscweb/static-codereview/httpbb: also test if files were synced [puppet] - 10https://gerrit.wikimedia.org/r/902228 (https://phabricator.wikimedia.org/T331896) [02:10:54] (03CR) 10Dzahn: [C: 03+2] miscweb/static-codereview/httpbb: also test if files were synced [puppet] - 10https://gerrit.wikimedia.org/r/902228 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [02:14:01] (03PS4) 10Dzahn: miscweb/static-codereview/httpbb: also test if files were synced [puppet] - 10https://gerrit.wikimedia.org/r/902228 (https://phabricator.wikimedia.org/T331896) [02:14:58] (03CR) 10Dzahn: miscweb/static-codereview/httpbb: also test if files were synced [puppet] - 10https://gerrit.wikimedia.org/r/902228 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [02:19:49] (03CR) 10Dzahn: [C: 03+2] miscweb: switch static-codereview to bullseye backend [puppet] - 10https://gerrit.wikimedia.org/r/902174 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:42] (03PS1) 10Dzahn: Revert "maintenance: temp allow rsyncing home dir to miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/902156 [02:39:53] (03CR) 10CI reject: [V: 04-1] Revert "maintenance: temp allow rsyncing home dir to miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/902156 (owner: 10Dzahn) [02:42:01] (03PS2) 10Dzahn: miscweb: add monitor for research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902224 [02:42:03] (03PS2) 10Dzahn: miscweb: add monitor for wikiworkshop.org [puppet] - 10https://gerrit.wikimedia.org/r/902225 [02:42:05] (03PS2) 10Dzahn: miscweb/static-codereview: add prometheus monitor, remove icinga monitor [puppet] - 10https://gerrit.wikimedia.org/r/902226 [02:42:07] (03PS1) 10Dzahn: decom miscweb2002 [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896) [03:24:52] (03PS7) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [03:46:10] (03PS4) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [03:46:13] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:51] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:59:27] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:03:15] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:16:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:16:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:58] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host doc2002.codfw.wmnet with OS bullseye [04:26:05] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye [04:28:01] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:37] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:17] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:51] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-client1002.eqiad.wmnet with OS bullseye [05:37:24] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host doc2002.codfw.wmnet with OS bullseye [05:37:29] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors: - doc2002 (**FAIL**) - Downtimed on Icinga/Alertmanage... [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0600). [06:07:15] (03PS1) 10Marostegui: Revert "es2029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/902157 [06:07:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45923 and previous config saved to /var/cache/conftool/dbconfig/20230323-060750-root.json [06:07:57] (03CR) 10Marostegui: [C: 03+2] Revert "es2029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/902157 (owner: 10Marostegui) [06:08:51] 10SRE, 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10Marostegui) 05Open→03Resolved I am finally repooling this today automatically. Thanks for checking the host Papaul! [06:22:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45924 and previous config saved to /var/cache/conftool/dbconfig/20230323-062255-root.json [06:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45925 and previous config saved to /var/cache/conftool/dbconfig/20230323-063800-root.json [06:53:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45926 and previous config saved to /var/cache/conftool/dbconfig/20230323-065306-root.json [07:00:04] Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T0700). [07:02:20] morning! [07:02:56] we have a training request but I don't know which slot it's for, the morning or after noon [07:03:04] and in any case, no patches are scheduled for deployment [07:04:35] 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10Wikimedia-Video, 10serviceops-radar, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10Joe) I wouldn't consider this task done, but we took all the actions t... [07:04:55] I don't have an irc handle for the trainee so I can't contact them [07:05:32] I'll camp in the google meet just in case they show up and I can tell them what's going on [07:08:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45927 and previous config saved to /var/cache/conftool/dbconfig/20230323-070811-root.json [07:11:12] 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10Wikimedia-Video, 10serviceops-radar, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10Joe) a:05Joe→03None [07:23:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2029 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45928 and previous config saved to /var/cache/conftool/dbconfig/20230323-072315-root.json [07:32:43] (03PS68) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [07:34:38] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [07:36:34] (03PS69) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [07:37:48] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubernetes1024.eqiad.wmnet with reason: Restart docker with overlay [07:38:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubernetes1024.eqiad.wmnet with reason: Restart docker with overlay [07:38:28] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [07:38:43] 10SRE-Sprint-Week-Sustainability-March2023, 10TimedMediaHandler-Transcode, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Add rate limiting to the jobqueue vidoscalers to prevent overloads - https://phabricator.wikimedia.org/T278945 (10Joe) 05Open→03Resolved [07:40:25] (03PS70) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [07:42:17] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [07:42:43] !log clean up docker on kubernetes1024 (cordon + stop kubelet + docker + clean /var/lib/docker/*) and reboot to enable overlay2 - T332803 [07:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:48] T332803: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 [07:43:27] (03PS1) 10Elukey: changeprop: improve liftwing streams configurability [deployment-charts] - 10https://gerrit.wikimedia.org/r/902237 (https://phabricator.wikimedia.org/T328576) [07:44:31] (03PS1) 10Giuseppe Lavagetto: appserver: send back a proper error page from envoy [puppet] - 10https://gerrit.wikimedia.org/r/902238 (https://phabricator.wikimedia.org/T287983) [07:45:20] (03PS71) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [07:45:58] (KubernetesCalicoDown) firing: kubernetes1024.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1024.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:46:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:47:11] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [07:47:46] (03PS1) 10Superpes15: [trwikiquote] Removing the temporary logo (already reverted) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902239 (https://phabricator.wikimedia.org/T329399) [07:48:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] appserver: send back a proper error page from envoy [puppet] - 10https://gerrit.wikimedia.org/r/902238 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [07:49:31] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubernetes2024.codfw.wmnet with reason: Restart docker with overlay [07:49:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubernetes2024.codfw.wmnet with reason: Restart docker with overlay [07:49:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kubernetes2023.codfw.wmnet with reason: Restart docker with overlay [07:50:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubernetes2023.codfw.wmnet with reason: Restart docker with overlay [07:50:58] (KubernetesCalicoDown) resolved: kubernetes1024.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1024.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:51:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:54:35] !log clean up docker and reboot kubernetes2023 to enable overlay2 - T332803 [07:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:54:41] T332803: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 [07:55:27] (03PS72) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [07:57:16] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [07:57:58] (KubernetesCalicoDown) firing: kubernetes2023.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2023.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:58:20] (03PS73) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:00:11] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:01:30] (03PS2) 10Elukey: changeprop: improve liftwing streams configurability [deployment-charts] - 10https://gerrit.wikimedia.org/r/902237 (https://phabricator.wikimedia.org/T328576) [08:02:58] (KubernetesCalicoDown) resolved: (2) kubernetes1024.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:04:26] !log rolling rollback to HAProxy 2.6.9 in cache text cluster - T332796 [08:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:32] T332796: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 [08:07:03] (03CR) 10Muehlenhoff: [C: 03+2] Move udpmixircecho to Python 3 for Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [08:08:25] (03PS74) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:08:44] !log fetch haproxy 2.6.11 in apt.wm.o thirdparty/haproxy26 for bullseye & buster [08:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:16] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:11:26] !log testing HAProxy 2.6.11 in cp4044 - T332796 [08:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:32] T332796: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 [08:12:05] (03PS75) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:12:12] 10SRE, 10Traffic, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) cp4044 was missing the haproxy 2.6.9 in /var/cache/apt/archives and 2.6.11 has been released, so I'm testing it there right now [08:13:59] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:20:59] !log clean up docker and reboot kubernetes2024 to enable overlay2 - T332803 [08:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:04] T332803: kubernetes[2023-2024].codfw.wmnet,kubernetes[1023-1024].eqiad.wmnet are using devicemapper instead of overlay2 - https://phabricator.wikimedia.org/T332803 [08:22:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:22:54] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:24:46] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:37:04] (03PS1) 10Muehlenhoff: Build for Bullseye [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/902302 (https://phabricator.wikimedia.org/T332589) [08:46:50] (03PS1) 10Vgutierrez: cache::haproxy,prometheus: Track unexpected restarts [puppet] - 10https://gerrit.wikimedia.org/r/902303 (https://phabricator.wikimedia.org/T332796) [08:48:21] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40289/console" [puppet] - 10https://gerrit.wikimedia.org/r/902303 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [08:48:51] (03CR) 10CI reject: [V: 04-1] cache::haproxy,prometheus: Track unexpected restarts [puppet] - 10https://gerrit.wikimedia.org/r/902303 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [08:49:09] (03PS2) 10Muehlenhoff: Build for Bullseye [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/902302 (https://phabricator.wikimedia.org/T332589) [08:49:35] (03PS76) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:51:30] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:52:58] (03PS77) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [08:53:21] (03PS2) 10Vgutierrez: cache::haproxy,prometheus: Track unexpected restarts [puppet] - 10https://gerrit.wikimedia.org/r/902303 (https://phabricator.wikimedia.org/T332796) [08:54:48] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40290/console" [puppet] - 10https://gerrit.wikimedia.org/r/902303 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [08:55:04] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:15:00] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Build for Bullseye [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/902302 (https://phabricator.wikimedia.org/T332589) (owner: 10Muehlenhoff) [09:22:50] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10Volans) In the meanwhile I've also added 2 graphs in the [[ htt... [09:23:27] (03PS78) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:24:51] (03PS3) 10Aklapper: Set wmgUseGraphWithJsonNamespace = false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888708 (https://phabricator.wikimedia.org/T124748) [09:25:22] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:26:43] (03CR) 10Aklapper: "Please abandon (I do not have rights). This now needs to be proposed against https://gitlab.wikimedia.org/repos/phabricator/deployment/-/b" [puppet] - 10https://gerrit.wikimedia.org/r/718418 (https://phabricator.wikimedia.org/T158177) (owner: 10DannyS712) [09:31:20] (03CR) 10Filippo Giunchedi: "Good news and bad news:" [puppet] - 10https://gerrit.wikimedia.org/r/902303 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [09:36:27] (03PS79) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:38:01] (03PS1) 10Elukey: services: add the first lift wing stream in changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902307 (https://phabricator.wikimedia.org/T328576) [09:38:23] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:38:39] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for hadoop-test-client - https://phabricator.wikimedia.org/T332656 (10Stevemunene) Vm is created and active closing the task. [09:40:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for hadoop-test-client - https://phabricator.wikimedia.org/T332656 (10Stevemunene) 05Open→03Resolved [09:41:43] (03PS80) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [09:43:33] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [09:47:17] !log uploaded prometheus-druid-exporter 0.8-2 for bullseye-wikimedia T332584 T332589 [09:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:24] T332584: Upgrade an-test-druid1001 to bullseye - https://phabricator.wikimedia.org/T332584 [09:47:24] T332589: Upgrade the druid-public cluster to bullseye - https://phabricator.wikimedia.org/T332589 [09:49:11] (03CR) 10Filippo Giunchedi: eventgate: add EventgateLoggingExternalErrors alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [09:52:46] (03CR) 10Volans: [C: 03+2] es_exporter: add NEL metrics by country [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [09:53:33] (03PS1) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/alerts [alerts] - 10https://gerrit.wikimedia.org/r/902308 [09:53:51] (03PS1) 10Muehlenhoff: Add irc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902309 (https://phabricator.wikimedia.org/T331702) [09:54:52] (03CR) 10Aklapper: "Please abandon this patch, as T207502 is resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469042 (https://phabricator.wikimedia.org/T207502) (owner: 10Varnent) [09:55:00] (03Abandoned) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/alerts [alerts] - 10https://gerrit.wikimedia.org/r/902308 (owner: 10Cathal Mooney) [09:55:26] (03PS9) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [09:56:27] (03PS1) 10Vgutierrez: prometheus: Track systemd service restarts on >= bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902310 (https://phabricator.wikimedia.org/T332796) [09:56:38] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [09:57:03] (03PS10) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [09:57:39] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2005.codfw.wmnet with reason: stop kafka and reimage [09:57:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2005.codfw.wmnet with reason: stop kafka and reimage [09:58:19] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [09:59:37] (03CR) 10Filippo Giunchedi: "Logic LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/902310 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [10:00:04] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T1000). [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T1000) [10:01:25] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main2005.codfw.wmnet with OS bullseye [10:01:52] (03PS2) 10Vgutierrez: prometheus: Track systemd service restarts on >= bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902310 (https://phabricator.wikimedia.org/T332796) [10:02:07] (03PS11) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [10:03:16] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [10:03:48] (03CR) 10JMeybohm: [C: 03+1] Revert "Revert: Remove the .Values.kubernetesApi hack" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto) [10:04:02] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10Volans) [10:04:27] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10Volans) Metrics are now being ingested by prometheus: {F36923902} [10:06:01] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10jcrespo) Experimenting with AI: (I believe those are public domain), but needs an artist to make them good: {F36923906} {F36923905} [10:07:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/902310 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [10:07:51] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Track systemd service restarts on >= bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902310 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [10:08:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host irc2002.wikimedia.org [10:08:09] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:08:43] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10SRE Observability, 10Patch-For-Review: How should we monitor for faulty memory modules? - https://phabricator.wikimedia.org/T302639 (10jbond) [10:08:45] (03Abandoned) 10Vgutierrez: cache::haproxy,prometheus: Track unexpected restarts [puppet] - 10https://gerrit.wikimedia.org/r/902303 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [10:10:44] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc2002.wikimedia.org - jmm@cumin2002" [10:12:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:13:05] this is expected --^ [10:15:45] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2005.codfw.wmnet with reason: host reimage [10:18:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2005.codfw.wmnet with reason: host reimage [10:20:22] (03PS2) 10Sergio Gimeno: GrowthExperiments: disable add a link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902131 (https://phabricator.wikimedia.org/T304551) [10:20:27] (03PS81) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:21:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc2002.wikimedia.org - jmm@cumin2002" [10:21:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:21:18] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache irc2002.wikimedia.org on all recursors [10:21:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) irc2002.wikimedia.org on all recursors [10:22:21] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:25:33] (03PS1) 10Vgutierrez: traffic: Add HAProxyRestarted alert [alerts] - 10https://gerrit.wikimedia.org/r/902312 (https://phabricator.wikimedia.org/T332796) [10:26:20] (03CR) 10Muehlenhoff: [C: 03+2] Add irc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902309 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [10:26:43] (03CR) 10CI reject: [V: 04-1] traffic: Add HAProxyRestarted alert [alerts] - 10https://gerrit.wikimedia.org/r/902312 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [10:26:52] I love you too CI [10:32:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:36:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:37:31] (03PS12) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [10:37:35] (03PS2) 10Vgutierrez: traffic: Add HAProxyRestarted alert [alerts] - 10https://gerrit.wikimedia.org/r/902312 (https://phabricator.wikimedia.org/T332796) [10:38:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2005.codfw.wmnet with OS bullseye [10:38:43] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [10:39:12] (03CR) 10Filippo Giunchedi: [C: 03+1] traffic: Add HAProxyRestarted alert [alerts] - 10https://gerrit.wikimedia.org/r/902312 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [10:39:21] (03CR) 10CI reject: [V: 04-1] traffic: Add HAProxyRestarted alert [alerts] - 10https://gerrit.wikimedia.org/r/902312 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [10:40:52] (03PS13) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [10:41:19] (03PS3) 10Vgutierrez: traffic: Add HAProxyRestarted alert [alerts] - 10https://gerrit.wikimedia.org/r/902312 (https://phabricator.wikimedia.org/T332796) [10:41:36] (03PS1) 10Volans: gitignore: ignore .tox/ [alerts] - 10https://gerrit.wikimedia.org/r/902315 [10:41:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host irc2002.wikimedia.org [10:41:38] (03PS1) 10Volans: NEL: add alert by country [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) [10:42:04] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [10:43:28] (03PS82) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:44:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host irc2002.wikimedia.org with OS bullseye [10:44:13] 10SRE, 10SRE-Unowned: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host irc2002.wikimedia.org with OS bullseye [10:45:05] (03PS14) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [10:45:18] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:46:02] (03CR) 10Vgutierrez: [C: 03+2] traffic: Add HAProxyRestarted alert [alerts] - 10https://gerrit.wikimedia.org/r/902312 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [10:46:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:46:52] (03PS83) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:47:30] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Peter) [10:48:03] (03CR) 10Filippo Giunchedi: [C: 03+1] NEL: add alert by country [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [10:48:41] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:50:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [10:51:41] (03PS1) 10JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) [10:52:15] (03PS84) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:52:59] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Observability-Alerting: Migrate Foundations Prometheus alerts to AlertManager - https://phabricator.wikimedia.org/T294564 (10jbond) [10:53:33] (03CR) 10Elukey: "Should we force overlay2 in the docker class as well? To avoid device mapper as default.. I mean for non-k8s projects :)" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [10:53:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:54:07] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:54:32] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Aklapper) > I suspect it's used in *some* channels to refer to the *English* Wikipedia, but people could just...stop doing that? I don't see a good reason to potentially end up with lots of `LANGUAGECODEwp.TLD` sty... [10:56:31] (03CR) 10JMeybohm: k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [10:56:40] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40292/console" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [10:56:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on irc2002.wikimedia.org with reason: host reimage [10:58:45] (03CR) 10Filippo Giunchedi: "Apologies for the drive-by comment, but I ran into the same issue (devicemapper being the default, rather than overlay2) on alert hosts to" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [10:58:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:01] (03CR) 10Filippo Giunchedi: [C: 03+1] gitignore: ignore .tox/ [alerts] - 10https://gerrit.wikimedia.org/r/902315 (owner: 10Volans) [11:01:18] (03CR) 10Volans: [C: 03+2] gitignore: ignore .tox/ [alerts] - 10https://gerrit.wikimedia.org/r/902315 (owner: 10Volans) [11:01:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on irc2002.wikimedia.org with reason: host reimage [11:01:55] (03CR) 10Elukey: k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [11:02:28] (03Merged) 10jenkins-bot: gitignore: ignore .tox/ [alerts] - 10https://gerrit.wikimedia.org/r/902315 (owner: 10Volans) [11:02:30] (03CR) 10JMeybohm: [V: 03+1] k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [11:02:49] (03CR) 10Elukey: [C: 03+1] "Anyway, the k8s part looks ok :)" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [11:04:53] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [11:05:00] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:05:10] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:06:47] (03CR) 10Cathal Mooney: [C: 03+2] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [11:06:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2004.codfw.wmnet with reason: stop kafka and reimage [11:07:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2004.codfw.wmnet with reason: stop kafka and reimage [11:07:57] (03Merged) 10jenkins-bot: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [11:08:42] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main2004.codfw.wmnet with OS bullseye [11:09:21] (03PS1) 10Vgutierrez: traffic: Fix HAProxyRestarted dashboard URL [alerts] - 10https://gerrit.wikimedia.org/r/902319 (https://phabricator.wikimedia.org/T332796) [11:10:18] (03PS1) 10Cathal Mooney: Remove Icinga prometheus check for irc-echo messages [puppet] - 10https://gerrit.wikimedia.org/r/902320 (https://phabricator.wikimedia.org/T327793) [11:10:31] (03CR) 10CI reject: [V: 04-1] traffic: Fix HAProxyRestarted dashboard URL [alerts] - 10https://gerrit.wikimedia.org/r/902319 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [11:11:11] (03PS1) 10Kamila Součková: tests: fix usage of deprecated `platform.linux_distribution()` [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 [11:11:49] (03PS85) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:13:12] (03PS2) 10Kamila Součková: tests: fix usage of deprecated `platform.linux_distribution()` [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) [11:13:51] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:15:18] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:15:38] (03PS3) 10Kamila Součková: fix usage of deprecated `platform.linux_distribution()` [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) [11:15:46] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:15:54] (03PS86) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [11:15:57] (03PS4) 10Kamila Součková: Fix usage of deprecated `platform.linux_distribution()` [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) [11:16:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host irc2002.wikimedia.org with OS bullseye [11:16:11] 10SRE, 10SRE-Unowned: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host irc2002.wikimedia.org with OS bullseye completed: - irc2002 (**PASS**) - Removed from Puppet and PuppetDB if p... [11:17:44] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:17:46] (03CR) 10Hnowlan: Fix usage of deprecated `platform.linux_distribution()` (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) (owner: 10Kamila Součková) [11:19:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:21:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902320 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [11:25:04] (03PS1) 10Jbond: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) [11:26:20] (03CR) 10CI reject: [V: 04-1] team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [11:26:47] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:27:23] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2004.codfw.wmnet with reason: host reimage [11:30:57] (03PS2) 10Jbond: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) [11:32:04] (03CR) 10CI reject: [V: 04-1] team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [11:32:43] (03PS1) 10MVernon: Update mail dashboard to use a log scale (workflow testing) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/902324 [11:32:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2004.codfw.wmnet with reason: host reimage [11:34:06] (03CR) 10MVernon: [C: 04-2] "Workflow test, not for merging." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/902324 (owner: 10MVernon) [11:34:49] (03PS3) 10Jbond: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) [11:36:01] (03CR) 10CI reject: [V: 04-1] team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [11:36:25] !log btullis@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-druid1001.eqiad.wmnet with OS bullseye [11:36:30] (03PS4) 10Jbond: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) [11:37:42] (03CR) 10CI reject: [V: 04-1] team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [11:39:26] (03PS1) 10Muehlenhoff: Assign mw_rc_irc role to irc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902325 (https://phabricator.wikimedia.org/T331702) [11:43:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:49] (03CR) 10Muehlenhoff: [C: 03+2] Assign mw_rc_irc role to irc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902325 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [11:44:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:47:00] (03PS3) 10Elukey: changeprop: improve liftwing streams configurability [deployment-charts] - 10https://gerrit.wikimedia.org/r/902237 (https://phabricator.wikimedia.org/T328576) [11:47:02] (03PS2) 10Elukey: services: add the first lift wing stream in changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902307 (https://phabricator.wikimedia.org/T328576) [11:47:17] !log rolling rollback to HAProxy 2.6.9 in cache upload cluster - T332796 [11:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:23] T332796: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 [11:48:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:48:20] (03PS5) 10Jbond: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) [11:51:38] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-druid1001.eqiad.wmnet with reason: host reimage [11:52:20] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Remove old kubernetes metric from alerts [alerts] - 10https://gerrit.wikimedia.org/r/902120 (https://phabricator.wikimedia.org/T322919) (owner: 10JMeybohm) [11:52:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2004.codfw.wmnet with OS bullseye [11:53:31] (03Merged) 10jenkins-bot: kubernetes: Remove old kubernetes metric from alerts [alerts] - 10https://gerrit.wikimedia.org/r/902120 (https://phabricator.wikimedia.org/T322919) (owner: 10JMeybohm) [11:54:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-druid1001.eqiad.wmnet with reason: host reimage [11:54:50] (03PS1) 10Superpes15: [ckbwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902347 (https://phabricator.wikimedia.org/T332470) [11:56:53] (03CR) 10Hnowlan: [C: 03+1] changeprop: improve liftwing streams configurability [deployment-charts] - 10https://gerrit.wikimedia.org/r/902237 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [11:57:34] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:57:48] (03CR) 10Samtar: [C: 03+1] [ckbwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902347 (https://phabricator.wikimedia.org/T332470) (owner: 10Superpes15) [11:58:03] 10SRE, 10Traffic, 10Patch-For-Review, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) p:05High→03Medium https://github.com/haproxy/haproxy/commit/407210a34d781f8249504557c371c170cb34f93e introduced in HAProxy 2.6.10 has been identified... [11:58:31] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:58:31] (03CR) 10Hnowlan: [C: 03+1] services: add the first lift wing stream in changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902307 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [11:59:12] (03PS6) 10Jbond: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) [11:59:14] (03PS1) 10Jbond: team-sre/puppet-agent: Add widespread puppet failure (no resources) alert [alerts] - 10https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) [12:00:37] (03PS1) 10Muehlenhoff: Fix typo in systemctl command [puppet] - 10https://gerrit.wikimedia.org/r/902349 (https://phabricator.wikimedia.org/T331702) [12:04:18] (03PS2) 10Vgutierrez: traffic: Fix HAProxyRestarted dashboard URL [alerts] - 10https://gerrit.wikimedia.org/r/902319 (https://phabricator.wikimedia.org/T332796) [12:04:25] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:04:40] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:05:47] (03CR) 10CI reject: [V: 04-1] traffic: Fix HAProxyRestarted dashboard URL [alerts] - 10https://gerrit.wikimedia.org/r/902319 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [12:05:56] .... [12:07:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902349 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [12:07:40] (03PS3) 10Vgutierrez: traffic: Fix HAProxyRestarted dashboard URL [alerts] - 10https://gerrit.wikimedia.org/r/902319 (https://phabricator.wikimedia.org/T332796) [12:08:32] PROBLEM - ircecho bot process on irc2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [12:09:00] ^^ moritzm expected? [12:09:33] it's being setup currently, not in service yet [12:09:37] ack [12:09:43] fixing it up for bullseye compat currently [12:10:08] (03CR) 10Vgutierrez: [C: 03+2] traffic: Fix HAProxyRestarted dashboard URL [alerts] - 10https://gerrit.wikimedia.org/r/902319 (https://phabricator.wikimedia.org/T332796) (owner: 10Vgutierrez) [12:14:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host an-test-druid1001.eqiad.wmnet with OS bullseye [12:16:32] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in systemctl command [puppet] - 10https://gerrit.wikimedia.org/r/902349 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [12:36:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:37:09] (03PS1) 10Muehlenhoff: Fix NRPE check when running under Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/902362 (https://phabricator.wikimedia.org/T331702) [12:40:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/902362 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [12:41:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:42:12] (03CR) 10Kamila Součková: Fix usage of deprecated `platform.linux_distribution()` (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) (owner: 10Kamila Součková) [12:46:13] (03CR) 10Hnowlan: Fix usage of deprecated `platform.linux_distribution()` (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) (owner: 10Kamila Součková) [12:47:57] 10SRE, 10Traffic, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) 05Open→03Resolved alerting is in place to avoid this kind of issue upon HAProxy updates in the future and everything is down to 2.6.9: ` vgutierrez@cumin1001:~$ sudo -i cum... [12:51:04] (03PS7) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [12:53:46] (03PS87) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:54:07] (03PS1) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [12:55:18] (03PS2) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [12:55:20] (03CR) 10CI reject: [V: 04-1] Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [12:55:36] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:56:28] (03CR) 10CI reject: [V: 04-1] Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [12:58:32] (03PS3) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [12:58:39] (03PS88) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:59:41] (03CR) 10CI reject: [V: 04-1] Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T1300). [13:00:05] Superpes and Sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] hello [13:00:21] Hi :) [13:00:24] I can deploy [13:00:28] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:01:37] Superpes: going to do 902211 and 902216 together first [13:01:54] (03PS4) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [13:01:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902211 (https://phabricator.wikimedia.org/T332784) (owner: 10Superpes15) [13:01:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902216 (https://phabricator.wikimedia.org/T332813) (owner: 10Superpes15) [13:02:00] Perfect! :) TheresNoTime thanks! [13:02:45] (03Merged) 10jenkins-bot: [dkwikimedia] Fixing current logo with an HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902211 (https://phabricator.wikimedia.org/T332784) (owner: 10Superpes15) [13:02:48] (03Merged) 10jenkins-bot: [ptwikinews] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902216 (https://phabricator.wikimedia.org/T332813) (owner: 10Superpes15) [13:03:08] (03CR) 10CI reject: [V: 04-1] Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [13:03:17] !log samtar@deploy2002 Started scap: Backport for [[gerrit:902211|[dkwikimedia] Fixing current logo with an HD version (T332784)]], [[gerrit:902216|[ptwikinews] Enable wgMinervaEnableSiteNotice (T332813)]] [13:03:24] T332813: enable Sitenotice mobile in ptwikinews - https://phabricator.wikimedia.org/T332813 [13:03:25] T332784: dkwikimedia logo is too big - https://phabricator.wikimedia.org/T332784 [13:03:59] (03PS89) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:04:24] (03PS2) 10Samtar: [trwikiquote] Removing the temporary logo (already reverted) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902239 (https://phabricator.wikimedia.org/T329399) (owner: 10Superpes15) [13:04:28] (03PS2) 10Samtar: [ckbwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902347 (https://phabricator.wikimedia.org/T332470) (owner: 10Superpes15) [13:05:53] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [13:08:21] !log samtar@deploy2002 samtar and superpes: Backport for [[gerrit:902211|[dkwikimedia] Fixing current logo with an HD version (T332784)]], [[gerrit:902216|[ptwikinews] Enable wgMinervaEnableSiteNotice (T332813)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:08:24] Superpes: those two are live on mwdebug [13:08:28] T332813: enable Sitenotice mobile in ptwikinews - https://phabricator.wikimedia.org/T332813 [13:08:29] T332784: dkwikimedia logo is too big - https://phabricator.wikimedia.org/T332784 [13:09:01] TheresNoTime Dkwikimedia's one is fine! I can't check ptwikinews, because they don't have a local sitenotice rn, but I suppose it should be fine! [13:09:09] syncing [13:09:27] Thanks :) [13:13:26] (03PS90) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [13:14:44] 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10jbond) p:05Triage→03Medium [13:15:05] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:902211|[dkwikimedia] Fixing current logo with an HD version (T332784)]], [[gerrit:902216|[ptwikinews] Enable wgMinervaEnableSiteNotice (T332813)]] (duration: 11m 47s) [13:15:12] T332813: enable Sitenotice mobile in ptwikinews - https://phabricator.wikimedia.org/T332813 [13:15:13] T332784: dkwikimedia logo is too big - https://phabricator.wikimedia.org/T332784 [13:15:26] Superpes: moving on to the other two together now [13:15:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902239 (https://phabricator.wikimedia.org/T329399) (owner: 10Superpes15) [13:15:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902347 (https://phabricator.wikimedia.org/T332470) (owner: 10Superpes15) [13:15:51] Ok :D [13:16:30] (03Merged) 10jenkins-bot: [trwikiquote] Removing the temporary logo (already reverted) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902239 (https://phabricator.wikimedia.org/T329399) (owner: 10Superpes15) [13:16:33] (03Merged) 10jenkins-bot: [ckbwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902347 (https://phabricator.wikimedia.org/T332470) (owner: 10Superpes15) [13:16:55] !log samtar@deploy2002 Started scap: Backport for [[gerrit:902239|[trwikiquote] Removing the temporary logo (already reverted) (T329399)]], [[gerrit:902347|[ckbwiki] Add Draft and Draft_talk namespaces (T332470)]] [13:17:01] T332470: Request for the Draft namespace on the Ckb wikipedia - https://phabricator.wikimedia.org/T332470 [13:17:02] T329399: Temporary logo change for trwikiquote - https://phabricator.wikimedia.org/T329399 [13:17:53] (03PS1) 10Jforrester: Disable DoubleWiki extension everywhere, at least for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902388 (https://phabricator.wikimedia.org/T332850) [13:18:50] !log samtar@deploy2002 samtar and superpes: Backport for [[gerrit:902239|[trwikiquote] Removing the temporary logo (already reverted) (T329399)]], [[gerrit:902347|[ckbwiki] Add Draft and Draft_talk namespaces (T332470)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:18:55] Superpes: live on mwdebug :) [13:19:59] TheresNoTime Trwikiquote's one doesn't need to be checked (logos are already reverted) - ckbwiki's task looks fine! :) [13:20:18] syncing [13:20:40] Thanks! [13:24:02] (03PS3) 10Samtar: GrowthExperiments: disable add a link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902131 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [13:25:34] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:902239|[trwikiquote] Removing the temporary logo (already reverted) (T329399)]], [[gerrit:902347|[ckbwiki] Add Draft and Draft_talk namespaces (T332470)]] (duration: 08m 39s) [13:25:41] T332470: Request for the Draft namespace on the Ckb wikipedia - https://phabricator.wikimedia.org/T332470 [13:25:42] T329399: Temporary logo change for trwikiquote - https://phabricator.wikimedia.org/T329399 [13:25:43] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/901640 (owner: 10PipelineBot) [13:25:46] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: disable add a link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902131 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [13:25:56] Superpes: live, just about to run `maintenance/namespaceDupes.php` [13:26:20] !log `[samtar@mwmaint2002 ~]$ mwscript maintenance/namespaceDupes.php --wiki ckbwiki --fix` T332470 [13:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:37] sergi0: ready? [13:26:50] yes [13:26:55] TheresNoTime Yep I supposed it :) Thanks for your help! [13:27:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902131 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [13:27:56] (03Merged) 10jenkins-bot: GrowthExperiments: disable add a link backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902131 (https://phabricator.wikimedia.org/T304551) (owner: 10Sergio Gimeno) [13:28:16] !log samtar@deploy2002 Started scap: Backport for [[gerrit:902131|GrowthExperiments: disable add a link backend (T304551)]] [13:28:21] T304551: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 [13:29:13] * Lucas_WMDE here now [13:29:37] 2slow [13:29:49] !log samtar@deploy2002 samtar and sgimeno: Backport for [[gerrit:902131|GrowthExperiments: disable add a link backend (T304551)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:30:03] sergi0: live on mwdebug [13:30:26] !log joal@deploy2002 Started deploy [analytics/refinery@f4113ac]: Hotfix analytics deploy (virtualpageview oozie job) [analytics/refinery@f4113ac] [13:30:36] TheresNoTime: the change is disabling the trigger of a maintenance script which runs in a periodic for the mentioned wikis. Dont think we need to test it, the flag is reliable but we could run the refreshLinkRecommendation [13:30:47] ah! will sync [13:31:07] Perfect :) [13:31:15] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/901640 (owner: 10PipelineBot) [13:32:47] (03PS3) 10Jforrester: core-Permissions: [dewiki] Add `ipblock-exempt` to `bot` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759) (owner: 10Samtar) [13:35:49] (03PS1) 10Nicolas Fraison: spark: provide CRUD rights on secret for spark-deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 [13:36:22] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:902131|GrowthExperiments: disable add a link backend (T304551)]] (duration: 08m 05s) [13:36:24] live, and I now have one patch to deploy [13:36:27] T304551: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 [13:36:28] (03PS2) 10Nicolas Fraison: spark: provide CRUD rights on secret for spark-deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 [13:36:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759) (owner: 10Samtar) [13:37:24] (03Merged) 10jenkins-bot: core-Permissions: [dewiki] Add `ipblock-exempt` to `bot` group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902207 (https://phabricator.wikimedia.org/T332759) (owner: 10Samtar) [13:37:49] !log samtar@deploy2002 Started scap: Backport for [[gerrit:902207|core-Permissions: [dewiki] Add `ipblock-exempt` to `bot` group (T332759)]] [13:37:54] T332759: Add right 'ipblock-exempt' for the usergroup Bots on dewiki - https://phabricator.wikimedia.org/T332759 [13:38:26] TheresNoTime: Thank you for the assistance! [13:39:22] !log samtar@deploy2002 samtar: Backport for [[gerrit:902207|core-Permissions: [dewiki] Add `ipblock-exempt` to `bot` group (T332759)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:39:24] testing [13:39:46] syncing [13:40:57] (03CR) 10Herron: "Thanks, out of curiosity were you able to generate a grr preview and grr diff for this patch?" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/902324 (owner: 10MVernon) [13:41:53] (03PS1) 10Volans: tox.ini: use sphinx-build instead of setup.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/902395 [13:43:32] !log joal@deploy2002 Finished deploy [analytics/refinery@f4113ac]: Hotfix analytics deploy (virtualpageview oozie job) [analytics/refinery@f4113ac] (duration: 13m 06s) [13:44:15] !log joal@deploy2002 Started deploy [analytics/refinery@f4113ac] (thin): Hotfix analytics deploy (virtualpageview oozie job) THIN [analytics/refinery@f4113ac] [13:44:23] !log joal@deploy2002 Finished deploy [analytics/refinery@f4113ac] (thin): Hotfix analytics deploy (virtualpageview oozie job) THIN [analytics/refinery@f4113ac] (duration: 00m 08s) [13:44:46] !log joal@deploy2002 Started deploy [analytics/refinery@f4113ac] (hadoop-test): Hotfix analytics deploy (virtualpageview oozie job) TEST [analytics/refinery@f4113ac] [13:44:58] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:45:00] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:45:35] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:902207|core-Permissions: [dewiki] Add `ipblock-exempt` to `bot` group (T332759)]] (duration: 07m 46s) [13:45:41] T332759: Add right 'ipblock-exempt' for the usergroup Bots on dewiki - https://phabricator.wikimedia.org/T332759 [13:46:03] !log close UTC afternoon backport window [13:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:15] !log joal@deploy2002 Finished deploy [analytics/refinery@f4113ac] (hadoop-test): Hotfix analytics deploy (virtualpageview oozie job) TEST [analytics/refinery@f4113ac] (duration: 01m 28s) [13:46:15] (03PS1) 10EoghanGaffney: Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 [13:46:26] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:46:28] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:49:52] PROBLEM - Check systemd state on mw2336 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:09] (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/902362 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [13:53:15] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1001.eqiad.wmnet [13:54:40] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:54:43] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:55:28] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reimage for host pybal-test2003.codfw.wmnet with OS bullseye [13:55:34] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host pybal-test2003.codfw.wmnet with OS bullseye [13:57:29] (03PS5) 10Kamila Součková: Remove code for supporting old Debian (<= 9.0) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) [13:58:31] (03CR) 10Kamila Součková: Remove code for supporting old Debian (<= 9.0) (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) (owner: 10Kamila Součková) [14:01:48] (03PS6) 10Kamila Součková: Remove code for supporting old Debian (<= 9.0) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) [14:02:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1001.eqiad.wmnet [14:03:49] !log joal@deploy2002 Started deploy [analytics/refinery@2520d3d]: Hotfix analytics deploy 2nd (virtualpageview oozie job) [analytics/refinery@2520d3d] [14:06:14] (03CR) 10Jbond: [C: 03+1] Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [14:06:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pybal-test2003.codfw.wmnet with reason: host reimage [14:09:00] !log joal@deploy2002 Finished deploy [analytics/refinery@2520d3d]: Hotfix analytics deploy 2nd (virtualpageview oozie job) [analytics/refinery@2520d3d] (duration: 05m 10s) [14:09:17] !log joal@deploy2002 Started deploy [analytics/refinery@2520d3d] (thin): Hotfix analytics deploy (virtualpageview oozie job) 2nd THIN [analytics/refinery@2520d3d] [14:09:26] !log joal@deploy2002 Finished deploy [analytics/refinery@2520d3d] (thin): Hotfix analytics deploy (virtualpageview oozie job) 2nd THIN [analytics/refinery@2520d3d] (duration: 00m 09s) [14:09:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 for forcing k8s, but wow, that's a lot of duplication going on. And all for k8s clusters, where there is more or less 0 chance we will " [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [14:09:38] !log joal@deploy2002 Started deploy [analytics/refinery@2520d3d] (hadoop-test): Hotfix analytics deploy (virtualpageview oozie job) 2nd TEST [analytics/refinery@2520d3d] [14:09:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pybal-test2003.codfw.wmnet with reason: host reimage [14:10:03] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:10:05] (03CR) 10Jbond: [C: 03+1] tox.ini: use sphinx-build instead of setup.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/902395 (owner: 10Volans) [14:10:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1002.eqiad.wmnet [14:10:42] (03CR) 10Muehlenhoff: [C: 03+2] Fix NRPE check when running under Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/902362 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [14:11:06] (03CR) 10Jbond: [C: 03+1] Remove Icinga prometheus check for irc-echo messages [puppet] - 10https://gerrit.wikimedia.org/r/902320 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [14:11:08] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [14:11:11] !log joal@deploy2002 Finished deploy [analytics/refinery@2520d3d] (hadoop-test): Hotfix analytics deploy (virtualpageview oozie job) 2nd TEST [analytics/refinery@2520d3d] (duration: 01m 32s) [14:12:46] RECOVERY - ircecho bot process on irc2002 is OK: PROCS OK: 1 process with command name python3, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [14:13:43] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [14:13:50] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [14:15:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host irc1002.wikimedia.org [14:15:02] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:15:03] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:15:03] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:15:23] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:15:42] (03CR) 10Elukey: [C: 03+2] changeprop: improve liftwing streams configurability [deployment-charts] - 10https://gerrit.wikimedia.org/r/902237 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [14:15:53] (03CR) 10Elukey: [C: 03+2] services: add the first lift wing stream in changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902307 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [14:16:05] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:16:57] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc1002.wikimedia.org - jmm@cumin2002" [14:19:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1002.eqiad.wmnet [14:21:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host pybal-test2003.codfw.wmnet with OS bullseye [14:21:45] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host pybal-test2003.codfw.wmnet with OS bullseye completed: - pybal-test2003 (**PASS**) - Downtimed on Icinga/Alertmanage... [14:21:54] (03PS1) 10Elukey: services: add option for lift wing config in changeprop staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/902400 (https://phabricator.wikimedia.org/T328576) [14:22:23] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [14:22:26] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1154 - https://phabricator.wikimedia.org/T332649 (10Cmjohnson) 05Open→03Resolved @marostegui the disk has been replaced, I did not add it back to the raid configuration. Please do so at your convenience. [14:22:39] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [14:23:15] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10Cmjohnson) @marostegui @jynus I apologize for the delay for this DIMM, Dell had a question that needed responding to and it's delaying the shipment. It should go out to... [14:24:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc1002.wikimedia.org - jmm@cumin2002" [14:24:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:24:03] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache irc1002.wikimedia.org on all recursors [14:24:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) irc1002.wikimedia.org on all recursors [14:24:08] (03PS1) 10Herron: Grizzly: update GrafanaDashboard objects to hidden field [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/902401 (https://phabricator.wikimedia.org/T332893) [14:24:52] (03PS2) 10Herron: Grizzly: update grafanaDashboards objects to hidden field [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/902401 (https://phabricator.wikimedia.org/T332893) [14:26:42] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [14:26:57] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [14:27:32] (03PS1) 10Nicolas Fraison: spark: add hadoop conf configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/902402 [14:29:09] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.reimage for host lists1003.wikimedia.org with OS bullseye [14:29:16] 10SRE, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host lists1003.wikimedia.org with OS bullseye [14:29:23] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1154 - https://phabricator.wikimedia.org/T332649 (10Marostegui) RAID being rebuilt: ` root@db1154:~# megacli -PDRbld -ShowProg -physdrv[32:9] -aALL Rebuild Progress on Device at Enclosure 32, Slot 9 Completed 4% in 7 Minutes. Exit Code: 0x00 root@db1154:~# ` [14:31:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) Reimaged `pybal-test2003` to bullseye, added `component/pybal` and everything appears to be fine with the installation. ` pybal: Installed: 1.15.10+deb11u1 Candidate: 1.15.10+deb11u1 Version table:... [14:33:36] (03PS1) 10Abijeet Patro: MessageWebImporter: Use translation instead of language code on import [extensions/Translate] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902370 (https://phabricator.wikimedia.org/T323430) [14:35:57] (03CR) 10JMeybohm: [C: 03+1] "Well, that was to easy 😊" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/902401 (https://phabricator.wikimedia.org/T332893) (owner: 10Herron) [14:36:04] PROBLEM - eventgate-main validation error rate too high on alert1001 is CRITICAL: 4.286 gt 0.5 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [14:37:11] this is me --^ [14:37:53] ack [14:38:21] (03CR) 10Volans: [C: 03+2] tox.ini: use sphinx-build instead of setup.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/902395 (owner: 10Volans) [14:40:56] (03PS1) 10Elukey: services: disable lift wing stream on changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902406 (https://phabricator.wikimedia.org/T328576) [14:41:31] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lists1003.wikimedia.org with reason: host reimage [14:42:24] (03Merged) 10jenkins-bot: tox.ini: use sphinx-build instead of setup.py [software/spicerack] - 10https://gerrit.wikimedia.org/r/902395 (owner: 10Volans) [14:42:27] (03PS2) 10Elukey: services: disable lift wing stream on changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902406 (https://phabricator.wikimedia.org/T328576) [14:43:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host irc1002.wikimedia.org [14:44:26] (03PS3) 10Elukey: services: disable lift wing stream on changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902406 (https://phabricator.wikimedia.org/T328576) [14:44:37] (03CR) 10Herron: [V: 03+2 C: 03+2] "Ha, thx for the quick review!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/902401 (https://phabricator.wikimedia.org/T332893) (owner: 10Herron) [14:45:08] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists1003.wikimedia.org with reason: host reimage [14:45:18] (03PS1) 10Muehlenhoff: Add irc1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/902407 [14:49:47] (03CR) 10Elukey: [C: 03+2] services: disable lift wing stream on changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902406 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [14:50:12] (03PS1) 10Nicolas Fraison: spark: authorize communication between executors on blockManager port [deployment-charts] - 10https://gerrit.wikimedia.org/r/902409 [14:50:19] (03CR) 10ArielGlenn: [C: 03+1] "If y'all are fine with it, it's fine with me." [puppet] - 10https://gerrit.wikimedia.org/r/879274 (owner: 10Majavah) [14:50:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1003.eqiad.wmnet [14:51:41] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [14:51:49] (03PS1) 10Jbond: swift: convert monitoring::check_prometheus checkes [alerts] - 10https://gerrit.wikimedia.org/r/902410 (https://phabricator.wikimedia.org/T312765) [14:51:56] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [14:53:09] (03CR) 10CI reject: [V: 04-1] swift: convert monitoring::check_prometheus checkes [alerts] - 10https://gerrit.wikimedia.org/r/902410 (https://phabricator.wikimedia.org/T312765) (owner: 10Jbond) [14:53:42] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [14:53:58] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [14:56:32] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host lists1003.wikimedia.org with OS bullseye [14:56:37] 10SRE, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host lists1003.wikimedia.org with OS bullseye completed: - lists1003 (**PASS**) - Removed fr... [14:59:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1003.eqiad.wmnet [14:59:47] (03CR) 10Btullis: "I think you still need to bump the chart version, right?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902409 (owner: 10Nicolas Fraison) [15:00:43] (03CR) 10Btullis: "Can you say in the commit message why it's required please?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902409 (owner: 10Nicolas Fraison) [15:02:27] (03CR) 10Muehlenhoff: [C: 03+2] Add irc1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/902407 (owner: 10Muehlenhoff) [15:03:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host irc1002.wikimedia.org with OS bullseye [15:03:46] 10SRE, 10SRE-Unowned: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host irc1002.wikimedia.org with OS bullseye [15:04:17] (03CR) 10Btullis: "I think you still need to bump the chart version." [deployment-charts] - 10https://gerrit.wikimedia.org/r/902402 (owner: 10Nicolas Fraison) [15:06:47] (03CR) 10Hnowlan: [C: 03+1] services: add option for lift wing config in changeprop staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/902400 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [15:07:52] RECOVERY - eventgate-main validation error rate too high on alert1001 is OK: (C)0.5 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [15:08:15] (03CR) 10Btullis: [C: 03+1] "Looks good to me, but I would like to get a +1 from someone from serviceops, if possible." [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (owner: 10Nicolas Fraison) [15:08:57] (03PS9) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [15:09:33] (03CR) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [15:10:07] (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [15:12:08] !log testing haproxy_2.6.11-1~bpo11+wmf2_amd64.deb in text@ulsfo - T332796 [15:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:13] T332796: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 [15:12:30] (03Abandoned) 10Jbond: swift: convert monitoring::check_prometheus checkes [alerts] - 10https://gerrit.wikimedia.org/r/902410 (https://phabricator.wikimedia.org/T312765) (owner: 10Jbond) [15:16:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on irc1002.wikimedia.org with reason: host reimage [15:19:26] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [15:20:49] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite) [15:21:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on irc1002.wikimedia.org with reason: host reimage [15:23:44] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration, and 2 others: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) p:05Triage→03High [15:24:45] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:52] (03CR) 10Btullis: spark: provide CRUD rights on secret for spark-deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (owner: 10Nicolas Fraison) [15:26:12] (03PS10) 10Filippo Giunchedi: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [15:26:54] (03PS7) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) [15:26:56] (03PS2) 10Giuseppe Lavagetto: Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 [15:26:58] (03PS7) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) [15:29:45] (JobUnavailable) resolved: Reduced availability for job udpmxircecho in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:08] (03CR) 10Btullis: "Could you link to the phab task please nfrainson, with a Bug: in the footer?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (owner: 10Nicolas Fraison) [15:34:33] (03CR) 10Nicolas Fraison: spark: add hadoop conf configmap (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/902402 (owner: 10Nicolas Fraison) [15:36:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host irc1002.wikimedia.org with OS bullseye [15:36:13] 10SRE, 10SRE-Unowned: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host irc1002.wikimedia.org with OS bullseye completed: - irc1002 (**PASS**) - Removed from Puppet and PuppetDB if p... [15:37:03] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:37:06] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:40:29] (03PS5) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [15:41:34] (03CR) 10Filippo Giunchedi: "Just a typo inline, rest LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [15:41:46] (03CR) 10Filippo Giunchedi: "Overall LGTM, see inline for nit/typo" [alerts] - 10https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [15:42:31] (03PS2) 10Dzahn: phabricator/aphlict: Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [15:42:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: add wmcs-roots to clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/879274 (owner: 10Majavah) [15:44:34] (03CR) 10Dzahn: [C: 03+1] "yea, confirmed. makes sense, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [15:46:24] (03PS1) 10Muehlenhoff: Add sre-admins as an NDA-relevant group [puppet] - 10https://gerrit.wikimedia.org/r/902423 [15:47:02] (03PS3) 10Nicolas Fraison: spark: provide CRUD rights on secret for spark-deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (https://phabricator.wikimedia.org/T332908) [15:47:17] (03PS2) 10Dzahn: Revert "maintenance: temp allow rsyncing home dir to miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/902156 [15:47:48] (03CR) 10Nicolas Fraison: spark: provide CRUD rights on secret for spark-deploy user (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (https://phabricator.wikimedia.org/T332908) (owner: 10Nicolas Fraison) [15:47:51] (03CR) 10Dzahn: "cleaning up mediawiki::maintenance profile" [puppet] - 10https://gerrit.wikimedia.org/r/902156 (owner: 10Dzahn) [15:47:57] (03CR) 10Cwhite: [C: 03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/902081 (https://phabricator.wikimedia.org/T234565) (owner: 10Ssingh) [15:50:53] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:52:05] (03CR) 10Dzahn: "removes all this stuff, rsyncd, timer (even though auto-sync isn't enabled), etc: https://puppet-compiler.wmflabs.org/output/902156/40293/" [puppet] - 10https://gerrit.wikimedia.org/r/902156 (owner: 10Dzahn) [15:54:09] (03CR) 10Dzahn: "this was here to originally upload files for static-codereview.wikimedia.org - now they are just synced between the miscweb* servers and m" [puppet] - 10https://gerrit.wikimedia.org/r/902156 (owner: 10Dzahn) [15:55:10] (03CR) 10Filippo Giunchedi: "Just two notes/nits inline, LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [15:55:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:56:01] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:57:59] (03PS3) 10Dzahn: miscweb: add monitor for research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902224 (https://phabricator.wikimedia.org/T329587) [15:58:20] (03PS3) 10Dzahn: miscweb: add monitor for wikiworkshop.org [puppet] - 10https://gerrit.wikimedia.org/r/902225 (https://phabricator.wikimedia.org/T329587) [15:58:51] (03PS3) 10Dzahn: miscweb/static-codereview: add prometheus monitor, remove icinga monitor [puppet] - 10https://gerrit.wikimedia.org/r/902226 (https://phabricator.wikimedia.org/T329587) [16:00:05] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T1600). [16:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:42] o/ [16:01:04] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:01:05] (03CR) 10Elukey: [C: 03+2] services: add option for lift wing config in changeprop staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/902400 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [16:03:08] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [16:03:23] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [16:04:11] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2002.codfw.wmnet with reason: stop kafka and reimage [16:04:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2002.codfw.wmnet with reason: stop kafka and reimage [16:06:34] 10SRE, 10serviceops: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10jhathaway) a:03jhathaway [16:06:45] 10SRE, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10jhathaway) a:03jhathaway [16:07:01] (03CR) 10JMeybohm: [C: 03+1] mesh.configuration: add support for custom error pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [16:07:23] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main2002.codfw.wmnet with OS bullseye [16:07:23] (03PS2) 10Nicolas Fraison: spark: authorize communication between executors on blockManager port [deployment-charts] - 10https://gerrit.wikimedia.org/r/902409 (https://phabricator.wikimedia.org/T331859) [16:07:25] (03PS2) 10Nicolas Fraison: spark: add hadoop conf configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/902402 (https://phabricator.wikimedia.org/T332909) [16:08:46] (03PS3) 10Nicolas Fraison: spark: add hadoop conf configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/902402 (https://phabricator.wikimedia.org/T332909) [16:09:29] (03CR) 10JMeybohm: [C: 03+1] Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto) [16:11:49] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10jcrespo) >>! In T332708#8721371, @Cmjohnson wrote: > @marostegui @jynus I apologize for the delay for this DIMM, Dell had a question that needed responding to and it's... [16:14:23] (03PS1) 10Jbond: alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) [16:15:49] (03CR) 10JMeybohm: [C: 03+1] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [16:16:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:17:01] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2002.codfw.wmnet with OS bullseye [16:18:06] jbond, rzl: could one of you help with a puppet deploy? [16:20:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:22:29] (03PS4) 10Nicolas Fraison: spark: add hadoop conf configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/902402 (https://phabricator.wikimedia.org/T332909) [16:26:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:28:56] ....or anyone else [16:30:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:36:02] (03CR) 10Cathal Mooney: [C: 03+2] Remove Icinga prometheus check for irc-echo messages [puppet] - 10https://gerrit.wikimedia.org/r/902320 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [16:36:18] <_joe_> tgr_: I assume that the puppet deployment window today was skipped because of SRE sprint week [16:36:47] <_joe_> but I think I can merge your change, I know that code well enough - just let me check if anyone from the traffic team is available [16:36:58] thx [16:38:08] (03PS2) 10Jbond: alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) [16:38:10] (03PS1) 10Jbond: alertmanager: ensure the default page action is to sre-page [puppet] - 10https://gerrit.wikimedia.org/r/902434 [16:41:18] (03CR) 10Filippo Giunchedi: [C: 03+1] miscweb/static-codereview: add prometheus monitor, remove icinga monitor [puppet] - 10https://gerrit.wikimedia.org/r/902226 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:42:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/902434 (owner: 10Jbond) [16:43:33] tgr_: merging shortly [16:43:56] thank you! [16:44:05] (03PS2) 10Jbond: team-sre/puppet-agent: Add widespread puppet failure (no resources) alert [alerts] - 10https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) [16:44:07] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:53] (03CR) 10Jbond: "updated thanks" [alerts] - 10https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [16:45:37] !log disable Puppet in A:cp to test and then merge CR 901333 [16:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10Gehel) [16:47:52] (03CR) 10Ssingh: [C: 03+2] multi-dc: Use primary for OAuth for both URL forms [puppet] - 10https://gerrit.wikimedia.org/r/901333 (https://phabricator.wikimedia.org/T313578) (owner: 10Gergő Tisza) [16:48:19] (03CR) 10Jbond: [C: 03+2] alertmanager: ensure the default page action is to sre-page [puppet] - 10https://gerrit.wikimedia.org/r/902434 (owner: 10Jbond) [16:48:29] (03PS3) 10Jbond: alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) [16:48:31] (03PS1) 10Jbond: alertmanager: also pages to sre for data-engineering [puppet] - 10https://gerrit.wikimedia.org/r/902435 (https://phabricator.wikimedia.org/T332709) [16:50:40] jbond: you have a merge running? [16:50:48] it's stalling I think should thought I should check :) [16:51:08] sukhe: you happy for me to merge yours i guess? [16:51:11] yes please [16:51:20] going now :) [16:51:29] (03CR) 10Jbond: alertmanager: also pages to sre for data-engineering, releng and search (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [16:51:43] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:53] thanks! [16:53:57] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/902423 (owner: 10Muehlenhoff) [16:54:49] tgr_: looks good, rolling it out to A:cp-text [16:54:56] (03PS1) 10EoghanGaffney: Remove prometheus=k8s from matcher in sessionstore alert [alerts] - 10https://gerrit.wikimedia.org/r/902438 [16:54:59] will let you know here when it's done :) [16:55:44] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove prometheus=k8s from matcher in sessionstore alert [alerts] - 10https://gerrit.wikimedia.org/r/902438 (owner: 10EoghanGaffney) [16:56:08] thanks sukhe! It's not immediately testable, but we should see the errors at https://logstash.wikimedia.org/goto/a94bdc105d8854535ae5772806968d61 go down in an hour or so if it works. [16:56:25] (03CR) 10Dzahn: [C: 03+2] miscweb/static-codereview: add prometheus monitor, remove icinga monitor [puppet] - 10https://gerrit.wikimedia.org/r/902226 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:56:27] noted! [16:56:30] (And if that was indeed the reason for the errors, which I am not 100% sure about, but it seems likely.) [16:56:32] (03PS4) 10Dzahn: miscweb/static-codereview: add prometheus monitor, remove icinga monitor [puppet] - 10https://gerrit.wikimedia.org/r/902226 (https://phabricator.wikimedia.org/T329587) [16:57:17] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:35] (03CR) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [16:57:43] (03CR) 10EoghanGaffney: [C: 03+2] Remove prometheus=k8s from matcher in sessionstore alert [alerts] - 10https://gerrit.wikimedia.org/r/902438 (owner: 10EoghanGaffney) [16:59:02] (03Merged) 10jenkins-bot: Remove prometheus=k8s from matcher in sessionstore alert [alerts] - 10https://gerrit.wikimedia.org/r/902438 (owner: 10EoghanGaffney) [16:59:43] !log rolling out CR 901333 to A:cp-text T313578 [16:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:48] T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578 [16:59:50] (03CR) 10EoghanGaffney: [C: 03+1] miscweb: add monitor for research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902224 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:00:04] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T1700). [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T1700) [17:00:05] (03CR) 10EoghanGaffney: [C: 03+1] miscweb: add monitor for wikiworkshop.org [puppet] - 10https://gerrit.wikimedia.org/r/902225 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:00:14] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/902333 [17:00:40] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:01] (03CR) 10Filippo Giunchedi: alertmanager: also pages to sre for data-engineering, releng and search (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [17:04:30] (03CR) 10Jbond: [C: 04-1] "see inline and pcc" [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [17:05:16] (03CR) 10Clément Goubert: [C: 03+1] Revert "maintenance: temp allow rsyncing home dir to miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/902156 (owner: 10Dzahn) [17:06:06] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:28] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:56] RECOVERY - Check systemd state on mw2336 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:26] (03CR) 10Dzahn: [C: 03+2] miscweb: add monitor for wikiworkshop.org [puppet] - 10https://gerrit.wikimedia.org/r/902225 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:10:32] (03PS4) 10Dzahn: miscweb: add monitor for wikiworkshop.org [puppet] - 10https://gerrit.wikimedia.org/r/902225 (https://phabricator.wikimedia.org/T329587) [17:12:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:44] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:08] (03PS4) 10Dzahn: miscweb: add monitor for research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902224 (https://phabricator.wikimedia.org/T329587) [17:13:32] RECOVERY - MegaRAID on db1154 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:13:38] eoghan: ^ re: aphlict alert. probably just needs "systemctl reset-failed" now to clear it up, since you remove duplicate logrotate [17:13:48] eoghan: thanks for reviews [17:15:03] mutante: I haven't merged the logrotate thing yet [17:16:07] eoghan: oh, heh, for some reason logrotate timer fails already [17:16:14] looks [17:17:21] mutante: Mar 23 17:00:54 aphlict1001 systemd[1]: logrotate.timer: Unit to trigger vanished. [17:17:22] Weird [17:18:06] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:06] !log aphlict1001 - systemctl reset-failed; systemctl start logrotate ; systemctl start logrotate.timer [17:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:12] eoghan: eh, yea [17:18:22] it seemed soooo related to the patch .. so I just assumed [17:18:28] must be the duplicate service [17:18:32] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:32] or what it's fixing [17:18:37] ^:) [17:18:55] I know, right? Weird coincidence. [17:19:03] [aphlict1001:~] $ systemctl list-units --state=failed [17:19:03] 0 loaded units listed. Pass --all to see loaded but inactive units, too. [17:19:06] cleaned up [17:19:11] in case it happens again we will see [17:19:11] ty [17:19:15] np,yw [17:20:03] maybe this happens when the 2 logrotates mess with each other [17:20:10] (03PS1) 10Jgiannelos: wikifeeds: Bump to latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/902443 [17:20:11] that would mean your change will fix i [17:20:34] could make sense then about "trigger vanished" [17:21:40] (03CR) 10Dzahn: [C: 03+2] miscweb: add monitor for research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/902224 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:22:52] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:31] (03CR) 10Hnowlan: [C: 03+1] Remove code for supporting old Debian (<= 9.0) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) (owner: 10Kamila Součková) [17:23:59] (03CR) 10Dzahn: "Kind of would like to see what it's actually doing by compiling it.. the issue with that is just that the compiler never knows about new h" [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [17:26:02] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Bump to latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/902443 (owner: 10Jgiannelos) [17:26:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:22] (03CR) 10Dzahn: "So I see in the doc profile it has parameters like these:" [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [17:28:47] (03CR) 10Dzahn: "also I noticed this is a scap target:" [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [17:30:31] (03PS1) 10Kamila Součková: admin: add user kamila [puppet] - 10https://gerrit.wikimedia.org/r/902444 [17:31:04] (03Merged) 10jenkins-bot: wikifeeds: Bump to latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/902443 (owner: 10Jgiannelos) [17:31:58] (03CR) 10Dzahn: "re: puppet compiling, I see now you already did but for the other 2 hosts and it shows they are not affected, which is good." [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [17:32:45] (03CR) 10Dzahn: "ci-docroot:" [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [17:35:18] (03CR) 10Dzahn: [C: 03+1] "if you want to merge as is, go ahead, just expect errors with scap and follow-up with the lists in Hiera. ping me on IRC if you want." [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [17:36:52] (03CR) 10Dzahn: [C: 03+2] rt: Remove some old migration cruft [puppet] - 10https://gerrit.wikimedia.org/r/902049 (owner: 10Muehlenhoff) [17:37:34] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [17:38:02] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [17:38:15] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [17:38:27] !log moscovium - systemctl stop rsync [17:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:00] (03CR) 10Dzahn: [C: 03+2] "17:38 < mutante> !log moscovium - systemctl stop rsync" [puppet] - 10https://gerrit.wikimedia.org/r/902049 (owner: 10Muehlenhoff) [17:39:11] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [17:39:26] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [17:39:58] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [17:43:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 [17:43:25] (03PS1) 10Giuseppe Lavagetto: mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) [17:43:27] (03PS1) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: fix margins in errorpage [puppet] - 10https://gerrit.wikimedia.org/r/902448 [17:43:41] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:45] (03CR) 10CI reject: [V: 04-1] mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 (owner: 10Giuseppe Lavagetto) [17:45:51] (03CR) 10CI reject: [V: 04-1] mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [17:46:03] (03CR) 10CI reject: [V: 04-1] profile::tlsproxy::envoy: fix margins in errorpage [puppet] - 10https://gerrit.wikimedia.org/r/902448 (owner: 10Giuseppe Lavagetto) [17:46:30] (03PS1) 10Volans: run_cookook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 [17:48:55] (03CR) 10CI reject: [V: 04-1] run_cookook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 (owner: 10Volans) [17:51:42] 10SRE, 10SRE-Access-Requests: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10hnowlan) [17:52:01] (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 (owner: 10Volans) [17:52:29] (03PS2) 10Hnowlan: admin: add user kamila [puppet] - 10https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: 10Kamila Součková) [17:54:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10FJoseph-WMF) Approved. [17:54:46] (03CR) 10Btullis: osd: Add osd on new ceph cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [17:55:22] (03CR) 10Dzahn: "not working per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.tab=1&g0.stacked=0&g0.range" [puppet] - 10https://gerrit.wikimedia.org/r/902226 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:55:26] (03PS6) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [18:00:04] dancy and brennen: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T1800). Please do the needful. [18:00:43] (03PS2) 10Volans: run_cookook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 [18:02:19] o/ [18:03:03] dancy's currently experiencing some connectivity issues; will roll train forward shortly. [18:06:55] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902452 (https://phabricator.wikimedia.org/T330207) [18:06:57] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902452 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [18:07:42] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902452 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [18:15:20] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.1 refs T330207 [18:15:26] T330207: 1.41.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T330207 [18:16:04] tgr_: 901333 rollout completed [18:16:35] Thanks a lot sukhe! Does that also include the ATS restart? [18:16:55] yep [18:17:06] Cool! I'll watch the logs. [18:17:08] restarted ATS too and that's why it took time [18:17:12] ok! [18:20:41] (03PS7) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [18:23:27] (03CR) 10Cathal Mooney: "Thanks for the feedback Filippo! Hope this now looks ok, I had one question about the last alert inline, relating to the time window if w" [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [18:28:43] brennen: Thanks for the backup! I have recovered. [18:30:12] dancy: cool cool. things are quiet so far, assuming typing this sentence hasn't jinxed anything. [18:30:16] (03PS1) 10Cathal Mooney: Remove Eventlogging prometheus-based Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/902454 (https://phabricator.wikimedia.org/T309007) [18:36:36] (03PS1) 10Urbanecm: [Growth] eswiki: Enable mentorship for 50% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) [18:46:31] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Volans) a:05Volans→03None [18:46:55] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Volans) a:05Volans→03None [18:48:27] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10Volans) a:05Volans→03None Removing myself from assignee as... [18:50:59] (03CR) 10Dzahn: "fails ->> https://phabricator.wikimedia.org/T332919" [puppet] - 10https://gerrit.wikimedia.org/r/902225 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [18:51:16] (03CR) 10Dzahn: [C: 03+2] "https://phabricator.wikimedia.org/T332919" [puppet] - 10https://gerrit.wikimedia.org/r/902226 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [18:52:30] (03CR) 10Dzahn: [C: 03+2] "works per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input" [puppet] - 10https://gerrit.wikimedia.org/r/902224 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [18:55:33] (03PS1) 10Dzahn: miscweb: wikiworkshop, change monitor to check redirect target [puppet] - 10https://gerrit.wikimedia.org/r/902456 (https://phabricator.wikimedia.org/T329587) [18:56:13] (03CR) 10Dzahn: [C: 03+2] miscweb: wikiworkshop, change monitor to check redirect target [puppet] - 10https://gerrit.wikimedia.org/r/902456 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [19:01:33] (03PS1) 10Jbond: team-sre/hardware: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) [19:01:59] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) ports from old SCS Name Cable Connection port1 1637 ps1-a1-eqiad (WMF5197) console0 port2 1638 ps1-a2-eqiad (WMF5221) console0 port3 1643 ps1-a3-eqiad (WMF5192) console0 port... [19:02:10] (03PS1) 10Dzahn: miscweb/static_codreview: fix string expected by http blackbox monitor [puppet] - 10https://gerrit.wikimedia.org/r/902458 (https://phabricator.wikimedia.org/T329587) [19:02:41] (03CR) 10Dzahn: [C: 03+2] miscweb/static_codreview: fix string expected by http blackbox monitor [puppet] - 10https://gerrit.wikimedia.org/r/902458 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [19:02:45] (03CR) 10CI reject: [V: 04-1] team-sre/hardware: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [19:14:33] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts doc2002 [19:17:43] (03CR) 10Dzahn: [C: 03+2] "fixed per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input" [puppet] - 10https://gerrit.wikimedia.org/r/902458 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [19:18:22] (03CR) 10Dzahn: [C: 03+2] "fixed per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input" [puppet] - 10https://gerrit.wikimedia.org/r/902456 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [19:18:26] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [19:20:48] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc2002 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [19:28:11] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doc2002 decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [19:28:11] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:13] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doc2002 [19:28:21] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by denisse@cumin1001 for hosts: `doc2002` - doc2002 (**WARN**) - //Host not found on Icinga, unable to downtime it// - Found Ganeti VM - VM shu... [19:31:41] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host doc2002.codfw.wmnet [19:31:43] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [19:35:14] (03PS2) 10Andrea Denisse: doc: Add role::doc to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) [19:35:49] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc2002.codfw.wmnet - denisse@cumin1001" [19:36:49] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc2002.codfw.wmnet - denisse@cumin1001" [19:36:49] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:36:49] !log denisse@cumin1001 START - Cookbook sre.dns.wipe-cache doc2002.codfw.wmnet on all recursors [19:36:52] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc2002.codfw.wmnet on all recursors [19:38:18] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40298/console" [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [19:40:57] (03PS1) 10Volans: remote: add results to RemoteExecutionError [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460 [19:53:57] (03PS1) 10Krinkle: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902376 (https://phabricator.wikimedia.org/T203786) [20:00:05] brennen and TheresNoTime: That opportune time is upon us again. Time for a UTC late backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230323T2000). [20:00:05] Nikerabbit: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] here [20:01:23] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10Dwisehaupt) [20:01:43] o/ I can deploy I guess [20:02:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/Translate] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902370 (https://phabricator.wikimedia.org/T323430) (owner: 10Abijeet Patro) [20:03:25] taavi: I see I'm not the only one up late ;) [20:07:37] lol [20:10:37] (03CR) 10CI reject: [V: 04-1] objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902376 (https://phabricator.wikimedia.org/T203786) (owner: 10Krinkle) [20:13:02] (03PS2) 10Krinkle: objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902376 (https://phabricator.wikimedia.org/T203786) [20:16:33] (03PS1) 10JHathaway: lists: Change role of lists1003 [puppet] - 10https://gerrit.wikimedia.org/r/902472 (https://phabricator.wikimedia.org/T331706) [20:21:38] (03CR) 10JHathaway: [C: 03+2] lists: Change role of lists1003 [puppet] - 10https://gerrit.wikimedia.org/r/902472 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [20:22:41] (03Merged) 10jenkins-bot: MessageWebImporter: Use translation instead of language code on import [extensions/Translate] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902370 (https://phabricator.wikimedia.org/T323430) (owner: 10Abijeet Patro) [20:22:59] !log taavi@deploy2002 Started scap: Backport for [[gerrit:902370|MessageWebImporter: Use translation instead of language code on import (T323430)]] [20:23:05] T323430: Bug with importing fuzzy translations from .po files - https://phabricator.wikimedia.org/T323430 [20:24:35] !log taavi@deploy2002 abi and taavi: Backport for [[gerrit:902370|MessageWebImporter: Use translation instead of language code on import (T323430)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:24:38] Nikerabbit: pulled to mwdebug servers, please test if possible [20:24:58] taavi: does it matter which debug server I choose from the list? [20:25:27] just pick the first one [20:26:33] okay, testing [20:26:49] scap these days pushes the patches to all servers, so you just want one in the writeable DC and the list is sorted so those come first [20:27:38] fix confirmed: https://meta.wikimedia.org/wiki/Translations:User:APatro_(WMF)/Test_Translation_Imports/1/fi has proper content instead of language code [20:28:01] thanks, syncing [20:33:13] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doc2002.codfw.wmnet [20:33:27] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Dzahn) > The username of your existing account on wikitech.wikimedia.org: confirmed. that's uidNumber: 43615 with a "-ctr@wikimedia" email address. > Do you currently have shell access (Yes/No)? Doe... [20:33:56] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:902370|MessageWebImporter: Use translation instead of language code on import (T323430)]] (duration: 10m 56s) [20:34:01] ok, done [20:34:02] T323430: Bug with importing fuzzy translations from .po files - https://phabricator.wikimedia.org/T323430 [20:34:52] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host doc2002.codfw.wmnet with OS bullseye [20:34:57] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye [20:35:38] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host doc2002.codfw.wmnet with OS bullseye [20:35:43] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors: - doc2002 (**FAIL**) - **The reimage failed, see the c... [20:35:59] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40299/console" [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [20:37:39] taavi: thanks, kiitos [20:42:03] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host doc2002.codfw.wmnet with OS bullseye [20:42:08] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye [20:42:10] !log denisse@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host doc2002.codfw.wmnet with OS bullseye [20:42:14] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors: - doc2002 (**FAIL**) - **The reimage failed, see the c... [20:47:59] (03PS1) 10JHathaway: bookworm: use default mtail pkg [puppet] - 10https://gerrit.wikimedia.org/r/902479 (https://phabricator.wikimedia.org/T331706) [20:50:54] (03PS1) 10JHathaway: lists: allow lists1003 to grab a cert [puppet] - 10https://gerrit.wikimedia.org/r/902481 (https://phabricator.wikimedia.org/T331706) [20:51:55] (03CR) 10JHathaway: [C: 03+2] lists: allow lists1003 to grab a cert [puppet] - 10https://gerrit.wikimedia.org/r/902481 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [20:52:09] (03CR) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [20:57:40] (03CR) 10Cwhite: [C: 04-2] "This filter does not match the kind of logs found here: https://logstash.wikimedia.org/app/discover#/?_g=(filters:!((query:(match_phrase:(" [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [21:00:24] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/902479 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [21:03:01] PROBLEM - Check systemd state on lists1003 is CRITICAL: CRITICAL - degraded: The following units failed: mailman3.service,mtail.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:25] PROBLEM - Exim SMTP on lists1003 is CRITICAL: connect to address 208.80.154.5 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [21:10:03] PROBLEM - HTTPS on lists1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed error:1408F10B:SSL routines:ssl3_get_record:wrong version number https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:44] ^ ignoring that because I know jhathaway just created that [21:15:02] oops sorry, i'll downtime those, thanks mutante [21:15:15] np:) thx [21:15:28] always happens when a new role is applied [21:16:39] PROBLEM - mailman archives on lists1003 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:39] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40300/console" [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [21:17:28] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] doc: Add role::doc to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/902222 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [21:22:03] (03PS1) 10Andrea Denisse: doc: Add the doc2002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/902489 (https://phabricator.wikimedia.org/T332819) [21:23:17] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40301/console" [puppet] - 10https://gerrit.wikimedia.org/r/902489 (https://phabricator.wikimedia.org/T332819) (owner: 10Andrea Denisse) [21:24:33] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "doc2002 - denisse@cumin1001 - T332819" [21:24:39] T332819: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 [21:25:22] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] doc: Add the doc2002 node definition [puppet] - 10https://gerrit.wikimedia.org/r/902489 (https://phabricator.wikimedia.org/T332819) (owner: 10Andrea Denisse) [21:25:39] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "doc2002 - denisse@cumin1001 - T332819" [21:26:07] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:26:09] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:30:33] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:30:35] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [21:31:55] !log denisse@cumin1001 START - Cookbook sre.ganeti.reimage for host doc2002.codfw.wmnet with OS bullseye [21:32:00] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye [21:33:24] (03PS3) 10EoghanGaffney: Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 [21:34:35] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1003.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:53] (03PS1) 10JHathaway: bookworm: Update spamassassin daemon name [puppet] - 10https://gerrit.wikimedia.org/r/902496 (https://phabricator.wikimedia.org/T331706) [21:36:18] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/902496 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [21:43:22] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:44:33] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:45:27] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on doc2002.codfw.wmnet with reason: host reimage [21:48:46] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doc2002.codfw.wmnet with reason: host reimage [21:51:52] (03CR) 10Krinkle: Add per-action component-level profiling in statsd using excimer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [21:53:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on moscovium.eqiad.wmnet with reason: dist-upgrade [21:54:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on moscovium.eqiad.wmnet with reason: dist-upgrade [21:54:07] (03PS1) 10JHathaway: apache2: Use systemd provider [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) [21:57:30] !log moscovium - apt-get upgrade (rt.wikimedia.org going into maintenance) T327068 [21:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:35] T327068: Bullseye upgrade for remaining Collab hosts - https://phabricator.wikimedia.org/T327068 [21:57:39] (03PS1) 10Bking: elasticsearch: [WIP] Add node ban logic [cookbooks] - 10https://gerrit.wikimedia.org/r/902502 (https://phabricator.wikimedia.org/T331303) [22:00:03] (03CR) 10CI reject: [V: 04-1] elasticsearch: [WIP] Add node ban logic [cookbooks] - 10https://gerrit.wikimedia.org/r/902502 (https://phabricator.wikimedia.org/T331303) (owner: 10Bking) [22:00:49] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host doc2002.codfw.wmnet with OS bullseye [22:00:53] !log moscovium - apt-get full-upgrade ; apt autoremove ; replace buster with bullseye in sources.list ; repeat apt-get upgrade/full-upgrade etc. (https://wiki.debian.org/DebianUpgrade) T327068 [22:00:54] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye completed: - doc2002 (**PASS**) - Removed from Puppet and PuppetDB if presen... [22:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:43] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10andrea.denisse) 05Open→03Resolved [22:09:28] !log moscovium - when doing an in-place upgrade from buster to bullseye and you replace the string in sources.list, you also need to replace "bullseye-updates" with "bullseye-security" in the security.debian.org lines - that this is needed is called a bug at https://shagain.club/index.php/archives/641/ - T327068 [22:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:34] T327068: Bullseye upgrade for remaining Collab hosts - https://phabricator.wikimedia.org/T327068 [22:11:12] (03PS1) 10Andrea Denisse: doc: Add role::doc to doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) [22:16:54] (03PS2) 10JHathaway: apache2: Use systemd provider [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) [22:17:09] (03CR) 10Dzahn: [C: 04-1] "You got "wikimedia.org" in the new names." [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [22:17:14] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [22:17:48] (03CR) 10Dzahn: [C: 04-1] "imho it's fine to just list them all and not do the (eqiad|codfw) and [12] regex, but up to you" [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [22:18:22] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:19:33] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:20:29] !log moscovium performing apt-get full-upgrade T332952 [22:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:34] T332952: upgrade moscovium (RT) to bullseye - https://phabricator.wikimedia.org/T332952 [22:20:54] ACKNOWLEDGEMENT - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1003.eqiad.wmnet.service Andrea Denisse Migrating doc hosts to Bullseye T319477 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:32] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:10] (03CR) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [22:30:32] !log moscovium - rebooting to finalize distro release upgrade - T332952 [22:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:38] T332952: upgrade moscovium (RT) to bullseye - https://phabricator.wikimedia.org/T332952 [22:30:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [22:30:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [22:52:40] (03CR) 10EoghanGaffney: Disable the package installed systemd timer for logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [22:54:13] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [22:54:16] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [22:56:15] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [22:56:17] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [22:58:20] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [22:58:22] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [23:00:42] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Dzahn) [23:14:00] (03PS1) 10Dzahn: planet: the HTTPS_PROXY itself is accessed by http [puppet] - 10https://gerrit.wikimedia.org/r/902513 [23:16:49] (03PS2) 10Dzahn: planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 [23:17:11] (03CR) 10CI reject: [V: 04-1] planet: the HTTPS_PROXY itself is accessed via http [puppet] - 10https://gerrit.wikimedia.org/r/902513 (owner: 10Dzahn) [23:26:08] (03PS1) 10Dzahn: planet: update the feed URLs getting 3xx [puppet] - 10https://gerrit.wikimedia.org/r/902515 [23:37:38] (03PS1) 10Cwhite: logstash: add thanos-query ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902334 (https://phabricator.wikimedia.org/T234565) [23:52:08] (03PS1) 10Papaul: Add new PDU's in row E-F[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/902519 (https://phabricator.wikimedia.org/T290899) [23:53:22] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:55:58] (03CR) 10Papaul: [C: 03+2] Add new PDU's in row E-F[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/902519 (https://phabricator.wikimedia.org/T290899) (owner: 10Papaul) [23:58:22] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull