[00:00:03] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5027 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:07] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3076 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:09] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5031 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:13] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3067 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:17] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3069 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:19] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3070 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:19] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3074 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:19] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:21] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3068 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:27] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5026 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:27] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5029 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:31] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:33] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5022 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:35] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:37] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3080 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:41] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5030 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:41] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5028 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:41] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3078 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:42] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:45] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3066 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:45] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6008 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:45] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:45] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:47] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3073 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:47] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5023 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:55] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6001 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:55] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:57] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3075 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:59] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:01] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3081 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:01] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5024 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:05] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5021 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:07] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:07] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:09] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3071 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:11] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6016 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:11] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:15] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5017 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:21] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3077 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:23] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:25] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5019 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:25] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5018 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:25] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5025 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:25] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5032 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:25] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5020 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:29] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3072 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:29] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3079 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:29] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:13:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 48.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:16:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:17:37] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:18:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 38.89% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:23:17] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:47] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:30:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968970 [00:39:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968970 (owner: 10TrainBranchBot) [00:40:35] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:48:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.860 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:49:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 7.556 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:58:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968970 (owner: 10TrainBranchBot) [01:03:50] (03PS41) 10Andrea Denisse: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [01:06:56] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:38:42] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:42] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:16:50] (03PS1) 10Andrew Bogott: wmcs_backup_volumes: reduce backup lifespan [puppet] - 10https://gerrit.wikimedia.org/r/969226 [03:33:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:14] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:28:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:56] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:52:49] 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10andrea.denisse) Hello, I see an active [[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ripe-atlas-ulsfo+IPv6&service=IPv6+ping+to+ulsfo | alert ]] on Icinga regarding this task. Can I mark the alert... [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231027T0600) [06:08:44] (03PS1) 10Marostegui: install_server: Do not reimage db1230 [puppet] - 10https://gerrit.wikimedia.org/r/969230 [06:10:46] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1230 [puppet] - 10https://gerrit.wikimedia.org/r/969230 (owner: 10Marostegui) [06:12:51] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bullseye [06:12:58] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye... [06:44:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:45:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:59:57] (03CR) 10Muehlenhoff: [C: 03+2] nftables::service: Fix file name variable [puppet] - 10https://gerrit.wikimedia.org/r/969140 (owner: 10Muehlenhoff) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231027T0700) [07:00:36] 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10ayounsi) sure, yeah. [07:01:05] (03PS2) 10Muehlenhoff: Switch idp_test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969138 [07:09:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969138 (owner: 10Muehlenhoff) [07:22:20] (03CR) 10Majavah: [C: 03+2] site: Re-image cloudmetrics hosts as insetup [puppet] - 10https://gerrit.wikimedia.org/r/968277 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah) [07:24:28] !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudmetrics1003.eqiad.wmnet with OS bookworm [07:32:51] !log ayounsi@cumin1001 START - Cookbook sre.hosts.provision for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED [07:36:25] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1003.eqiad.wmnet with reason: host reimage [07:39:35] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1003.eqiad.wmnet with reason: host reimage [07:48:08] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1004.eqiad.wmnet with reason: cloudmetrics1003 reimage [07:48:33] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1004.eqiad.wmnet with reason: cloudmetrics1003 reimage [07:50:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:54:02] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED [07:54:59] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudmetrics1003.eqiad.wmnet with OS bookworm [07:55:39] !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudmetrics1004.eqiad.wmnet with OS bookworm [07:58:15] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [08:00:20] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [08:07:21] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1004.eqiad.wmnet with reason: host reimage [08:10:24] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1004.eqiad.wmnet with reason: host reimage [08:20:53] 10SRE, 10Traffic: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10LSobanski) [08:25:47] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudmetrics1004.eqiad.wmnet with OS bookworm [08:29:42] (03PS1) 10Giuseppe Lavagetto: Review access change [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/969164 [08:30:58] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:41:19] !log downgrading dh-python on build2001 to the version which is in Bullseye. Before, 5.20230130~bpo11+1 was installed from bullseye-backports, but that version has dropped the python2 sequence we still need for some Buster builds [08:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:22] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:05] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [08:49:49] !log uploaded libxml2 2.9.4+dfsg1-7+deb10u6+icu67+wmf1 to component/icu67 for buster-wikimedia (rebase of the ICU compat patches on top of the latest buster security update for libxml2) T345561 [08:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:55] T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 [08:51:37] (03CR) 10Filippo Giunchedi: [C: 03+1] arclamp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969137 (owner: 10Muehlenhoff) [08:53:47] (03CR) 10Filippo Giunchedi: "This LGTM, however we'll need to do this in two passed due to exported resources usage:" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [09:01:07] (03CR) 10Filippo Giunchedi: systemd: Add a way to provide a default team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [09:02:51] !log deployment-prep app servers are now using ICU67/Unicode 13 [09:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I was under the impression that require_packages would already ensure that the packages are installed before running the manifest co" [puppet] - 10https://gerrit.wikimedia.org/r/969201 (owner: 10Majavah) [09:06:57] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:07:18] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: drop prometheus access for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/968278 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:07:55] (03CR) 10Filippo Giunchedi: [C: 03+1] P:alertmanager: drop cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/968279 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:08:26] (03CR) 10Majavah: [C: 03+2] prometheus: ipmi_exporter: add dependency on package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969201 (owner: 10Majavah) [09:08:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/968280 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:09:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/968281 (owner: 10Majavah) [09:09:15] (03CR) 10Filippo Giunchedi: [C: 03+1] O:wmcs::monitoring: drop role [puppet] - 10https://gerrit.wikimedia.org/r/968282 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah) [09:10:29] (03PS1) 10Majavah: openstack: nova: add a dependency on libvirt-clients [puppet] - 10https://gerrit.wikimedia.org/r/969299 [09:10:56] (03PS2) 10Majavah: hieradata: drop prometheus access for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/968278 (https://phabricator.wikimedia.org/T336854) [09:10:58] (03PS2) 10Majavah: P:alertmanager: drop cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/968279 (https://phabricator.wikimedia.org/T336854) [09:11:00] (03PS2) 10Majavah: P:wmcs::prometheus: drop profile [puppet] - 10https://gerrit.wikimedia.org/r/968280 (https://phabricator.wikimedia.org/T336854) [09:11:02] (03PS2) 10Majavah: P:wmcs: drop graphite manifests [puppet] - 10https://gerrit.wikimedia.org/r/968281 [09:11:04] (03PS2) 10Majavah: O:wmcs::monitoring: drop role [puppet] - 10https://gerrit.wikimedia.org/r/968282 (https://phabricator.wikimedia.org/T336774) [09:11:35] (03CR) 10Majavah: [C: 03+2] hieradata: drop prometheus access for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/968278 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:14:32] (03CR) 10Majavah: [C: 03+2] P:alertmanager: drop cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/968279 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:14:41] (03CR) 10Majavah: [C: 03+2] P:wmcs::prometheus: drop profile [puppet] - 10https://gerrit.wikimedia.org/r/968280 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [09:14:54] (03CR) 10Majavah: [C: 03+2] P:wmcs: drop graphite manifests [puppet] - 10https://gerrit.wikimedia.org/r/968281 (owner: 10Majavah) [09:15:04] (03CR) 10Majavah: [C: 03+2] O:wmcs::monitoring: drop role [puppet] - 10https://gerrit.wikimedia.org/r/968282 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah) [09:19:37] !log btullis@cumin1001 Added views for new wiki: tlywiki T345169 [09:19:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [09:19:42] T345169: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169 [09:21:39] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi) Thanks and on which switch port is it? For the management side, I can't get the provision cookbook to run, the iDRAC doesn't seem to be querying for an IP over DHCP. The [[ https://netbox.wikimedia.org/extra... [09:22:37] (03CR) 10Filippo Giunchedi: systemd: Add a way to provide a default team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [09:23:24] (03CR) 10WMDE-Fisch: "Note: This is good to go now. 1.42.0-wmf.2 is deployed and the feature flags are not used anymore." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch) [09:25:10] (03CR) 10Muehlenhoff: [C: 03+2] arclamp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969137 (owner: 10Muehlenhoff) [09:26:30] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:36] neat, I think we're okay to remove 'check systemd state' from icinga now? cc slyngs jbond [09:32:58] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:33:04] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [09:34:27] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye [09:34:33] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye [09:38:49] (03PS1) 10Majavah: team-wmcs: openstack: update trove/magnum haproxy svc names [alerts] - 10https://gerrit.wikimedia.org/r/969302 (https://phabricator.wikimedia.org/T349801) [09:41:04] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:53] (03PS1) 10Giuseppe Lavagetto: Add weekly-update script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) [09:45:36] (03PS1) 10Muehlenhoff: Update role contact for thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/969305 [09:45:55] (03PS2) 10Muehlenhoff: Update role contact for thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/969305 [09:49:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, related task is https://phabricator.wikimedia.org/T341488" [puppet] - 10https://gerrit.wikimedia.org/r/969305 (owner: 10Muehlenhoff) [09:49:52] (03PS3) 10Muehlenhoff: Update role contact for thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/969305 (https://phabricator.wikimedia.org/T341488) [09:53:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/969302 (https://phabricator.wikimedia.org/T349801) (owner: 10Majavah) [09:53:35] (03CR) 10Majavah: [C: 03+2] team-wmcs: openstack: update trove/magnum haproxy svc names [alerts] - 10https://gerrit.wikimedia.org/r/969302 (https://phabricator.wikimedia.org/T349801) (owner: 10Majavah) [09:54:51] (03Merged) 10jenkins-bot: team-wmcs: openstack: update trove/magnum haproxy svc names [alerts] - 10https://gerrit.wikimedia.org/r/969302 (https://phabricator.wikimedia.org/T349801) (owner: 10Majavah) [09:57:23] hmm, looks like I've broken puppet on the main prometheus hosts. looking [09:58:57] (03PS1) 10Majavah: P:openstack: fix openstack_exporter host hiera key name [puppet] - 10https://gerrit.wikimedia.org/r/969307 [09:59:02] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1101.eqiad.wmnet with OS bullseye [09:59:07] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye executed with errors: - cp1101 (**FAIL**) - Downtimed on Icinga/... [09:59:38] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye [09:59:44] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye [10:03:34] (03CR) 10Majavah: [C: 03+2] "self-merging trivial patch to unbreak puppet on prometheus* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/969307 (owner: 10Majavah) [10:06:12] taavi: ack, thanks [10:06:28] (fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/969307/) [10:06:38] (03CR) 10Muehlenhoff: [C: 03+2] Update role contact for thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/969305 (https://phabricator.wikimedia.org/T341488) (owner: 10Muehlenhoff) [10:07:24] (03PS1) 10Majavah: Remove cloudmetrics Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/969309 (https://phabricator.wikimedia.org/T336774) [10:07:32] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) [10:07:38] 10SRE-swift-storage, 10Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10Bawolff) [10:08:30] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: Re-enable the envoy admin listener on tcp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/969141 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:09:00] (03CR) 10Effie Mouzeli: [C: 03+2] Update mobileapps to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967405 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:09:03] PROBLEM - Disk space on an-web1001 is CRITICAL: DISK CRITICAL - free space: /srv 46656 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops [10:09:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969138 (owner: 10Muehlenhoff) [10:09:28] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) As a note on current status, as part of T191805, mediawiki will now accept files with swift up to 5GB. $wgMaxUploadSize is 4gb, so this only affects fil... [10:09:40] (03Merged) 10jenkins-bot: tegola-vector-tiles: Re-enable the envoy admin listener on tcp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/969141 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:09:58] (03Merged) 10jenkins-bot: Update mobileapps to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967405 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:10:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969309 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah) [10:10:34] (03CR) 10Majavah: [C: 03+2] Remove cloudmetrics Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/969309 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah) [10:13:45] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [10:14:23] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [10:14:24] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [10:14:50] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage [10:17:00] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [10:17:50] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage [10:18:00] (03CR) 10Hnowlan: [C: 03+1] Add weekly-update script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [10:18:23] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:20:02] !log taavi@cumin1001 START - Cookbook sre.hosts.remove-downtime for cloudvirt-wdqs1001.eqiad.wmnet [10:20:03] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cloudvirt-wdqs1001.eqiad.wmnet [10:35:01] (03PS1) 10Jbond: team-sre/systemd: update systemd checks to make use of systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) [10:35:56] (03CR) 10Jbond: [V: 03+1] systemd: Add a way to provide a default team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [10:36:04] (03Abandoned) 10Jbond: systemd: Add a way to provide a default team [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [10:36:12] (03CR) 10CI reject: [V: 04-1] team-sre/systemd: update systemd checks to make use of systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [10:36:32] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1101.eqiad.wmnet with OS bullseye [10:36:42] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye completed: - cp1101 (**PASS**) - Removed from Puppet and PuppetD... [10:39:02] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [10:40:12] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS bullseye [10:40:22] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye [10:42:54] (03PS2) 10Jbond: team-sre/systemd: update systemd checks to make use of systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) [10:43:27] (03CR) 10Jbond: "ready for review" [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [10:44:59] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:45:15] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:45:32] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:47:15] (03PS4) 10Cathal Mooney: Interface automation: skip import of existing int IPs and VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) [10:48:17] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:48:46] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:48:47] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:56:46] (03CR) 10EoghanGaffney: "This change is ready for review." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney) [10:57:13] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) [10:57:38] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) [10:59:30] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) [10:59:45] (03PS7) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [10:59:53] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) p:05Triage→03Medium [11:00:12] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:00:24] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [11:01:07] !log jbond@cumin2002 START - Cookbook sre.ganeti.resource-report [11:01:07] !log jbond@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [11:02:02] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) from the following Group A seems like the best ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MF... [11:05:58] (03PS8) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [11:06:37] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [11:08:25] !log volans@cumin2002 START - Cookbook sre.ganeti.resource-report [11:08:26] !log volans@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [11:09:49] RECOVERY - Disk space on an-web1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops [11:10:03] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10MoritzMuehlenhoff) Looks good, A sounds indeed best. [11:12:02] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) 05Open→03In progress [11:12:13] (03PS9) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [11:12:56] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [11:15:05] (03PS1) 10Jbond: netboot: Add acmechief[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/969314 (https://phabricator.wikimedia.org/T349890) [11:15:38] (03CR) 10Jbond: [C: 03+2] netboot: Add acmechief[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/969314 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [11:17:24] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1102.eqiad.wmnet with OS bullseye [11:17:30] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye executed with errors: - cp1102 (**FAIL**) - Downtimed on Icinga/... [11:18:21] (03PS10) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [11:18:27] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS bullseye [11:18:34] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye [11:19:00] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) As discussed with @papaul we may try to connect this to lsw1-a2-codfw instead, so that we can remove the requirement for a leaf switch in... [11:19:15] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [11:21:02] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10User-MoritzMuehlenhoff: Annotate images in our registry with OS (and OS version) - https://phabricator.wikimedia.org/T335337 (10JMeybohm) [11:26:11] !log jbond@cumin1001 START - Cookbook sre.ganeti.makevm for new host acmechief2002.codfw.wmnet [11:26:12] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [11:28:11] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM acmechief2002.codfw.wmnet - jbond@cumin1001" [11:29:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM acmechief2002.codfw.wmnet - jbond@cumin1001" [11:29:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:29:02] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache acmechief2002.codfw.wmnet on all recursors [11:29:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) acmechief2002.codfw.wmnet on all recursors [11:29:30] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM acmechief2002.codfw.wmnet - jbond@cumin1001" [11:30:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM acmechief2002.codfw.wmnet - jbond@cumin1001" [11:31:30] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host acmechief2002.codfw.wmnet with OS bookworm [11:31:31] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage [11:31:36] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host acmechief2002.codfw.wmnet with OS bookworm [11:34:42] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage [11:37:05] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10ssingh) >>! In T349890#9287016, @ops-monitoring-bot wrote: > Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host acmechief2002.codfw.wmnet with OS boo... [11:38:09] (03CR) 10Ilias Sarantopoulos: "This change is ready for review." (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [11:44:51] (03PS3) 10Filippo Giunchedi: team-sre/systemd: update systemd checks to make use of systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [11:45:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [11:46:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [11:51:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:52:02] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1102.eqiad.wmnet with OS bullseye [11:52:06] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye completed: - cp1102 (**PASS**) - Removed from Puppet and PuppetD... [11:54:31] (03CR) 10Kamila Součková: [C: 03+1] Add weekly-update script (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [11:55:13] (03PS1) 10Ayounsi: [POC] Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319 [11:56:02] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [12:01:58] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) Discussed with @papaul and we will do this work on Thursday at 11.30am CDT / 16:30 UCT. Shouldn't be any inter... [12:06:22] (03PS1) 10Jbond: site.pp: Add acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/969320 (https://phabricator.wikimedia.org/T349890) [12:06:58] (03CR) 10Jbond: [C: 03+2] site.pp: Add acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/969320 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [12:13:53] (03CR) 10JMeybohm: Add weekly-update script (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [12:14:08] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief2002.codfw.wmnet with reason: host reimage [12:14:39] (03CR) 10Elukey: [C: 03+1] team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [12:14:51] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Enable icu67 component on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [12:17:16] (03CR) 10Jbond: [C: 03+2] team-sre/systemd: update systemd checks to make use of systemd_unit_owner (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [12:17:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief2002.codfw.wmnet with reason: host reimage [12:19:18] (03PS1) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 [12:19:46] (03CR) 10CI reject: [V: 04-1] Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff) [12:21:52] (03PS2) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 [12:24:03] (03CR) 10CI reject: [V: 04-1] Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff) [12:24:58] (03PS3) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 [12:31:27] RECOVERY - Check systemd state on sretest2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:34] (03PS4) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 [12:37:03] (03CR) 10Muehlenhoff: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff) [12:41:45] !log updated mwdebug1001 to icu67 - T345561 [12:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:50] T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 [12:54:00] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:55:21] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:57:22] (03PS1) 10Filippo Giunchedi: team-sre: move SystemdUnitCrashLoop to systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969326 (https://phabricator.wikimedia.org/T293970) [13:00:01] (03PS1) 10Muehlenhoff: Switch arclamp to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969328 [13:00:45] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bullseye [13:00:52] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye [13:04:51] (03CR) 10Muehlenhoff: "A few more comments, looks good otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney) [13:05:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff) [13:06:57] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:07:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/969326 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [13:11:10] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: move SystemdUnitCrashLoop to systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969326 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [13:14:24] (03PS1) 10Muehlenhoff: Switch netboxdb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969331 [13:14:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, + other o11y folks as heads-up" [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff) [13:16:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff) [13:18:34] (03CR) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:27:44] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host acmechief2002.codfw.wmnet with OS bookworm [13:27:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host acmechief2002.codfw.wmnet [13:28:07] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host acmechief2002.codfw.wmnet with OS bookworm completed: - acmechief2002 (**WARN**)... [13:31:10] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host acmechief2002.codfw.wmnet [13:33:21] (03PS1) 10Jbond: acmechief2002: move to pupet7 [puppet] - 10https://gerrit.wikimedia.org/r/969335 (https://phabricator.wikimedia.org/T349890) [13:33:23] (03PS1) 10Jbond: acmechief: add new acmechief server [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) [13:33:42] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:34:03] (03CR) 10Jbond: [C: 03+2] acmechief2002: move to pupet7 [puppet] - 10https://gerrit.wikimedia.org/r/969335 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [13:35:39] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change sretest2004 DNS - cmooney@cumin1001" [13:36:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond) [13:36:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change sretest2004 DNS - cmooney@cumin1001" [13:36:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:37:07] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bullseye [13:37:12] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [13:38:14] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bullseye [13:38:15] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host acmechief2002.codfw.wmnet [13:38:20] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye [13:38:29] (03CR) 10Klausman: team-ml: add alert for memory spike in inf services (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:38:35] PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief2002 is CRITICAL: FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.status https://wikitech.wikimedia.org/wiki/Acme-chief [13:38:35] PROBLEM - Check systemd state on acmechief2002 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:38] (03CR) 10Klausman: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:44:42] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [13:53:26] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata) [13:53:47] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata) [13:56:10] (03PS1) 10Jbond: idp: update junper to stag uri [puppet] - 10https://gerrit.wikimedia.org/r/969337 [13:56:26] (03CR) 10Jbond: [C: 03+2] idp: update junper to stag uri [puppet] - 10https://gerrit.wikimedia.org/r/969337 (owner: 10Jbond) [13:56:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: update junper to stag uri [puppet] - 10https://gerrit.wikimedia.org/r/969337 (owner: 10Jbond) [13:57:01] (03PS1) 10Muehlenhoff: package_builder: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969338 [13:59:01] (03CR) 10JMeybohm: [C: 03+2] Update developer-portal to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958479 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:59:53] (03Merged) 10jenkins-bot: Update developer-portal to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958479 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:02:45] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [14:02:50] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [14:03:16] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [14:03:36] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [14:04:11] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [14:04:18] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [14:04:45] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [14:07:31] (03CR) 10Herron: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [14:09:08] (03PS6) 10JMeybohm: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [14:09:40] (03CR) 10JMeybohm: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [14:12:17] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:13:19] (03CR) 10JMeybohm: [C: 03+2] mw-debug: Revert envoy draining tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/968959 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [14:14:07] (03Merged) 10jenkins-bot: mw-debug: Revert envoy draining tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/968959 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [14:15:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff) [14:17:32] (03PS1) 10Btullis: Enable the TagManager plugin functionality on Matomo [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910) [14:18:07] (03PS2) 10Btullis: Enable the TagManager plugin functionality on Matomo [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910) [14:18:37] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:19:07] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:19:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/221/con" [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [14:19:53] !log announcing internal core routes to esams asw's to test policy T344547 [14:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:57] T344547: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 [14:21:41] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:27:42] (03PS1) 10JMeybohm: Update flink-session-cluster to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969343 (https://phabricator.wikimedia.org/T300033) [14:30:10] (03CR) 10Vgutierrez: acmechief: add new acmechief server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [14:30:12] (03PS2) 10Jbond: acmechief: add new acmechief server [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) [14:30:43] (03CR) 10Jbond: acmechief: add new acmechief server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [14:34:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [14:38:42] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:21] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [14:41:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [14:42:21] (03CR) 10Vgutierrez: [C: 03+1] acmechief: add new acmechief server [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [14:42:27] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [14:43:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] acmechief: add new acmechief server [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [14:47:38] (03PS1) 10Muehlenhoff: Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [14:47:48] (03PS2) 10Muehlenhoff: Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) [14:48:12] (03CR) 10CI reject: [V: 04-1] Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff) [14:50:17] (03CR) 10Ahmon Dancy: [V: 03+2 C: 03+2] Review access change [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/969164 (owner: 10Giuseppe Lavagetto) [14:50:51] (03PS1) 10JMeybohm: Update datahub to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969345 (https://phabricator.wikimedia.org/T300033) [14:52:13] (03PS1) 10JMeybohm: Update benthos to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969366 (https://phabricator.wikimedia.org/T300033) [14:53:05] (03CR) 10JMeybohm: "As benthos does not use the service mesh, this should be more or less a noop" [deployment-charts] - 10https://gerrit.wikimedia.org/r/969366 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:53:42] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:04] RECOVERY - Check systemd state on acmechief2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:50] (03PS1) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) [14:59:59] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [15:03:06] RECOVERY - Ensure that passive node gets the certificates from the active node as expected on acmechief2002 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.status is 185 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [15:03:30] ^^ jbond acmechief2002 already has the TLS material :) [15:04:46] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:04:48] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:05:07] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:38] PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:05:54] PROBLEM - thanos.wikimedia.org tls expiry on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:06:07] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:14] PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:06:15] grafana not working for me ^ [15:06:16] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:06:29] ssh, did it crash? [15:06:34] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:07:50] ping looks ok, so it is not that [15:08:30] hopefully someone on call can help me debug [15:08:42] ssh looks down indeed [15:08:42] (JobUnavailable) firing: (6) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:56] PROBLEM - thanos.wikimedia.org requires authentication on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:09:04] will try mgmt [15:10:28] uff, lot of time for login to respond, my guess would be overload [15:10:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution fo acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) [15:10:48] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) [15:10:54] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff) [15:11:18] sadly I lack visibility [15:11:51] (03PS1) 10Jbond: idp_test: switch to puppet7 acmechief host [puppet] - 10https://gerrit.wikimedia.org/r/969368 (https://phabricator.wikimedia.org/T349890) [15:12:00] (03CR) 10Majavah: [C: 03+2] openstack: encapi: don't try to hold a single connection open [puppet] - 10https://gerrit.wikimedia.org/r/968331 (https://phabricator.wikimedia.org/T349195) (owner: 10Majavah) [15:13:26] godog: suggestions on how to proceed? [15:13:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) Above patch reflects my thinking on the best approach for this. I've taken the approach that we should announce all our internal... [15:13:56] jynus: checking [15:14:19] didn't oncall get paged? [15:14:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/224/console" [puppet] - 10https://gerrit.wikimedia.org/r/969368 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [15:14:26] unsure [15:14:44] godog: yes just acked [15:15:01] herron: ack [15:15:06] problem is we are traveling blind at the moment [15:15:45] someone already rebooting titan1001? [15:15:54] I was asking godog if to do so [15:15:55] not afaik [15:16:05] ok, its hanging on ssh I'll go ahead and reboot [15:16:11] but if it is caused externally, not sure it is wise [15:16:25] the host is up, just seems very loaded [15:16:38] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) 05In progress→03Resolved a:03jbond [15:16:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution fo acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) [15:16:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] idp_test: switch to puppet7 acmechief host [puppet] - 10https://gerrit.wikimedia.org/r/969368 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond) [15:16:54] yeah reboot sounds good, we can also make codfw active for thanos [15:17:01] but to be fair, I cannot think of what's the alternative [15:17:16] so +1 to do it and see [15:18:31] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10Aklapper) [15:20:08] godog: other parts of the monitoring stack look good (e.g. prometheus?) [15:20:57] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) Acmechief2002 has been installed with a the version available in https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_reques... [15:20:58] mmm, it seem it affected titan1001 and titan1002, so that would confirm a load issue [15:21:05] (03PS2) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) [15:21:14] !log power cycled titan1001 [15:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:00] yeah basically running queries through thanos eqiad is affected, other parts are fine e.g. prometheus itself and alertmanager [15:22:09] good [15:22:28] PROBLEM - Host titan1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:03] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) [15:23:21] that's actually nice redundancy [15:23:32] (03PS3) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) [15:23:55] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): update acme chief access - https://phabricator.wikimedia.org/T349620 (10jbond) [15:24:18] it should be up now, saw it reboot and finish loading linux [15:24:20] alright userspace is back up finally [15:24:20] RECOVERY - Host titan1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:24:20] ftr, we got an alert for Wikidata, but I assume that’s also due to the titan/thanos issue and y’all are on it [15:24:32] RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 20 Nov 2023 08:22:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:24:34] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:25:01] let's see what is the behaviour of titna1002 without touching it [15:25:07] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:20] RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:25:26] Lucas_WMDE: quite possibly yes [15:25:44] as it may be useful for debugging, if 1001 can take the work [15:25:45] yeah https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=5m&viewPanel=28 is showing data again [15:25:53] thanks for the powercycle :) [15:25:58] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:26:07] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:27:33] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Volans) 05Open→03Resolved a:03Volans Resolving for now, feel free to re-open in case it happens again. [15:27:54] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:28:32] 10SRE, 10Maps: Allow Wikimedia Maps usage on QGIS - https://phabricator.wikimedia.org/T349917 (10Francois.peru) [15:28:42] (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:30:15] in case you see something relevant: https://phabricator.wikimedia.org/P53060 [15:31:27] I think overload is almost 100% confirmed [15:31:30] RECOVERY - thanos.wikimedia.org tls expiry on titan1002 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 20 Nov 2023 08:56:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:31:44] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:31:50] RECOVERY - thanos.wikimedia.org requires authentication on titan1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:32:05] I would agree with that [15:32:19] 1002 recovered itself, so that may have more interesting logs [15:33:42] (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:19] thanks herron [15:34:31] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [15:35:32] (03PS4) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) [15:38:14] (03PS5) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) [15:38:36] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [15:38:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS bullseye [15:38:44] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [15:39:24] (03PS1) 10Muehlenhoff: pybaltest: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969370 [15:40:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969370 (owner: 10Muehlenhoff) [15:41:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) FWIW in my original config for this I had terms to match routes redistributed into BGP locally and announced in IBGP, or between c... [15:44:31] (03PS1) 10Jbond: site.pp: rename site.pp so that it is loaded first [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) [15:44:46] (03CR) 10CI reject: [V: 04-1] site.pp: rename site.pp so that it is loaded first [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond) [15:47:33] (03PS2) 10Jbond: site.pp: rename site.pp so that it is loaded first [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) [15:48:29] (03CR) 10Cathal Mooney: Change core router config to export internal routes to Switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney) [15:50:40] (03CR) 10Ssingh: [C: 03+1] pybaltest: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969370 (owner: 10Muehlenhoff) [15:51:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:52:18] (03CR) 10Vgutierrez: [C: 03+1] pybaltest: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969370 (owner: 10Muehlenhoff) [15:55:16] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) 05Open→03In progress p:05Triage→03Medium [15:55:18] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [15:55:20] (03PS1) 10Jbond: realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) [15:56:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/226/console" [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [16:00:33] (03PS2) 10Jbond: realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) [16:00:39] (03CR) 10Jbond: realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [16:01:44] (03PS3) 10Jbond: realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) [16:03:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/227/console" [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [16:04:14] (03CR) 10Bking: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [16:05:02] (03CR) 10Bking: rdf-streaming-updater: update staging values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [16:14:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [16:16:14] (03CR) 10Jbond: [C: 03+1] "lgtm just some missed clean up" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [16:18:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969338 (owner: 10Muehlenhoff) [16:19:21] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10Papaul) @cmooney for the cross rack link it does make sense to use copper with 1000BaseT sine we have those already on site. On the other hand sin... [16:25:27] (03PS5) 10EoghanGaffney: [apt-staging] Add apt-staging host for CI pipeline [puppet] - 10https://gerrit.wikimedia.org/r/968288 [16:26:32] (03CR) 10Btullis: [C: 04-1] "I like the approch, but I think we may need to take advice on whether we should use the discovery intermediate, or create our own." [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [16:27:04] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:27:04] (03CR) 10EoghanGaffney: [apt-staging] Add apt-staging host for CI pipeline (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney) [16:32:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:41:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 63, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:41:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:41:16] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:41:36] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:58] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:48:50] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti expansion - https://phabricator.wikimedia.org/T349925 (10RobH) [16:48:54] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion - https://phabricator.wikimedia.org/T349926 (10RobH) [16:49:22] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti expansion - https://phabricator.wikimedia.org/T349925 (10RobH) [16:49:29] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion - https://phabricator.wikimedia.org/T349926 (10RobH) [16:50:12] (03CR) 10Jbond: "functionally lgtm have added some comments for improvements" [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff) [16:51:26] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion - https://phabricator.wikimedia.org/T349926 (10RobH) a:03MoritzMuehlenhoff @MoritzMuehlenhoff, The parent purchasing task for 2 nodes in codfw has been escalated to order without racking details. Would you... [16:51:38] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti expansion - https://phabricator.wikimedia.org/T349925 (10RobH) @MoritzMuehlenhoff, The parent purchasing task for 4 nodes in eqiad has been escalated to order without racking details. Would you please provide racking details... [16:51:42] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti expansion - https://phabricator.wikimedia.org/T349925 (10RobH) a:03MoritzMuehlenhoff [16:57:53] (03CR) 10Jbond: Enable the management of the skein certificate via Puppet (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [17:02:11] (03PS22) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [17:11:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.395432074523921s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:30:09] ACKNOWLEDGEMENT - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 616 probes of 616 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map Andrea Denisse Resolved https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:42:57] (03CR) 10Milimetric: [C: 03+1] wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric) [17:52:09] 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10RobH) [17:52:26] 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10RobH) [18:06:10] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [18:07:36] (03CR) 10CDanis: [C: 03+1] "seems sensible enough to me!" [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [18:16:53] (03CR) 10Andrew Bogott: [C: 03+2] wmcs_backup_volumes: reduce backup lifespan [puppet] - 10https://gerrit.wikimedia.org/r/969226 (owner: 10Andrew Bogott) [18:17:34] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:06] (03PS23) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [18:21:57] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [18:23:04] 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH) [18:23:27] 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH) [18:30:42] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:57] (03CR) 10Dwisehaupt: "Thanks. Cleaned that up. @jhathaway would you care to have a look over this?" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [18:52:39] 10SRE, 10Maps: Allow Wikimedia Maps usage on QGIS - https://phabricator.wikimedia.org/T349917 (10Aklapper) 05Open→03Declined Hi @Francois.peru, thanks for taking the time to report this. The field "Wikimedia Affiliate supporting project" above is not filled out, so for now I am going to decline this ticket... [18:54:34] (03PS24) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [18:59:10] (03CR) 10Dwisehaupt: "Last patch is just a minor update to use the proper gitlab repo now that it has been set up. No other changes." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:37:31] (03PS1) 10Bking: search-loader: use default system python [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) [19:40:00] (03CR) 10CI reject: [V: 04-1] search-loader: use default system python [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [19:47:38] (03CR) 10Bking: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [19:51:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:51:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [20:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.460643168670146s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:15:54] (03CR) 10Ryan Kemper: [C: 03+1] search-loader: use default system python [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [20:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.240180559958171s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:41:36] (03CR) 10Aqu: "Awesome, thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [21:03:34] (03PS6) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [21:11:20] (03CR) 10Aqu: "Thanks for your reviews. Feel free to take the lead on it next week." [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [21:19:21] (03CR) 10Btullis: "Thanks so much for the feedback" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [21:22:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.0145303287018566s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:22:46] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.0168584249270927s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:25:53] (03PS1) 10Gergő Tisza: [beta] Use dedicated redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969387 (https://phabricator.wikimedia.org/T340908) [21:34:15] deploying a beta-only patch [21:35:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969387 (https://phabricator.wikimedia.org/T340908) (owner: 10Gergő Tisza) [21:36:43] (03Merged) 10jenkins-bot: [beta] Use dedicated redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969387 (https://phabricator.wikimedia.org/T340908) (owner: 10Gergő Tisza) [21:55:39] (03PS1) 10Gergő Tisza: [beta] Fix Redis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969393 (https://phabricator.wikimedia.org/T340908) [21:57:17] and one more [21:57:42] (03CR) 10Gergő Tisza: [C: 03+2] [beta] Fix Redis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969393 (https://phabricator.wikimedia.org/T340908) (owner: 10Gergő Tisza) [21:58:24] (03Merged) 10jenkins-bot: [beta] Fix Redis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969393 (https://phabricator.wikimedia.org/T340908) (owner: 10Gergő Tisza) [22:06:04] (03PS2) 10Dwisehaupt: Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) [22:08:11] (03CR) 10Dwisehaupt: "jgreen, if you can verify these, we can go ahead and merge this up." [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:21:57] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [22:41:04] (03CR) 10Cwhite: [C: 04-1] "We will also need a separate CR defining "uri_host" in a new w3creportingapi-1.0.0 template revision to deploy prior to this one." [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [22:47:04] !log reprepro -C main include bullseye-wikimedia k8s-controller-sidecars_1.0.2-1_source.changes [22:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:58] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:09] (03PS1) 10Gergő Tisza: [beta] Disable statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969397 (https://phabricator.wikimedia.org/T349944) [22:53:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969397 (https://phabricator.wikimedia.org/T349944) (owner: 10Gergő Tisza) [22:54:28] (03Merged) 10jenkins-bot: [beta] Disable statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969397 (https://phabricator.wikimedia.org/T349944) (owner: 10Gergő Tisza) [23:01:38] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:15] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure