[00:00:03] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5027 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:07] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3076 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:09] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5031 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:13] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3067 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:17] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3069 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:19] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3070 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:19] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3074 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:19] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:21] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3068 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:27] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5026 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:27] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5029 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:31] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:33] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5022 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:35] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:37] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3080 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:41] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5030 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:41] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5028 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:41] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3078 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:42] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:45] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3066 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:45] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6008 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:45] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:45] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:47] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3073 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:47] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5023 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:55] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6001 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:55] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:57] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3075 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:00:59] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:01] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3081 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:01] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5024 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:05] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5021 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:07] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:07] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:09] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3071 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:11] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6016 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:11] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:15] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5017 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:21] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3077 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:23] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:25] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5019 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:25] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5018 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:25] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5025 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:25] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5032 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:25] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5020 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:29] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3072 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:29] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3079 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:01:29] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2023-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS
[00:13:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 48.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:16:21] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:17:37] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:18:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 38.89% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:23:17] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:21] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:47] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:30:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:39:04] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968970
[00:39:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968970 (owner: 10TrainBranchBot)
[00:40:35] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:48:15] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.860 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:49:29] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 7.556 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:58:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/968970 (owner: 10TrainBranchBot)
[01:03:50] <wikibugs>	 (03PS41) 10Andrea Denisse: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448)
[01:06:56] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[01:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:01:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:38:42] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:03:42] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:16:50] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs_backup_volumes: reduce backup lifespan [puppet] - 10https://gerrit.wikimedia.org/r/969226
[03:33:55] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:14] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:28:59] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:56] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[05:52:49] <wikibugs>	 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10andrea.denisse) Hello, I see an active [[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ripe-atlas-ulsfo+IPv6&service=IPv6+ping+to+ulsfo | alert ]] on Icinga regarding this task. Can I mark the alert...
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231027T0600)
[06:08:44] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1230 [puppet] - 10https://gerrit.wikimedia.org/r/969230
[06:10:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1230 [puppet] - 10https://gerrit.wikimedia.org/r/969230 (owner: 10Marostegui)
[06:12:51] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bullseye
[06:12:58] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye...
[06:44:33] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:45:49] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:59:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] nftables::service: Fix file name variable [puppet] - 10https://gerrit.wikimedia.org/r/969140 (owner: 10Muehlenhoff)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231027T0700)
[07:00:36] <wikibugs>	 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10ayounsi) sure, yeah.
[07:01:05] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch idp_test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969138
[07:09:05] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969138 (owner: 10Muehlenhoff)
[07:22:20] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] site: Re-image cloudmetrics hosts as insetup [puppet] - 10https://gerrit.wikimedia.org/r/968277 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah)
[07:24:28] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudmetrics1003.eqiad.wmnet with OS bookworm
[07:32:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.provision for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED
[07:36:25] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1003.eqiad.wmnet with reason: host reimage
[07:39:35] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1003.eqiad.wmnet with reason: host reimage
[07:48:08] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1004.eqiad.wmnet with reason: cloudmetrics1003 reimage
[07:48:33] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1004.eqiad.wmnet with reason: cloudmetrics1003 reimage
[07:50:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:54:02] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED
[07:54:59] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudmetrics1003.eqiad.wmnet with OS bookworm
[07:55:39] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudmetrics1004.eqiad.wmnet with OS bookworm
[07:58:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro)
[08:00:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro)
[08:07:21] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1004.eqiad.wmnet with reason: host reimage
[08:10:24] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1004.eqiad.wmnet with reason: host reimage
[08:20:53] <wikibugs>	 10SRE, 10Traffic: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10LSobanski)
[08:25:47] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudmetrics1004.eqiad.wmnet with OS bookworm
[08:29:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Review access change [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/969164
[08:30:58] <icinga-wm>	 PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:41:19] <moritzm>	 !log downgrading dh-python on build2001 to the version which is in Bullseye. Before, 5.20230130~bpo11+1 was installed from bullseye-backports, but that version has dropped the python2 sequence we still need for some Buster builds
[08:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:22] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:48:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[08:49:49] <moritzm>	 !log uploaded libxml2 2.9.4+dfsg1-7+deb10u6+icu67+wmf1 to component/icu67 for buster-wikimedia (rebase of the ICU compat patches on top of the latest buster security update for libxml2) T345561
[08:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:55] <stashbot>	 T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561
[08:51:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] arclamp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969137 (owner: 10Muehlenhoff)
[08:53:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: "This LGTM, however we'll need to do this in two passed due to exported resources usage:" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[09:01:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: systemd: Add a way to provide a default team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond)
[09:02:51] <moritzm>	 !log deployment-prep app servers are now using ICU67/Unicode 13
[09:02:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I was under the impression that require_packages would already ensure that the packages are installed before running the manifest co" [puppet] - 10https://gerrit.wikimedia.org/r/969201 (owner: 10Majavah)
[09:06:57] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:07:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: drop prometheus access for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/968278 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah)
[09:07:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:alertmanager: drop cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/968279 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah)
[09:08:26] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] prometheus: ipmi_exporter: add dependency on package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969201 (owner: 10Majavah)
[09:08:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/968280 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah)
[09:09:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/968281 (owner: 10Majavah)
[09:09:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] O:wmcs::monitoring: drop role [puppet] - 10https://gerrit.wikimedia.org/r/968282 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah)
[09:10:29] <wikibugs>	 (03PS1) 10Majavah: openstack: nova: add a dependency on libvirt-clients [puppet] - 10https://gerrit.wikimedia.org/r/969299
[09:10:56] <wikibugs>	 (03PS2) 10Majavah: hieradata: drop prometheus access for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/968278 (https://phabricator.wikimedia.org/T336854)
[09:10:58] <wikibugs>	 (03PS2) 10Majavah: P:alertmanager: drop cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/968279 (https://phabricator.wikimedia.org/T336854)
[09:11:00] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::prometheus: drop profile [puppet] - 10https://gerrit.wikimedia.org/r/968280 (https://phabricator.wikimedia.org/T336854)
[09:11:02] <wikibugs>	 (03PS2) 10Majavah: P:wmcs: drop graphite manifests [puppet] - 10https://gerrit.wikimedia.org/r/968281
[09:11:04] <wikibugs>	 (03PS2) 10Majavah: O:wmcs::monitoring: drop role [puppet] - 10https://gerrit.wikimedia.org/r/968282 (https://phabricator.wikimedia.org/T336774)
[09:11:35] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: drop prometheus access for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/968278 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah)
[09:14:32] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:alertmanager: drop cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/968279 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah)
[09:14:41] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:wmcs::prometheus: drop profile [puppet] - 10https://gerrit.wikimedia.org/r/968280 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah)
[09:14:54] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:wmcs: drop graphite manifests [puppet] - 10https://gerrit.wikimedia.org/r/968281 (owner: 10Majavah)
[09:15:04] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] O:wmcs::monitoring: drop role [puppet] - 10https://gerrit.wikimedia.org/r/968282 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah)
[09:19:37] <logmsgbot>	 !log btullis@cumin1001 Added views for new wiki: tlywiki T345169
[09:19:37] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0)
[09:19:42] <stashbot>	 T345169: Prepare and check storage layer for tlywiki - https://phabricator.wikimedia.org/T345169
[09:21:39] <wikibugs>	 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi) Thanks and on which switch port is it?  For the management side, I can't get the provision cookbook to run, the iDRAC doesn't seem to be querying for an IP over DHCP. The [[ https://netbox.wikimedia.org/extra...
[09:22:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: systemd: Add a way to provide a default team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond)
[09:23:24] <wikibugs>	 (03CR) 10WMDE-Fisch: "Note: This is good to go now. 1.42.0-wmf.2 is deployed and the feature flags are not used anymore." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966520 (https://phabricator.wikimedia.org/T332785) (owner: 10WMDE-Fisch)
[09:25:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] arclamp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969137 (owner: 10Muehlenhoff)
[09:26:30] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:28:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:29:36] <godog>	 neat, I think we're okay to remove 'check systemd state' from icinga now? cc slyngs jbond 
[09:32:58] <icinga-wm>	 RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:33:04] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[09:34:27] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye
[09:34:33] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye
[09:38:49] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: openstack: update trove/magnum haproxy svc names [alerts] - 10https://gerrit.wikimedia.org/r/969302 (https://phabricator.wikimedia.org/T349801)
[09:41:04] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:43:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:43:53] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add weekly-update script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478)
[09:45:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Update role contact for thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/969305
[09:45:55] <wikibugs>	 (03PS2) 10Muehlenhoff: Update role contact for thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/969305
[09:49:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, related task is https://phabricator.wikimedia.org/T341488" [puppet] - 10https://gerrit.wikimedia.org/r/969305 (owner: 10Muehlenhoff)
[09:49:52] <wikibugs>	 (03PS3) 10Muehlenhoff: Update role contact for thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/969305 (https://phabricator.wikimedia.org/T341488)
[09:53:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/969302 (https://phabricator.wikimedia.org/T349801) (owner: 10Majavah)
[09:53:35] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: openstack: update trove/magnum haproxy svc names [alerts] - 10https://gerrit.wikimedia.org/r/969302 (https://phabricator.wikimedia.org/T349801) (owner: 10Majavah)
[09:54:51] <wikibugs>	 (03Merged) 10jenkins-bot: team-wmcs: openstack: update trove/magnum haproxy svc names [alerts] - 10https://gerrit.wikimedia.org/r/969302 (https://phabricator.wikimedia.org/T349801) (owner: 10Majavah)
[09:57:23] <taavi>	 hmm, looks like I've broken puppet on the main prometheus hosts. looking
[09:58:57] <wikibugs>	 (03PS1) 10Majavah: P:openstack: fix openstack_exporter host hiera key name [puppet] - 10https://gerrit.wikimedia.org/r/969307
[09:59:02] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1101.eqiad.wmnet with OS bullseye
[09:59:07] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye executed with errors: - cp1101 (**FAIL**)   - Downtimed on Icinga/...
[09:59:38] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye
[09:59:44] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye
[10:03:34] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "self-merging trivial patch to unbreak puppet on prometheus* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/969307 (owner: 10Majavah)
[10:06:12] <godog>	 taavi: ack, thanks
[10:06:28] <taavi>	 (fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/969307/)
[10:06:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update role contact for thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/969305 (https://phabricator.wikimedia.org/T341488) (owner: 10Muehlenhoff)
[10:07:24] <wikibugs>	 (03PS1) 10Majavah: Remove cloudmetrics Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/969309 (https://phabricator.wikimedia.org/T336774)
[10:07:32] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff)
[10:07:38] <wikibugs>	 10SRE-swift-storage, 10Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10Bawolff)
[10:08:30] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: Re-enable the envoy admin listener on tcp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/969141 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:09:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Update mobileapps to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967405 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:09:03] <icinga-wm>	 PROBLEM - Disk space on an-web1001 is CRITICAL: DISK CRITICAL - free space: /srv 46656 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops
[10:09:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969138 (owner: 10Muehlenhoff)
[10:09:28] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) As a note on current status, as part of T191805, mediawiki will now accept files with swift up to 5GB. $wgMaxUploadSize is 4gb, so this only affects fil...
[10:09:40] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: Re-enable the envoy admin listener on tcp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/969141 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:09:58] <wikibugs>	 (03Merged) 10jenkins-bot: Update mobileapps to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967405 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:10:07] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969309 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah)
[10:10:34] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Remove cloudmetrics Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/969309 (https://phabricator.wikimedia.org/T336774) (owner: 10Majavah)
[10:13:45] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[10:14:23] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:14:24] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[10:14:50] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage
[10:17:00] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:17:50] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage
[10:18:00] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Add weekly-update script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto)
[10:18:23] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[10:20:02] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.remove-downtime for cloudvirt-wdqs1001.eqiad.wmnet
[10:20:03] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cloudvirt-wdqs1001.eqiad.wmnet
[10:35:01] <wikibugs>	 (03PS1) 10Jbond: team-sre/systemd: update systemd checks to make use of systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176)
[10:35:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] systemd: Add a way to provide a default team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond)
[10:36:04] <wikibugs>	 (03Abandoned) 10Jbond: systemd: Add a way to provide a default team [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond)
[10:36:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-sre/systemd: update systemd checks to make use of systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond)
[10:36:32] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1101.eqiad.wmnet with OS bullseye
[10:36:42] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye completed: - cp1101 (**PASS**)   - Removed from Puppet and PuppetD...
[10:39:02] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[10:40:12] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS bullseye
[10:40:22] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye
[10:42:54] <wikibugs>	 (03PS2) 10Jbond: team-sre/systemd: update systemd checks to make use of systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176)
[10:43:27] <wikibugs>	 (03CR) 10Jbond: "ready for review" [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond)
[10:44:59] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[10:45:15] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[10:45:32] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[10:47:15] <wikibugs>	 (03PS4) 10Cathal Mooney: Interface automation: skip import of existing int IPs and VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774)
[10:48:17] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[10:48:46] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[10:48:47] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[10:56:46] <wikibugs>	 (03CR) 10EoghanGaffney: "This change is ready for review." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney)
[10:57:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond)
[10:57:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond)
[10:59:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond)
[10:59:45] <wikibugs>	 (03PS7) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[10:59:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) p:05Triage→03Medium
[11:00:12] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[11:00:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[11:01:07] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.ganeti.resource-report
[11:01:07] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[11:02:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) from the following Group A seems like the best  ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances |  MFree   | MF...
[11:05:58] <wikibugs>	 (03PS8) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[11:06:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[11:08:25] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.ganeti.resource-report
[11:08:26] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[11:09:49] <icinga-wm>	 RECOVERY - Disk space on an-web1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops
[11:10:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10MoritzMuehlenhoff) Looks good, A sounds indeed best.
[11:12:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) 05Open→03In progress
[11:12:13] <wikibugs>	 (03PS9) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[11:12:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[11:15:05] <wikibugs>	 (03PS1) 10Jbond: netboot: Add acmechief[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/969314 (https://phabricator.wikimedia.org/T349890)
[11:15:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] netboot: Add acmechief[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/969314 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[11:17:24] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1102.eqiad.wmnet with OS bullseye
[11:17:30] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye executed with errors: - cp1102 (**FAIL**)   - Downtimed on Icinga/...
[11:18:21] <wikibugs>	 (03PS10) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[11:18:27] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS bullseye
[11:18:34] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye
[11:19:00] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) As discussed with @papaul we may try to connect this to lsw1-a2-codfw instead, so that we can remove the requirement for a leaf switch in...
[11:19:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[11:21:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10User-MoritzMuehlenhoff: Annotate images in our registry with OS (and OS version) - https://phabricator.wikimedia.org/T335337 (10JMeybohm)
[11:26:11] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.ganeti.makevm for new host acmechief2002.codfw.wmnet
[11:26:12] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.netbox
[11:28:11] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM acmechief2002.codfw.wmnet - jbond@cumin1001"
[11:29:02] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM acmechief2002.codfw.wmnet - jbond@cumin1001"
[11:29:02] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:29:02] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache acmechief2002.codfw.wmnet on all recursors
[11:29:06] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) acmechief2002.codfw.wmnet on all recursors
[11:29:30] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM acmechief2002.codfw.wmnet - jbond@cumin1001"
[11:30:21] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM acmechief2002.codfw.wmnet - jbond@cumin1001"
[11:31:30] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host acmechief2002.codfw.wmnet with OS bookworm
[11:31:31] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage
[11:31:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host acmechief2002.codfw.wmnet with OS bookworm
[11:34:42] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage
[11:37:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10ssingh) >>! In T349890#9287016, @ops-monitoring-bot wrote: > Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host acmechief2002.codfw.wmnet with OS boo...
[11:38:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "This change is ready for review." (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[11:44:51] <wikibugs>	 (03PS3) 10Filippo Giunchedi: team-sre/systemd: update systemd checks to make use of systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond)
[11:45:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond)
[11:46:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm)
[11:51:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:52:02] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1102.eqiad.wmnet with OS bullseye
[11:52:06] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye completed: - cp1102 (**PASS**)   - Removed from Puppet and PuppetD...
[11:54:31] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] Add weekly-update script (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto)
[11:55:13] <wikibugs>	 (03PS1) 10Ayounsi: [POC] Split interface_automation into multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/969319
[11:56:02] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[12:01:58] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) Discussed with @papaul and we will do this work on Thursday at 11.30am CDT / 16:30 UCT.  Shouldn't be any inter...
[12:06:22] <wikibugs>	 (03PS1) 10Jbond: site.pp: Add acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/969320 (https://phabricator.wikimedia.org/T349890)
[12:06:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] site.pp: Add acmechief2002 [puppet] - 10https://gerrit.wikimedia.org/r/969320 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[12:13:53] <wikibugs>	 (03CR) 10JMeybohm: Add weekly-update script (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/969303 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto)
[12:14:08] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief2002.codfw.wmnet with reason: host reimage
[12:14:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[12:14:51] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Enable icu67 component on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm)
[12:17:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] team-sre/systemd: update systemd checks to make use of systemd_unit_owner (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/969312 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond)
[12:17:18] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief2002.codfw.wmnet with reason: host reimage
[12:19:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324
[12:19:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff)
[12:21:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324
[12:24:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff)
[12:24:58] <wikibugs>	 (03PS3) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324
[12:31:27] <icinga-wm>	 RECOVERY - Check systemd state on sretest2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:34] <wikibugs>	 (03PS4) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324
[12:37:03] <wikibugs>	 (03CR) 10Muehlenhoff: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff)
[12:41:45] <jayme>	 !log updated mwdebug1001 to icu67 - T345561
[12:41:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:50] <stashbot>	 T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561
[12:54:00] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:55:21] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:57:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: team-sre: move SystemdUnitCrashLoop to systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969326 (https://phabricator.wikimedia.org/T293970)
[13:00:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch arclamp to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969328
[13:00:45] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bullseye
[13:00:52] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye
[13:04:51] <wikibugs>	 (03CR) 10Muehlenhoff: "A few more comments, looks good otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney)
[13:05:07] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff)
[13:06:57] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[13:07:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/969326 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi)
[13:11:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: move SystemdUnitCrashLoop to systemd_unit_owner [alerts] - 10https://gerrit.wikimedia.org/r/969326 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi)
[13:14:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch netboxdb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969331
[13:14:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, + other o11y folks as heads-up" [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff)
[13:16:54] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff)
[13:18:34] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:27:44] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host acmechief2002.codfw.wmnet with OS bookworm
[13:27:45] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host acmechief2002.codfw.wmnet
[13:28:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host acmechief2002.codfw.wmnet with OS bookworm completed: - acmechief2002 (**WARN**)...
[13:31:10] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host acmechief2002.codfw.wmnet
[13:33:21] <wikibugs>	 (03PS1) 10Jbond: acmechief2002: move to pupet7 [puppet] - 10https://gerrit.wikimedia.org/r/969335 (https://phabricator.wikimedia.org/T349890)
[13:33:23] <wikibugs>	 (03PS1) 10Jbond: acmechief: add new acmechief server [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890)
[13:33:42] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:34:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] acmechief2002: move to pupet7 [puppet] - 10https://gerrit.wikimedia.org/r/969335 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[13:35:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change sretest2004 DNS - cmooney@cumin1001"
[13:36:46] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond)
[13:36:51] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change sretest2004 DNS - cmooney@cumin1001"
[13:36:51] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:37:07] <logmsgbot>	 !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bullseye
[13:37:12] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye...
[13:38:14] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bullseye
[13:38:15] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host acmechief2002.codfw.wmnet
[13:38:20] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye
[13:38:29] <wikibugs>	 (03CR) 10Klausman: team-ml: add alert for memory spike in inf services (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:38:35] <icinga-wm>	 PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief2002 is CRITICAL: FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.status https://wikitech.wikimedia.org/wiki/Acme-chief
[13:38:35] <icinga-wm>	 PROBLEM - Check systemd state on acmechief2002 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:39:38] <wikibugs>	 (03CR) 10Klausman: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:44:42] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata)
[13:53:26] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata)
[13:53:47] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata)
[13:56:10] <wikibugs>	 (03PS1) 10Jbond: idp: update junper to stag uri [puppet] - 10https://gerrit.wikimedia.org/r/969337
[13:56:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: update junper to stag uri [puppet] - 10https://gerrit.wikimedia.org/r/969337 (owner: 10Jbond)
[13:56:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: update junper to stag uri [puppet] - 10https://gerrit.wikimedia.org/r/969337 (owner: 10Jbond)
[13:57:01] <wikibugs>	 (03PS1) 10Muehlenhoff: package_builder: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969338
[13:59:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update developer-portal to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958479 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:59:53] <wikibugs>	 (03Merged) 10jenkins-bot: Update developer-portal to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958479 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[14:02:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr)
[14:02:50] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[14:03:16] <logmsgbot>	 !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[14:03:36] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[14:04:11] <logmsgbot>	 !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[14:04:18] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[14:04:45] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[14:07:31] <wikibugs>	 (03CR) 10Herron: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron)
[14:09:08] <wikibugs>	 (03PS6) 10JMeybohm: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert)
[14:09:40] <wikibugs>	 (03CR) 10JMeybohm: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert)
[14:12:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED
[14:12:19] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED
[14:13:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mw-debug: Revert envoy draining tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/968959 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert)
[14:14:07] <wikibugs>	 (03Merged) 10jenkins-bot: mw-debug: Revert envoy draining tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/968959 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert)
[14:15:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff)
[14:17:32] <wikibugs>	 (03PS1) 10Btullis: Enable the TagManager plugin functionality on Matomo [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910)
[14:18:07] <wikibugs>	 (03PS2) 10Btullis: Enable the TagManager plugin functionality on Matomo [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910)
[14:18:37] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:19:07] <logmsgbot>	 !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:19:42] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/221/con" [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[14:19:53] <topranks>	 !log announcing internal core routes to esams asw's to test policy T344547
[14:19:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:57] <stashbot>	 T344547: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547
[14:21:41] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[14:27:42] <wikibugs>	 (03PS1) 10JMeybohm: Update flink-session-cluster to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969343 (https://phabricator.wikimedia.org/T300033)
[14:30:10] <wikibugs>	 (03CR) 10Vgutierrez: acmechief: add new acmechief server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[14:30:12] <wikibugs>	 (03PS2) 10Jbond: acmechief: add new acmechief server [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890)
[14:30:43] <wikibugs>	 (03CR) 10Jbond: acmechief: add new acmechief server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[14:34:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[14:38:42] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:21] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[14:41:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[14:42:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] acmechief: add new acmechief server [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[14:42:27] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[14:43:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] acmechief: add new acmechief server [puppet] - 10https://gerrit.wikimedia.org/r/969336 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[14:47:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349)
[14:47:48] <wikibugs>	 (03PS2) 10Muehlenhoff: Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349)
[14:48:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configure ACLs for reprepro upload queue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/969344 (https://phabricator.wikimedia.org/T115349) (owner: 10Muehlenhoff)
[14:50:17] <wikibugs>	 (03CR) 10Ahmon Dancy: [V: 03+2 C: 03+2] Review access change [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/969164 (owner: 10Giuseppe Lavagetto)
[14:50:51] <wikibugs>	 (03PS1) 10JMeybohm: Update datahub to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969345 (https://phabricator.wikimedia.org/T300033)
[14:52:13] <wikibugs>	 (03PS1) 10JMeybohm: Update benthos to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969366 (https://phabricator.wikimedia.org/T300033)
[14:53:05] <wikibugs>	 (03CR) 10JMeybohm: "As benthos does not use the service mesh, this should be more or less a noop" [deployment-charts] - 10https://gerrit.wikimedia.org/r/969366 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[14:53:42] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:56:04] <icinga-wm>	 RECOVERY - Check systemd state on acmechief2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:57:50] <wikibugs>	 (03PS1) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547)
[14:59:59] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001"
[15:03:06] <icinga-wm>	 RECOVERY - Ensure that passive node gets the certificates from the active node as expected on acmechief2002 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.status is 185 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief
[15:03:30] <vgutierrez>	 ^^ jbond acmechief2002 already has the TLS material :)
[15:04:46] <icinga-wm>	 PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:04:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:05:07] <jinxer-wm>	 (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:05:38] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:05:54] <icinga-wm>	 PROBLEM - thanos.wikimedia.org tls expiry on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:06:07] <jinxer-wm>	 (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:06:14] <icinga-wm>	 PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:06:15] <jynus>	 grafana not working for me ^
[15:06:16] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:06:29] <jynus>	 ssh, did it crash?
[15:06:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:07:50] <jynus>	 ping looks ok, so it is not that
[15:08:30] <jynus>	 hopefully someone on call can help me debug
[15:08:42] <jynus>	 ssh looks down indeed
[15:08:42] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:56] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:09:04] <jynus>	 will try mgmt
[15:10:28] <jynus>	 uff, lot of time for login to respond, my guess would be overload
[15:10:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution fo acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond)
[15:10:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond)
[15:10:54] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff)
[15:11:18] <jynus>	 sadly I lack visibility
[15:11:51] <wikibugs>	 (03PS1) 10Jbond: idp_test: switch to puppet7 acmechief host [puppet] - 10https://gerrit.wikimedia.org/r/969368 (https://phabricator.wikimedia.org/T349890)
[15:12:00] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] openstack: encapi: don't try to hold a single connection open [puppet] - 10https://gerrit.wikimedia.org/r/968331 (https://phabricator.wikimedia.org/T349195) (owner: 10Majavah)
[15:13:26] <jynus>	 godog: suggestions on how to proceed?
[15:13:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) Above patch reflects my thinking on the best approach for this.  I've taken the approach that we should announce all our internal...
[15:13:56] <godog>	 jynus: checking
[15:14:19] <godog>	 didn't oncall get paged?
[15:14:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/224/console" [puppet] - 10https://gerrit.wikimedia.org/r/969368 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[15:14:26] <jynus>	 unsure
[15:14:44] <herron>	 godog: yes just acked
[15:15:01] <godog>	 herron: ack
[15:15:06] <jynus>	 problem is we are traveling blind at the moment
[15:15:45] <herron>	 someone already rebooting titan1001?
[15:15:54] <jynus>	 I was asking godog if to do so
[15:15:55] <godog>	 not afaik
[15:16:05] <herron>	 ok, its hanging on ssh I'll go ahead and reboot
[15:16:11] <jynus>	 but if it is caused externally, not sure it is wise
[15:16:25] <jynus>	 the host is up, just seems very loaded
[15:16:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: codfw: 1 VM request for acme-chief - https://phabricator.wikimedia.org/T349890 (10jbond) 05In progress→03Resolved a:03jbond
[15:16:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution fo acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond)
[15:16:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] idp_test: switch to puppet7 acmechief host [puppet] - 10https://gerrit.wikimedia.org/r/969368 (https://phabricator.wikimedia.org/T349890) (owner: 10Jbond)
[15:16:54] <godog>	 yeah reboot sounds good, we can also make codfw active for thanos
[15:17:01] <jynus>	 but to be fair, I cannot think of what's the alternative
[15:17:16] <jynus>	 so +1 to do it and see
[15:18:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10Aklapper)
[15:20:08] <jynus>	 godog: other parts of the monitoring stack look good (e.g. prometheus?)
[15:20:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) Acmechief2002 has been installed with a the version available in https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_reques...
[15:20:58] <jynus>	 mmm, it seem it affected titan1001 and titan1002, so that would confirm a load issue
[15:21:05] <wikibugs>	 (03PS2) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547)
[15:21:14] <herron>	 !log power cycled titan1001
[15:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:00] <godog>	 yeah basically running queries through thanos eqiad is affected, other parts are fine e.g. prometheus itself and alertmanager
[15:22:09] <jynus>	 good
[15:22:28] <icinga-wm>	 PROBLEM - Host titan1001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:23:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond)
[15:23:21] <jynus>	 that's actually nice redundancy
[15:23:32] <wikibugs>	 (03PS3) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547)
[15:23:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): update acme chief access - https://phabricator.wikimedia.org/T349620 (10jbond)
[15:24:18] <jynus>	 it should be up now, saw it reboot and finish loading linux
[15:24:20] <herron>	 alright userspace is back up finally
[15:24:20] <icinga-wm>	 RECOVERY - Host titan1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[15:24:20] <Lucas_WMDE>	 ftr, we got an alert for Wikidata, but I assume that’s also due to the titan/thanos issue and y’all are on it
[15:24:32] <icinga-wm>	 RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 20 Nov 2023 08:22:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:24:34] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:25:01] <jynus>	 let's see what is the behaviour of titna1002 without touching it
[15:25:07] <jinxer-wm>	 (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:25:20] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:25:26] <godog>	 Lucas_WMDE: quite possibly yes
[15:25:44] <jynus>	 as it may be useful for debugging, if 1001 can take the work
[15:25:45] <Lucas_WMDE>	 yeah https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=5m&viewPanel=28 is showing data again
[15:25:53] <Lucas_WMDE>	 thanks for the powercycle :)
[15:25:58] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:26:07] <jinxer-wm>	 (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:27:33] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Volans) 05Open→03Resolved a:03Volans Resolving for now, feel free to re-open in case it happens again.
[15:27:54] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:28:32] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on QGIS - https://phabricator.wikimedia.org/T349917 (10Francois.peru)
[15:28:42] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:30:15] <jynus>	 in case you see something relevant: https://phabricator.wikimedia.org/P53060
[15:31:27] <jynus>	 I think overload is almost 100% confirmed
[15:31:30] <icinga-wm>	 RECOVERY - thanos.wikimedia.org tls expiry on titan1002 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 20 Nov 2023 08:56:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:31:44] <icinga-wm>	 RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:31:50] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on titan1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:32:05] <godog>	 I would agree with that
[15:32:19] <jynus>	 1002 recovered itself, so that may have more interesting logs
[15:33:42] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:34:19] <jynus>	 thanks herron
[15:34:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro)
[15:35:32] <wikibugs>	 (03PS4) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547)
[15:38:14] <wikibugs>	 (03PS5) 10Cathal Mooney: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547)
[15:38:36] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001"
[15:38:37] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS bullseye
[15:38:44] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye...
[15:39:24] <wikibugs>	 (03PS1) 10Muehlenhoff: pybaltest: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969370
[15:40:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969370 (owner: 10Muehlenhoff)
[15:41:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) FWIW in my original config for this I had terms to match routes redistributed into BGP locally and announced in IBGP, or between c...
[15:44:31] <wikibugs>	 (03PS1) 10Jbond: site.pp: rename site.pp so that it is loaded first [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918)
[15:44:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site.pp: rename site.pp so that it is loaded first [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918) (owner: 10Jbond)
[15:47:33] <wikibugs>	 (03PS2) 10Jbond: site.pp: rename site.pp so that it is loaded first [puppet] - 10https://gerrit.wikimedia.org/r/969373 (https://phabricator.wikimedia.org/T349918)
[15:48:29] <wikibugs>	 (03CR) 10Cathal Mooney: Change core router config to export internal routes to Switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney)
[15:50:40] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] pybaltest: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969370 (owner: 10Muehlenhoff)
[15:51:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:52:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] pybaltest: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969370 (owner: 10Muehlenhoff)
[15:55:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) 05Open→03In progress p:05Triage→03Medium
[15:55:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[15:55:20] <wikibugs>	 (03PS1) 10Jbond: realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915)
[15:56:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/226/console" [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[16:00:33] <wikibugs>	 (03PS2) 10Jbond: realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915)
[16:00:39] <wikibugs>	 (03CR) 10Jbond: realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[16:01:44] <wikibugs>	 (03PS3) 10Jbond: realm: use puppet7 acmechief when on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915)
[16:03:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/227/console" [puppet] - 10https://gerrit.wikimedia.org/r/969375 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond)
[16:04:14] <wikibugs>	 (03CR) 10Bking: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[16:05:02] <wikibugs>	 (03CR) 10Bking: rdf-streaming-updater: update staging values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking)
[16:14:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[16:16:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm just some missed clean up" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[16:18:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969338 (owner: 10Muehlenhoff)
[16:19:21] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10Papaul) @cmooney for the cross rack link it does make sense to use copper with 1000BaseT sine we have those already on site.  On the other hand sin...
[16:25:27] <wikibugs>	 (03PS5) 10EoghanGaffney: [apt-staging] Add apt-staging host for CI pipeline [puppet] - 10https://gerrit.wikimedia.org/r/968288
[16:26:32] <wikibugs>	 (03CR) 10Btullis: [C: 04-1] "I like the approch, but I think we may need to take advice on whether we should use the discovery intermediate, or create our own." [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[16:27:04] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:27:04] <wikibugs>	 (03CR) 10EoghanGaffney: [apt-staging] Add apt-staging host for CI pipeline (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney)
[16:32:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:41:02] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 63, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:41:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:41:16] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:41:36] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:43:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:48:50] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti expansion - https://phabricator.wikimedia.org/T349925 (10RobH)
[16:48:54] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion - https://phabricator.wikimedia.org/T349926 (10RobH)
[16:49:22] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti expansion - https://phabricator.wikimedia.org/T349925 (10RobH)
[16:49:29] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion - https://phabricator.wikimedia.org/T349926 (10RobH)
[16:50:12] <wikibugs>	 (03CR) 10Jbond: "functionally lgtm have added some comments for improvements" [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff)
[16:51:26] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion - https://phabricator.wikimedia.org/T349926 (10RobH) a:03MoritzMuehlenhoff @MoritzMuehlenhoff,  The parent purchasing task for 2 nodes in codfw has been escalated to order without racking details.  Would you...
[16:51:38] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti expansion - https://phabricator.wikimedia.org/T349925 (10RobH) @MoritzMuehlenhoff,  The parent purchasing task for 4 nodes in eqiad has been escalated to order without racking details.  Would you please provide racking details...
[16:51:42] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti expansion - https://phabricator.wikimedia.org/T349925 (10RobH) a:03MoritzMuehlenhoff
[16:57:53] <wikibugs>	 (03CR) 10Jbond: Enable the management of the skein certificate via Puppet (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[17:02:11] <wikibugs>	 (03PS22) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095)
[17:11:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.395432074523921s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:30:09] <icinga-wm>	 ACKNOWLEDGEMENT - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 616 probes of 616 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map Andrea Denisse Resolved https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:42:57] <wikibugs>	 (03CR) 10Milimetric: [C: 03+1] wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric)
[17:52:09] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10RobH)
[17:52:26] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10RobH)
[18:06:10] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron)
[18:07:36] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "seems sensible enough to me!" [alerts] - 10https://gerrit.wikimedia.org/r/967143 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[18:16:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs_backup_volumes: reduce backup lifespan [puppet] - 10https://gerrit.wikimedia.org/r/969226 (owner: 10Andrew Bogott)
[18:17:34] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:20:06] <wikibugs>	 (03PS23) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[18:21:57] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[18:23:04] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH)
[18:23:27] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH)
[18:30:42] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:57] <wikibugs>	 (03CR) 10Dwisehaupt: "Thanks. Cleaned that up. @jhathaway would you care to have a look over this?" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[18:52:39] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on QGIS - https://phabricator.wikimedia.org/T349917 (10Aklapper) 05Open→03Declined Hi @Francois.peru, thanks for taking the time to report this. The field "Wikimedia Affiliate supporting project" above is not filled out, so for now I am going to decline this ticket...
[18:54:34] <wikibugs>	 (03PS24) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[18:59:10] <wikibugs>	 (03CR) 10Dwisehaupt: "Last patch is just a minor update to use the proper gitlab repo now that it has been set up. No other changes." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[19:37:31] <wikibugs>	 (03PS1) 10Bking: search-loader: use default system python [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039)
[19:40:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] search-loader: use default system python [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[19:47:38] <wikibugs>	 (03CR) 10Bking: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[19:51:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:51:21] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[20:06:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.460643168670146s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:15:54] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] search-loader: use default system python [puppet] - 10https://gerrit.wikimedia.org/r/969386 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking)
[20:27:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.240180559958171s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:41:36] <wikibugs>	 (03CR) 10Aqu: "Awesome, thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[21:03:34] <wikibugs>	 (03PS6) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532)
[21:11:20] <wikibugs>	 (03CR) 10Aqu: "Thanks for your reviews. Feel free to take the lead on it next week." [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[21:19:21] <wikibugs>	 (03CR) 10Btullis: "Thanks so much for the feedback" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[21:22:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.0145303287018566s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:22:46] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.0168584249270927s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:25:53] <wikibugs>	 (03PS1) 10Gergő Tisza: [beta] Use dedicated redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969387 (https://phabricator.wikimedia.org/T340908)
[21:34:15] <tgr>	 deploying a beta-only patch
[21:35:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969387 (https://phabricator.wikimedia.org/T340908) (owner: 10Gergő Tisza)
[21:36:43] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Use dedicated redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969387 (https://phabricator.wikimedia.org/T340908) (owner: 10Gergő Tisza)
[21:55:39] <wikibugs>	 (03PS1) 10Gergő Tisza: [beta] Fix Redis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969393 (https://phabricator.wikimedia.org/T340908)
[21:57:17] <tgr>	 and one more
[21:57:42] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] [beta] Fix Redis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969393 (https://phabricator.wikimedia.org/T340908) (owner: 10Gergő Tisza)
[21:58:24] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Fix Redis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969393 (https://phabricator.wikimedia.org/T340908) (owner: 10Gergő Tisza)
[22:06:04] <wikibugs>	 (03PS2) 10Dwisehaupt: Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486)
[22:08:11] <wikibugs>	 (03CR) 10Dwisehaupt: "jgreen, if you can verify these, we can go ahead and merge this up." [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[22:21:57] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[22:41:04] <wikibugs>	 (03CR) 10Cwhite: [C: 04-1] "We will also need a separate CR defining "uri_host" in a new w3creportingapi-1.0.0 template revision to deploy prior to this one." [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron)
[22:47:04] <rzl>	 !log reprepro -C main include bullseye-wikimedia k8s-controller-sidecars_1.0.2-1_source.changes
[22:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:58] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:53:09] <wikibugs>	 (03PS1) 10Gergő Tisza: [beta] Disable statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969397 (https://phabricator.wikimedia.org/T349944)
[22:53:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969397 (https://phabricator.wikimedia.org/T349944) (owner: 10Gergő Tisza)
[22:54:28] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Disable statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969397 (https://phabricator.wikimedia.org/T349944) (owner: 10Gergő Tisza)
[23:01:38] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:51:15] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure