[00:00:07] <icinga-wm>	 RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:21] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4005.ulsfo.wmnet with reason: host reimage
[00:01:03] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti4005.ulsfo.wmnet with reason: host reimage
[00:02:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on elastic1067 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:03:59] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster
[00:04:32] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[00:05:44] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:06:15] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs4008
[00:06:31] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs4008
[00:06:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321312)', diff saved to https://phabricator.wikimedia.org/P35783 and previous config saved to /var/cache/conftool/dbconfig/20221021-000636-ladsgroup.json
[00:06:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[00:06:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[00:07:21] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bullseye
[00:07:29] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye
[00:09:58] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: run puppet after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/845068 (owner: 10Ryan Kemper)
[00:11:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[00:11:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[00:11:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P35784 and previous config saved to /var/cache/conftool/dbconfig/20221021-001117-ladsgroup.json
[00:11:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T321312)', diff saved to https://phabricator.wikimedia.org/P35785 and previous config saved to /var/cache/conftool/dbconfig/20221021-001123-ladsgroup.json
[00:14:25] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on elastic1066 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:16:01] <wikibugs>	 (03PS1) 10Ssingh: cp4049: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/845075 (https://phabricator.wikimedia.org/T317244)
[00:16:53] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4005.ulsfo.wmnet with OS bullseye
[00:17:01] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye completed: - ganeti4...
[00:17:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321312)', diff saved to https://phabricator.wikimedia.org/P35786 and previous config saved to /var/cache/conftool/dbconfig/20221021-001740-ladsgroup.json
[00:18:22] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on elastic1060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:18:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:18:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:20:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on elastic1064 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:20:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on elastic1063 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:20:28] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster
[00:23:07] <icinga-wm>	 RECOVERY - Check systemd state on elastic1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:25:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster
[00:26:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35787 and previous config saved to /var/cache/conftool/dbconfig/20221021-002624-ladsgroup.json
[00:27:17] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[00:27:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P35788 and previous config saved to /var/cache/conftool/dbconfig/20221021-002739-ladsgroup.json
[00:27:55] <icinga-wm>	 RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:32:21] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage
[00:32:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P35789 and previous config saved to /var/cache/conftool/dbconfig/20221021-003247-ladsgroup.json
[00:32:49] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on elastic1067 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:34:36] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310
[00:38:39] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster
[00:39:08] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310
[00:42:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P35790 and previous config saved to /var/cache/conftool/dbconfig/20221021-004246-ladsgroup.json
[00:45:39] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: run puppet in correct place [cookbooks] - 10https://gerrit.wikimedia.org/r/845086
[00:47:45] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye
[00:47:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P35791 and previous config saved to /var/cache/conftool/dbconfig/20221021-004754-ladsgroup.json
[00:48:00] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS bullseye
[00:48:07] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye completed: - lvs4008 (*...
[00:57:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P35792 and previous config saved to /var/cache/conftool/dbconfig/20221021-005752-ladsgroup.json
[00:59:28] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH)
[01:01:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster
[01:03:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321312)', diff saved to https://phabricator.wikimedia.org/P35793 and previous config saved to /var/cache/conftool/dbconfig/20221021-010301-ladsgroup.json
[01:03:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[01:03:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[01:03:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T321312)', diff saved to https://phabricator.wikimedia.org/P35794 and previous config saved to /var/cache/conftool/dbconfig/20221021-010325-ladsgroup.json
[01:09:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321312)', diff saved to https://phabricator.wikimedia.org/P35795 and previous config saved to /var/cache/conftool/dbconfig/20221021-010944-ladsgroup.json
[01:13:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P35796 and previous config saved to /var/cache/conftool/dbconfig/20221021-011259-ladsgroup.json
[01:13:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[01:13:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[01:13:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35797 and previous config saved to /var/cache/conftool/dbconfig/20221021-011324-ladsgroup.json
[01:14:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35798 and previous config saved to /var/cache/conftool/dbconfig/20221021-011452-ladsgroup.json
[01:16:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[01:19:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: run puppet in correct place [cookbooks] - 10https://gerrit.wikimedia.org/r/845086 (owner: 10Ryan Kemper)
[01:22:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35799 and previous config saved to /var/cache/conftool/dbconfig/20221021-012213-ladsgroup.json
[01:22:37] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310
[01:24:15] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster
[01:24:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P35800 and previous config saved to /var/cache/conftool/dbconfig/20221021-012450-ladsgroup.json
[01:35:04] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:37:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P35801 and previous config saved to /var/cache/conftool/dbconfig/20221021-013720-ladsgroup.json
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:39:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P35802 and previous config saved to /var/cache/conftool/dbconfig/20221021-013957-ladsgroup.json
[01:42:36] <icinga-wm>	 PROBLEM - Check systemd state on elastic1083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:02] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[01:44:38] <icinga-wm>	 RECOVERY - Check systemd state on elastic1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:47:08] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P35803 and previous config saved to /var/cache/conftool/dbconfig/20221021-015226-ladsgroup.json
[01:52:28] <icinga-wm>	 PROBLEM - Check systemd state on elastic1068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:58] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310
[01:54:32] <icinga-wm>	 RECOVERY - Check systemd state on elastic1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:55:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321312)', diff saved to https://phabricator.wikimedia.org/P35804 and previous config saved to /var/cache/conftool/dbconfig/20221021-015503-ladsgroup.json
[01:55:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[01:55:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[02:07:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35805 and previous config saved to /var/cache/conftool/dbconfig/20221021-020733-ladsgroup.json
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35806 and previous config saved to /var/cache/conftool/dbconfig/20221021-021250-ladsgroup.json
[02:27:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P35807 and previous config saved to /var/cache/conftool/dbconfig/20221021-022757-ladsgroup.json
[02:29:28] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[02:36:02] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:43:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P35808 and previous config saved to /var/cache/conftool/dbconfig/20221021-024303-ladsgroup.json
[02:58:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35809 and previous config saved to /var/cache/conftool/dbconfig/20221021-025809-ladsgroup.json
[02:58:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[02:58:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[02:58:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35810 and previous config saved to /var/cache/conftool/dbconfig/20221021-025836-ladsgroup.json
[03:05:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35811 and previous config saved to /var/cache/conftool/dbconfig/20221021-030531-ladsgroup.json
[03:20:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P35812 and previous config saved to /var/cache/conftool/dbconfig/20221021-032037-ladsgroup.json
[03:35:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P35813 and previous config saved to /var/cache/conftool/dbconfig/20221021-033544-ladsgroup.json
[03:50:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35814 and previous config saved to /var/cache/conftool/dbconfig/20221021-035050-ladsgroup.json
[03:50:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[03:51:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[03:51:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[03:51:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[03:51:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T321312)', diff saved to https://phabricator.wikimedia.org/P35815 and previous config saved to /var/cache/conftool/dbconfig/20221021-035120-ladsgroup.json
[03:56:18] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[03:57:48] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 1.64 ms
[03:58:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321312)', diff saved to https://phabricator.wikimedia.org/P35816 and previous config saved to /var/cache/conftool/dbconfig/20221021-035848-ladsgroup.json
[04:13:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P35817 and previous config saved to /var/cache/conftool/dbconfig/20221021-041354-ladsgroup.json
[04:19:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:23:06] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[04:28:08] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:29:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P35818 and previous config saved to /var/cache/conftool/dbconfig/20221021-042901-ladsgroup.json
[04:37:12] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:44:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321312)', diff saved to https://phabricator.wikimedia.org/P35819 and previous config saved to /var/cache/conftool/dbconfig/20221021-044407-ladsgroup.json
[04:44:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[04:44:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[04:44:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35820 and previous config saved to /var/cache/conftool/dbconfig/20221021-044433-ladsgroup.json
[04:48:04] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms
[04:48:53] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fix broken links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845277
[04:50:02] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 4 (graphite1005, ...), No backups: 2 (graphite1005, ...), Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:50:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35821 and previous config saved to /var/cache/conftool/dbconfig/20221021-045051-ladsgroup.json
[04:54:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "For the record:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson)
[05:02:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:05:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P35822 and previous config saved to /var/cache/conftool/dbconfig/20221021-050558-ladsgroup.json
[05:07:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:16:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[05:21:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P35823 and previous config saved to /var/cache/conftool/dbconfig/20221021-052104-ladsgroup.json
[05:36:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35824 and previous config saved to /var/cache/conftool/dbconfig/20221021-053611-ladsgroup.json
[05:36:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[05:36:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[05:36:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T321312)', diff saved to https://phabricator.wikimedia.org/P35825 and previous config saved to /var/cache/conftool/dbconfig/20221021-053636-ladsgroup.json
[05:38:10] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:42:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321312)', diff saved to https://phabricator.wikimedia.org/P35826 and previous config saved to /var/cache/conftool/dbconfig/20221021-054258-ladsgroup.json
[05:58:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P35827 and previous config saved to /var/cache/conftool/dbconfig/20221021-055804-ladsgroup.json
[06:13:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P35828 and previous config saved to /var/cache/conftool/dbconfig/20221021-061311-ladsgroup.json
[06:28:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321312)', diff saved to https://phabricator.wikimedia.org/P35829 and previous config saved to /var/cache/conftool/dbconfig/20221021-062817-ladsgroup.json
[06:28:20] <wikibugs>	 (03PS4) 10ArielGlenn: Create temporary dir for html dumps [puppet] - 10https://gerrit.wikimedia.org/r/844984 (https://phabricator.wikimedia.org/T319269) (owner: 10Hokwelum)
[06:29:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[06:30:30] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Create temporary dir for html dumps [puppet] - 10https://gerrit.wikimedia.org/r/844984 (https://phabricator.wikimedia.org/T319269) (owner: 10Hokwelum)
[06:53:06] <wikibugs>	 (03PS18) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[06:55:48] <wikibugs>	 (03CR) 10Slyngshede: role::idm Basic deployment of IDM (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[06:58:49] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 36692
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221021T0700)
[07:00:49] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36692
[07:05:16] <wikibugs>	 (03PS3) 10Sohom Datta: Enable source links on Translation ns on enwikisource and thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980)
[07:09:36] <wikibugs>	 (03CR) 10Sohom Datta: "Planning on deploying this on 24th Oct" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) (owner: 10Sohom Datta)
[07:10:10] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845277 (owner: 10Giuseppe Lavagetto)
[07:12:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845277 (owner: 10Giuseppe Lavagetto)
[07:12:58] <wikibugs>	 (03Merged) 10jenkins-bot: Fix broken links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845277 (owner: 10Giuseppe Lavagetto)
[07:13:14] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Backport for [[gerrit:845277|Fix broken links]]
[07:13:34] <logmsgbot>	 !log oblivian@deploy1002 oblivian and oblivian: Backport for [[gerrit:845277|Fix broken links]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[07:20:25] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:845277|Fix broken links]] (duration: 07m 11s)
[07:28:32] <wikibugs>	 (03CR) 10JMeybohm: New organization of templates (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto)
[07:37:19] <jynus>	 !log start of rolling restart of backup hosts
[07:37:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:52:54] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Identify the 'backup' role as data persistence owned [puppet] - 10https://gerrit.wikimedia.org/r/845407 (https://phabricator.wikimedia.org/T321310)
[07:55:30] <wikibugs>	 (03CR) 10David Caro: alerts.downtime_host: attempt to match alert hostnames with :<port> (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[07:56:11] <jynus>	 the bacula jobs is me, as bacula was briefly unavailable and it is a single job, should be back soon
[07:58:22] <jynus>	 mmm, it doesn't come back, I will check why it is failing still after the host came back
[07:59:51] <jynus>	 the exported needed a restart, it had failed too many times- I think I have to add a hook so it restarts after bacula is started, like we did for mysql and its exporter
[08:00:22] <jynus>	 (not super important as it is well monitored if it fails)
[08:00:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:04:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/845030 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis)
[08:07:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Identify the 'backup' role as data persistence owned [puppet] - 10https://gerrit.wikimedia.org/r/845407 (https://phabricator.wikimedia.org/T321310) (owner: 10Jcrespo)
[08:19:05] <wikibugs>	 (03PS4) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253)
[08:19:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[08:19:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:20:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi)
[08:27:37] <wikibugs>	 (03PS1) 10Elukey: sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310)
[08:29:28] <wikibugs>	 (03PS1) 10Elukey: cumin: add alias for ml-staging worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/845434
[08:30:50] <wikibugs>	 (03PS2) 10Elukey: cumin: add alias for ml-staging worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/845434
[08:31:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) (owner: 10Elukey)
[08:33:52] <wikibugs>	 (03PS2) 10Elukey: sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310)
[08:34:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cumin: add alias for ml-staging worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/845434 (owner: 10Elukey)
[08:40:06] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:40:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[08:40:44] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:40:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
[08:40:58] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) (owner: 10Elukey)
[08:46:58] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:47:59] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet
[08:50:32] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:51:38] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:51:58] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:52:41] <jynus>	 !log finished rolling restart of backup hosts
[08:52:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:02] <jynus>	 !log start of rolling restart of dbprov hosts
[08:53:23] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet
[08:53:31] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet
[08:53:44] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 157, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:54:04] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:58:54] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet
[08:58:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:59:45] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org
[09:03:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:04:31] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet
[09:05:39] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet
[09:06:52] <wikibugs>	 (03PS1) 10Hashar: .gitmodules: translations migrated to Gerrit [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/845459 (https://phabricator.wikimedia.org/T321350)
[09:06:54] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org
[09:07:00] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:07:32] <wikibugs>	 (03CR) 10Hashar: "I have moved the translations from Phabricator to Gerrit T321350#8334913" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/845459 (https://phabricator.wikimedia.org/T321350) (owner: 10Hashar)
[09:08:02] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:08:58] <wikibugs>	 (03PS1) 10Btullis: Grant analytics-admins the right to run commands as the yarn user [puppet] - 10https://gerrit.wikimedia.org/r/845462 (https://phabricator.wikimedia.org/T321378)
[09:08:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:09:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:09:54] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet
[09:10:04] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:10:15] <jynus>	 !log finished rolling restart of dbprov hosts
[09:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:28] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:10:42] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org
[09:10:48] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:11:02] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet
[09:12:10] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:12:34] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10ayounsi) > Long term wise, I can take the action item and check with Julianne on how long they need to keep track of these devices. Good idea! I think also how they need to...
[09:13:34] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:13:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:13:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:14:40] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 157, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:14:58] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet
[09:15:00] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:15:38] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. I'll abandon my similar change for dse-k8s only in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/845028" [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) (owner: 10Elukey)
[09:16:18] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet
[09:16:34] <wikibugs>	 (03CR) 10Majavah: OpenStack HAProxy: support frontend ferm rules into haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[09:16:48] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:17:26] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:18:05] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org
[09:18:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) (owner: 10Elukey)
[09:18:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:18:30] <wikibugs>	 (03Abandoned) 10Btullis: Fix the sre.k8s.reboot-nodes cookbook for dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845028 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[09:18:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:20:50] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org
[09:21:39] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet
[09:21:54] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet
[09:22:47] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker
[09:23:46] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:27:03] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org
[09:27:30] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:27:56] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:29:26] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:34:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:43:28] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[09:43:28] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:44:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Grant analytics-admins the right to run commands as the yarn user [puppet] - 10https://gerrit.wikimedia.org/r/845462 (https://phabricator.wikimedia.org/T321378) (owner: 10Btullis)
[09:46:07] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org
[09:48:28] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:48:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:49:02] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:50:34] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:50:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:51:34] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:53:34] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[09:53:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:54:03] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org
[09:54:57] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker
[09:55:28] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:dse-k8s-worker
[09:56:56] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw
[10:00:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[10:00:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[10:01:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[10:01:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[10:01:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35830 and previous config saved to /var/cache/conftool/dbconfig/20221021-100137-ladsgroup.json
[10:03:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35831 and previous config saved to /var/cache/conftool/dbconfig/20221021-100305-ladsgroup.json
[10:03:35] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-staging-worker
[10:04:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[10:04:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[10:05:04] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:05:12] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:06:14] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:07:47] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker
[10:07:48] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:dse-k8s-worker
[10:07:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[10:08:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[10:08:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T321312)', diff saved to https://phabricator.wikimedia.org/P35832 and previous config saved to /var/cache/conftool/dbconfig/20221021-100813-ladsgroup.json
[10:10:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35833 and previous config saved to /var/cache/conftool/dbconfig/20221021-101009-ladsgroup.json
[10:10:18] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Grant analytics-admins the right to run commands as the yarn user [puppet] - 10https://gerrit.wikimedia.org/r/845462 (https://phabricator.wikimedia.org/T321378) (owner: 10Btullis)
[10:13:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T321312)', diff saved to https://phabricator.wikimedia.org/P35834 and previous config saved to /var/cache/conftool/dbconfig/20221021-101336-ladsgroup.json
[10:13:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:18:42] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:18:57] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw
[10:18:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:19:48] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:24:33] <jynus>	 !log restart of ms-backup hosts
[10:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P35835 and previous config saved to /var/cache/conftool/dbconfig/20221021-102516-ladsgroup.json
[10:25:58] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad
[10:28:04] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T321315 (10jcrespo)
[10:28:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo)
[10:28:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P35836 and previous config saved to /var/cache/conftool/dbconfig/20221021-102842-ladsgroup.json
[10:29:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[10:30:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) Disk finished rebuilding: ` /usr/local/lib/nagios/plugins/get-raid-status-perccli communication 0 OK | controller: 0 OK | physical_disk: 0 OK | virtual_disk: 0 OK | bbu: 0 OK | enc...
[10:32:21] <wikibugs>	 10SRE-tools, 10Icinga, 10Infrastructure-Foundations: get-raid-status-perccli should allow for commands to return non-zero exit code - https://phabricator.wikimedia.org/T320998 (10jcrespo)
[10:33:14] <wikibugs>	 (03PS1) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143)
[10:34:24] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:35:22] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:37:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:40:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P35837 and previous config saved to /var/cache/conftool/dbconfig/20221021-104022-ladsgroup.json
[10:42:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:43:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P35838 and previous config saved to /var/cache/conftool/dbconfig/20221021-104349-ladsgroup.json
[10:44:36] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:47:17] <wikibugs>	 (03CR) 10Jbond: "LGTM i think its probably also worth adding bookworm, this is useful for backporting where you want to grab the bookworm version and rebui" [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[10:47:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] package_builder: add deb-src for buster [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[10:49:27] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw
[10:49:35] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad
[10:55:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM but probably also worth asking in service ops just in case" [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway)
[10:55:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35839 and previous config saved to /var/cache/conftool/dbconfig/20221021-105529-ladsgroup.json
[10:57:38] <wikibugs>	 (03PS2) 10Daniel Kinzler: Enable parsoid cache warming on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843955 (https://phabricator.wikimedia.org/T320535)
[10:58:06] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet
[10:58:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35840 and previous config saved to /var/cache/conftool/dbconfig/20221021-105845-ladsgroup.json
[10:58:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T321312)', diff saved to https://phabricator.wikimedia.org/P35841 and previous config saved to /var/cache/conftool/dbconfig/20221021-105855-ladsgroup.json
[10:59:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[10:59:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[10:59:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T321312)', diff saved to https://phabricator.wikimedia.org/P35842 and previous config saved to /var/cache/conftool/dbconfig/20221021-105921-ladsgroup.json
[11:05:05] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet
[11:05:18] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:05:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T321312)', diff saved to https://phabricator.wikimedia.org/P35843 and previous config saved to /var/cache/conftool/dbconfig/20221021-110544-ladsgroup.json
[11:06:22] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: drain ms-be2050 [puppet] - 10https://gerrit.wikimedia.org/r/845502 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon)
[11:06:53] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet
[11:09:20] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508
[11:10:18] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:10:48] <wikibugs>	 (03PS2) 10Jbond: sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508
[11:11:12] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 (owner: 10Jbond)
[11:12:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:13:41] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet
[11:13:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P35844 and previous config saved to /var/cache/conftool/dbconfig/20221021-111352-ladsgroup.json
[11:13:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:14:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 (owner: 10Jbond)
[11:16:48] <wikibugs>	 (03PS3) 10Jbond: sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508
[11:18:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:18:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:20:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P35845 and previous config saved to /var/cache/conftool/dbconfig/20221021-112050-ladsgroup.json
[11:20:56] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 (owner: 10Jbond)
[11:21:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 (owner: 10Jbond)
[11:22:02] <wikibugs>	 10SRE, 10ops-eqiad: msw-c5-eqiad offline - https://phabricator.wikimedia.org/T321311 (10Jclark-ctr) Updated Netbox with cable
[11:22:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw
[11:27:23] <Emperor>	 !log rolling reboot of eqiad swift frontends re October reboots
[11:27:25] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[11:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P35846 and previous config saved to /var/cache/conftool/dbconfig/20221021-112859-ladsgroup.json
[11:35:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P35847 and previous config saved to /var/cache/conftool/dbconfig/20221021-113556-ladsgroup.json
[11:35:58] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:36:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:44:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35848 and previous config saved to /var/cache/conftool/dbconfig/20221021-114405-ladsgroup.json
[11:44:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[11:44:10] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:44:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[11:44:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35849 and previous config saved to /var/cache/conftool/dbconfig/20221021-114429-ladsgroup.json
[11:44:40] <icinga-wm>	 PROBLEM - SSH on db1121.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:45:19] <wikibugs>	 (03PS1) 10Jbond: sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515
[11:45:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35850 and previous config saved to /var/cache/conftool/dbconfig/20221021-114553-ladsgroup.json
[11:47:00] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad
[11:47:36] <wikibugs>	 (03PS1) 10Matthias Mullie: Add mediawiki.searchpreview schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069)
[11:47:57] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 04-2] "DNM; schema still in the works" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie)
[11:48:46] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:52] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[11:48:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond)
[11:51:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T321312)', diff saved to https://phabricator.wikimedia.org/P35851 and previous config saved to /var/cache/conftool/dbconfig/20221021-115103-ladsgroup.json
[11:51:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[11:51:20] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2006 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[11:51:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T321312)', diff saved to https://phabricator.wikimedia.org/P35852 and previous config saved to /var/cache/conftool/dbconfig/20221021-115128-ladsgroup.json
[11:51:46] <Emperor>	 !log rolling reboot of codfw swift frontends re October reboots
[11:51:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:11] <wikibugs>	 (03PS1) 10Jbond: O:insetup: drop role contact I/F [puppet] - 10https://gerrit.wikimedia.org/r/845519
[11:52:13] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[11:52:24] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:52:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35853 and previous config saved to /var/cache/conftool/dbconfig/20221021-115255-ladsgroup.json
[11:53:16] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.491 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:54:42] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:55:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "we likely need an update to aptrepo/files/distributions-wikimedia as well." [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[11:57:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T321312)', diff saved to https://phabricator.wikimedia.org/P35854 and previous config saved to /var/cache/conftool/dbconfig/20221021-115742-ladsgroup.json
[11:59:10] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:03:51] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2006.codfw.wmnet
[12:06:00] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:48] <wikibugs>	 (03CR) 10FNegri: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[12:08:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P35855 and previous config saved to /var/cache/conftool/dbconfig/20221021-120802-ladsgroup.json
[12:09:05] <wikibugs>	 (03CR) 10Jbond: "should discuss this more when you are back" [puppet] - 10https://gerrit.wikimedia.org/r/845519 (owner: 10Jbond)
[12:09:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:insetup: drop role contact I/F [puppet] - 10https://gerrit.wikimedia.org/r/845519 (owner: 10Jbond)
[12:09:17] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2006.codfw.wmnet
[12:09:46] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:09:56] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:11:10] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[12:11:50] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 157, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:11:58] <wikibugs>	 (03PS2) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143)
[12:12:02] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:12:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P35856 and previous config saved to /var/cache/conftool/dbconfig/20221021-121249-ladsgroup.json
[12:12:54] <wikibugs>	 (03CR) 10FNegri: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[12:13:56] <wikibugs>	 (03CR) 10FNegri: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[12:23:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P35857 and previous config saved to /var/cache/conftool/dbconfig/20221021-122308-ladsgroup.json
[12:23:10] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=kubesvc,name=kubernetes2006.codfw.wmnet
[12:25:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:27:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P35858 and previous config saved to /var/cache/conftool/dbconfig/20221021-122755-ladsgroup.json
[12:30:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:32:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266)
[12:32:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266)
[12:33:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[12:34:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[12:35:01] <claime>	 !log rebooted kubernetes2006.codfw.wmnet manually - root cause T273026
[12:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:07] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[12:37:31] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266)
[12:38:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35859 and previous config saved to /var/cache/conftool/dbconfig/20221021-123815-ladsgroup.json
[12:38:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[12:41:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35860 and previous config saved to /var/cache/conftool/dbconfig/20221021-124132-ladsgroup.json
[12:41:59] <dcausse>	 !log restarting blazegraph on wdqs1013 (BlazegraphFreeAllocatorsDecreasingRapidly)
[12:42:04] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2023.codfw.wmnet
[12:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T321312)', diff saved to https://phabricator.wikimedia.org/P35861 and previous config saved to /var/cache/conftool/dbconfig/20221021-124302-ladsgroup.json
[12:43:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[12:43:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[12:43:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T321312)', diff saved to https://phabricator.wikimedia.org/P35862 and previous config saved to /var/cache/conftool/dbconfig/20221021-124327-ladsgroup.json
[12:44:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266)
[12:44:04] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266)
[12:45:44] <icinga-wm>	 RECOVERY - SSH on db1121.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:46:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[12:47:45] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2023.codfw.wmnet
[12:48:19] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2024.codfw.wmnet
[12:49:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'm having troubles understanding why CI fails on seemingly unrelated tests:" [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[12:49:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T321312)', diff saved to https://phabricator.wikimedia.org/P35863 and previous config saved to /var/cache/conftool/dbconfig/20221021-124950-ladsgroup.json
[12:55:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2024.codfw.wmnet
[12:56:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P35864 and previous config saved to /var/cache/conftool/dbconfig/20221021-125639-ladsgroup.json
[13:00:23] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-worker-codfw
[13:04:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[13:04:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P35865 and previous config saved to /var/cache/conftool/dbconfig/20221021-130456-ladsgroup.json
[13:07:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster
[13:09:51] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet
[13:11:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P35866 and previous config saved to /var/cache/conftool/dbconfig/20221021-131145-ladsgroup.json
[13:13:01] <wikibugs>	 (03PS3) 10Ssingh: package_builder: add deb-src for buster and bookworm [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309)
[13:13:48] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37672/console" [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:14:14] <wikibugs>	 (03CR) 10Ssingh: package_builder: add deb-src for buster and bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:14:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] aptrepo: add thirdparty/haproxy24 for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/844983 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:15:06] <sukhe>	 jbond: ok to merge your change?
[13:15:11] <sukhe>	 jbond: O:insetup: drop role contact I/F (5fd9cafaf8)
[13:16:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[13:16:51] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet
[13:17:12] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[13:18:08] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet
[13:18:39] <sukhe>	 jbond: given it seems like a trivial change, I will go ahead and merge
[13:19:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:19:32] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster
[13:20:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P35867 and previous config saved to /var/cache/conftool/dbconfig/20221021-132003-ladsgroup.json
[13:20:26] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:22:20] <wikibugs>	 (03PS1) 10Elukey: conftool-data: update dse-k8s node list [puppet] - 10https://gerrit.wikimedia.org/r/845544
[13:23:38] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Nice. Thanks ever so much." [puppet] - 10https://gerrit.wikimedia.org/r/845544 (owner: 10Elukey)
[13:24:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] conftool-data: update dse-k8s node list [puppet] - 10https://gerrit.wikimedia.org/r/845544 (owner: 10Elukey)
[13:25:46] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet
[13:26:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35868 and previous config saved to /var/cache/conftool/dbconfig/20221021-132652-ladsgroup.json
[13:26:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
[13:27:06] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet
[13:27:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
[13:27:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T321312)', diff saved to https://phabricator.wikimedia.org/P35869 and previous config saved to /var/cache/conftool/dbconfig/20221021-132716-ladsgroup.json
[13:27:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1041.eqiad.wmnet
[13:28:23] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker
[13:31:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updates - bking@cumin2002 - T321310
[13:32:15] <claime>	 !log kubernetes1005:~$ sudo systemctl reset-failed ifup@ens13.service T273026
[13:32:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:32:21] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:32:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:21] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:32:22] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[13:32:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T321312)', diff saved to https://phabricator.wikimedia.org/P35870 and previous config saved to /var/cache/conftool/dbconfig/20221021-133231-ladsgroup.json
[13:33:47] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet
[13:34:01] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=kubernetes1005.eqiad.wmnet
[13:34:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] package_builder: add deb-src for buster and bookworm [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:34:47] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1041.eqiad.wmnet
[13:35:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T321312)', diff saved to https://phabricator.wikimedia.org/P35871 and previous config saved to /var/cache/conftool/dbconfig/20221021-133509-ladsgroup.json
[13:35:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[13:35:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[13:35:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T321312)', diff saved to https://phabricator.wikimedia.org/P35872 and previous config saved to /var/cache/conftool/dbconfig/20221021-133534-ladsgroup.json
[13:38:55] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:40:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:05] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:41:06] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:41:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T321312)', diff saved to https://phabricator.wikimedia.org/P35873 and previous config saved to /var/cache/conftool/dbconfig/20221021-134153-ladsgroup.json
[13:44:02] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2043.codfw.wmnet
[13:45:40] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1042.eqiad.wmnet
[13:47:11] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:47:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P35874 and previous config saved to /var/cache/conftool/dbconfig/20221021-134737-ladsgroup.json
[13:48:01] <wikibugs>	 (03PS1) 10Ssingh: cp4047: temporarily remove references [puppet] - 10https://gerrit.wikimedia.org/r/845546
[13:48:07] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:45] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:25] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:52:04] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4047: temporarily remove references [puppet] - 10https://gerrit.wikimedia.org/r/845546 (owner: 10Ssingh)
[13:53:11] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:57:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P35875 and previous config saved to /var/cache/conftool/dbconfig/20221021-135659-ladsgroup.json
[14:00:03] <logmsgbot>	 !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1042.eqiad.wmnet
[14:00:07] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2043.codfw.wmnet
[14:00:19] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:43] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:02:43] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:02:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P35876 and previous config saved to /var/cache/conftool/dbconfig/20221021-140245-ladsgroup.json
[14:03:11] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1063 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:23] <wikibugs>	 (03PS1) 10Ssingh: cp4049: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/845550 (https://phabricator.wikimedia.org/T317244)
[14:06:29] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:22] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4049: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/845550 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[14:07:37] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet
[14:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:11:25] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1063 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:11:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Cmjohnson) The dns has been updated but I am not getting any mgmt connection, I need to check to make sure the mgmt cables are conne...
[14:11:38] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet
[14:12:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P35877 and previous config saved to /var/cache/conftool/dbconfig/20221021-141206-ladsgroup.json
[14:12:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS buster
[14:13:01] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:13:01] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:13:11] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:25] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/list/languagepairs (Get all the language pairs) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[14:15:13] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:15] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:16:21] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[14:16:25] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:17:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T321312)', diff saved to https://phabricator.wikimedia.org/P35878 and previous config saved to /var/cache/conftool/dbconfig/20221021-141752-ladsgroup.json
[14:17:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[14:18:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[14:18:14] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster
[14:18:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T321312)', diff saved to https://phabricator.wikimedia.org/P35879 and previous config saved to /var/cache/conftool/dbconfig/20221021-141815-ladsgroup.json
[14:18:17] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1040.eqiad.wmnet
[14:21:01] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet
[14:21:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:21:29] <sukhe>	 !log pool new host cp4037: T317244
[14:21:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:34] <stashbot>	 T317244: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244
[14:22:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updates - bking@cumin2002 - T321310
[14:22:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4037.ulsfo.wmnet,service=ats-be
[14:22:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4037.ulsfo.wmnet,service=ats-tls
[14:22:05] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4037.ulsfo.wmnet,service=varnish-fe
[14:22:05] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=ats-be
[14:22:05] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=ats-tls
[14:22:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=varnish-fe
[14:22:20] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1043.eqiad.wmnet
[14:22:25] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet
[14:23:00] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ssingh)
[14:23:02] <logmsgbot>	 !log bblack@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster
[14:23:21] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:23:21] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:24:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:25:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T321312)', diff saved to https://phabricator.wikimedia.org/P35880 and previous config saved to /var/cache/conftool/dbconfig/20221021-142521-ladsgroup.json
[14:25:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: dse-k8s-worker1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1006.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:26:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:27:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T321312)', diff saved to https://phabricator.wikimedia.org/P35881 and previous config saved to /var/cache/conftool/dbconfig/20221021-142712-ladsgroup.json
[14:27:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[14:27:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[14:27:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[14:27:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[14:27:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T321312)', diff saved to https://phabricator.wikimedia.org/P35882 and previous config saved to /var/cache/conftool/dbconfig/20221021-142742-ladsgroup.json
[14:29:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[14:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:30:15] <wikibugs>	 (03PS1) 10Btullis: Add dummy passwords for the airflow database users [labs/private] - 10https://gerrit.wikimedia.org/r/845559 (https://phabricator.wikimedia.org/T319440)
[14:30:17] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1068 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:30:21] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:26] <wikibugs>	 (03PS1) 10Btullis: Add a simple mechanism for creating postgresql users and databases [puppet] - 10https://gerrit.wikimedia.org/r/845560 (https://phabricator.wikimedia.org/T319440)
[14:30:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: dse-k8s-worker1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1006.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:33:06] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:34:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T321312)', diff saved to https://phabricator.wikimedia.org/P35883 and previous config saved to /var/cache/conftool/dbconfig/20221021-143400-ladsgroup.json
[14:34:11] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37673/console" [puppet] - 10https://gerrit.wikimedia.org/r/845560 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[14:34:21] <logmsgbot>	 !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2045.codfw.wmnet
[14:34:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:36:54] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy passwords for the airflow database users [labs/private] - 10https://gerrit.wikimedia.org/r/845559 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[14:37:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage
[14:37:54] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37674/console" [puppet] - 10https://gerrit.wikimedia.org/r/845560 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[14:39:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] coredns: upgrade to 1.8.7 (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[14:40:17] <wikibugs>	 (03PS3) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143)
[14:40:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P35884 and previous config saved to /var/cache/conftool/dbconfig/20221021-144028-ladsgroup.json
[14:40:42] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet
[14:41:08] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:41:14] <wikibugs>	 (03PS1) 10KartikMistry: Enable Section Translation in Hawaiian, Pashto and Xhosa WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845573 (https://phabricator.wikimedia.org/T317289)
[14:41:21] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage
[14:41:30] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:41:58] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1063 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:43:56] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:22] <icinga-wm>	 PROBLEM - Host ms-be1043 is DOWN: PING CRITICAL - Packet loss = 100%
[14:44:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply updates - bking@cumin2002 - T321310
[14:44:58] <icinga-wm>	 RECOVERY - Host ms-be1043 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[14:44:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:46:49] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp4047
[14:47:46] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4047
[14:47:49] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet
[14:48:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:48:40] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet
[14:49:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P35885 and previous config saved to /var/cache/conftool/dbconfig/20221021-144907-ladsgroup.json
[14:49:11] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker
[14:49:32] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:49:54] <icinga-wm>	 PROBLEM - Host ms-be1043 is DOWN: PING CRITICAL - Packet loss = 100%
[14:50:18] <icinga-wm>	 RECOVERY - Host ms-be1043 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[14:50:28] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:51:26] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:52:36] <icinga-wm>	 PROBLEM - Check systemd state on elastic1092 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:52:36] <icinga-wm>	 PROBLEM - Check systemd state on elastic1090 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:00] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:54:14] <icinga-wm>	 RECOVERY - Check systemd state on elastic1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:14] <icinga-wm>	 RECOVERY - Check systemd state on elastic1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:57] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet
[14:55:28] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1043.eqiad.wmnet
[14:55:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P35886 and previous config saved to /var/cache/conftool/dbconfig/20221021-145534-ladsgroup.json
[14:57:10] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:24] <icinga-wm>	 RECOVERY - Check systemd state on elastic1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:11] <wikibugs>	 (03PS1) 10Ssingh: Revert "cp4047: temporarily remove references" [puppet] - 10https://gerrit.wikimedia.org/r/845586
[14:59:16] <icinga-wm>	 PROBLEM - Check systemd state on elastic1091 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:24] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:01:02] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:01:12] <icinga-wm>	 RECOVERY - Check systemd state on elastic1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:12] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1068 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:04:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P35887 and previous config saved to /var/cache/conftool/dbconfig/20221021-150413-ladsgroup.json
[15:04:33] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:04:57] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495
[15:05:30] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "cp4047: temporarily remove references" [puppet] - 10https://gerrit.wikimedia.org/r/845586 (owner: 10Ssingh)
[15:05:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4049.ulsfo.wmnet with OS buster
[15:06:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10herron) p:05Triage→03Medium
[15:07:18] <icinga-wm>	 PROBLEM - Check systemd state on elastic1101 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:09:16] <icinga-wm>	 RECOVERY - Check systemd state on elastic1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:09:46] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:10:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:10:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T321312)', diff saved to https://phabricator.wikimedia.org/P35888 and previous config saved to /var/cache/conftool/dbconfig/20221021-151040-ladsgroup.json
[15:10:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[15:10:56] <wikibugs>	 (03PS3) 10Elukey: coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159)
[15:10:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[15:11:02] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:11:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T321312)', diff saved to https://phabricator.wikimedia.org/P35889 and previous config saved to /var/cache/conftool/dbconfig/20221021-151104-ladsgroup.json
[15:11:55] <wikibugs>	 (03CR) 10Elukey: "Sorry the code review was more WIP than ready, I wanted to ask if it was an ok-direction and left some horrors here and there :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey)
[15:11:57] <wikibugs>	 (03PS1) 10Herron: admin: add kindrobot to deployers [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355)
[15:12:12] <sukhe>	 Bsadowski1: ok to merge yours?
[15:12:15] <sukhe>	 sorry ^ 
[15:12:22] <sukhe>	 btullis: ^
[15:13:17] <btullis>	 Which one?
[15:13:26] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: restart gitlab-runner gracefully [puppet] - 10https://gerrit.wikimedia.org/r/844985 (owner: 10Jelto)
[15:13:46] <sukhe>	 labs/private
[15:13:48] <sukhe>	 I skipped it
[15:13:52] <sukhe>	 +profile::analytics::postgresql::replication_password: dummydummy
[15:13:52] <sukhe>	 +profile::analytics::postgresql::users:
[15:15:00] <wikibugs>	 (03CR) 10Dzahn: "Has the approvals and looks alright. Just the key used here does not appear on the ticket." [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) (owner: 10Herron)
[15:15:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster
[15:16:28] <btullis>	 sukhe: Sorry about that.
[15:16:41] <sukhe>	 btullis: no problem at all :) I realized later it was labs/private and should not have pinged you
[15:17:19] <jelto>	 ^ I merged that, was pm'ing btu.llis too :)
[15:17:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T321312)', diff saved to https://phabricator.wikimedia.org/P35890 and previous config saved to /var/cache/conftool/dbconfig/20221021-151727-ladsgroup.json
[15:17:54] <btullis>	 I forgot it needed merging on puppetmaster, because it works straight away in pcc
[15:19:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T321312)', diff saved to https://phabricator.wikimedia.org/P35892 and previous config saved to /var/cache/conftool/dbconfig/20221021-151920-ladsgroup.json
[15:19:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[15:19:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[15:19:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T321312)', diff saved to https://phabricator.wikimedia.org/P35893 and previous config saved to /var/cache/conftool/dbconfig/20221021-151945-ladsgroup.json
[15:20:44] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:21:24] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:23:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2001-dev']
[15:24:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:26:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T321312)', diff saved to https://phabricator.wikimedia.org/P35894 and previous config saved to /var/cache/conftool/dbconfig/20221021-152603-ladsgroup.json
[15:26:41] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ssingh)
[15:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:30:30] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:32:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P35895 and previous config saved to /var/cache/conftool/dbconfig/20221021-153234-ladsgroup.json
[15:33:08] <icinga-wm>	 PROBLEM - Check systemd state on elastic1072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:36:42] <wikibugs>	 (03PS13) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495
[15:37:08] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:11] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2001-dev']
[15:39:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:40:05] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye
[15:40:10] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:41:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P35896 and previous config saved to /var/cache/conftool/dbconfig/20221021-154110-ladsgroup.json
[15:41:15] <wikibugs>	 (03PS4) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143)
[15:41:40] <icinga-wm>	 PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:44] <wikibugs>	 (03CR) 10FNegri: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri)
[15:41:45] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage
[15:42:56] <icinga-wm>	 RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage
[15:46:10] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-worker-eqiad
[15:47:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P35897 and previous config saved to /var/cache/conftool/dbconfig/20221021-154740-ladsgroup.json
[15:51:47] <icinga-wm>	 PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:05] <icinga-wm>	 RECOVERY - Check systemd state on elastic1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:56:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P35898 and previous config saved to /var/cache/conftool/dbconfig/20221021-155616-ladsgroup.json
[15:57:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage
[15:57:27] <wikibugs>	 (03Abandoned) 10Dzahn: scap/dsh: remove parsoid service, replaced by parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn)
[15:57:44] <wikibugs>	 (03PS1) 10Jdlrobson: Document check for broken symbolic links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845620 (https://phabricator.wikimedia.org/T319223)
[15:58:05] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet
[15:58:17] <wikibugs>	 (03CR) 10JHathaway: "Daniel & Giuseppe, if you all could just confirm this is safe to remove, that would be great, you both touched this data structure long ag" [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway)
[15:58:17] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet
[15:58:23] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2045 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:59:12] <wikibugs>	 (03PS3) 10Jdlrobson: Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319012)
[15:59:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2002-dev']
[15:59:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:00:47] <icinga-wm>	 PROBLEM - Check systemd state on elastic1085 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:49] <icinga-wm>	 PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:05] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage
[16:02:10] <icinga-wm>	 RECOVERY - Check systemd state on elastic1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:14] <icinga-wm>	 RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T321312)', diff saved to https://phabricator.wikimedia.org/P35899 and previous config saved to /var/cache/conftool/dbconfig/20221021-160246-ladsgroup.json
[16:02:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[16:03:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[16:07:15] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet
[16:08:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[16:08:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[16:08:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T321312)', diff saved to https://phabricator.wikimedia.org/P35900 and previous config saved to /var/cache/conftool/dbconfig/20221021-160858-ladsgroup.json
[16:09:41] <wikibugs>	 (03PS1) 10Majavah: openstack: encapi: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/845623
[16:09:54] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet
[16:11:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T321312)', diff saved to https://phabricator.wikimedia.org/P35901 and previous config saved to /var/cache/conftool/dbconfig/20221021-161123-ladsgroup.json
[16:11:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[16:11:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[16:11:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35902 and previous config saved to /var/cache/conftool/dbconfig/20221021-161150-ladsgroup.json
[16:12:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4047.ulsfo.wmnet with OS buster
[16:13:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35903 and previous config saved to /var/cache/conftool/dbconfig/20221021-161315-ladsgroup.json
[16:13:44] <wikibugs>	 (03PS2) 10Majavah: openstack: encapi: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/845623
[16:14:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T321312)', diff saved to https://phabricator.wikimedia.org/P35904 and previous config saved to /var/cache/conftool/dbconfig/20221021-161411-ladsgroup.json
[16:14:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2002-dev']
[16:15:16] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev']
[16:15:41] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt2003-dev']
[16:16:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev']
[16:20:09] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl1001.eqiad.wmnet
[16:20:10] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[16:20:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35905 and previous config saved to /var/cache/conftool/dbconfig/20221021-162032-ladsgroup.json
[16:22:22] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:22:22] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl1001.eqiad.wmnet on all recursors
[16:22:25] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl1001.eqiad.wmnet on all recursors
[16:23:19] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt2003-dev']
[16:23:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev']
[16:23:44] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl1002.eqiad.wmnet
[16:23:45] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[16:24:29] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2003-dev']
[16:27:20] <icinga-wm>	 PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:24] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye
[16:27:47] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:27:48] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl1002.eqiad.wmnet on all recursors
[16:27:51] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl1002.eqiad.wmnet on all recursors
[16:29:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev']
[16:29:14] <icinga-wm>	 RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2045 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:29:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P35906 and previous config saved to /var/cache/conftool/dbconfig/20221021-162917-ladsgroup.json
[16:29:29] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt2003-dev']
[16:31:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev']
[16:35:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P35907 and previous config saved to /var/cache/conftool/dbconfig/20221021-163538-ladsgroup.json
[16:42:47] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bullseye
[16:44:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2003-dev']
[16:44:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P35908 and previous config saved to /var/cache/conftool/dbconfig/20221021-164424-ladsgroup.json
[16:45:55] <icinga-wm>	 PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:02] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl1001.eqiad.wmnet
[16:46:26] <wikibugs>	 (03CR) 10Cwhite: "Proposal to drop the request-cookie field from varnish in logstash." [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite)
[16:46:37] <icinga-wm>	 PROBLEM - SSH on db1121.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:47:33] <icinga-wm>	 RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:50:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P35909 and previous config saved to /var/cache/conftool/dbconfig/20221021-165045-ladsgroup.json
[16:55:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: add sanitize filter [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite)
[16:57:07] <wikibugs>	 (03CR) 10Herron: admin: add kindrobot to deployers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) (owner: 10Herron)
[16:57:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10odimitrijevic) @Cmjohnson Thank you!
[16:59:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T321312)', diff saved to https://phabricator.wikimedia.org/P35910 and previous config saved to /var/cache/conftool/dbconfig/20221021-165930-ladsgroup.json
[16:59:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[16:59:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[16:59:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:00:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:00:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P35911 and previous config saved to /var/cache/conftool/dbconfig/20221021-170011-ladsgroup.json
[17:00:36] <wikibugs>	 (03PS8) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235)
[17:03:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle)
[17:03:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage
[17:04:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10herron) 05Stalled→03Invalid Transitioning to invalid pending sponsor.  Once sponsor details have been worked out please update the task and reopen.  Thanks!
[17:05:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35912 and previous config saved to /var/cache/conftool/dbconfig/20221021-170551-ladsgroup.json
[17:06:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10herron) 05In progress→03Stalled
[17:07:11] <wikibugs>	 (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] .gitmodules: translations migrated to Gerrit [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/845459 (https://phabricator.wikimedia.org/T321350) (owner: 10Hashar)
[17:07:16] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage
[17:09:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35913 and previous config saved to /var/cache/conftool/dbconfig/20221021-170908-ladsgroup.json
[17:09:31] <wikibugs>	 (03PS8) 10Vlad.shapik: WIP: Provide additional tests to cover errors and exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406)
[17:09:33] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl1002.eqiad.wmnet
[17:10:26] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2003-dev.codfw.wmnet with OS bullseye
[17:12:15] <icinga-wm>	 PROBLEM - Check systemd state on elastic1074 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:12:49] <icinga-wm>	 PROBLEM - Check systemd state on elastic1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:13:15] <icinga-wm>	 RECOVERY - Check systemd state on elastic1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:13:59] <icinga-wm>	 RECOVERY - Check systemd state on elastic1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[17:18:45] <wikibugs>	 (03PS3) 10Jdlrobson: Unset some bad logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845035
[17:18:52] <wikibugs>	 (03PS4) 10Jdlrobson: Unset some bad logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845035
[17:20:02] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply updates - bking@cumin2002 - T321310
[17:20:03] <icinga-wm>	 PROBLEM - Check systemd state on elastic1098 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: apply updates - bking@cumin2002 - T321310
[17:21:13] <icinga-wm>	 PROBLEM - Check systemd state on elastic1102 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:24:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P35914 and previous config saved to /var/cache/conftool/dbconfig/20221021-172414-ladsgroup.json
[17:26:43] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage
[17:29:41] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bullseye
[17:30:52] <icinga-wm>	 PROBLEM - Check systemd state on elastic1100 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:02] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage
[17:36:57] <icinga-wm>	 RECOVERY - Check systemd state on elastic1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:39:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P35915 and previous config saved to /var/cache/conftool/dbconfig/20221021-173921-ladsgroup.json
[17:46:59] <wikibugs>	 (03PS1) 10BBlack: Clean up outdated commentary on requestctl [puppet] - 10https://gerrit.wikimedia.org/r/845648 (https://phabricator.wikimedia.org/T288106)
[17:47:01] <wikibugs>	 (03PS1) 10BBlack: Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106)
[17:47:03] <wikibugs>	 (03PS1) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106)
[17:47:05] <wikibugs>	 (03PS1) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651
[17:47:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack)
[17:48:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack)
[17:51:05] <wikibugs>	 (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack)
[17:52:19] <wikibugs>	 (03PS9) 10Vlad.shapik: Provide additional tests to cover errors caused by wrong engine commands [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406)
[17:53:25] <icinga-wm>	 RECOVERY - Check systemd state on elastic1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:51] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2003-dev.codfw.wmnet with OS bullseye
[17:54:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35916 and previous config saved to /var/cache/conftool/dbconfig/20221021-175427-ladsgroup.json
[17:54:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[17:54:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[17:54:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35917 and previous config saved to /var/cache/conftool/dbconfig/20221021-175453-ladsgroup.json
[17:56:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35918 and previous config saved to /var/cache/conftool/dbconfig/20221021-175615-ladsgroup.json
[17:58:19] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[17:59:27] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:00:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P35919 and previous config saved to /var/cache/conftool/dbconfig/20221021-180028-ladsgroup.json
[18:01:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for GFontenelle - https://phabricator.wikimedia.org/T321218 (10GFontenelle_WMF) Thank you, @herron and @Aklapper!
[18:02:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35920 and previous config saved to /var/cache/conftool/dbconfig/20221021-180228-ladsgroup.json
[18:03:37] <wikibugs>	 (03PS2) 10BBlack: Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106)
[18:03:39] <wikibugs>	 (03PS2) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106)
[18:03:41] <wikibugs>	 (03PS2) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651
[18:04:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack)
[18:04:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack)
[18:09:30] <wikibugs>	 (03PS3) 10BBlack: Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106)
[18:09:32] <wikibugs>	 (03PS3) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106)
[18:09:34] <wikibugs>	 (03PS3) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651
[18:10:19] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[18:11:43] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[18:13:45] <icinga-wm>	 RECOVERY - Check systemd state on elastic1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:15:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P35921 and previous config saved to /var/cache/conftool/dbconfig/20221021-181534-ladsgroup.json
[18:15:41] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel
[18:15:51] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[18:15:55] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel
[18:16:25] <icinga-wm>	 PROBLEM - Check systemd state on elastic2072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P35922 and previous config saved to /var/cache/conftool/dbconfig/20221021-181734-ladsgroup.json
[18:17:59] <icinga-wm>	 RECOVERY - Check systemd state on elastic2072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:19:19] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[18:19:28] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel
[18:19:42] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel
[18:20:53] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[18:21:03] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.wikimedia.org
[18:22:19] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.wikimedia.org
[18:24:04] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel
[18:24:18] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel
[18:30:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P35923 and previous config saved to /var/cache/conftool/dbconfig/20221021-183041-ladsgroup.json
[18:32:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P35924 and previous config saved to /var/cache/conftool/dbconfig/20221021-183241-ladsgroup.json
[18:33:09] <wikibugs>	 (03PS4) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106)
[18:33:11] <wikibugs>	 (03PS4) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651
[18:33:11] <icinga-wm>	 PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:33:31] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudservices2005-dev.wikimedia.org
[18:35:21] <icinga-wm>	 RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:37:20] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.wikimedia.org
[18:38:47] <sukhe>	 !log pool new host cp4047: T317244
[18:38:50] <sukhe>	 !log pool new host cp4049: T317244
[18:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:52] <stashbot>	 T317244: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244
[18:38:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4047.ulsfo.wmnet,service=ats-be
[18:39:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4047.ulsfo.wmnet,service=ats-tls
[18:39:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4047.ulsfo.wmnet,service=varnish-fe
[18:39:07] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet,service=ats-be
[18:39:07] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet,service=ats-tls
[18:39:07] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet,service=varnish-fe
[18:40:23] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4049.ulsfo.wmnet,service=ats-be
[18:40:23] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4049.ulsfo.wmnet,service=ats-tls
[18:40:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4049.ulsfo.wmnet,service=varnish-fe
[18:40:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet,service=ats-be
[18:40:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet,service=ats-tls
[18:40:25] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet,service=varnish-fe
[18:41:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify jupyterhub config files to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[18:45:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P35925 and previous config saved to /var/cache/conftool/dbconfig/20221021-184547-ladsgroup.json
[18:45:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[18:46:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[18:46:22] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2005-dev.wikimedia.org
[18:46:41] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2004-dev.wikimedia.org
[18:47:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35926 and previous config saved to /var/cache/conftool/dbconfig/20221021-184747-ladsgroup.json
[18:48:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2075-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:48:11] <icinga-wm>	 PROBLEM - Check systemd state on elastic2075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:21] <icinga-wm>	 PROBLEM - Check systemd state on elastic2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:31] <icinga-wm>	 PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:49:41] <wikibugs>	 (03PS4) 10Xcollazo: Modify jupyterhub config files to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088)
[18:49:47] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudservices2004-dev.wikimedia.org
[18:50:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35927 and previous config saved to /var/cache/conftool/dbconfig/20221021-185003-ladsgroup.json
[18:50:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[18:50:13] <sukhe>	 ^ there seems to be repeated elastic systemd unit failures. are these known?
[18:50:17] <icinga-wm>	 RECOVERY - Check systemd state on elastic2075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:50:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[18:50:27] <icinga-wm>	 RECOVERY - Check systemd state on elastic2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:50:29] <sukhe>	 seems like they are flapping
[18:50:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35928 and previous config saved to /var/cache/conftool/dbconfig/20221021-185032-ladsgroup.json
[18:50:35] <icinga-wm>	 RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:51:42] <wikibugs>	 (03PS5) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088)
[18:52:58] <wikibugs>	 (03CR) 10Xcollazo: "Ok this one is ready for reviews." [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[18:53:01] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2075-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:56:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35929 and previous config saved to /var/cache/conftool/dbconfig/20221021-185651-ladsgroup.json
[18:57:21] <icinga-wm>	 PROBLEM - Check systemd state on elastic2062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:58:49] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2004-dev.wikimedia.org
[18:59:25] <icinga-wm>	 RECOVERY - Check systemd state on elastic2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:59:41] <wikibugs>	 10SRE, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T321409 (10Sgs)
[19:05:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P35930 and previous config saved to /var/cache/conftool/dbconfig/20221021-190509-ladsgroup.json
[19:10:39] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2001-dev.wikimedia.org
[19:11:51] <wikibugs>	 (03CR) 10Ottomata: "Looking good!  some comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo)
[19:11:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P35931 and previous config saved to /var/cache/conftool/dbconfig/20221021-191157-ladsgroup.json
[19:14:31] <icinga-wm>	 PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:42] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet
[19:16:31] <icinga-wm>	 RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:24] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): wmf-puppet-dashboard updates
[19:18:36] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): wmf-puppet-dashboard updates (duration: 01m 12s)
[19:19:36] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet
[19:19:43] <logmsgbot>	 !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: wmf-puppet-dashboard updates
[19:20:17] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet
[19:20:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P35932 and previous config saved to /var/cache/conftool/dbconfig/20221021-192016-ladsgroup.json
[19:21:56] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2001-dev.wikimedia.org
[19:22:38] <logmsgbot>	 !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: wmf-puppet-dashboard updates (duration: 02m 55s)
[19:24:10] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet
[19:27:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P35933 and previous config saved to /var/cache/conftool/dbconfig/20221021-192704-ladsgroup.json
[19:32:03] <icinga-wm>	 PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:03] <icinga-wm>	 RECOVERY - Check systemd state on elastic2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35934 and previous config saved to /var/cache/conftool/dbconfig/20221021-193524-ladsgroup.json
[19:35:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[19:35:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[19:35:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T321312)', diff saved to https://phabricator.wikimedia.org/P35935 and previous config saved to /var/cache/conftool/dbconfig/20221021-193550-ladsgroup.json
[19:40:01] <icinga-wm>	 PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:41:21] <icinga-wm>	 RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:42:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T321312)', diff saved to https://phabricator.wikimedia.org/P35936 and previous config saved to /var/cache/conftool/dbconfig/20221021-194201-ladsgroup.json
[19:42:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35937 and previous config saved to /var/cache/conftool/dbconfig/20221021-194210-ladsgroup.json
[19:42:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[19:42:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[19:42:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35938 and previous config saved to /var/cache/conftool/dbconfig/20221021-194234-ladsgroup.json
[19:44:29] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) wanted to update the task and first say sorry for the delayed progress on this task.    To p...
[19:45:32] <wikibugs>	 (03PS1) 10Stef Dunlap: Fixup development tooling for wider compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680
[19:48:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35939 and previous config saved to /var/cache/conftool/dbconfig/20221021-194847-ladsgroup.json
[19:57:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P35940 and previous config saved to /var/cache/conftool/dbconfig/20221021-195708-ladsgroup.json
[19:57:25] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[20:00:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:01:27] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[20:02:36] <wikibugs>	 (03PS1) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304)
[20:03:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P35941 and previous config saved to /var/cache/conftool/dbconfig/20221021-200353-ladsgroup.json
[20:04:35] <icinga-wm>	 PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:06:43] <icinga-wm>	 RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:12:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P35942 and previous config saved to /var/cache/conftool/dbconfig/20221021-201214-ladsgroup.json
[20:12:15] <icinga-wm>	 PROBLEM - Check systemd state on elastic2076 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:13:51] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[20:14:21] <icinga-wm>	 RECOVERY - Check systemd state on elastic2076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:15:55] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[20:19:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P35943 and previous config saved to /var/cache/conftool/dbconfig/20221021-201900-ladsgroup.json
[20:20:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: apply updates - bking@cumin2002 - T321310
[20:27:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T321312)', diff saved to https://phabricator.wikimedia.org/P35944 and previous config saved to /var/cache/conftool/dbconfig/20221021-202721-ladsgroup.json
[20:30:23] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[20:34:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35945 and previous config saved to /var/cache/conftool/dbconfig/20221021-203406-ladsgroup.json
[20:34:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[20:34:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[20:34:29] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[20:34:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T321312)', diff saved to https://phabricator.wikimedia.org/P35946 and previous config saved to /var/cache/conftool/dbconfig/20221021-203430-ladsgroup.json
[20:36:31] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[20:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T321312)', diff saved to https://phabricator.wikimedia.org/P35947 and previous config saved to /var/cache/conftool/dbconfig/20221021-204045-ladsgroup.json
[20:40:59] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo - https://phabricator.wikimedia.org/T318820 (10bking) Looks like we (as in Search Platform SREs) need to cut a new package for  `wmf-elasticsearch-search-plugins`...
[20:44:53] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[20:46:57] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[20:51:33] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:55:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P35948 and previous config saved to /var/cache/conftool/dbconfig/20221021-205551-ladsgroup.json
[20:56:48] <wikibugs>	 (03PS5) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106)
[20:56:50] <wikibugs>	 (03PS5) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651
[20:56:52] <wikibugs>	 (03PS1) 10BBlack: Remove confd_experiment_fqdn support [puppet] - 10https://gerrit.wikimedia.org/r/845713 (https://phabricator.wikimedia.org/T288106)
[20:57:15] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[21:00:21] <wikibugs>	 (03PS2) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304)
[21:00:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "oh, duh, I did not see that. all looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) (owner: 10Herron)
[21:01:23] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[21:02:15] <wikibugs>	 (03CR) 10Dzahn: "eh, yea, I am not sure I can confirm it's 100% safe but _from what I can tell_ nothing _seems_ to use the ID number. A few things do still" [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway)
[21:02:38] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/844563 (https://phabricator.wikimedia.org/T321410)
[21:08:10] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: add kindrobot to deployers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) (owner: 10Herron)
[21:10:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will fully propagate within 30 minutes.  I'll transition this to resolved...
[21:10:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P35949 and previous config saved to /var/cache/conftool/dbconfig/20221021-211058-ladsgroup.json
[21:16:32] <wikibugs>	 10SRE, 10Znuny, 10serviceops-collab: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Dzahn) I did a `mysql -h m2-master.eqiad.wmnet -u otrs -p otrs` from otrs1001 and could confirm that the password at `hieradata/common/profile/vrts.yaml:profile::vrts::data...
[21:16:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[21:26:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T321312)', diff saved to https://phabricator.wikimedia.org/P35950 and previous config saved to /var/cache/conftool/dbconfig/20221021-212604-ladsgroup.json
[21:26:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[21:26:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[21:26:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T321312)', diff saved to https://phabricator.wikimedia.org/P35951 and previous config saved to /var/cache/conftool/dbconfig/20221021-212629-ladsgroup.json
[21:32:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T321312)', diff saved to https://phabricator.wikimedia.org/P35952 and previous config saved to /var/cache/conftool/dbconfig/20221021-213242-ladsgroup.json
[21:36:37] <icinga-wm>	 PROBLEM - Host wcqs1001 is DOWN: PING CRITICAL - Packet loss = 100%
[21:37:07] <icinga-wm>	 RECOVERY - Host wcqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[21:41:53] <icinga-wm>	 PROBLEM - Host wcqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:33] <icinga-wm>	 RECOVERY - Host wcqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[21:47:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P35953 and previous config saved to /var/cache/conftool/dbconfig/20221021-214749-ladsgroup.json
[22:02:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P35954 and previous config saved to /var/cache/conftool/dbconfig/20221021-220256-ladsgroup.json
[22:18:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T321312)', diff saved to https://phabricator.wikimedia.org/P35955 and previous config saved to /var/cache/conftool/dbconfig/20221021-221802-ladsgroup.json
[22:18:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[22:18:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[22:18:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T321312)', diff saved to https://phabricator.wikimedia.org/P35956 and previous config saved to /var/cache/conftool/dbconfig/20221021-221826-ladsgroup.json
[22:19:49] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10serviceops-collab: ensure httpd error logs from "misc apps" (krypton) end up in logstash - https://phabricator.wikimedia.org/T216090 (10Dzahn) a:03Dzahn
[22:24:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T321312)', diff saved to https://phabricator.wikimedia.org/P35957 and previous config saved to /var/cache/conftool/dbconfig/20221021-222442-ladsgroup.json
[22:39:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P35958 and previous config saved to /var/cache/conftool/dbconfig/20221021-223948-ladsgroup.json
[22:51:45] <icinga-wm>	 RECOVERY - SSH on db1121.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:53:31] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:54:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P35959 and previous config saved to /var/cache/conftool/dbconfig/20221021-225455-ladsgroup.json
[23:10:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T321312)', diff saved to https://phabricator.wikimedia.org/P35960 and previous config saved to /var/cache/conftool/dbconfig/20221021-231001-ladsgroup.json
[23:10:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[23:10:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[23:10:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T321312)', diff saved to https://phabricator.wikimedia.org/P35961 and previous config saved to /var/cache/conftool/dbconfig/20221021-231026-ladsgroup.json
[23:17:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T321312)', diff saved to https://phabricator.wikimedia.org/P35962 and previous config saved to /var/cache/conftool/dbconfig/20221021-231741-ladsgroup.json
[23:32:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P35963 and previous config saved to /var/cache/conftool/dbconfig/20221021-233247-ladsgroup.json
[23:33:27] <icinga-wm>	 PROBLEM - Check systemd state on db2137 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P35964 and previous config saved to /var/cache/conftool/dbconfig/20221021-234754-ladsgroup.json