[00:00:07] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:21] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4005.ulsfo.wmnet with reason: host reimage [00:01:03] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti4005.ulsfo.wmnet with reason: host reimage [00:02:35] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1067 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:03:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:03:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster [00:04:32] !log robh@cumin2002 START - Cookbook sre.dns.netbox [00:05:44] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:06:15] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs4008 [00:06:31] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs4008 [00:06:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321312)', diff saved to https://phabricator.wikimedia.org/P35783 and previous config saved to /var/cache/conftool/dbconfig/20221021-000636-ladsgroup.json [00:06:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [00:06:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [00:07:21] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS bullseye [00:07:29] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye [00:09:58] (03CR) 10Ryan Kemper: [C: 03+2] elastic: run puppet after reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/845068 (owner: 10Ryan Kemper) [00:11:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [00:11:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [00:11:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P35784 and previous config saved to /var/cache/conftool/dbconfig/20221021-001117-ladsgroup.json [00:11:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T321312)', diff saved to https://phabricator.wikimedia.org/P35785 and previous config saved to /var/cache/conftool/dbconfig/20221021-001123-ladsgroup.json [00:14:25] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1066 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:16:01] (03PS1) 10Ssingh: cp4049: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/845075 (https://phabricator.wikimedia.org/T317244) [00:16:53] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4005.ulsfo.wmnet with OS bullseye [00:17:01] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bullseye completed: - ganeti4... [00:17:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321312)', diff saved to https://phabricator.wikimedia.org/P35786 and previous config saved to /var/cache/conftool/dbconfig/20221021-001740-ladsgroup.json [00:18:22] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1060 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:18:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:18:59] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:20:07] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1064 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:20:07] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1063 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:20:28] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster [00:23:07] RECOVERY - Check systemd state on elastic1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:25:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster [00:26:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35787 and previous config saved to /var/cache/conftool/dbconfig/20221021-002624-ladsgroup.json [00:27:17] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [00:27:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P35788 and previous config saved to /var/cache/conftool/dbconfig/20221021-002739-ladsgroup.json [00:27:55] RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:21] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [00:32:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P35789 and previous config saved to /var/cache/conftool/dbconfig/20221021-003247-ladsgroup.json [00:32:49] RECOVERY - Check whether ferm is active by checking the default input chain on elastic1067 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:34:36] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310 [00:38:39] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster [00:39:08] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310 [00:42:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P35790 and previous config saved to /var/cache/conftool/dbconfig/20221021-004246-ladsgroup.json [00:45:39] (03PS1) 10Ryan Kemper: elastic: run puppet in correct place [cookbooks] - 10https://gerrit.wikimedia.org/r/845086 [00:47:45] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [00:47:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P35791 and previous config saved to /var/cache/conftool/dbconfig/20221021-004754-ladsgroup.json [00:48:00] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS bullseye [00:48:07] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host lvs4008.ulsfo.wmnet with OS bullseye completed: - lvs4008 (*... [00:57:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P35792 and previous config saved to /var/cache/conftool/dbconfig/20221021-005752-ladsgroup.json [00:59:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [01:01:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster [01:03:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321312)', diff saved to https://phabricator.wikimedia.org/P35793 and previous config saved to /var/cache/conftool/dbconfig/20221021-010301-ladsgroup.json [01:03:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [01:03:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [01:03:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T321312)', diff saved to https://phabricator.wikimedia.org/P35794 and previous config saved to /var/cache/conftool/dbconfig/20221021-010325-ladsgroup.json [01:09:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321312)', diff saved to https://phabricator.wikimedia.org/P35795 and previous config saved to /var/cache/conftool/dbconfig/20221021-010944-ladsgroup.json [01:13:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321312)', diff saved to https://phabricator.wikimedia.org/P35796 and previous config saved to /var/cache/conftool/dbconfig/20221021-011259-ladsgroup.json [01:13:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [01:13:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [01:13:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35797 and previous config saved to /var/cache/conftool/dbconfig/20221021-011324-ladsgroup.json [01:14:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35798 and previous config saved to /var/cache/conftool/dbconfig/20221021-011452-ladsgroup.json [01:16:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:19:26] (03CR) 10Ryan Kemper: [C: 03+2] elastic: run puppet in correct place [cookbooks] - 10https://gerrit.wikimedia.org/r/845086 (owner: 10Ryan Kemper) [01:22:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35799 and previous config saved to /var/cache/conftool/dbconfig/20221021-012213-ladsgroup.json [01:22:37] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310 [01:24:15] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster [01:24:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P35800 and previous config saved to /var/cache/conftool/dbconfig/20221021-012450-ladsgroup.json [01:35:04] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P35801 and previous config saved to /var/cache/conftool/dbconfig/20221021-013720-ladsgroup.json [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P35802 and previous config saved to /var/cache/conftool/dbconfig/20221021-013957-ladsgroup.json [01:42:36] PROBLEM - Check systemd state on elastic1083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:02] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [01:44:38] RECOVERY - Check systemd state on elastic1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:08] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P35803 and previous config saved to /var/cache/conftool/dbconfig/20221021-015226-ladsgroup.json [01:52:28] PROBLEM - Check systemd state on elastic1068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T321310 [01:54:32] RECOVERY - Check systemd state on elastic1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321312)', diff saved to https://phabricator.wikimedia.org/P35804 and previous config saved to /var/cache/conftool/dbconfig/20221021-015503-ladsgroup.json [01:55:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:55:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:07:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35805 and previous config saved to /var/cache/conftool/dbconfig/20221021-020733-ladsgroup.json [02:07:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35806 and previous config saved to /var/cache/conftool/dbconfig/20221021-021250-ladsgroup.json [02:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P35807 and previous config saved to /var/cache/conftool/dbconfig/20221021-022757-ladsgroup.json [02:29:28] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [02:36:02] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:43:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P35808 and previous config saved to /var/cache/conftool/dbconfig/20221021-024303-ladsgroup.json [02:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321312)', diff saved to https://phabricator.wikimedia.org/P35809 and previous config saved to /var/cache/conftool/dbconfig/20221021-025809-ladsgroup.json [02:58:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [02:58:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [02:58:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35810 and previous config saved to /var/cache/conftool/dbconfig/20221021-025836-ladsgroup.json [03:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35811 and previous config saved to /var/cache/conftool/dbconfig/20221021-030531-ladsgroup.json [03:20:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P35812 and previous config saved to /var/cache/conftool/dbconfig/20221021-032037-ladsgroup.json [03:35:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P35813 and previous config saved to /var/cache/conftool/dbconfig/20221021-033544-ladsgroup.json [03:50:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321312)', diff saved to https://phabricator.wikimedia.org/P35814 and previous config saved to /var/cache/conftool/dbconfig/20221021-035050-ladsgroup.json [03:50:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [03:51:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [03:51:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [03:51:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [03:51:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T321312)', diff saved to https://phabricator.wikimedia.org/P35815 and previous config saved to /var/cache/conftool/dbconfig/20221021-035120-ladsgroup.json [03:56:18] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:57:48] RECOVERY - Host parse1001.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 1.64 ms [03:58:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321312)', diff saved to https://phabricator.wikimedia.org/P35816 and previous config saved to /var/cache/conftool/dbconfig/20221021-035848-ladsgroup.json [04:13:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P35817 and previous config saved to /var/cache/conftool/dbconfig/20221021-041354-ladsgroup.json [04:19:13] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:23:06] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:28:08] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:29:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P35818 and previous config saved to /var/cache/conftool/dbconfig/20221021-042901-ladsgroup.json [04:37:12] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:44:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321312)', diff saved to https://phabricator.wikimedia.org/P35819 and previous config saved to /var/cache/conftool/dbconfig/20221021-044407-ladsgroup.json [04:44:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [04:44:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [04:44:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35820 and previous config saved to /var/cache/conftool/dbconfig/20221021-044433-ladsgroup.json [04:48:04] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms [04:48:53] (03PS1) 10Giuseppe Lavagetto: Fix broken links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845277 [04:50:02] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 4 (graphite1005, ...), No backups: 2 (graphite1005, ...), Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:50:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35821 and previous config saved to /var/cache/conftool/dbconfig/20221021-045051-ladsgroup.json [04:54:23] (03CR) 10Giuseppe Lavagetto: "For the record:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [05:02:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:05:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P35822 and previous config saved to /var/cache/conftool/dbconfig/20221021-050558-ladsgroup.json [05:07:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:21:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P35823 and previous config saved to /var/cache/conftool/dbconfig/20221021-052104-ladsgroup.json [05:36:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35824 and previous config saved to /var/cache/conftool/dbconfig/20221021-053611-ladsgroup.json [05:36:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [05:36:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [05:36:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T321312)', diff saved to https://phabricator.wikimedia.org/P35825 and previous config saved to /var/cache/conftool/dbconfig/20221021-053636-ladsgroup.json [05:38:10] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:42:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321312)', diff saved to https://phabricator.wikimedia.org/P35826 and previous config saved to /var/cache/conftool/dbconfig/20221021-054258-ladsgroup.json [05:58:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P35827 and previous config saved to /var/cache/conftool/dbconfig/20221021-055804-ladsgroup.json [06:13:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P35828 and previous config saved to /var/cache/conftool/dbconfig/20221021-061311-ladsgroup.json [06:28:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321312)', diff saved to https://phabricator.wikimedia.org/P35829 and previous config saved to /var/cache/conftool/dbconfig/20221021-062817-ladsgroup.json [06:28:20] (03PS4) 10ArielGlenn: Create temporary dir for html dumps [puppet] - 10https://gerrit.wikimedia.org/r/844984 (https://phabricator.wikimedia.org/T319269) (owner: 10Hokwelum) [06:29:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [06:30:30] (03CR) 10ArielGlenn: [C: 03+2] Create temporary dir for html dumps [puppet] - 10https://gerrit.wikimedia.org/r/844984 (https://phabricator.wikimedia.org/T319269) (owner: 10Hokwelum) [06:53:06] (03PS18) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [06:55:48] (03CR) 10Slyngshede: role::idm Basic deployment of IDM (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [06:58:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 36692 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221021T0700) [07:00:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 36692 [07:05:16] (03PS3) 10Sohom Datta: Enable source links on Translation ns on enwikisource and thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) [07:09:36] (03CR) 10Sohom Datta: "Planning on deploying this on 24th Oct" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) (owner: 10Sohom Datta) [07:10:10] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845277 (owner: 10Giuseppe Lavagetto) [07:12:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845277 (owner: 10Giuseppe Lavagetto) [07:12:58] (03Merged) 10jenkins-bot: Fix broken links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845277 (owner: 10Giuseppe Lavagetto) [07:13:14] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:845277|Fix broken links]] [07:13:34] !log oblivian@deploy1002 oblivian and oblivian: Backport for [[gerrit:845277|Fix broken links]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [07:20:25] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:845277|Fix broken links]] (duration: 07m 11s) [07:28:32] (03CR) 10JMeybohm: New organization of templates (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [07:37:19] !log start of rolling restart of backup hosts [07:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:45] (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:52:54] (03PS1) 10Jcrespo: bacula: Identify the 'backup' role as data persistence owned [puppet] - 10https://gerrit.wikimedia.org/r/845407 (https://phabricator.wikimedia.org/T321310) [07:55:30] (03CR) 10David Caro: alerts.downtime_host: attempt to match alert hostnames with : (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [07:56:11] the bacula jobs is me, as bacula was briefly unavailable and it is a single job, should be back soon [07:58:22] mmm, it doesn't come back, I will check why it is failing still after the host came back [07:59:51] the exported needed a restart, it had failed too many times- I think I have to add a hook so it restarts after bacula is started, like we did for mysql and its exporter [08:00:22] (not super important as it is well monitored if it fails) [08:00:45] (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:04:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/845030 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis) [08:07:39] (03CR) 10Jcrespo: [C: 03+2] bacula: Identify the 'backup' role as data persistence owned [puppet] - 10https://gerrit.wikimedia.org/r/845407 (https://phabricator.wikimedia.org/T321310) (owner: 10Jcrespo) [08:19:05] (03PS4) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) [08:19:13] (03CR) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [08:19:13] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:20:04] (03CR) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [08:27:37] (03PS1) 10Elukey: sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) [08:29:28] (03PS1) 10Elukey: cumin: add alias for ml-staging worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/845434 [08:30:50] (03PS2) 10Elukey: cumin: add alias for ml-staging worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/845434 [08:31:24] (03CR) 10CI reject: [V: 04-1] sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) (owner: 10Elukey) [08:33:52] (03PS2) 10Elukey: sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) [08:34:34] (03CR) 10Elukey: [C: 03+2] cumin: add alias for ml-staging worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/845434 (owner: 10Elukey) [08:40:06] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:40:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [08:40:44] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:40:56] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [08:40:58] (03CR) 10Klausman: [C: 03+1] sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) (owner: 10Elukey) [08:46:58] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:47:59] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [08:50:32] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:38] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:51:58] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:52:41] !log finished rolling restart of backup hosts [08:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:02] !log start of rolling restart of dbprov hosts [08:53:23] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [08:53:31] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet [08:53:44] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 157, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:54:04] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:54] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet [08:58:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:59:45] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [09:03:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:04:31] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [09:05:39] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [09:06:52] (03PS1) 10Hashar: .gitmodules: translations migrated to Gerrit [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/845459 (https://phabricator.wikimedia.org/T321350) [09:06:54] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org [09:07:00] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:07:32] (03CR) 10Hashar: "I have moved the translations from Phabricator to Gerrit T321350#8334913" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/845459 (https://phabricator.wikimedia.org/T321350) (owner: 10Hashar) [09:08:02] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:08:58] (03PS1) 10Btullis: Grant analytics-admins the right to run commands as the yarn user [puppet] - 10https://gerrit.wikimedia.org/r/845462 (https://phabricator.wikimedia.org/T321378) [09:08:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:09:24] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:54] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [09:10:04] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:15] !log finished rolling restart of dbprov hosts [09:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:28] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:10:42] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [09:10:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:11:02] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [09:12:10] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:12:34] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10ayounsi) > Long term wise, I can take the action item and check with Julianne on how long they need to keep track of these devices. Good idea! I think also how they need to... [09:13:34] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:13:58] (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:13:58] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:14:40] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 157, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:14:58] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [09:15:00] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:15:38] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I'll abandon my similar change for dse-k8s only in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/845028" [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) (owner: 10Elukey) [09:16:18] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [09:16:34] (03CR) 10Majavah: OpenStack HAProxy: support frontend ferm rules into haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [09:16:48] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:17:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:18:05] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [09:18:18] (03CR) 10Elukey: [C: 03+2] sre.k8s.reboot-nodes: add support for ml-staging and fix dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845430 (https://phabricator.wikimedia.org/T321310) (owner: 10Elukey) [09:18:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:18:30] (03Abandoned) 10Btullis: Fix the sre.k8s.reboot-nodes cookbook for dse-k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/845028 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [09:18:58] (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:20:50] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [09:21:39] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [09:21:54] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [09:22:47] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [09:23:46] PROBLEM - Check systemd state on ml-serve-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:03] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [09:27:30] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:27:56] RECOVERY - Check systemd state on ml-serve-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:26] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:34:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:43:28] !log klausman@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [09:43:28] (KubernetesCalicoDown) firing: (2) ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:44:33] (03CR) 10Elukey: [C: 03+1] Grant analytics-admins the right to run commands as the yarn user [puppet] - 10https://gerrit.wikimedia.org/r/845462 (https://phabricator.wikimedia.org/T321378) (owner: 10Btullis) [09:46:07] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [09:48:28] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:48:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:49:02] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:50:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:36] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:51:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:53:34] (03CR) 10Majavah: [C: 03+1] p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [09:53:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:54:03] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [09:54:57] !log btullis@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [09:55:28] !log btullis@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:dse-k8s-worker [09:56:56] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [10:00:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:00:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:01:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:01:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:01:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35830 and previous config saved to /var/cache/conftool/dbconfig/20221021-100137-ladsgroup.json [10:03:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35831 and previous config saved to /var/cache/conftool/dbconfig/20221021-100305-ladsgroup.json [10:03:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-staging-worker [10:04:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [10:04:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [10:05:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:06:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:47] !log btullis@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [10:07:48] !log btullis@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:dse-k8s-worker [10:07:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [10:08:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [10:08:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T321312)', diff saved to https://phabricator.wikimedia.org/P35832 and previous config saved to /var/cache/conftool/dbconfig/20221021-100813-ladsgroup.json [10:10:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35833 and previous config saved to /var/cache/conftool/dbconfig/20221021-101009-ladsgroup.json [10:10:18] (03CR) 10Btullis: [C: 03+2] Grant analytics-admins the right to run commands as the yarn user [puppet] - 10https://gerrit.wikimedia.org/r/845462 (https://phabricator.wikimedia.org/T321378) (owner: 10Btullis) [10:13:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T321312)', diff saved to https://phabricator.wikimedia.org/P35834 and previous config saved to /var/cache/conftool/dbconfig/20221021-101336-ladsgroup.json [10:13:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:18:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:18:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw [10:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:19:48] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:24:33] !log restart of ms-backup hosts [10:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P35835 and previous config saved to /var/cache/conftool/dbconfig/20221021-102516-ladsgroup.json [10:25:58] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [10:28:04] 10SRE, 10ops-eqiad: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T321315 (10jcrespo) [10:28:16] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) [10:28:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P35836 and previous config saved to /var/cache/conftool/dbconfig/20221021-102842-ladsgroup.json [10:29:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [10:30:12] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) Disk finished rebuilding: ` /usr/local/lib/nagios/plugins/get-raid-status-perccli communication 0 OK | controller: 0 OK | physical_disk: 0 OK | virtual_disk: 0 OK | bbu: 0 OK | enc... [10:32:21] 10SRE-tools, 10Icinga, 10Infrastructure-Foundations: get-raid-status-perccli should allow for commands to return non-zero exit code - https://phabricator.wikimedia.org/T320998 (10jcrespo) [10:33:14] (03PS1) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) [10:34:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:35:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:37:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:40:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P35837 and previous config saved to /var/cache/conftool/dbconfig/20221021-104022-ladsgroup.json [10:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:43:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P35838 and previous config saved to /var/cache/conftool/dbconfig/20221021-104349-ladsgroup.json [10:44:36] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:47:17] (03CR) 10Jbond: "LGTM i think its probably also worth adding bookworm, this is useful for backporting where you want to grab the bookworm version and rebui" [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [10:47:20] (03CR) 10Jbond: [C: 03+1] package_builder: add deb-src for buster [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [10:49:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [10:49:35] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad [10:55:21] (03CR) 10Jbond: [C: 03+1] "LGTM but probably also worth asking in service ops just in case" [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway) [10:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35839 and previous config saved to /var/cache/conftool/dbconfig/20221021-105529-ladsgroup.json [10:57:38] (03PS2) 10Daniel Kinzler: Enable parsoid cache warming on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843955 (https://phabricator.wikimedia.org/T320535) [10:58:06] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet [10:58:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35840 and previous config saved to /var/cache/conftool/dbconfig/20221021-105845-ladsgroup.json [10:58:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T321312)', diff saved to https://phabricator.wikimedia.org/P35841 and previous config saved to /var/cache/conftool/dbconfig/20221021-105855-ladsgroup.json [10:59:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [10:59:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [10:59:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T321312)', diff saved to https://phabricator.wikimedia.org/P35842 and previous config saved to /var/cache/conftool/dbconfig/20221021-105921-ladsgroup.json [11:05:05] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet [11:05:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:05:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T321312)', diff saved to https://phabricator.wikimedia.org/P35843 and previous config saved to /var/cache/conftool/dbconfig/20221021-110544-ladsgroup.json [11:06:22] (03CR) 10MVernon: [C: 03+2] swift: drain ms-be2050 [puppet] - 10https://gerrit.wikimedia.org/r/845502 (https://phabricator.wikimedia.org/T308677) (owner: 10MVernon) [11:06:53] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet [11:09:20] (03PS1) 10Jbond: sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 [11:10:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:48] (03PS2) 10Jbond: sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 [11:11:12] (03CR) 10Clément Goubert: [C: 03+1] sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 (owner: 10Jbond) [11:12:26] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:13:41] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet [11:13:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P35844 and previous config saved to /var/cache/conftool/dbconfig/20221021-111352-ladsgroup.json [11:13:58] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:14:58] (03CR) 10CI reject: [V: 04-1] sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 (owner: 10Jbond) [11:16:48] (03PS3) 10Jbond: sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 [11:18:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:18:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:20:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P35845 and previous config saved to /var/cache/conftool/dbconfig/20221021-112050-ladsgroup.json [11:20:56] (03CR) 10Clément Goubert: [C: 03+1] sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 (owner: 10Jbond) [11:21:18] (03CR) 10Jbond: [C: 03+2] sre.hosts.reboot-cluster: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/845508 (owner: 10Jbond) [11:22:02] 10SRE, 10ops-eqiad: msw-c5-eqiad offline - https://phabricator.wikimedia.org/T321311 (10Jclark-ctr) Updated Netbox with cable [11:22:42] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw [11:27:23] !log rolling reboot of eqiad swift frontends re October reboots [11:27:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P35846 and previous config saved to /var/cache/conftool/dbconfig/20221021-112859-ladsgroup.json [11:35:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P35847 and previous config saved to /var/cache/conftool/dbconfig/20221021-113556-ladsgroup.json [11:35:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:44:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35848 and previous config saved to /var/cache/conftool/dbconfig/20221021-114405-ladsgroup.json [11:44:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:44:10] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:44:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:44:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35849 and previous config saved to /var/cache/conftool/dbconfig/20221021-114429-ladsgroup.json [11:44:40] PROBLEM - SSH on db1121.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:45:19] (03PS1) 10Jbond: sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 [11:45:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35850 and previous config saved to /var/cache/conftool/dbconfig/20221021-114553-ladsgroup.json [11:47:00] !log klausman@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad [11:47:36] (03PS1) 10Matthias Mullie: Add mediawiki.searchpreview schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069) [11:47:57] (03CR) 10Matthias Mullie: [C: 04-2] "DNM; schema still in the works" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie) [11:48:46] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:52] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [11:48:55] (03CR) 10CI reject: [V: 04-1] sre.SREBatchRunner: add max failed argument [cookbooks] - 10https://gerrit.wikimedia.org/r/845515 (owner: 10Jbond) [11:51:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T321312)', diff saved to https://phabricator.wikimedia.org/P35851 and previous config saved to /var/cache/conftool/dbconfig/20221021-115103-ladsgroup.json [11:51:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [11:51:20] PROBLEM - Check systemd state on kubernetes2006 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [11:51:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T321312)', diff saved to https://phabricator.wikimedia.org/P35852 and previous config saved to /var/cache/conftool/dbconfig/20221021-115128-ladsgroup.json [11:51:46] !log rolling reboot of codfw swift frontends re October reboots [11:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:11] (03PS1) 10Jbond: O:insetup: drop role contact I/F [puppet] - 10https://gerrit.wikimedia.org/r/845519 [11:52:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:52:24] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:52:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35853 and previous config saved to /var/cache/conftool/dbconfig/20221021-115255-ladsgroup.json [11:53:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.491 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:54:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:55:43] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "we likely need an update to aptrepo/files/distributions-wikimedia as well." [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [11:57:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T321312)', diff saved to https://phabricator.wikimedia.org/P35854 and previous config saved to /var/cache/conftool/dbconfig/20221021-115742-ladsgroup.json [11:59:10] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:51] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2006.codfw.wmnet [12:06:00] RECOVERY - Check systemd state on kubernetes2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:48] (03CR) 10FNegri: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [12:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P35855 and previous config saved to /var/cache/conftool/dbconfig/20221021-120802-ladsgroup.json [12:09:05] (03CR) 10Jbond: "should discuss this more when you are back" [puppet] - 10https://gerrit.wikimedia.org/r/845519 (owner: 10Jbond) [12:09:11] (03CR) 10Jbond: [C: 03+2] O:insetup: drop role contact I/F [puppet] - 10https://gerrit.wikimedia.org/r/845519 (owner: 10Jbond) [12:09:17] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2006.codfw.wmnet [12:09:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:09:56] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:11:10] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [12:11:50] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 157, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:11:58] (03PS2) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) [12:12:02] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P35856 and previous config saved to /var/cache/conftool/dbconfig/20221021-121249-ladsgroup.json [12:12:54] (03CR) 10FNegri: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [12:13:56] (03CR) 10FNegri: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [12:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P35857 and previous config saved to /var/cache/conftool/dbconfig/20221021-122308-ladsgroup.json [12:23:10] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=kubesvc,name=kubernetes2006.codfw.wmnet [12:25:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:27:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P35858 and previous config saved to /var/cache/conftool/dbconfig/20221021-122755-ladsgroup.json [12:30:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:32:12] (03PS1) 10Filippo Giunchedi: prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) [12:32:14] (03PS1) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) [12:33:21] (03CR) 10CI reject: [V: 04-1] prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:34:36] (03CR) 10CI reject: [V: 04-1] prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:35:01] !log rebooted kubernetes2006.codfw.wmnet manually - root cause T273026 [12:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:07] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [12:37:31] (03PS2) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) [12:38:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35859 and previous config saved to /var/cache/conftool/dbconfig/20221021-123815-ladsgroup.json [12:38:15] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:41:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35860 and previous config saved to /var/cache/conftool/dbconfig/20221021-124132-ladsgroup.json [12:41:59] !log restarting blazegraph on wdqs1013 (BlazegraphFreeAllocatorsDecreasingRapidly) [12:42:04] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2023.codfw.wmnet [12:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T321312)', diff saved to https://phabricator.wikimedia.org/P35861 and previous config saved to /var/cache/conftool/dbconfig/20221021-124302-ladsgroup.json [12:43:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [12:43:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [12:43:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T321312)', diff saved to https://phabricator.wikimedia.org/P35862 and previous config saved to /var/cache/conftool/dbconfig/20221021-124327-ladsgroup.json [12:44:02] (03PS2) 10Filippo Giunchedi: prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) [12:44:04] (03PS3) 10Filippo Giunchedi: prometheus: probe SSH on mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/845529 (https://phabricator.wikimedia.org/T310266) [12:45:44] RECOVERY - SSH on db1121.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:46:43] (03CR) 10CI reject: [V: 04-1] prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:47:45] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2023.codfw.wmnet [12:48:19] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2024.codfw.wmnet [12:49:00] (03CR) 10Filippo Giunchedi: "I'm having troubles understanding why CI fails on seemingly unrelated tests:" [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T321312)', diff saved to https://phabricator.wikimedia.org/P35863 and previous config saved to /var/cache/conftool/dbconfig/20221021-124950-ladsgroup.json [12:55:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2024.codfw.wmnet [12:56:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P35864 and previous config saved to /var/cache/conftool/dbconfig/20221021-125639-ladsgroup.json [13:00:23] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-worker-codfw [13:04:51] (03CR) 10Arturo Borrero Gonzalez: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [13:04:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P35865 and previous config saved to /var/cache/conftool/dbconfig/20221021-130456-ladsgroup.json [13:07:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster [13:09:51] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet [13:11:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P35866 and previous config saved to /var/cache/conftool/dbconfig/20221021-131145-ladsgroup.json [13:13:01] (03PS3) 10Ssingh: package_builder: add deb-src for buster and bookworm [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) [13:13:48] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37672/console" [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:14:14] (03CR) 10Ssingh: package_builder: add deb-src for buster and bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:14:34] (03CR) 10Ssingh: [C: 03+2] aptrepo: add thirdparty/haproxy24 for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/844983 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:15:06] jbond: ok to merge your change? [13:15:11] jbond: O:insetup: drop role contact I/F (5fd9cafaf8) [13:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:16:51] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet [13:17:12] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [13:18:08] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet [13:18:39] jbond: given it seems like a trivial change, I will go ahead and merge [13:19:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:32] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster [13:20:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P35867 and previous config saved to /var/cache/conftool/dbconfig/20221021-132003-ladsgroup.json [13:20:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:22:20] (03PS1) 10Elukey: conftool-data: update dse-k8s node list [puppet] - 10https://gerrit.wikimedia.org/r/845544 [13:23:38] (03CR) 10Btullis: [C: 03+1] "Nice. Thanks ever so much." [puppet] - 10https://gerrit.wikimedia.org/r/845544 (owner: 10Elukey) [13:24:24] (03CR) 10Elukey: [C: 03+2] conftool-data: update dse-k8s node list [puppet] - 10https://gerrit.wikimedia.org/r/845544 (owner: 10Elukey) [13:25:46] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet [13:26:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35868 and previous config saved to /var/cache/conftool/dbconfig/20221021-132652-ladsgroup.json [13:26:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [13:27:06] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet [13:27:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [13:27:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T321312)', diff saved to https://phabricator.wikimedia.org/P35869 and previous config saved to /var/cache/conftool/dbconfig/20221021-132716-ladsgroup.json [13:27:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1041.eqiad.wmnet [13:28:23] !log btullis@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [13:31:44] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updates - bking@cumin2002 - T321310 [13:32:15] !log kubernetes1005:~$ sudo systemctl reset-failed ifup@ens13.service T273026 [13:32:17] (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:32:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:32:22] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [13:32:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T321312)', diff saved to https://phabricator.wikimedia.org/P35870 and previous config saved to /var/cache/conftool/dbconfig/20221021-133231-ladsgroup.json [13:33:47] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet [13:34:01] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=kubernetes1005.eqiad.wmnet [13:34:46] (03CR) 10Ssingh: [C: 03+2] package_builder: add deb-src for buster and bookworm [puppet] - 10https://gerrit.wikimedia.org/r/845036 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:34:47] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1041.eqiad.wmnet [13:35:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T321312)', diff saved to https://phabricator.wikimedia.org/P35871 and previous config saved to /var/cache/conftool/dbconfig/20221021-133509-ladsgroup.json [13:35:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [13:35:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [13:35:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T321312)', diff saved to https://phabricator.wikimedia.org/P35872 and previous config saved to /var/cache/conftool/dbconfig/20221021-133534-ladsgroup.json [13:38:55] PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:07] RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:41:06] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T321312)', diff saved to https://phabricator.wikimedia.org/P35873 and previous config saved to /var/cache/conftool/dbconfig/20221021-134153-ladsgroup.json [13:44:02] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2043.codfw.wmnet [13:45:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1042.eqiad.wmnet [13:47:11] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:47:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P35874 and previous config saved to /var/cache/conftool/dbconfig/20221021-134737-ladsgroup.json [13:48:01] (03PS1) 10Ssingh: cp4047: temporarily remove references [puppet] - 10https://gerrit.wikimedia.org/r/845546 [13:48:07] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:45] RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:25] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:52:04] (03CR) 10Ssingh: [C: 03+2] cp4047: temporarily remove references [puppet] - 10https://gerrit.wikimedia.org/r/845546 (owner: 10Ssingh) [13:53:11] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P35875 and previous config saved to /var/cache/conftool/dbconfig/20221021-135659-ladsgroup.json [14:00:03] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1042.eqiad.wmnet [14:00:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2043.codfw.wmnet [14:00:19] PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:43] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:02:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:02:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P35876 and previous config saved to /var/cache/conftool/dbconfig/20221021-140245-ladsgroup.json [14:03:11] PROBLEM - Check systemd state on ms-be1063 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:23] (03PS1) 10Ssingh: cp4049: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/845550 (https://phabricator.wikimedia.org/T317244) [14:06:29] RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:22] (03CR) 10Ssingh: [C: 03+2] cp4049: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/845550 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [14:07:37] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet [14:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:11:25] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1063 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:11:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Cmjohnson) The dns has been updated but I am not getting any mgmt connection, I need to check to make sure the mgmt cables are conne... [14:11:38] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet [14:12:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P35877 and previous config saved to /var/cache/conftool/dbconfig/20221021-141206-ladsgroup.json [14:12:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS buster [14:13:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:01] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:11] PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:25] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/list/languagepairs (Get all the language pairs) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:15:13] RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:15] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:16:21] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:16:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T321312)', diff saved to https://phabricator.wikimedia.org/P35878 and previous config saved to /var/cache/conftool/dbconfig/20221021-141752-ladsgroup.json [14:17:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance [14:18:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance [14:18:14] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster [14:18:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T321312)', diff saved to https://phabricator.wikimedia.org/P35879 and previous config saved to /var/cache/conftool/dbconfig/20221021-141815-ladsgroup.json [14:18:17] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1040.eqiad.wmnet [14:21:01] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet [14:21:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:29] !log pool new host cp4037: T317244 [14:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:34] T317244: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 [14:22:00] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updates - bking@cumin2002 - T321310 [14:22:04] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4037.ulsfo.wmnet,service=ats-be [14:22:04] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4037.ulsfo.wmnet,service=ats-tls [14:22:05] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4037.ulsfo.wmnet,service=varnish-fe [14:22:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=ats-be [14:22:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=ats-tls [14:22:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=varnish-fe [14:22:20] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1043.eqiad.wmnet [14:22:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet [14:23:00] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ssingh) [14:23:02] !log bblack@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS buster [14:23:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:23:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:25:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T321312)', diff saved to https://phabricator.wikimedia.org/P35880 and previous config saved to /var/cache/conftool/dbconfig/20221021-142521-ladsgroup.json [14:25:58] (KubernetesCalicoDown) firing: dse-k8s-worker1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1006.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:26:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T321312)', diff saved to https://phabricator.wikimedia.org/P35881 and previous config saved to /var/cache/conftool/dbconfig/20221021-142712-ladsgroup.json [14:27:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [14:27:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [14:27:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [14:27:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [14:27:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T321312)', diff saved to https://phabricator.wikimedia.org/P35882 and previous config saved to /var/cache/conftool/dbconfig/20221021-142742-ladsgroup.json [14:29:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:29:58] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:30:15] (03PS1) 10Btullis: Add dummy passwords for the airflow database users [labs/private] - 10https://gerrit.wikimedia.org/r/845559 (https://phabricator.wikimedia.org/T319440) [14:30:17] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1068 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:30:21] PROBLEM - Check systemd state on ms-be1068 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:26] (03PS1) 10Btullis: Add a simple mechanism for creating postgresql users and databases [puppet] - 10https://gerrit.wikimedia.org/r/845560 (https://phabricator.wikimedia.org/T319440) [14:30:58] (KubernetesCalicoDown) resolved: dse-k8s-worker1006.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1006.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:33:06] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T321312)', diff saved to https://phabricator.wikimedia.org/P35883 and previous config saved to /var/cache/conftool/dbconfig/20221021-143400-ladsgroup.json [14:34:11] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37673/console" [puppet] - 10https://gerrit.wikimedia.org/r/845560 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [14:34:21] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2045.codfw.wmnet [14:34:25] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:54] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy passwords for the airflow database users [labs/private] - 10https://gerrit.wikimedia.org/r/845559 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [14:37:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage [14:37:54] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37674/console" [puppet] - 10https://gerrit.wikimedia.org/r/845560 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [14:39:52] (03CR) 10JMeybohm: [C: 04-1] coredns: upgrade to 1.8.7 (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [14:40:17] (03PS3) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) [14:40:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P35884 and previous config saved to /var/cache/conftool/dbconfig/20221021-144028-ladsgroup.json [14:40:42] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet [14:41:08] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:14] (03PS1) 10KartikMistry: Enable Section Translation in Hawaiian, Pashto and Xhosa WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845573 (https://phabricator.wikimedia.org/T317289) [14:41:21] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage [14:41:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:58] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1063 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:43:56] RECOVERY - Check systemd state on ms-be1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:22] PROBLEM - Host ms-be1043 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:41] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply updates - bking@cumin2002 - T321310 [14:44:58] RECOVERY - Host ms-be1043 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [14:44:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:46:49] !log bblack@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp4047 [14:47:46] !log bblack@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4047 [14:47:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet [14:48:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet [14:49:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P35885 and previous config saved to /var/cache/conftool/dbconfig/20221021-144907-ladsgroup.json [14:49:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker [14:49:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:49:54] PROBLEM - Host ms-be1043 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:18] RECOVERY - Host ms-be1043 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:50:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:26] PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:36] PROBLEM - Check systemd state on elastic1092 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:36] PROBLEM - Check systemd state on elastic1090 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:00] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:14] RECOVERY - Check systemd state on elastic1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:14] RECOVERY - Check systemd state on elastic1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet [14:55:28] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1043.eqiad.wmnet [14:55:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P35886 and previous config saved to /var/cache/conftool/dbconfig/20221021-145534-ladsgroup.json [14:57:10] RECOVERY - Check systemd state on ms-be1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:24] RECOVERY - Check systemd state on elastic1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:11] (03PS1) 10Ssingh: Revert "cp4047: temporarily remove references" [puppet] - 10https://gerrit.wikimedia.org/r/845586 [14:59:16] PROBLEM - Check systemd state on elastic1091 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:02] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:01:12] RECOVERY - Check systemd state on elastic1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:12] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1068 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:04:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P35887 and previous config saved to /var/cache/conftool/dbconfig/20221021-150413-ladsgroup.json [15:04:33] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:57] (03PS12) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [15:05:30] (03CR) 10Ssingh: [C: 03+2] Revert "cp4047: temporarily remove references" [puppet] - 10https://gerrit.wikimedia.org/r/845586 (owner: 10Ssingh) [15:05:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4049.ulsfo.wmnet with OS buster [15:06:55] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10herron) p:05Triage→03Medium [15:07:18] PROBLEM - Check systemd state on elastic1101 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:16] RECOVERY - Check systemd state on elastic1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:46] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:10:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T321312)', diff saved to https://phabricator.wikimedia.org/P35888 and previous config saved to /var/cache/conftool/dbconfig/20221021-151040-ladsgroup.json [15:10:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance [15:10:56] (03PS3) 10Elukey: coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) [15:10:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance [15:11:02] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:11:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T321312)', diff saved to https://phabricator.wikimedia.org/P35889 and previous config saved to /var/cache/conftool/dbconfig/20221021-151104-ladsgroup.json [15:11:55] (03CR) 10Elukey: "Sorry the code review was more WIP than ready, I wanted to ask if it was an ok-direction and left some horrors here and there :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [15:11:57] (03PS1) 10Herron: admin: add kindrobot to deployers [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) [15:12:12] Bsadowski1: ok to merge yours? [15:12:15] sorry ^ [15:12:22] btullis: ^ [15:13:17] Which one? [15:13:26] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: restart gitlab-runner gracefully [puppet] - 10https://gerrit.wikimedia.org/r/844985 (owner: 10Jelto) [15:13:46] labs/private [15:13:48] I skipped it [15:13:52] +profile::analytics::postgresql::replication_password: dummydummy [15:13:52] +profile::analytics::postgresql::users: [15:15:00] (03CR) 10Dzahn: "Has the approvals and looks alright. Just the key used here does not appear on the ticket." [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) (owner: 10Herron) [15:15:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS buster [15:16:28] sukhe: Sorry about that. [15:16:41] btullis: no problem at all :) I realized later it was labs/private and should not have pinged you [15:17:19] ^ I merged that, was pm'ing btu.llis too :) [15:17:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T321312)', diff saved to https://phabricator.wikimedia.org/P35890 and previous config saved to /var/cache/conftool/dbconfig/20221021-151727-ladsgroup.json [15:17:54] I forgot it needed merging on puppetmaster, because it works straight away in pcc [15:19:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T321312)', diff saved to https://phabricator.wikimedia.org/P35892 and previous config saved to /var/cache/conftool/dbconfig/20221021-151920-ladsgroup.json [15:19:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [15:19:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [15:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T321312)', diff saved to https://phabricator.wikimedia.org/P35893 and previous config saved to /var/cache/conftool/dbconfig/20221021-151945-ladsgroup.json [15:20:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:12] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2001-dev'] [15:24:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:26:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T321312)', diff saved to https://phabricator.wikimedia.org/P35894 and previous config saved to /var/cache/conftool/dbconfig/20221021-152603-ladsgroup.json [15:26:41] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ssingh) [15:29:58] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:30:30] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:32:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P35895 and previous config saved to /var/cache/conftool/dbconfig/20221021-153234-ladsgroup.json [15:33:08] PROBLEM - Check systemd state on elastic1072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:58] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:36:42] (03PS13) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [15:37:08] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2001-dev'] [15:39:58] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:40:05] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [15:40:10] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P35896 and previous config saved to /var/cache/conftool/dbconfig/20221021-154110-ladsgroup.json [15:41:15] (03PS4) 10FNegri: Add Tekton deb repository [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) [15:41:40] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:44] (03CR) 10FNegri: Add Tekton deb repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845483 (https://phabricator.wikimedia.org/T317143) (owner: 10FNegri) [15:41:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage [15:42:56] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage [15:46:10] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-worker-eqiad [15:47:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P35897 and previous config saved to /var/cache/conftool/dbconfig/20221021-154740-ladsgroup.json [15:51:47] PROBLEM - Check systemd state on elastic1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:05] RECOVERY - Check systemd state on elastic1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P35898 and previous config saved to /var/cache/conftool/dbconfig/20221021-155616-ladsgroup.json [15:57:07] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage [15:57:27] (03Abandoned) 10Dzahn: scap/dsh: remove parsoid service, replaced by parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/825753 (https://phabricator.wikimedia.org/T241207) (owner: 10Dzahn) [15:57:44] (03PS1) 10Jdlrobson: Document check for broken symbolic links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845620 (https://phabricator.wikimedia.org/T319223) [15:58:05] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet [15:58:17] (03CR) 10JHathaway: "Daniel & Giuseppe, if you all could just confirm this is safe to remove, that would be great, you both touched this data structure long ag" [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway) [15:58:17] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet [15:58:23] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2045 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:59:12] (03PS3) 10Jdlrobson: Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319012) [15:59:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2002-dev'] [15:59:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:47] PROBLEM - Check systemd state on elastic1085 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:49] PROBLEM - Check systemd state on elastic1055 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:05] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage [16:02:10] RECOVERY - Check systemd state on elastic1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:14] RECOVERY - Check systemd state on elastic1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T321312)', diff saved to https://phabricator.wikimedia.org/P35899 and previous config saved to /var/cache/conftool/dbconfig/20221021-160246-ladsgroup.json [16:02:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance [16:03:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance [16:07:15] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet [16:08:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [16:08:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [16:08:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T321312)', diff saved to https://phabricator.wikimedia.org/P35900 and previous config saved to /var/cache/conftool/dbconfig/20221021-160858-ladsgroup.json [16:09:41] (03PS1) 10Majavah: openstack: encapi: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/845623 [16:09:54] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet [16:11:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T321312)', diff saved to https://phabricator.wikimedia.org/P35901 and previous config saved to /var/cache/conftool/dbconfig/20221021-161123-ladsgroup.json [16:11:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [16:11:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [16:11:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35902 and previous config saved to /var/cache/conftool/dbconfig/20221021-161150-ladsgroup.json [16:12:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4047.ulsfo.wmnet with OS buster [16:13:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35903 and previous config saved to /var/cache/conftool/dbconfig/20221021-161315-ladsgroup.json [16:13:44] (03PS2) 10Majavah: openstack: encapi: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/845623 [16:14:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T321312)', diff saved to https://phabricator.wikimedia.org/P35904 and previous config saved to /var/cache/conftool/dbconfig/20221021-161411-ladsgroup.json [16:14:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2002-dev'] [16:15:16] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:22] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev'] [16:15:41] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt2003-dev'] [16:16:11] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev'] [16:20:09] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl1001.eqiad.wmnet [16:20:10] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [16:20:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35905 and previous config saved to /var/cache/conftool/dbconfig/20221021-162032-ladsgroup.json [16:22:22] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:22:22] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl1001.eqiad.wmnet on all recursors [16:22:25] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl1001.eqiad.wmnet on all recursors [16:23:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt2003-dev'] [16:23:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev'] [16:23:44] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl1002.eqiad.wmnet [16:23:45] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [16:24:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2003-dev'] [16:27:20] PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:24] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bullseye [16:27:47] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:48] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl1002.eqiad.wmnet on all recursors [16:27:51] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl1002.eqiad.wmnet on all recursors [16:29:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev'] [16:29:14] RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:14] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2045 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:29:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P35906 and previous config saved to /var/cache/conftool/dbconfig/20221021-162917-ladsgroup.json [16:29:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt2003-dev'] [16:31:11] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2003-dev'] [16:35:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P35907 and previous config saved to /var/cache/conftool/dbconfig/20221021-163538-ladsgroup.json [16:42:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [16:44:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2003-dev'] [16:44:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P35908 and previous config saved to /var/cache/conftool/dbconfig/20221021-164424-ladsgroup.json [16:45:55] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:02] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl1001.eqiad.wmnet [16:46:26] (03CR) 10Cwhite: "Proposal to drop the request-cookie field from varnish in logstash." [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite) [16:46:37] PROBLEM - SSH on db1121.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:47:33] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P35909 and previous config saved to /var/cache/conftool/dbconfig/20221021-165045-ladsgroup.json [16:55:17] (03CR) 10Herron: [C: 03+1] logstash: add sanitize filter [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite) [16:57:07] (03CR) 10Herron: admin: add kindrobot to deployers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) (owner: 10Herron) [16:57:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10odimitrijevic) @Cmjohnson Thank you! [16:59:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T321312)', diff saved to https://phabricator.wikimedia.org/P35910 and previous config saved to /var/cache/conftool/dbconfig/20221021-165930-ladsgroup.json [16:59:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [16:59:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [16:59:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:00:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:00:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P35911 and previous config saved to /var/cache/conftool/dbconfig/20221021-170011-ladsgroup.json [17:00:36] (03PS8) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [17:03:05] (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [17:03:56] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [17:04:55] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10herron) 05Stalled→03Invalid Transitioning to invalid pending sponsor. Once sponsor details have been worked out please update the task and reopen. Thanks! [17:05:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P35912 and previous config saved to /var/cache/conftool/dbconfig/20221021-170551-ladsgroup.json [17:06:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10herron) 05In progress→03Stalled [17:07:11] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] .gitmodules: translations migrated to Gerrit [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/845459 (https://phabricator.wikimedia.org/T321350) (owner: 10Hashar) [17:07:16] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [17:09:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35913 and previous config saved to /var/cache/conftool/dbconfig/20221021-170908-ladsgroup.json [17:09:31] (03PS8) 10Vlad.shapik: WIP: Provide additional tests to cover errors and exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) [17:09:33] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl1002.eqiad.wmnet [17:10:26] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [17:12:15] PROBLEM - Check systemd state on elastic1074 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:49] PROBLEM - Check systemd state on elastic1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:15] RECOVERY - Check systemd state on elastic1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:59] RECOVERY - Check systemd state on elastic1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:18:45] (03PS3) 10Jdlrobson: Unset some bad logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845035 [17:18:52] (03PS4) 10Jdlrobson: Unset some bad logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845035 [17:20:02] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply updates - bking@cumin2002 - T321310 [17:20:03] PROBLEM - Check systemd state on elastic1098 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:41] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: apply updates - bking@cumin2002 - T321310 [17:21:13] PROBLEM - Check systemd state on elastic1102 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P35914 and previous config saved to /var/cache/conftool/dbconfig/20221021-172414-ladsgroup.json [17:26:43] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage [17:29:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bullseye [17:30:52] PROBLEM - Check systemd state on elastic1100 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage [17:36:57] RECOVERY - Check systemd state on elastic1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P35915 and previous config saved to /var/cache/conftool/dbconfig/20221021-173921-ladsgroup.json [17:46:59] (03PS1) 10BBlack: Clean up outdated commentary on requestctl [puppet] - 10https://gerrit.wikimedia.org/r/845648 (https://phabricator.wikimedia.org/T288106) [17:47:01] (03PS1) 10BBlack: Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) [17:47:03] (03PS1) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) [17:47:05] (03PS1) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 [17:47:57] (03CR) 10CI reject: [V: 04-1] Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [17:48:24] (03CR) 10CI reject: [V: 04-1] single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [17:51:05] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [17:52:19] (03PS9) 10Vlad.shapik: Provide additional tests to cover errors caused by wrong engine commands [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) [17:53:25] RECOVERY - Check systemd state on elastic1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2003-dev.codfw.wmnet with OS bullseye [17:54:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35916 and previous config saved to /var/cache/conftool/dbconfig/20221021-175427-ladsgroup.json [17:54:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [17:54:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [17:54:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35917 and previous config saved to /var/cache/conftool/dbconfig/20221021-175453-ladsgroup.json [17:56:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35918 and previous config saved to /var/cache/conftool/dbconfig/20221021-175615-ladsgroup.json [17:58:19] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:59:27] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P35919 and previous config saved to /var/cache/conftool/dbconfig/20221021-180028-ladsgroup.json [18:01:11] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for GFontenelle - https://phabricator.wikimedia.org/T321218 (10GFontenelle_WMF) Thank you, @herron and @Aklapper! [18:02:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35920 and previous config saved to /var/cache/conftool/dbconfig/20221021-180228-ladsgroup.json [18:03:37] (03PS2) 10BBlack: Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) [18:03:39] (03PS2) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) [18:03:41] (03PS2) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 [18:04:24] (03CR) 10CI reject: [V: 04-1] Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [18:04:48] (03CR) 10CI reject: [V: 04-1] single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [18:09:30] (03PS3) 10BBlack: Split confd file definitions [puppet] - 10https://gerrit.wikimedia.org/r/845649 (https://phabricator.wikimedia.org/T288106) [18:09:32] (03PS3) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) [18:09:34] (03PS3) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 [18:10:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:11:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:13:45] RECOVERY - Check systemd state on elastic1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P35921 and previous config saved to /var/cache/conftool/dbconfig/20221021-181534-ladsgroup.json [18:15:41] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel [18:15:51] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:15:55] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel [18:16:25] PROBLEM - Check systemd state on elastic2072 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P35922 and previous config saved to /var/cache/conftool/dbconfig/20221021-181734-ladsgroup.json [18:17:59] RECOVERY - Check systemd state on elastic2072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [18:19:28] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel [18:19:42] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel [18:20:53] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:21:03] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.wikimedia.org [18:22:19] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.wikimedia.org [18:24:04] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel [18:24:18] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel [18:30:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P35923 and previous config saved to /var/cache/conftool/dbconfig/20221021-183041-ladsgroup.json [18:32:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P35924 and previous config saved to /var/cache/conftool/dbconfig/20221021-183241-ladsgroup.json [18:33:09] (03PS4) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) [18:33:11] (03PS4) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 [18:33:11] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:31] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudservices2005-dev.wikimedia.org [18:35:21] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:20] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.wikimedia.org [18:38:47] !log pool new host cp4047: T317244 [18:38:50] !log pool new host cp4049: T317244 [18:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:52] T317244: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 [18:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:06] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4047.ulsfo.wmnet,service=ats-be [18:39:06] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4047.ulsfo.wmnet,service=ats-tls [18:39:06] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4047.ulsfo.wmnet,service=varnish-fe [18:39:07] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet,service=ats-be [18:39:07] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet,service=ats-tls [18:39:07] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet,service=varnish-fe [18:40:23] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4049.ulsfo.wmnet,service=ats-be [18:40:23] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4049.ulsfo.wmnet,service=ats-tls [18:40:24] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4049.ulsfo.wmnet,service=varnish-fe [18:40:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet,service=ats-be [18:40:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet,service=ats-tls [18:40:25] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet,service=varnish-fe [18:41:42] (03CR) 10CI reject: [V: 04-1] Modify jupyterhub config files to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [18:45:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P35925 and previous config saved to /var/cache/conftool/dbconfig/20221021-184547-ladsgroup.json [18:45:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:46:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:46:22] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2005-dev.wikimedia.org [18:46:41] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2004-dev.wikimedia.org [18:47:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P35926 and previous config saved to /var/cache/conftool/dbconfig/20221021-184747-ladsgroup.json [18:48:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2075-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:48:11] PROBLEM - Check systemd state on elastic2075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:21] PROBLEM - Check systemd state on elastic2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:31] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:41] (03PS4) 10Xcollazo: Modify jupyterhub config files to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [18:49:47] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudservices2004-dev.wikimedia.org [18:50:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35927 and previous config saved to /var/cache/conftool/dbconfig/20221021-185003-ladsgroup.json [18:50:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [18:50:13] ^ there seems to be repeated elastic systemd unit failures. are these known? [18:50:17] RECOVERY - Check systemd state on elastic2075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [18:50:27] RECOVERY - Check systemd state on elastic2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:29] seems like they are flapping [18:50:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35928 and previous config saved to /var/cache/conftool/dbconfig/20221021-185032-ladsgroup.json [18:50:35] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:42] (03PS5) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [18:52:58] (03CR) 10Xcollazo: "Ok this one is ready for reviews." [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [18:53:01] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2075-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:56:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35929 and previous config saved to /var/cache/conftool/dbconfig/20221021-185651-ladsgroup.json [18:57:21] PROBLEM - Check systemd state on elastic2062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:49] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2004-dev.wikimedia.org [18:59:25] RECOVERY - Check systemd state on elastic2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:41] 10SRE, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T321409 (10Sgs) [19:05:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P35930 and previous config saved to /var/cache/conftool/dbconfig/20221021-190509-ladsgroup.json [19:10:39] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2001-dev.wikimedia.org [19:11:51] (03CR) 10Ottomata: "Looking good! some comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [19:11:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P35931 and previous config saved to /var/cache/conftool/dbconfig/20221021-191157-ladsgroup.json [19:14:31] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:42] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet [19:16:31] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:24] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): wmf-puppet-dashboard updates [19:18:36] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): wmf-puppet-dashboard updates (duration: 01m 12s) [19:19:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet [19:19:43] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: wmf-puppet-dashboard updates [19:20:17] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet [19:20:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P35932 and previous config saved to /var/cache/conftool/dbconfig/20221021-192016-ladsgroup.json [19:21:56] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol2001-dev.wikimedia.org [19:22:38] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: wmf-puppet-dashboard updates (duration: 02m 55s) [19:24:10] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet [19:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P35933 and previous config saved to /var/cache/conftool/dbconfig/20221021-192704-ladsgroup.json [19:32:03] PROBLEM - Check systemd state on elastic2057 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:03] RECOVERY - Check systemd state on elastic2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T321312)', diff saved to https://phabricator.wikimedia.org/P35934 and previous config saved to /var/cache/conftool/dbconfig/20221021-193524-ladsgroup.json [19:35:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [19:35:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [19:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T321312)', diff saved to https://phabricator.wikimedia.org/P35935 and previous config saved to /var/cache/conftool/dbconfig/20221021-193550-ladsgroup.json [19:40:01] PROBLEM - Check systemd state on elastic2058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:21] RECOVERY - Check systemd state on elastic2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T321312)', diff saved to https://phabricator.wikimedia.org/P35936 and previous config saved to /var/cache/conftool/dbconfig/20221021-194201-ladsgroup.json [19:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T321312)', diff saved to https://phabricator.wikimedia.org/P35937 and previous config saved to /var/cache/conftool/dbconfig/20221021-194210-ladsgroup.json [19:42:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [19:42:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [19:42:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35938 and previous config saved to /var/cache/conftool/dbconfig/20221021-194234-ladsgroup.json [19:44:29] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) wanted to update the task and first say sorry for the delayed progress on this task. To p... [19:45:32] (03PS1) 10Stef Dunlap: Fixup development tooling for wider compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 [19:48:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35939 and previous config saved to /var/cache/conftool/dbconfig/20221021-194847-ladsgroup.json [19:57:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P35940 and previous config saved to /var/cache/conftool/dbconfig/20221021-195708-ladsgroup.json [19:57:25] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:00:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:01:27] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:02:36] (03PS1) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) [20:03:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P35941 and previous config saved to /var/cache/conftool/dbconfig/20221021-200353-ladsgroup.json [20:04:35] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:43] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P35942 and previous config saved to /var/cache/conftool/dbconfig/20221021-201214-ladsgroup.json [20:12:15] PROBLEM - Check systemd state on elastic2076 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:51] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:14:21] RECOVERY - Check systemd state on elastic2076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:55] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:19:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P35943 and previous config saved to /var/cache/conftool/dbconfig/20221021-201900-ladsgroup.json [20:20:02] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: apply updates - bking@cumin2002 - T321310 [20:27:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T321312)', diff saved to https://phabricator.wikimedia.org/P35944 and previous config saved to /var/cache/conftool/dbconfig/20221021-202721-ladsgroup.json [20:30:23] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:34:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T321312)', diff saved to https://phabricator.wikimedia.org/P35945 and previous config saved to /var/cache/conftool/dbconfig/20221021-203406-ladsgroup.json [20:34:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [20:34:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [20:34:29] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:34:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T321312)', diff saved to https://phabricator.wikimedia.org/P35946 and previous config saved to /var/cache/conftool/dbconfig/20221021-203430-ladsgroup.json [20:36:31] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T321312)', diff saved to https://phabricator.wikimedia.org/P35947 and previous config saved to /var/cache/conftool/dbconfig/20221021-204045-ladsgroup.json [20:40:59] 10SRE, 10Discovery-Search (Current work): Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo - https://phabricator.wikimedia.org/T318820 (10bking) Looks like we (as in Search Platform SREs) need to cut a new package for `wmf-elasticsearch-search-plugins`... [20:44:53] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:46:57] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:51:33] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:55:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P35948 and previous config saved to /var/cache/conftool/dbconfig/20221021-205551-ladsgroup.json [20:56:48] (03PS5) 10BBlack: single_backend mode for production varnishes [puppet] - 10https://gerrit.wikimedia.org/r/845650 (https://phabricator.wikimedia.org/T288106) [20:56:50] (03PS5) 10BBlack: cp4045: enable single_backend mode [puppet] - 10https://gerrit.wikimedia.org/r/845651 [20:56:52] (03PS1) 10BBlack: Remove confd_experiment_fqdn support [puppet] - 10https://gerrit.wikimedia.org/r/845713 (https://phabricator.wikimedia.org/T288106) [20:57:15] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [21:00:21] (03PS2) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) [21:00:46] (03CR) 10Dzahn: [C: 03+1] "oh, duh, I did not see that. all looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) (owner: 10Herron) [21:01:23] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [21:02:15] (03CR) 10Dzahn: "eh, yea, I am not sure I can confirm it's 100% safe but _from what I can tell_ nothing _seems_ to use the ID number. A few things do still" [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway) [21:02:38] (03PS1) 10Cwhite: beta-logs: add new hosts [puppet] - 10https://gerrit.wikimedia.org/r/844563 (https://phabricator.wikimedia.org/T321410) [21:08:10] (03CR) 10Herron: [C: 03+2] admin: add kindrobot to deployers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845591 (https://phabricator.wikimedia.org/T321355) (owner: 10Herron) [21:10:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: MediaWiki deployment shell access request for SDunlap - https://phabricator.wikimedia.org/T321355 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will fully propagate within 30 minutes. I'll transition this to resolved... [21:10:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P35949 and previous config saved to /var/cache/conftool/dbconfig/20221021-211058-ladsgroup.json [21:16:32] 10SRE, 10Znuny, 10serviceops-collab: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Dzahn) I did a `mysql -h m2-master.eqiad.wmnet -u otrs -p otrs` from otrs1001 and could confirm that the password at `hieradata/common/profile/vrts.yaml:profile::vrts::data... [21:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T321312)', diff saved to https://phabricator.wikimedia.org/P35950 and previous config saved to /var/cache/conftool/dbconfig/20221021-212604-ladsgroup.json [21:26:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [21:26:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [21:26:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T321312)', diff saved to https://phabricator.wikimedia.org/P35951 and previous config saved to /var/cache/conftool/dbconfig/20221021-212629-ladsgroup.json [21:32:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T321312)', diff saved to https://phabricator.wikimedia.org/P35952 and previous config saved to /var/cache/conftool/dbconfig/20221021-213242-ladsgroup.json [21:36:37] PROBLEM - Host wcqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:37:07] RECOVERY - Host wcqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [21:41:53] PROBLEM - Host wcqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:44:33] RECOVERY - Host wcqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [21:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P35953 and previous config saved to /var/cache/conftool/dbconfig/20221021-214749-ladsgroup.json [22:02:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P35954 and previous config saved to /var/cache/conftool/dbconfig/20221021-220256-ladsgroup.json [22:18:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T321312)', diff saved to https://phabricator.wikimedia.org/P35955 and previous config saved to /var/cache/conftool/dbconfig/20221021-221802-ladsgroup.json [22:18:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [22:18:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [22:18:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T321312)', diff saved to https://phabricator.wikimedia.org/P35956 and previous config saved to /var/cache/conftool/dbconfig/20221021-221826-ladsgroup.json [22:19:49] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10serviceops-collab: ensure httpd error logs from "misc apps" (krypton) end up in logstash - https://phabricator.wikimedia.org/T216090 (10Dzahn) a:03Dzahn [22:24:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T321312)', diff saved to https://phabricator.wikimedia.org/P35957 and previous config saved to /var/cache/conftool/dbconfig/20221021-222442-ladsgroup.json [22:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P35958 and previous config saved to /var/cache/conftool/dbconfig/20221021-223948-ladsgroup.json [22:51:45] RECOVERY - SSH on db1121.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:53:31] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P35959 and previous config saved to /var/cache/conftool/dbconfig/20221021-225455-ladsgroup.json [23:10:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T321312)', diff saved to https://phabricator.wikimedia.org/P35960 and previous config saved to /var/cache/conftool/dbconfig/20221021-231001-ladsgroup.json [23:10:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [23:10:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [23:10:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T321312)', diff saved to https://phabricator.wikimedia.org/P35961 and previous config saved to /var/cache/conftool/dbconfig/20221021-231026-ladsgroup.json [23:17:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T321312)', diff saved to https://phabricator.wikimedia.org/P35962 and previous config saved to /var/cache/conftool/dbconfig/20221021-231741-ladsgroup.json [23:32:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P35963 and previous config saved to /var/cache/conftool/dbconfig/20221021-233247-ladsgroup.json [23:33:27] PROBLEM - Check systemd state on db2137 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P35964 and previous config saved to /var/cache/conftool/dbconfig/20221021-234754-ladsgroup.json