[00:00:36] (03PS1) 10Dzahn: ci::master: add parameter to enable/disable monitoring of jenkins/httpd [puppet] - 10https://gerrit.wikimedia.org/r/904374 (https://phabricator.wikimedia.org/T324659) [00:01:38] (03CR) 10Dzahn: [C: 03+2] "fixing monitoring alerts via https://gerrit.wikimedia.org/r/c/operations/puppet/+/904374" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [00:02:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1225.mgmt.eqiad.wmnet with reboot policy FORCED [00:04:19] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/904374/40435/" [puppet] - 10https://gerrit.wikimedia.org/r/904374 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [00:07:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [00:08:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [00:09:54] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072'] [00:10:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072'] [00:10:31] u/win 14 [00:10:51] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072'] [00:11:05] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072'] [00:12:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1225.mgmt.eqiad.wmnet with reboot policy FORCED [00:13:21] (03PS1) 10Dzahn: microsites: do not use TLS when monitoring commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976) [00:13:35] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207'] [00:13:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1207'] [00:18:07] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207'] [00:18:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1207'] [00:18:43] (03CR) 10Dzahn: "opening https://phabricator.wikimedia.org/T333510 to clean this up for real" [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [00:18:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207'] [00:18:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1207'] [00:20:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072'] [00:20:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072'] [00:21:52] (03PS2) 10Dzahn: microsites: do not use TLS when monitoring commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976) [00:24:53] (03PS3) 10Dzahn: microsites: do not use TLS when monitoring commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976) [00:26:31] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2023-03-21 00:00:07 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:27:36] (03CR) 10Dzahn: "Can we talk about this setup please? It is a special case that does things differently from everything else on miscweb. This does not work" [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [00:27:36] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207'] [00:27:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1207'] [00:30:19] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:44] (03CR) 10Dzahn: "please reach out to serviceops-collab team when adding new sites to miscweb in the future so that we can assist you with the certs, monito" [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [00:31:04] (03CR) 10Dzahn: [C: 03+2] microsites: do not use TLS when monitoring commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [00:35:02] (03PS1) 10Dzahn: microsites: for transparency.wikimedia.org except HTTP 302, not 301 [puppet] - 10https://gerrit.wikimedia.org/r/904378 (https://phabricator.wikimedia.org/T327976) [00:35:28] (03PS2) 10Dzahn: microsites: for transparency.wikimedia.org expect HTTP 302, not 301 [puppet] - 10https://gerrit.wikimedia.org/r/904378 (https://phabricator.wikimedia.org/T327976) [00:35:43] (03CR) 10Dzahn: [C: 03+2] microsites: for transparency.wikimedia.org expect HTTP 302, not 301 [puppet] - 10https://gerrit.wikimedia.org/r/904378 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [00:37:50] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T333328 (10wiki_willy) a:03Papaul [00:43:18] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) @jbond all the server @Jclark-ctr and I worked on are failing with the error below. ` START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207'] Managem... [00:45:32] (ProbeDown) firing: (8) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:48:26] (03PS1) 10Ssingh: pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) [00:49:07] (03CR) 10CI reject: [V: 04-1] pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [00:51:58] (03PS2) 10Ssingh: pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) [00:52:29] (03PS1) 10Dzahn: microsites: commons-query.wm.org only works on port 80/http [puppet] - 10https://gerrit.wikimedia.org/r/904382 (https://phabricator.wikimedia.org/T333510) [00:53:41] (03PS2) 10Dzahn: microsites: commons-query.wm.org only works on port 80/http [puppet] - 10https://gerrit.wikimedia.org/r/904382 (https://phabricator.wikimedia.org/T333510) [00:55:47] (03CR) 10Dzahn: [C: 03+2] microsites: commons-query.wm.org only works on port 80/http [puppet] - 10https://gerrit.wikimedia.org/r/904382 (https://phabricator.wikimedia.org/T333510) (owner: 10Dzahn) [00:55:53] (03PS3) 10Dzahn: microsites: commons-query.wm.org only works on port 80/http [puppet] - 10https://gerrit.wikimedia.org/r/904382 (https://phabricator.wikimedia.org/T333510) [00:57:00] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40436/console" [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [00:57:21] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2023-03-28 17:02:45 (4227 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:01:47] (03PS3) 10Ssingh: pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) [01:05:32] (ProbeDown) firing: (4) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:05:47] (ProbeDown) firing: (4) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:06:02] (03CR) 10Dzahn: [C: 03+2] "https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*contint.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max" [puppet] - 10https://gerrit.wikimedia.org/r/904374 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [01:09:55] (03PS1) 10Dzahn: microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) [01:10:17] (03CR) 10CI reject: [V: 04-1] microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) (owner: 10Dzahn) [01:10:32] (ProbeDown) firing: (6) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:11:15] (03PS2) 10Dzahn: microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) [01:11:32] (03CR) 10Dzahn: [C: 03+2] microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) (owner: 10Dzahn) [01:13:10] (03PS3) 10Dzahn: microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) [01:13:16] (03CR) 10Dzahn: [V: 03+2] microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) (owner: 10Dzahn) [01:15:32] (ProbeDown) firing: (6) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:20:32] (ProbeDown) resolved: (7) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:25:58] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "works per https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.max_source_resol" [puppet] - 10https://gerrit.wikimedia.org/r/904273 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [01:36:41] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) The production role for ci::master is now applied on contint2002. Some minor follow-ups were needed: - run puppet mult... [01:37:23] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) 05Open→03In progress [01:38:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [01:46:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) @Jclark-ctr when you are back on site can you please check the network mgmt cable for db1209 and db1210. Thanks [01:48:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [01:53:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [01:53:34] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:34] (JobUnavailable) firing: (2) Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:42] (JobUnavailable) firing: (2) Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [02:38:37] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [04:38:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [04:43:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [05:13:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [05:18:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [05:38:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [05:43:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T0600) [06:00:05] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T0600). [06:10:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [06:15:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [06:23:34] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [06:38:37] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [06:51:58] (03PS11) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [06:52:32] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [06:53:43] (03PS12) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [06:55:42] (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [06:58:48] (03PS13) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [07:00:05] Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T0700) [07:00:44] ah, let me check [07:01:10] no trainees signed up for today [07:01:39] aaaand no patches scheduled in the window either, a nice quiet morning for everyone [07:01:44] so see you all next time! [07:02:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40439/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [07:05:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40441/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [07:11:28] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40442/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [07:14:21] (03CR) 10Slyngshede: [V: 03+1] C:httpd move htcacheclean to httpd class (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [07:15:00] (03PS14) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 [07:38:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [07:40:00] (03PS1) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) [07:43:37] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [07:46:22] (03CR) 10David Caro: maintain-dbusers: run isort and black and use pep563 types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [07:46:56] (03CR) 10David Caro: maintain-dbusers: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [07:48:12] (03PS8) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [07:48:37] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [07:49:32] (03PS9) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [07:50:13] (03PS10) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [07:50:42] (03CR) 10Ayounsi: "Thanks, some comments and I think we should be able to solve usecase #3" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [07:53:36] (03PS1) 10Alexandros Kosiaris: thumbor: Switch all summaries to histograms [deployment-charts] - 10https://gerrit.wikimedia.org/r/904452 (https://phabricator.wikimedia.org/T333445) [07:56:17] (03CR) 10David Caro: "I'm a bit confused about what this diff is showing :S, I'll try to rebase and resend, it seems to show a "Base" version that is not the on" [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [07:59:23] (03CR) 10Ayounsi: Add CORS headers to http endpoints of measure-dc domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [08:07:14] (03CR) 10Elukey: [C: 03+1] k8s: Remove unused token hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/904179 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:08:14] (03CR) 10Elukey: [C: 03+1] k8s: Remove references to unused token hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/904181 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:08:30] 10SRE, 10Data-Engineering, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10ayounsi) > @ayounsi - are you able to confirm trat dropped packets are no longer a problem for this host from the logstash firewall dashboards? I confirm. [08:09:00] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: prep statsd/graphite records for easier write failover [dns] - 10https://gerrit.wikimedia.org/r/904185 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [08:10:26] (03CR) 10Vgutierrez: "considering it's key to work properly please add a CORS check to 22-measure.vtc" [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [08:15:54] (03CR) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [08:17:33] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove hardcoded statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/904186 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [08:18:36] (03PS1) 10Elukey: role::kafka::jumbo::broker: upgrade all brokers to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) [08:19:13] (03PS3) 10Alexandros Kosiaris: openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074 [08:19:15] (03PS1) 10Alexandros Kosiaris: thumbor: Switch all summaries to histograms [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) [08:19:39] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM if it works. Reading about the 'as-path unique-count' it doesn't count confed AS's, or multiples of the same AS, so I'm wondering wo" [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [08:19:51] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40443/console" [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [08:20:04] (03PS2) 10Filippo Giunchedi: profile: remove hardcoded statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/904186 (https://phabricator.wikimedia.org/T239862) [08:20:06] (03CR) 10Alexandros Kosiaris: "@andrewbogott, reviews welcome!" [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [08:22:18] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove hardcoded statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/904186 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [08:25:46] (03PS2) 10Alexandros Kosiaris: thumbor: Switch all summaries to histograms [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) [08:25:51] (03PS2) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) [08:26:58] (03CR) 10Elukey: [C: 03+1] "Tested on superset-next (with/without CAS auth) and it works nicely :)" [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) (owner: 10Volans) [08:27:54] (03CR) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [08:29:05] (03PS2) 10Elukey: role::kafka::main: deploy PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) [08:30:23] (03CR) 10Filippo Giunchedi: "PCC fails (float vs integer) https://puppet-compiler.wmflabs.org/output/904456/40446/thumbor1001.eqiad.wmnet/change.thumbor1001.eqiad.wmne" [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris) [08:34:14] (03CR) 10Btullis: [C: 03+1] "Great! Thanks elukey." [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [08:34:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:35:15] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] k8s: Remove unused token hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/904179 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:36:35] (03PS3) 10Alexandros Kosiaris: thumbor: Switch all summaries to histograms [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) [08:37:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] role::kafka::main: deploy PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [08:38:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40447/console" [puppet] - 10https://gerrit.wikimedia.org/r/904181 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:38:18] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:38:19] (03CR) 10Elukey: [C: 03+2] role::kafka::main: deploy PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [08:38:49] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40448/console" [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris) [08:39:22] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Remove references to unused token hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/904181 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:39:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:43:17] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:43:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:44:19] (03CR) 10Vgutierrez: [C: 03+1] "looking good, tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [08:45:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris) [08:47:07] (03CR) 10Alexandros Kosiaris: WIP: Add new self hosted machinetranslation service (MinT) (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [08:47:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [08:47:40] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) (owner: 10Volans) [08:48:37] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:49:57] (03PS3) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) [08:51:16] (03CR) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [08:53:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] WIP: Add new self hosted machinetranslation service (MinT) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [08:54:33] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [08:55:36] !log move kafka main clusters to new truststore (PKI+Puppet root CA certs) - T319372 [08:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:41] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [08:58:34] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Serve an HTTP response for measurement domains directly from Varnish - https://phabricator.wikimedia.org/T332028 (10JameelKaisar) a:03JameelKaisar [08:58:55] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445 (10akosiaris) I 've upload a couple of changes to switch summaries to histograms in both environments. That way we will be able to have aggregatable data acros... [09:04:46] !log silence LogstashIndexingFailures during investigation T180051 [09:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:52] T180051: Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 [09:05:45] (03PS1) 10Clément Goubert: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) [09:06:26] (03CR) 10CI reject: [V: 04-1] jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert) [09:09:00] !log Merging mw-on-k8s ATS lua routing script - T331318 [09:09:03] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) (owner: 10Giuseppe Lavagetto) [09:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:05] T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 [09:12:07] (03PS1) 10DCausse: [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) [09:12:38] !log puppet disabled for A:cp-text - T331318 [09:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:03] (03CR) 10CI reject: [V: 04-1] [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [09:15:55] !log puppet disabled for A:cp-upload - T331318 [09:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:01] T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 [09:16:48] !log Running puppet on cp2028.codfw.wmnet (cp-upload noop test) - T331318 [09:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:27] (03PS3) 10Ayounsi: Add policy to export prefixes to k8s nodes [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) [09:19:29] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:20:41] (03PS2) 10DCausse: [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) [09:23:53] !log Re-enabling puppet for A:cp-upload - T331318 [09:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:59] T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 [09:24:45] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:25:38] 10SRE, 10Data-Engineering, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10fgiunchedi) >>! In T238794#8738885, @BTullis wrote: > @fgiunchedi - is this just a matter of removing some old config now? Or is there another reason why we're not seeing traff... [09:27:03] (03CR) 10Ayounsi: [C: 03+2] Add policy to export prefixes to k8s nodes [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:27:38] (03Merged) 10jenkins-bot: Add policy to export prefixes to k8s nodes [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:28:02] !log joal@deploy2002 Started deploy [analytics/refinery@359f4bd]: Regular analytics weekly train (2nd) [analytics/refinery@359f4bd] [09:28:43] (03CR) 10DCausse: [DNM] flink-app: always include /etc/envoy/ssl/ca.crt (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [09:32:20] (03CR) 10DCausse: "not meant to be merged, just to illustrate what I think might be the cause of:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [09:32:27] (03CR) 10DCausse: [C: 04-2] [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [09:33:54] !log joal@deploy2002 Finished deploy [analytics/refinery@359f4bd]: Regular analytics weekly train (2nd) [analytics/refinery@359f4bd] (duration: 05m 53s) [09:34:37] !log joal@deploy2002 Started deploy [analytics/refinery@359f4bd] (thin): Regular analytics weekly train (2nd) THIN [analytics/refinery@359f4bd] [09:34:45] !log joal@deploy2002 Finished deploy [analytics/refinery@359f4bd] (thin): Regular analytics weekly train (2nd) THIN [analytics/refinery@359f4bd] (duration: 00m 08s) [09:35:03] !log Re-enabling puppet for cp4037 - T331318 [09:35:06] !log joal@deploy2002 Started deploy [analytics/refinery@359f4bd] (hadoop-test): Regular analytics weekly train (2nd) TEST [analytics/refinery@359f4bd] [09:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:10] T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 [09:36:34] !log joal@deploy2002 Finished deploy [analytics/refinery@359f4bd] (hadoop-test): Regular analytics weekly train (2nd) TEST [analytics/refinery@359f4bd] (duration: 01m 28s) [09:37:13] (03CR) 10Jaime Nuche: "Is there value in updating the failing gerrit-git-fat-pull job to test lfs? From what I gather maybe it's not worth it we can just remove " [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [09:37:56] (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [09:38:29] (03PS1) 10Ayounsi: Move the as-path-regex out of the policy [homer/public] - 10https://gerrit.wikimedia.org/r/904486 (https://phabricator.wikimedia.org/T328523) [09:39:14] (03CR) 10Ayounsi: [C: 03+2] Move the as-path-regex out of the policy [homer/public] - 10https://gerrit.wikimedia.org/r/904486 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:39:48] (03Merged) 10jenkins-bot: Move the as-path-regex out of the policy [homer/public] - 10https://gerrit.wikimedia.org/r/904486 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:43:05] (03PS1) 10Btullis: Bump datahub version to 0.10.0 and re-enable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/904487 (https://phabricator.wikimedia.org/T329514) [09:43:48] (03CR) 10Hnowlan: [C: 03+1] thumbor: Switch all summaries to histograms [deployment-charts] - 10https://gerrit.wikimedia.org/r/904452 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris) [09:44:24] !log Re-enabling puppet for cp-text_ulsfo - T331318 [09:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:30] T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 [09:47:56] !log joal@deploy2002 Started deploy [airflow-dags/analytics@b7b41ae]: Regular analytics weekly train (2nd) [airflow-dags/analytics@b7b41ae] [09:48:07] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@b7b41ae]: Regular analytics weekly train (2nd) [airflow-dags/analytics@b7b41ae] (duration: 00m 11s) [09:48:28] (03PS5) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [09:49:08] (03PS2) 10Clément Goubert: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) [09:49:19] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10oleksandr_tsyba_WMDE) Thank you, @Ladsgroup! 🙏 //*in case of emergency // ` git rebase -i HEAD~1 drop 20bd3f71404912f60f45c1a84f0ed7d76386d6a5 git push --force` ` [09:50:07] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:27] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:53:30] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:54:20] (03PS1) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) [09:54:42] (03CR) 10CI reject: [V: 04-1] k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [09:55:36] (03CR) 10LSobanski: [C: 03+1] alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [09:55:42] (03PS2) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) [09:56:21] (03PS3) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) [09:58:19] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [09:58:22] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:00:05] mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1000) [10:02:08] (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) [10:04:21] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [10:04:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P45985 and previous config saved to /var/cache/conftool/dbconfig/20230330-100457-ladsgroup.json [10:08:28] (03PS9) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [10:08:30] (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) [10:09:09] (03PS1) 10Vgutierrez: purged: Don't specify the kafka compression codec [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669) [10:10:12] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/904261 (https://phabricator.wikimedia.org/T333538) [10:10:14] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40451/console" [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [10:10:40] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [10:11:50] (03PS2) 10Vgutierrez: purged: Don't specify the kafka compression codec [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669) [10:12:34] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:12:37] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:12:59] (03PS10) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [10:13:01] (03PS6) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) [10:13:20] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40452/console" [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669) (owner: 10Vgutierrez) [10:15:09] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [10:16:21] (03CR) 10Jbond: "updated i have also updated the paste and included the devices with no primary ipv4 which is quit e a lot. wonder if this is expected?" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [10:18:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s7 T333538 [10:18:43] T333538: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T333538 [10:19:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T333538 [10:20:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P45987 and previous config saved to /var/cache/conftool/dbconfig/20230330-102002-ladsgroup.json [10:20:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1181 with weight 0 T333538', diff saved to https://phabricator.wikimedia.org/P45988 and previous config saved to /var/cache/conftool/dbconfig/20230330-102012-ladsgroup.json [10:23:42] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:27:44] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [10:28:48] (03PS1) 10Hnowlan: Add service records for rest-gateway [dns] - 10https://gerrit.wikimedia.org/r/904493 (https://phabricator.wikimedia.org/T329074) [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P45989 and previous config saved to /var/cache/conftool/dbconfig/20230330-103506-ladsgroup.json [10:35:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [10:35:50] (03PS3) 10Hnowlan: service, k8s: Add service definitions for rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) [10:42:11] (03CR) 10JMeybohm: [C: 03+1] spark: provide CRUD rights on secret for spark-deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (https://phabricator.wikimedia.org/T332908) (owner: 10Nicolas Fraison) [10:44:22] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [10:45:57] !log Starting s7 eqiad failover from db1136 to db1181 - T333538 [10:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:02] T333538: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T333538 [10:46:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1181 to s7 primary T333538', diff saved to https://phabricator.wikimedia.org/P45992 and previous config saved to /var/cache/conftool/dbconfig/20230330-104617-ladsgroup.json [10:46:39] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/904261 (https://phabricator.wikimedia.org/T333538) (owner: 10Gerrit maintenance bot) [10:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1136 T333538', diff saved to https://phabricator.wikimedia.org/P45993 and previous config saved to /var/cache/conftool/dbconfig/20230330-104928-ladsgroup.json [10:50:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P45994 and previous config saved to /var/cache/conftool/dbconfig/20230330-105011-ladsgroup.json [10:51:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [10:51:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [10:53:43] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:55:49] jouncebot: nowandnext [10:55:49] For the next 0 hour(s) and 4 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1000) [10:55:50] For the next 0 hour(s) and 4 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1000) [10:55:50] In 2 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300) [10:55:50] In 2 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300) [10:56:18] (03CR) 10Stevemunene: [C: 03+1] role::kafka::jumbo::broker: upgrade all brokers to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [10:57:52] (03CR) 10JMeybohm: [C: 03+1] Add service records for rest-gateway [dns] - 10https://gerrit.wikimedia.org/r/904493 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [10:58:02] (03CR) 10JMeybohm: [C: 03+1] service, k8s: Add service definitions for rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [10:58:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:58:39] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED [10:58:52] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [10:59:01] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [10:59:48] (03Merged) 10jenkins-bot: Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup) [11:00:41] (03PS1) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) [11:02:29] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:893552|Revert "Revert "mwscript: Switch to use run.php"" (T326800)]] [11:02:34] T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800 [11:02:40] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10hnowlan) >>! In T320398#8718719, @akosiaris wrote: > * Some links to important graphs to look at and correlate when in an... [11:03:52] !log Re-enabling puppet for cp-text - T331318 [11:03:56] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:893552|Revert "Revert "mwscript: Switch to use run.php"" (T326800)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [11:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:58] T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 [11:04:46] (03CR) 10Btullis: [C: 03+2] Bump datahub version to 0.10.0 and re-enable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/904487 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:04:55] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:20] (03CR) 10Hnowlan: [C: 03+2] Add service records for rest-gateway [dns] - 10https://gerrit.wikimedia.org/r/904493 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [11:06:19] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:07:12] that's me ^ fixing in a second [11:08:32] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [11:09:43] (03Merged) 10jenkins-bot: Bump datahub version to 0.10.0 and re-enable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/904487 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [11:10:28] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:893552|Revert "Revert "mwscript: Switch to use run.php"" (T326800)]] (duration: 07m 59s) [11:10:34] T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800 [11:10:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:11] !log hnowlan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for rest-gateway - hnowlan@cumin1001" [11:11:17] (03PS1) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268) [11:12:11] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for rest-gateway - hnowlan@cumin1001" [11:12:11] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:13:30] (03PS1) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) [11:15:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:04] (03Abandoned) 10Hnowlan: restbase-dev: create new codfw cluster, replace old eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [11:17:29] (03CR) 10Hnowlan: [C: 03+2] service, k8s: Add service definitions for rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [11:22:35] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:24:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:24:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:24:47] (03PS2) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) [11:25:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:25:45] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.964 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:08] (03PS7) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) [11:27:10] (03PS11) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [11:28:54] (03PS1) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [11:29:06] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [11:29:15] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [11:31:08] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [11:31:27] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/904502/40456/" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [11:36:48] (03PS1) 10Hnowlan: kubernetes: add dummy tokens for rest-gateway [labs/private] - 10https://gerrit.wikimedia.org/r/904511 (https://phabricator.wikimedia.org/T329049) [11:44:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:44:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance [11:47:48] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [11:49:54] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns an-worker1149-56 - jclark@cumin1001" [11:50:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns an-worker1149-56 - jclark@cumin1001" [11:50:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:51:34] 10SRE, 10Infrastructure-Foundations: Bug in bridge-utils breaks IPv6 on interface if its not part of a bridge but vlan sub-int of it is - https://phabricator.wikimedia.org/T320429 (10jbond) just noting that ganeti also seems to hit this issue also reported in T233906 [11:55:56] (03PS2) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [11:56:00] (03PS1) 10Ladsgroup: Set externallinks to WRITE BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904512 (https://phabricator.wikimedia.org/T321662) [11:57:25] jouncebot: nowandnext [11:57:25] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [11:57:25] In 1 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300) [11:57:25] In 1 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300) [11:57:29] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:57:32] coooool [11:57:58] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [11:58:45] (03CR) 10Ladsgroup: [C: 03+2] Set externallinks to WRITE BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904512 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [11:59:29] (03Merged) 10jenkins-bot: Set externallinks to WRITE BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904512 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup) [12:00:23] (03PS1) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [12:00:29] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] [12:00:31] (03PS3) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [12:00:36] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [12:00:45] (03CR) 10CI reject: [V: 04-1] mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [12:02:12] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [12:02:13] the scap is erroring for helm [12:02:16] Error: Kubernetes cluster unreachable: Get "https://kubemaster.svc.eqiad.wmnet:6443/version": dial tcp 10.2.2.8:6443: connect: connection refused [12:02:29] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [12:02:31] claime: ^ :D [12:03:26] (03PS2) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) [12:06:22] (03PS4) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [12:07:43] (03CR) 10Kamila Součková: [C: 03+1] "LGTM." [labs/private] - 10https://gerrit.wikimedia.org/r/904511 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [12:08:39] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:10:24] tested on mwdebug on s4, s8 and s7 and everything worked fine [12:12:07] (03PS12) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [12:12:54] (03PS1) 10Btullis: Run the datahub consumers in the GMS context [deployment-charts] - 10https://gerrit.wikimedia.org/r/904517 (https://phabricator.wikimedia.org/T329514) [12:14:12] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [12:15:28] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] (duration: 14m 58s) [12:15:34] T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662 [12:16:45] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno) a:03Atieno [12:17:17] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno) [12:17:18] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [12:17:24] (03CR) 10Jcrespo: "Ready to go: https://puppet-compiler.wmflabs.org/output/904514/40457/backup1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo) [12:17:33] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [12:17:59] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED [12:18:02] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno) [12:18:41] (03PS1) 10Arturo Borrero Gonzalez: profile::bird::anycast: add template parameter [puppet] - 10https://gerrit.wikimedia.org/r/904518 (https://phabricator.wikimedia.org/T324992) [12:19:03] (03CR) 10CI reject: [V: 04-1] profile::bird::anycast: add template parameter [puppet] - 10https://gerrit.wikimedia.org/r/904518 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:19:14] (03CR) 10Btullis: [C: 03+2] Run the datahub consumers in the GMS context [deployment-charts] - 10https://gerrit.wikimedia.org/r/904517 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:23:13] (03CR) 10Ayounsi: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [12:24:06] (03Merged) 10jenkins-bot: Run the datahub consumers in the GMS context [deployment-charts] - 10https://gerrit.wikimedia.org/r/904517 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:25:37] (03PS1) 10Majavah: labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519 [12:25:47] (03PS2) 10Majavah: labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519 [12:26:06] (03PS3) 10Majavah: labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519 (https://phabricator.wikimedia.org/T333549) [12:26:10] (03CR) 10CI reject: [V: 04-1] labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519 (https://phabricator.wikimedia.org/T333549) (owner: 10Majavah) [12:26:20] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:27:06] (03Abandoned) 10DCausse: [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [12:27:07] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:28:13] (03PS6) 10David Caro: maintain-dbusers: run isort and black and use pep563 types [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) [12:28:15] (03PS6) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) [12:28:17] (03PS11) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [12:28:19] (03PS6) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) [12:28:21] (03PS10) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [12:29:07] (03CR) 10David Caro: [C: 03+2] labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519 (https://phabricator.wikimedia.org/T333549) (owner: 10Majavah) [12:31:10] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ladsgroup) [12:31:45] (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - allow egress to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 [12:31:58] !log joal@deploy2002 Started deploy [airflow-dags/analytics@a6500cf]: Regular analytics weekly train (2nd) HOTFIX [airflow-dags/analytics@a6500cf] [12:32:09] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@a6500cf]: Regular analytics weekly train (2nd) HOTFIX [airflow-dags/analytics@a6500cf] (duration: 00m 11s) [12:32:44] (03CR) 10EoghanGaffney: [C: 03+1] "It would be nice if we had something that did more than just check for an open port. Possibly something to follow up with observability fo" [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [12:33:40] (03CR) 10EoghanGaffney: [C: 03+1] "Same as the feedback in https://gerrit.wikimedia.org/r/c/operations/puppet/+/903805, it would be nice to have something that checks what t" [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [12:36:08] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [12:37:20] (03CR) 10Gmodena: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [12:37:23] (03PS1) 10EoghanGaffney: Removes unnecessary krb:present line [puppet] - 10https://gerrit.wikimedia.org/r/904522 [12:39:51] (03CR) 10Elukey: [C: 03+1] purged: Don't specify the kafka compression codec [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669) (owner: 10Vgutierrez) [12:39:56] (03CR) 10Andrew Bogott: "Your arguments are convincing :) For science, I've stopped nutcracker on the cloudweb hosts to confirm that there are no consequences... w" [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [12:40:34] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - allow egress to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [12:41:48] thx elukey [12:43:23] PROBLEM - nutcracker process on cloudweb1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [12:43:59] (03CR) 10BBlack: [C: 03+1] "Looks about right to me, assuming it runs successfully :)" [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:44:07] PROBLEM - nutcracker port on cloudweb1004 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [12:44:07] PROBLEM - nutcracker process on cloudweb1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [12:44:27] PROBLEM - nutcracker port on cloudweb1003 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [12:45:27] (03PS1) 10Filippo Giunchedi: alertmanager: route data-persistence warnings to -feed [puppet] - 10https://gerrit.wikimedia.org/r/904525 [12:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:46:25] (03Merged) 10jenkins-bot: mediawiki-page-content-change-enrichment - allow egress to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [12:46:47] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.provision cookbook: check for both default and wmf password - https://phabricator.wikimedia.org/T333554 (10ayounsi) p:05Triage→03Low [12:47:45] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ottomata) Approved. I'd guess that "AQS troubleshooting" would require kerberos as well. [12:48:14] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks, but I think it'd be good to have a +1 from at least one other data-persistence team member before merging." [puppet] - 10https://gerrit.wikimedia.org/r/904525 (owner: 10Filippo Giunchedi) [12:50:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:51:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10ayounsi) How does this compare to taking iBGP down between LEAF1 to SPINE2 if the link goes down? [12:53:37] Amir1: Do you have more output, like what deployment failed ? [12:53:41] RECOVERY - Check systemd state on kubernetes2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:25] claime: https://phabricator.wikimedia.org/P45995 [12:54:41] thabks [12:54:45] s/b/n/ [12:55:40] So it failed temporarily and only for mw-debug eqiad [12:55:43] Hmm [12:56:34] (03CR) 10Ottomata: [C: 03+1] "😊" [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300) [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:06] (03PS13) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [13:02:15] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [13:02:39] Amir1: It coincides almost perfectly with puppet runs on both kubemaster1001 and kubemaster1002 where it refreshed kube-apiserver [13:03:44] it's fine for me, I can make it run it again [13:03:46] just to be sure [13:04:11] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] [13:04:33] Trigger: Thu 2023-03-30 13:29:00 UTC; 24min left [13:04:37] Trigger: Thu 2023-03-30 13:30:00 UTC; 26min left [13:04:49] They're a bit too close for two servers that do the exact same thing [13:04:50] 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ladsgroup) [13:05:34] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:06:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P45996 and previous config saved to /var/cache/conftool/dbconfig/20230330-130625-ladsgroup.json [13:10:07] (03CR) 10Slyngshede: [C: 03+1] "LGTM. I can't find the documentation for the logfmt function, but I'd assume it parses logfmt formatted logs, in which case the rest make " [puppet] - 10https://gerrit.wikimedia.org/r/902334 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [13:10:36] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@47f3a61]: (no justification provided) [13:10:40] (03PS14) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [13:10:59] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] (duration: 06m 47s) [13:10:59] (03PS1) 10JMeybohm: envoy: Add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/904527 (https://phabricator.wikimedia.org/T333551) [13:11:20] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [13:11:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:11] (03PS2) 10Volans: superset: add static html for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) [13:12:31] (03CR) 10Elukey: [C: 03+1] envoy: Add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/904527 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [13:13:38] Amir1: I guess it deployed correctly [13:13:40] cgoubert@deploy2002:/srv/deployment-charts/helmfile.d/services/mw-debug$ helmfile -e eqiad status 2>/dev/null | grep DEPLOYED [13:13:42] LAST DEPLOYED: Thu Mar 30 13:05:07 2023 [13:14:01] So yeah, having both kubemasters run puppet with a 1 minute difference is bad. [13:14:43] as long as it's transient, I don't mind [13:14:56] Well I do :D [13:15:07] (03CR) 10Ayounsi: [C: 03+2] "Tests are still happy." [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:15:58] that's fqdn_rand... [13:16:42] (03CR) 10JMeybohm: "This raises multiple warnings:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [13:16:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:38] claime: T161145 [13:17:57] volans: I know. [13:18:03] It's a pain in the ass sometimes [13:18:13] (03CR) 10Raymond Ndibe: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/900642 (owner: 10David Caro) [13:18:18] Although I'd never seen the variable.fqdn_rand syntax before [13:18:24] modules/profile/manifests/puppet/agent.pp: $timer_interval = "*:${interval.fqdn_rand}/${interval}:00" [13:18:31] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2022 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:18:37] (03CR) 10Volans: [C: 03+2] superset: add static html for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) (owner: 10Volans) [13:18:52] claime: that's puppet's builtin [13:18:55] IIRC [13:19:01] in the past we used a different thing [13:19:21] volans: Yeah, I just wonder what it uses the variable for actually [13:19:33] I guess it's MAX ? [13:20:21] I'm used to seeing it as fqdn_rand(MAX), not MAX.fqdn_rand [13:20:32] 301 j.bond :D [13:21:07] 007 j.bond >_> [13:21:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P45997 and previous config saved to /var/cache/conftool/dbconfig/20230330-132130-ladsgroup.json [13:23:05] (03CR) 10JMeybohm: mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [13:25:29] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/904527 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [13:25:56] (03PS1) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) [13:26:03] (03PS1) 10Volans: superset: fix typo in file path [puppet] - 10https://gerrit.wikimedia.org/r/904530 [13:28:18] (03PS7) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [13:28:26] (03PS3) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) [13:28:34] (03PS5) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [13:29:43] (03CR) 10Volans: [C: 03+2] superset: fix typo in file path [puppet] - 10https://gerrit.wikimedia.org/r/904530 (owner: 10Volans) [13:31:40] (03CR) 10Slyngshede: "First draft, and not yet tested. Just checking that we agree on the direction of the implementation." [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [13:32:38] (03PS1) 10Hnowlan: admin: move kamila to ops [puppet] - 10https://gerrit.wikimedia.org/r/904532 [13:32:41] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@47f3a61]: (no justification provided) (duration: 22m 04s) [13:32:46] (03CR) 10Ayounsi: [C: 03+1] "Awesome!" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [13:34:47] (03PS1) 10Ssingh: pybal: don't install python3-requests on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/904533 (https://phabricator.wikimedia.org/T321309) [13:36:07] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40459/console" [puppet] - 10https://gerrit.wikimedia.org/r/904533 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:36:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P45998 and previous config saved to /var/cache/conftool/dbconfig/20230330-133635-ladsgroup.json [13:37:43] FYI, stashbot is currently having issues and !logs are not being processed [13:38:31] (cc nemo-yiannis, Amir1 from the past few !logs) [13:39:12] :( [13:39:20] !log disable Puppet on A:lvs to test 904533 [13:39:37] oh ok, just saw the stashbot thing :) [13:41:24] (03CR) 10Ssingh: [V: 03+1 C: 03+2] pybal: don't install python3-requests on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/904533 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:41:26] (03CR) 10Jcrespo: [C: 03+1] alertmanager: route data-persistence warnings to -feed [puppet] - 10https://gerrit.wikimedia.org/r/904525 (owner: 10Filippo Giunchedi) [13:42:20] (03CR) 10Volans: "Nice start! I think it can be simplified a bit without too much extra logic, see comments inline, feel free to ping me." [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [13:44:36] !log enable Puppet on A:lvs to test 904533 [13:46:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] thumbor: Switch all summaries to histograms [deployment-charts] - 10https://gerrit.wikimedia.org/r/904452 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris) [13:46:48] (03CR) 10Ssingh: [C: 03+2] pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:47:22] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] thumbor: Switch all summaries to histograms [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris) [13:49:19] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10FJoseph-WMF) Approved [13:49:36] (03PS1) 10Giuseppe Lavagetto: admin: add Kavitha to the approver for the ops group [puppet] - 10https://gerrit.wikimedia.org/r/904535 [13:50:28] (03CR) 10Clément Goubert: [C: 03+1] "She is indeed our manager." [puppet] - 10https://gerrit.wikimedia.org/r/904535 (owner: 10Giuseppe Lavagetto) [13:50:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add Kavitha to the approver for the ops group [puppet] - 10https://gerrit.wikimedia.org/r/904535 (owner: 10Giuseppe Lavagetto) [13:51:19] (03Merged) 10jenkins-bot: thumbor: Switch all summaries to histograms [deployment-charts] - 10https://gerrit.wikimedia.org/r/904452 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris) [13:51:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P45999 and previous config saved to /var/cache/conftool/dbconfig/20230330-135140-ladsgroup.json [13:52:43] RECOVERY - nutcracker process on cloudweb1003 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [13:53:03] RECOVERY - nutcracker port on cloudweb1003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 https://wikitech.wikimedia.org/wiki/Nutcracker [13:53:47] RECOVERY - nutcracker process on cloudweb1004 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [13:54:31] RECOVERY - nutcracker port on cloudweb1004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 https://wikitech.wikimedia.org/wiki/Nutcracker [13:58:10] (03PS1) 10JMeybohm: modules.mesh.configuration: Add version 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904537 (https://phabricator.wikimedia.org/T333551) [13:58:12] (03PS1) 10JMeybohm: mesh.configuration: Use wmf-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551) [13:58:30] (03CR) 10Andrew Bogott: [C: 04-1] "We are, it turns out, using nutcracker for Horizon session state. See T333561" [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [13:59:05] (03CR) 10David Caro: [C: 03+2] harbor: use external url for the proxies [puppet] - 10https://gerrit.wikimedia.org/r/900642 (owner: 10David Caro) [13:59:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jhancock.wm) @Papaul the firmware has been updated. [14:00:07] (03PS1) 10Jbond: P:puppet::agent: allow to add a seed to the time the agent runs [puppet] - 10https://gerrit.wikimedia.org/r/904539 [14:00:11] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/904525 (owner: 10Filippo Giunchedi) [14:01:13] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Papaul) @Jhancock.wm thanks [14:01:16] (03CR) 10Jelto: "I'm not sure if this is a typical use case to start a unit/timer after unmasking it. One host (phab1004) did not require a additional star" [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [14:01:48] (03CR) 10Elukey: [C: 03+1] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [14:01:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40461/console" [puppet] - 10https://gerrit.wikimedia.org/r/904539 (owner: 10Jbond) [14:03:38] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Papaul) Now that all the firmware are up to date I will recommend the re-image of the server. [14:03:49] (03PS1) 10Cwhite: logstash: collapse eventgate response_body field [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051) [14:05:49] (03CR) 10CI reject: [V: 04-1] logstash: collapse eventgate response_body field [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051) (owner: 10Cwhite) [14:06:12] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [14:07:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppet::agent: allow to add a seed to the time the agent runs [puppet] - 10https://gerrit.wikimedia.org/r/904539 (owner: 10Jbond) [14:08:57] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [14:10:05] BGP alerts expected in ulsfo [14:10:32] (03PS2) 10Hnowlan: admin: move kamila to ops [puppet] - 10https://gerrit.wikimedia.org/r/904532 (https://phabricator.wikimedia.org/T333565) [14:10:58] (03CR) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [14:11:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bullseye [14:11:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye [14:11:54] (03PS3) 10Clément Goubert: P:kubernetes::master: profile::puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536 [14:12:23] (03PS1) 10Bking: flink-app: temp fix for envoy proxy usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/904542 (https://phabricator.wikimedia.org/T333551) [14:12:27] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [14:13:08] (03CR) 10DCausse: [C: 03+1] flink-app: temp fix for envoy proxy usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/904542 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking) [14:13:14] (03CR) 10Raymond Ndibe: maintain-dbusers: only-users match tool users with or without prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) (owner: 10David Caro) [14:14:05] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:15:07] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:17] (03PS1) 10Jbond: redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 [14:16:21] (03CR) 10Bking: [C: 03+2] flink-app: temp fix for envoy proxy usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/904542 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking) [14:17:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [14:18:21] (03PS1) 10Ayounsi: Kubestage: don't set next-hop self on exported prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/904544 (https://phabricator.wikimedia.org/T328523) [14:18:45] (03CR) 10Hashar: "We should be able to mark the Phabricator task with #release-engineering-team as well." [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [14:19:40] (03PS2) 10Cwhite: logstash: remove eventgate response_body field [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051) [14:20:17] (03CR) 10CI reject: [V: 04-1] redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (owner: 10Jbond) [14:20:29] (03CR) 10Jelto: Adds flag to start after unmask, starts logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [14:21:08] (03Merged) 10jenkins-bot: flink-app: temp fix for envoy proxy usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/904542 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking) [14:22:23] (03Abandoned) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) (owner: 10David Caro) [14:22:33] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [14:22:41] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:22:43] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:23:00] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:23:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:23:42] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:25:04] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40463/console" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert) [14:25:53] (03PS1) 10Bking: flink-app: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904546 (https://phabricator.wikimedia.org/T333551) [14:26:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051) (owner: 10Cwhite) [14:26:53] (03CR) 10DCausse: [C: 03+1] flink-app: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904546 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking) [14:26:55] (03CR) 10Vgutierrez: [C: 03+2] varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [14:27:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [14:28:18] (03CR) 10Bking: [C: 03+2] flink-app: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904546 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking) [14:30:22] (03PS4) 10Clément Goubert: kubemaster*.eqiad: Add puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536 [14:30:42] (03PS6) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [14:31:31] (03CR) 10CI reject: [V: 04-1] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [14:31:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage [14:32:18] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40464/console" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert) [14:32:59] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [14:34:33] (03Merged) 10jenkins-bot: flink-app: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904546 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking) [14:35:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:35:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:36:41] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:36:44] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:38:44] (03PS1) 10Volans: superset: requestctl-generator error handling [puppet] - 10https://gerrit.wikimedia.org/r/904550 [14:39:01] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:39:02] (03CR) 10Hashar: Migrate from git fat to git lfs (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [14:39:08] (03PS5) 10Clément Goubert: kubemaster*.eqiad: Add puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536 [14:39:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:39:37] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:39:40] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:40:09] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40465/console" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert) [14:40:25] (03PS7) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [14:40:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:41:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:43:30] (03PS6) 10Clément Goubert: kubemaster*.eqiad: Add puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536 [14:43:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:43:44] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:44:12] (03PS2) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) [14:44:17] stashbot is apparently back btw, I didn’t notice ^^ [14:44:17] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [14:44:31] (03CR) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [14:44:35] (03CR) 10CI reject: [V: 04-1] Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [14:44:41] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40466/console" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert) [14:45:13] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:46:17] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:46:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bullseye [14:46:52] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [14:47:29] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:47:31] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:49:02] (03PS1) 10DCausse: rdf-streaming-updater: temp fix, pin envoy image version to 1.18.3-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904553 (https://phabricator.wikimedia.org/T328675) [14:49:14] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:49:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:50:15] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: temp fix, pin envoy image version to 1.18.3-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904553 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [14:51:00] Any deployer around for a urgent security fix? [14:51:18] See https://phabricator.wikimedia.org/T333569 [14:51:20] meeting now, can do in twenty minutes or so [14:52:06] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [14:52:53] If a deployer is able to start on this, please ping me and I'll come back. [14:53:02] o/ Dreamy_Jazz [14:53:50] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [14:55:03] (03Merged) 10jenkins-bot: rdf-streaming-updater: temp fix, pin envoy image version to 1.18.3-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904553 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [14:56:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:57:01] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:57:06] (03PS2) 10Hashar: Migrate from git fat to git lfs [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) [14:57:28] (03PS3) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) [14:57:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) >>! In T332781#8741660, @ayounsi wrote: > How does this compare to taking iBGP down between LEAF1 to SPINE2 if the link g... [14:57:59] (03CR) 10Hashar: Migrate from git fat to git lfs (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [14:58:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] openstack::nutcracker: Remove redis support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris) [14:59:55] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:59:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:01:27] heads up, I’m deploying a security fix [15:01:56] (03PS2) 10Jbond: redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661) [15:03:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert) [15:03:57] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubemaster*.eqiad: Add puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert) [15:04:20] (03PS3) 10Jbond: redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661) [15:05:16] (03PS2) 10JMeybohm: mesh.configuration: Use wmf-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551) [15:05:19] Amir1: kubemasters should now run puppet with more splay, so you shouldn´t run into the issue anymore [15:05:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:06:00] (03PS1) 10Vgutierrez: varnish: Set backend_hit = esitest for HfP requests [puppet] - 10https://gerrit.wikimedia.org/r/904556 (https://phabricator.wikimedia.org/T308799) [15:06:33] (03CR) 10JMeybohm: mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [15:07:14] (03PS2) 10Vgutierrez: varnish: Set backend_hint = esitest for HfP requests [puppet] - 10https://gerrit.wikimedia.org/r/904556 (https://phabricator.wikimedia.org/T308799) [15:08:20] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661) (owner: 10Jbond) [15:08:37] !log lucaswerkmeister-wmde: Deployed security patch for T333569 [15:08:53] (03CR) 10Jbond: [C: 03+2] redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661) (owner: 10Jbond) [15:09:26] (03PS8) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [15:10:55] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: wikidatardf-truthy-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:46] (03Merged) 10jenkins-bot: redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661) (owner: 10Jbond) [15:13:28] (03CR) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [15:14:51] !log lucaswerkmeister-wmde: Deployed security patch for T333569 [15:15:25] (03PS1) 10JMeybohm: Update default tls terminator/mesh envoy version to 1.18.3-2 [puppet] - 10https://gerrit.wikimedia.org/r/904557 (https://phabricator.wikimedia.org/T333551) [15:16:07] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/904558 [15:16:37] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/904558 (owner: 10Volans) [15:18:07] (03PS1) 10Arturo Borrero Gonzalez: alertmanager: disable phabricator task creation for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) [15:18:12] (03PS9) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [15:19:28] (03CR) 10Vgutierrez: [C: 03+2] varnish: Set backend_hint = esitest for HfP requests [puppet] - 10https://gerrit.wikimedia.org/r/904556 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [15:19:30] (03PS10) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [15:20:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:21:26] (03PS1) 10Volans: Upstream release v6.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/904561 [15:21:38] (03CR) 10Volans: [C: 03+2] Upstream release v6.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/904561 (owner: 10Volans) [15:25:45] (03Merged) 10jenkins-bot: Upstream release v6.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/904561 (owner: 10Volans) [15:26:20] (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/904522 (owner: 10EoghanGaffney) [15:26:34] (03CR) 10Clément Goubert: [C: 03+1] noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [15:26:56] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] "Approved by manager on phab" [puppet] - 10https://gerrit.wikimedia.org/r/904532 (https://phabricator.wikimedia.org/T333565) (owner: 10Hnowlan) [15:27:30] (03PS1) 10Andrew Bogott: Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) [15:28:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904532 (https://phabricator.wikimedia.org/T333565) (owner: 10Hnowlan) [15:28:13] (03CR) 10Andrew Bogott: [C: 04-1] "This needs to be merged during a pre-defined window on Monday morning" [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [15:28:30] (03PS1) 10Lucas Werkmeister (WMDE): admin: add .gitconfig for lucaswerkmeister-wmde [puppet] - 10https://gerrit.wikimedia.org/r/904563 [15:30:05] brennen and mutante: #bothumor I � Unicode. All rise for Phabricator update window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1530). [15:30:13] o/ [15:30:49] !log uploaded spicerack_6.4.1 to apt.wikimedia.org bullseye-wikimedia [15:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:01] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:32:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: maintenance [15:32:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: maintenance [15:32:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: maintenance [15:32:32] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2002-dev [15:32:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: maintenance [15:32:57] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2002-dev [15:33:16] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2002-dev [15:33:36] !log phabricator maintenance / deploy window starting [15:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:16] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2002-dev [15:34:24] !log brennen@deploy2002 Started deploy [phabricator/deployment@9f0866e]: test deploy to phab2002 for T333516 [15:34:29] T333516: Phabricator deployment 2023-03-30 - https://phabricator.wikimedia.org/T333516 [15:34:35] !log upgraded spicerack to v6.4.1 on the cumin hosts [15:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] (03CR) 10Jbond: [C: 04-1] "this affects more then just the systemd timer so we need to think about it more carfully. i also think that we should instead try to noti" [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [15:34:54] !log brennen@deploy2002 Finished deploy [phabricator/deployment@9f0866e]: test deploy to phab2002 for T333516 (duration: 00m 30s) [15:35:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) Sounds good to me. This is what we need to do with cloudcontrol2004-dev: * figure out how to... [15:35:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) Third batch |Host|U space|Existing port|New port| |cloudcephosd2001-dev|3|asw-b1-codfw ge-1/0/... [15:35:44] (03CR) 10David Caro: alertmanager: disable phabricator task creation for WMCS alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez) [15:35:51] !log brennen@deploy2002 Started deploy [phabricator/deployment@9f0866e]: deploy to phab1004 for T333516 [15:35:55] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207'] [15:36:13] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2001-dev [15:36:33] !log brennen@deploy2002 Finished deploy [phabricator/deployment@9f0866e]: deploy to phab1004 for T333516 (duration: 00m 42s) [15:36:42] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2001-dev [15:39:25] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata) [15:40:40] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1208'] [15:40:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [15:43:10] (JobUnavailable) resolved: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:01] !log phabricator maintenance window / deployment ended (T329974) [15:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:11] (03PS2) 10Arturo Borrero Gonzalez: alertmanager: update phabricator project for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) [15:44:13] T329974: Show "other assignee" avatar on tasks in workboard - https://phabricator.wikimedia.org/T329974 [15:50:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) @jbond i have batteries for all of these can this be done tomorrow? If possible can you shut down server and I can preform repair 9am est tomorrow? [15:53:17] (03CR) 10CDanis: [C: 03+2] admin: add .gitconfig for lucaswerkmeister-wmde [puppet] - 10https://gerrit.wikimedia.org/r/904563 (owner: 10Lucas Werkmeister (WMDE)) [15:53:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10jbond) @Jclark-ctr you will need to contacts someone in analytics (possibly @BTullis) and data persistence (maybe @MatthewVernon) [15:54:30] (03CR) 10Arturo Borrero Gonzalez: alertmanager: update phabricator project for WMCS alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez) [16:00:04] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1600) [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1207'] [16:00:50] (03CR) 10Dzahn: [C: 03+2] phabricator: replace Icinga with Prometheus for SMTP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [16:01:11] (03CR) 10David Caro: [C: 03+1] "LGTM :crossingfingers:" [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez) [16:01:16] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2001-dev [16:01:32] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2001-dev [16:03:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1208'] [16:03:56] (03CR) 10Dzahn: [C: 03+2] "true! well, there is "query_response", but not in wide usage yet" [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [16:05:23] 10SRE, 10observability, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551 (10lmata) @BTullis, @jcrespo coming in late to this thread to update you that we've scheduled to tackle {T108027} next quarter (q4), which I think would address this issue. Feel free to reach... [16:05:38] (03CR) 10Dzahn: [C: 03+2] "let me confirm it works like this and then try to add a query_response check later" [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [16:06:36] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] kubernetes: add dummy tokens for rest-gateway [labs/private] - 10https://gerrit.wikimedia.org/r/904511 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [16:06:47] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: update phabricator project for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez) [16:06:54] (03CR) 10Ahmon Dancy: [C: 03+1] scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [16:08:31] (03CR) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:09:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209'] [16:09:41] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['db1209'] [16:09:45] (03PS1) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) [16:09:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1211'] [16:10:51] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1212'] [16:18:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [16:18:30] (Emergency syslog message) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:19:26] (03CR) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:19:39] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2003-dev [16:19:55] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2003-dev [16:20:08] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephmon2004-dev [16:20:16] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephmon2004-dev [16:20:23] (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Use wmf-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [16:20:23] !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2003-dev [16:20:27] (03CR) 10JMeybohm: [C: 03+2] modules.mesh.configuration: Add version 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904537 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [16:21:25] !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2003-dev [16:23:30] (Emergency syslog message) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:25:18] (03Merged) 10jenkins-bot: modules.mesh.configuration: Add version 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904537 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [16:25:56] (03PS1) 10Papaul: Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/904574 (https://phabricator.wikimedia.org/T326661) [16:26:23] (03Merged) 10jenkins-bot: mesh.configuration: Use wmf-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm) [16:29:00] (03CR) 10Cwhite: [C: 03+2] logstash: remove eventgate response_body field (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051) (owner: 10Cwhite) [16:31:29] (03PS1) 10Hashar: Extract and deploy upstream plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 [16:32:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1211'] [16:36:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1212'] [16:42:55] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1213'] [16:43:03] (03CR) 10Hashar: "In the child change https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/904575 I am adding some more plugins tracked by git lfs." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar) [16:43:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) All remaining (non public-vlan) hosts have been moved and look good to me (reachable, MAC addr... [16:44:51] (03CR) 10Brennen Bearnes: "+1 on overall idea." [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [16:46:21] (03CR) 10Hashar: "The parent change migrates Gerrit deployment from git-fat to git-lfs. Jaime and I successfully used it for the Gitlab jenkins-deploy repo." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar) [16:46:55] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1214'] [16:46:58] (03CR) 10Hashar: [C: 03+2] "With git-lfs, I have proposed to add the bundled Gerrit plugins in the deployment repository again https://gerrit.wikimedia.org/r/c/operat" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy) [16:47:14] 10SRE, 10Traffic-Icebox, 10Upstream: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134 (10BCornwall) 05Stalled→03Invalid And another 3 years have passed. Since OCSP is in a bit of a zombie state and its future support in Firefox is questionable (see Mozilla's crlite project), it... [16:52:10] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-03-27-111537-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/904578 [16:54:20] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:55:35] (03PS1) 10Cwhite: logstash: restore logstash index patch level [puppet] - 10https://gerrit.wikimedia.org/r/904265 (https://phabricator.wikimedia.org/T180051) [16:56:44] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [16:56:53] (03CR) 10Papaul: [C: 03+2] Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/904574 (https://phabricator.wikimedia.org/T326661) (owner: 10Papaul) [16:58:17] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-03-27-111537-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/904578 (owner: 10BryanDavis) [16:58:48] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [17:00:05] bd808: How many deployers does it take to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1700). [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1700) [17:00:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:00:34] jouncebot: I think just 1, but that's not a great punchline. ;) [17:00:40] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:01:03] * bd808 will be deploying an updated developer portal today [17:01:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1207.eqiad.wmnet with OS bullseye [17:01:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1207.eqiad.wmnet with OS bullseye [17:03:40] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-03-27-111537-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/904578 (owner: 10BryanDavis) [17:04:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1213'] [17:04:38] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1215'] [17:04:42] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:05:48] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:06:01] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:06:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1214'] [17:07:11] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:07:23] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:08:32] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:09:03] (03CR) 10Dzahn: [C: 03+2] "works per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*phab.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h" [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [17:09:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1216'] [17:10:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [17:10:42] BGP alerts in ulsfo expected [17:10:55] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [17:14:19] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:14:19] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:14:31] PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:15:21] ^ expected [17:15:45] PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:16:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1207.eqiad.wmnet with reason: host reimage [17:16:47] (03PS2) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580) [17:19:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1207.eqiad.wmnet with reason: host reimage [17:19:55] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:20:04] ^ expected [17:20:18] there isn't a good way to silence these alerts [17:20:41] partly I also don't want to today (so that we can know if something breaks) but yeah, in general as well :) [17:21:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1208.eqiad.wmnet with OS bullseye [17:21:36] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1208.eqiad.wmnet with OS bullseye [17:22:26] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10BTullis) @Cmjohnson I can't think of any reason why six disks should have failed. I think they're all single volume RAID 0 logical volumes, aren't they? We've power cycled it a few times with... [17:25:27] (03PS1) 10Cmjohnson: updating site.pp and netboot with new gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/904586 (https://phabricator.wikimedia.org/T326366) [17:25:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:24] (03CR) 10Cwhite: [C: 03+2] logstash: add k8s statsd-exporter ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/901631 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:26:40] (03CR) 10Cmjohnson: [C: 03+2] updating site.pp and netboot with new gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/904586 (https://phabricator.wikimedia.org/T326366) (owner: 10Cmjohnson) [17:27:27] !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@8b242c2]: (no justification provided) [17:27:38] !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@8b242c2]: (no justification provided) (duration: 00m 11s) [17:28:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1215'] [17:28:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1216'] [17:28:38] !log killed Oozie mediawiki-history-check_denormalize job and started Airflow mediawiki_history_check_denormalize dag. [17:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:32] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit1003'] [17:29:45] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['gerrit1003'] [17:30:17] !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit1003'] [17:30:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['gerrit1003'] [17:32:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1217'] [17:32:31] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1218'] [17:34:29] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:35:34] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [17:36:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1208.eqiad.wmnet with reason: host reimage [17:36:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:36:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1207.eqiad.wmnet with OS bullseye [17:36:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1207.eqiad.wmnet with OS bullseye completed: - db1207 (**PASS**) - Removed from Puppet an... [17:36:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gerrit1003.wikimedia.org with OS bullseye [17:36:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1211.eqiad.wmnet with OS bullseye [17:36:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gerrit1003.wikimedia.org with OS bullseye [17:36:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1211.eqiad.wmnet with OS bullseye [17:39:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1208.eqiad.wmnet with reason: host reimage [17:49:28] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gerrit1003.wikimedia.org with OS bullseye [17:49:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gerrit1003.wikimedia.org with OS bullseye executed with errors: - gerrit1003 (*... [17:51:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1211.eqiad.wmnet with reason: host reimage [17:54:01] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:54:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1211.eqiad.wmnet with reason: host reimage [17:55:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gerrit1003.wikimedia.org with OS bullseye [17:55:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gerrit1003.wikimedia.org with OS bullseye [17:56:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Cmjohnson) [17:57:27] 10SRE, 10Traffic-Icebox: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10BCornwall) 05Stalled→03Resolved a:03BCornwall I haven't been able to see any indication that this has been an issue for the entirety of our metrics. @Ema's great work likely has fix... [18:00:05] dduvall and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1800). [18:05:42] 10SRE, 10Traffic, 10HTTPS, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BCornwall) [18:05:55] 10SRE, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) 05Stalled→03Declined I'm going to decline this as it's not possible. I will follow it up with T333591 which tracks moving the domain. [18:06:20] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:06:23] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:08:45] (03CR) 10Dzahn: [C: 03+2] noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [18:09:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) There doesn't seem to be a raid controller {F36934521} [18:09:48] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:12:05] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:12:08] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:14:28] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye [18:15:30] (03CR) 10BCornwall: [C: 03+1] grizzly: adapt managed dashboards to 0.2 metadata approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903776 (https://phabricator.wikimedia.org/T332895) (owner: 10Herron) [18:21:46] (03PS2) 10Dzahn: miscweb: move simplestatic.erb out of role/templates/apache/sites/ [puppet] - 10https://gerrit.wikimedia.org/r/902141 [18:22:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:22:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1211.eqiad.wmnet with OS bullseye [18:22:26] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:22:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1208.eqiad.wmnet with OS bullseye [18:22:28] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1211.eqiad.wmnet with OS bullseye completed: - db1211 (**PASS**) - Removed from Puppet an... [18:22:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1208.eqiad.wmnet with OS bullseye completed: - db1208 (**WARN**) - Removed from Puppet an... [18:22:44] (03CR) 10Dzahn: [C: 03+1] "used only by https://openstack-browser.toolforge.org/puppetclass/role::simplestatic afaict" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn) [18:22:59] !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@5355ead]: (no justification provided) [18:23:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1212.eqiad.wmnet with OS bullseye [18:23:11] !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@5355ead]: (no justification provided) (duration: 00m 12s) [18:23:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1212.eqiad.wmnet with OS bullseye [18:23:33] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:23:36] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:24:02] (03PS1) 10Mforns: modules::profile::manifests::airflow.pp: add plugins_folder path [puppet] - 10https://gerrit.wikimedia.org/r/904609 (https://phabricator.wikimedia.org/T324485) [18:24:23] (03CR) 10Dzahn: [C: 03+1] "unknown why compiling on cloud VPS not working: Hosts that were skipped (fail fast)" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn) [18:25:39] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904611 (https://phabricator.wikimedia.org/T330208) [18:25:41] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904611 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [18:26:26] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904611 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [18:26:29] (03CR) 10Dzahn: [C: 03+2] miscweb: move simplestatic.erb out of role/templates/apache/sites/ [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn) [18:27:51] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:27:53] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:30:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1213.eqiad.wmnet with OS bullseye [18:30:36] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1213.eqiad.wmnet with OS bullseye [18:31:18] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:31:19] (03PS1) 10Dzahn: simplestatic: fix path to erb template [puppet] - 10https://gerrit.wikimedia.org/r/904613 [18:31:20] !log Killed Oozie mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord [18:31:21] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:31:38] (03CR) 10Dzahn: [C: 03+2] "duh, follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/904613/" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn) [18:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1217'] [18:31:46] (03CR) 10Dzahn: [C: 03+2] simplestatic: fix path to erb template [puppet] - 10https://gerrit.wikimedia.org/r/904613 (owner: 10Dzahn) [18:32:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1218'] [18:32:16] !log started Airflow mediwiki wikitext dags after killing oozie jobs as part of Migration task [18:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:12] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.2 refs T330208 [18:33:18] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [18:33:58] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:34:38] (03CR) 10Dzahn: [C: 03+2] "confirmed working / noop on dashiki-02.dashiki.eqiad.wmflabs now" [puppet] - 10https://gerrit.wikimedia.org/r/904613 (owner: 10Dzahn) [18:34:48] (03CR) 10Dzahn: [C: 03+2] "confirmed working / noop on dashiki-02.dashiki.eqiad.wmflabs now after follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn) [18:37:16] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:37:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1212.eqiad.wmnet with reason: host reimage [18:38:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:40:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1212.eqiad.wmnet with reason: host reimage [18:41:04] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:41:06] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:41:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:41:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:43:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:44:58] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:44:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1213.eqiad.wmnet with reason: host reimage [18:45:01] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:45:44] (03PS2) 10Dzahn: decom miscweb2002 [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896) [18:45:48] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:45:51] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:46:22] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:46:26] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:46:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:46:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:47:00] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:48:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1213.eqiad.wmnet with reason: host reimage [18:48:48] (03PS1) 10Herron: dns: repoint alert host services to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/904614 (https://phabricator.wikimedia.org/T333478) [18:48:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:49:00] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:52:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:52:59] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:54:09] (03CR) 10Herron: "to be merged after the related puppet patch during planned failover window" [dns] - 10https://gerrit.wikimedia.org/r/904614 (https://phabricator.wikimedia.org/T333478) (owner: 10Herron) [18:54:50] (03CR) 10Herron: "jftr I512758d23fe0682e5ce302d15b838c8b836dc4f3" [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [18:55:02] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:55:23] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:55:26] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [18:57:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:57:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1212.eqiad.wmnet with OS bullseye [18:57:14] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1212.eqiad.wmnet with OS bullseye completed: - db1212 (**PASS**) - Removed from Puppet an... [18:57:32] (03PS1) 10Dzahn: gitlab_runner: run clear-docker-cache every hour [puppet] - 10https://gerrit.wikimedia.org/r/904616 (https://phabricator.wikimedia.org/T333586) [18:58:42] (03PS1) 10Stevemunene: Jupyterhub-conda exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) [18:59:40] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1219'] [19:00:26] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1220'] [19:02:15] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:02:20] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40468/console" [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene) [19:04:21] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:04:24] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:05:31] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:05:34] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:06:30] (03PS6) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) [19:08:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:08:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1213.eqiad.wmnet with OS bullseye [19:08:47] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:08:50] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:08:52] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1213.eqiad.wmnet with OS bullseye completed: - db1213 (**PASS**) - Removed from Puppet an... [19:09:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1215.eqiad.wmnet with OS bullseye [19:09:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1215.eqiad.wmnet with OS bullseye [19:09:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1214.eqiad.wmnet with OS bullseye [19:09:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1214.eqiad.wmnet with OS bullseye [19:10:36] (03PS7) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) [19:11:13] (03CR) 10Ottomata: [C: 03+1] modules::profile::manifests::airflow.pp: add plugins_folder path [puppet] - 10https://gerrit.wikimedia.org/r/904609 (https://phabricator.wikimedia.org/T324485) (owner: 10Mforns) [19:11:25] (03PS1) 10Dzahn: miscweb: remove miscweb2002 from rsync dest hosts [puppet] - 10https://gerrit.wikimedia.org/r/904619 (https://phabricator.wikimedia.org/T331896) [19:11:46] (03CR) 10Dzahn: [C: 03+2] miscweb: remove miscweb2002 from rsync dest hosts [puppet] - 10https://gerrit.wikimedia.org/r/904619 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [19:11:53] (03PS8) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) [19:13:32] (03PS1) 10Ssingh: hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010) [puppet] - 10https://gerrit.wikimedia.org/r/904620 (https://phabricator.wikimedia.org/T321309) [19:14:00] bblack: /win 14 [19:14:02] er [19:14:10] my 14 is probably not yours :) [19:14:19] haha [19:14:35] (03CR) 10BBlack: [C: 03+1] hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010) [puppet] - 10https://gerrit.wikimedia.org/r/904620 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:14:39] ^ thanks! [19:14:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40469/console" [puppet] - 10https://gerrit.wikimedia.org/r/904620 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:14:57] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:14:58] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:15:13] (03CR) 10Dzahn: [C: 03+2] "the only change is on miscweb2002 itself, not the others machines. removes ferm rule but not rsync itself.." [puppet] - 10https://gerrit.wikimedia.org/r/904619 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [19:15:15] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010) [puppet] - 10https://gerrit.wikimedia.org/r/904620 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:15:42] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:15:43] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:16:07] (03PS3) 10Dzahn: decom miscweb2002 [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896) [19:16:16] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gerrit1003.wikimedia.org with OS bullseye [19:16:20] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:16:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gerrit1003.wikimedia.org with OS bullseye executed with errors: - gerrit1003 (*... [19:16:22] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:16:39] (03PS4) 10Dzahn: miscweb/site: remove miscweb2002 from site [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896) [19:17:23] RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:18:17] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:18:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1219'] [19:18:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1220'] [19:18:26] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:29] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:53] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:18:56] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:19:01] RECOVERY - pybal on lvs4008 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:19:03] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:20:33] RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [19:22:28] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1221'] [19:22:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1222'] [19:23:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1215.eqiad.wmnet with reason: host reimage [19:24:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [19:24:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage [19:26:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1215.eqiad.wmnet with reason: host reimage [19:29:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage [19:35:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:40:57] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:42:37] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:45:48] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Ladsgroup) [19:46:38] (03PS1) 10Nray: Remove inline script from United States static page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904621 (https://phabricator.wikimedia.org/T331681) [19:46:59] (03PS2) 10Nray: Remove inline script from United States static page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904621 (https://phabricator.wikimedia.org/T331681) [19:54:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:54:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1215.eqiad.wmnet with OS bullseye [19:54:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:54:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1214.eqiad.wmnet with OS bullseye [19:54:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1215.eqiad.wmnet with OS bullseye completed: - db1215 (**WARN**) - Removed from Puppet an... [19:54:40] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1214.eqiad.wmnet with OS bullseye completed: - db1214 (**PASS**) - Removed from Puppet an... [19:54:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1216.eqiad.wmnet with OS bullseye [19:55:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1216.eqiad.wmnet with OS bullseye [19:55:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1217.eqiad.wmnet with OS bullseye [19:55:18] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1217.eqiad.wmnet with OS bullseye [20:02:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1222'] [20:02:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1221'] [20:06:35] jouncebot: now [20:06:35] For the next 0 hour(s) and 53 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T2000) [20:06:42] huh, but no ping [20:07:16] anyway nray you around for backport and config window? [20:07:26] yes, I'm here! [20:07:40] cool, I guess I can be your deployer :) [20:07:45] thank you! [20:09:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1216.eqiad.wmnet with reason: host reimage [20:09:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1217.eqiad.wmnet with reason: host reimage [20:10:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904621 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray) [20:11:00] (03Merged) 10jenkins-bot: Remove inline script from United States static page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904621 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray) [20:11:12] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:904621|Remove inline script from United States static page (T331681)]] [20:11:17] T331681: Make a proposal for supporting the disabling of multiple features in client preferences - https://phabricator.wikimedia.org/T331681 [20:12:15] (03CR) 10CDanis: alerting_host: failover icinga and alertmanger from eqiad to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [20:12:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1216.eqiad.wmnet with reason: host reimage [20:12:31] !log thcipriani@deploy2002 nray and thcipriani: Backport for [[gerrit:904621|Remove inline script from United States static page (T331681)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:12:43] ^ nray should be live on mwdebug machines, check please [20:12:53] thcipriani: thank you, checking now [20:14:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1217.eqiad.wmnet with reason: host reimage [20:14:44] thcipriani: looks good, you can proceed [20:14:58] okie doke, doing so [20:20:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1223'] [20:20:54] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:904621|Remove inline script from United States static page (T331681)]] (duration: 09m 42s) [20:20:57] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1224'] [20:20:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:21:00] T331681: Make a proposal for supporting the disabling of multiple features in client preferences - https://phabricator.wikimedia.org/T331681 [20:21:10] nray: live everywhere [20:21:19] thcipriani: thanks so much for your help! [20:21:26] sure thing :) [20:24:50] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:25:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:27:02] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:27:39] 10SRE, 10Traffic: Performance implications of buffer sizes in Apache Traffic Server intercept plugins - https://phabricator.wikimedia.org/T287847 (10BCornwall) [20:27:45] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:30:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:30:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1216.eqiad.wmnet with OS bullseye [20:30:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:30:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1217.eqiad.wmnet with OS bullseye [20:31:00] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1216.eqiad.wmnet with OS bullseye completed: - db1216 (**PASS**) - Removed from Puppet an... [20:31:08] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1217.eqiad.wmnet with OS bullseye completed: - db1217 (**WARN**) - Removed from Puppet an... [20:33:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1218.eqiad.wmnet with OS bullseye [20:33:11] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1218.eqiad.wmnet with OS bullseye [20:33:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1219.eqiad.wmnet with OS bullseye [20:33:28] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1219.eqiad.wmnet with OS bullseye [20:35:31] (03PS2) 10Jdlrobson: Disable Vector js/css sharing on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809) [20:36:05] thcipriani: is it too late to add something to the window? [20:39:48] 10SRE, 10Traffic, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10BCornwall) [20:40:03] 10SRE, 10Traffic, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10BCornwall) p:05Medium→03Triage [20:41:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1224'] [20:42:40] 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10BCornwall) [20:42:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1223'] [20:42:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:36] (03PS2) 10Andrew Bogott: Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) [20:44:38] (03PS1) 10Andrew Bogott: nfs traffic shaping: label IPs circa 2017 [puppet] - 10https://gerrit.wikimedia.org/r/904623 [20:44:40] (03PS1) 10Andrew Bogott: nfs traffic-shaping: replace labstore100[67] with clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904624 [20:44:42] (03PS1) 10Andrew Bogott: nfs traffic shaping: remove refs to labstore100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904625 [20:44:44] (03PS1) 10Andrew Bogott: nfs traffic_shaping: replace labstore1003 rules with rules for scratch.svc [puppet] - 10https://gerrit.wikimedia.org/r/904626 [20:44:46] (03PS1) 10Andrew Bogott: nfs traffic_shaping: replace labstore1004 rules with rules for tools-nfs.svc [puppet] - 10https://gerrit.wikimedia.org/r/904627 (https://phabricator.wikimedia.org/T333477) [20:46:27] (03CR) 10Andrew Bogott: [C: 04-1] "this should only be merged after we switch to the new nfs server" [puppet] - 10https://gerrit.wikimedia.org/r/904627 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [20:46:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1218.eqiad.wmnet with reason: host reimage [20:47:57] (03CR) 10Andrew Bogott: Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [20:47:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1219.eqiad.wmnet with reason: host reimage [20:48:18] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10netbox: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10BCornwall) [20:51:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1218.eqiad.wmnet with reason: host reimage [20:53:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1219.eqiad.wmnet with reason: host reimage [20:58:41] 10SRE, 10Traffic, 10Patch-For-Review: Update certspotter - https://phabricator.wikimedia.org/T204993 (10BCornwall) p:05Medium→03Triage [20:59:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1225'] [21:00:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [21:05:55] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:06:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:11:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED [21:12:56] (03PS1) 10Ladsgroup: admin: Add sfaci ssh key and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/904629 (https://phabricator.wikimedia.org/T333456) [21:13:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED [21:13:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:13:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1219.eqiad.wmnet with OS bullseye [21:13:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:13:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1218.eqiad.wmnet with OS bullseye [21:13:43] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1219.eqiad.wmnet with OS bullseye completed: - db1219 (**WARN**) - Removed from Puppet an... [21:13:47] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1218.eqiad.wmnet with OS bullseye completed: - db1218 (**PASS**) - Removed from Puppet an... [21:14:43] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BCornwall) 05Open→03In progress p:05Medium→03High a:05BBlack→03BCornwall Since there wasn't any feedback on this, I guess I'll claim this ticket since I'm actively trying to fix this. I'll ask me... [21:22:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1225'] [21:24:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209'] [21:24:52] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [21:25:02] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [21:27:26] (03PS1) 10Andrew Bogott: labstore1004: park in an 'insetup' role until we're ready to decom [puppet] - 10https://gerrit.wikimedia.org/r/904630 (https://phabricator.wikimedia.org/T333477) [21:30:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED [21:33:50] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10netbox: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) Correct, and we've already the first validators in netbox-next that will be released to prod shortly so this can b... [21:35:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1210'] [21:35:35] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1209'] [21:42:17] (03PS1) 10Bartosz Dziewoński: Enable visual enhancements on pages using __NEWSECTIONLINK__ on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904631 (https://phabricator.wikimedia.org/T333570) [21:48:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:52:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:57:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:06:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1210'] [22:07:17] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209'] [22:15:14] (03CR) 10Dzahn: [C: 03+2] vrts: replace Icinga with Prometheus for SMTP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:15:53] (03CR) 10Dzahn: [C: 03+2] "I made a follow-up ticket for adding actual "send/expect" patterns to all TCP checks. Thanks for reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:16:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db1209'] [22:16:39] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit1003'] [22:17:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['gerrit1003'] [22:18:56] (03CR) 10Dzahn: [C: 03+2] "also checked that on new machine vrts2001 we already have port 25 with exim listening" [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:20:39] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209'] [22:20:53] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1209'] [22:21:17] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209'] [22:27:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1209'] [22:33:08] (03PS1) 10Cwhite: logstash: replace grafana ecs fields [puppet] - 10https://gerrit.wikimedia.org/r/904590 [22:35:22] (03CR) 10Dzahn: [C: 03+2] "works per https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=probe_success%7Binstance%3D~%22.*otrs.*%22%7D&g0.max_source_resoluti" [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [22:40:28] (03CR) 10Cwhite: [C: 03+2] logstash: replace grafana ecs fields [puppet] - 10https://gerrit.wikimedia.org/r/904590 (owner: 10Cwhite) [22:47:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1220.eqiad.wmnet with OS bullseye [22:47:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1220.eqiad.wmnet with OS bullseye [22:50:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1221.eqiad.wmnet with OS bullseye [22:51:01] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1221.eqiad.wmnet with OS bullseye [22:51:11] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [22:52:18] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Papaul) @Jelto the 2 disks are in place in gitlab2003 [22:52:56] 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Papaul) a:03Jelto [22:57:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:49] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED [22:59:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED [23:01:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1220.eqiad.wmnet with reason: host reimage [23:02:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:02:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1209.eqiad.wmnet with OS bullseye [23:02:51] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1209.eqiad.wmnet with OS bullseye [23:03:51] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata) a:05lmata→03herron We've made this item and subsequent follow-up an OKR for Q4, handing it off to @herron [23:04:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1220.eqiad.wmnet with reason: host reimage [23:05:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage [23:07:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) @BTullis what HW raid to not in task [23:08:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage [23:09:38] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED [23:11:10] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T333328 (10Papaul) 05Open→03Resolved This was fixed by @Jhancock.wm [23:13:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1210.eqiad.wmnet with OS bullseye [23:13:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1210.eqiad.wmnet with OS bullseye [23:15:04] PROBLEM - Check systemd state on graphite1005 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage [23:18:33] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED [23:19:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:19:30] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [23:20:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage [23:23:05] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:24:16] RECOVERY - Check systemd state on graphite1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:26:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1221.eqiad.wmnet with OS bullseye [23:26:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:26:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1220.eqiad.wmnet with OS bullseye [23:27:01] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1221.eqiad.wmnet with OS bullseye completed: - db1221 (**WARN**) - Removed from Puppet an... [23:27:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1220.eqiad.wmnet with OS bullseye completed: - db1220 (**PASS**) - Removed from Puppet an... [23:29:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED [23:30:00] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1151.mgmt.eqiad.wmnet with reboot policy FORCED [23:31:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1222.eqiad.wmnet with OS bullseye [23:31:10] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1222.eqiad.wmnet with OS bullseye [23:32:10] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1003.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:17] (03PS1) 10Cwhite: logstash: normalize_level add grafana error level alias [puppet] - 10https://gerrit.wikimedia.org/r/904591 [23:34:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage [23:35:52] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:37:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1151.mgmt.eqiad.wmnet with reboot policy FORCED [23:37:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage [23:38:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1152.mgmt.eqiad.wmnet with reboot policy FORCED [23:39:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:39:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1209.eqiad.wmnet with OS bullseye [23:39:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1209.eqiad.wmnet with OS bullseye completed: - db1209 (**PASS**) - Removed from Puppet an... [23:40:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [23:41:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1223.eqiad.wmnet with OS bullseye [23:41:46] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1223.eqiad.wmnet with OS bullseye [23:44:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1152.mgmt.eqiad.wmnet with reboot policy FORCED [23:45:07] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED [23:45:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage [23:48:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage [23:51:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1224.eqiad.wmnet with OS bullseye [23:51:18] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1224.eqiad.wmnet with OS bullseye [23:51:20] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED [23:51:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1154.mgmt.eqiad.wmnet with reboot policy FORCED [23:55:14] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10andrea.denisse) [23:55:30] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10andrea.denisse) a:03andrea.denisse [23:56:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage [23:58:59] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:59:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1154.mgmt.eqiad.wmnet with reboot policy FORCED [23:59:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage [23:59:18] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1155.mgmt.eqiad.wmnet with reboot policy FORCED [23:59:53] !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus3002.esams.wmnet [23:59:54] !log denisse@cumin1001 START - Cookbook sre.dns.netbox