[00:00:36] <wikibugs>	 (03PS1) 10Dzahn: ci::master: add parameter to enable/disable monitoring of jenkins/httpd [puppet] - 10https://gerrit.wikimedia.org/r/904374 (https://phabricator.wikimedia.org/T324659)
[00:01:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "fixing monitoring alerts via https://gerrit.wikimedia.org/r/c/operations/puppet/+/904374" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[00:02:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1225.mgmt.eqiad.wmnet with reboot policy FORCED
[00:04:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/904374/40435/" [puppet] - 10https://gerrit.wikimedia.org/r/904374 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[00:07:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet']
[00:08:02] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet']
[00:09:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072']
[00:10:02] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072']
[00:10:31] <sukhe>	 u/win 14
[00:10:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072']
[00:11:05] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072']
[00:12:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1225.mgmt.eqiad.wmnet with reboot policy FORCED
[00:13:21] <wikibugs>	 (03PS1) 10Dzahn: microsites: do not use TLS when monitoring commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976)
[00:13:35] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207']
[00:13:46] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1207']
[00:18:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207']
[00:18:12] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1207']
[00:18:43] <wikibugs>	 (03CR) 10Dzahn: "opening https://phabricator.wikimedia.org/T333510 to clean this up for real" [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[00:18:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207']
[00:18:59] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1207']
[00:20:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072']
[00:20:45] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072']
[00:21:52] <wikibugs>	 (03PS2) 10Dzahn: microsites: do not use TLS when monitoring commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976)
[00:24:53] <wikibugs>	 (03PS3) 10Dzahn: microsites: do not use TLS when monitoring commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976)
[00:26:31] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2023-03-21 00:00:07 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:27:36] <wikibugs>	 (03CR) 10Dzahn: "Can we talk about this setup please? It is a special case that does things differently from everything else on miscweb. This does not work" [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[00:27:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207']
[00:27:51] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1207']
[00:30:19] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:44] <wikibugs>	 (03CR) 10Dzahn: "please reach out to serviceops-collab team when adding new sites to miscweb in the future so that we can assist you with the certs, monito" [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[00:31:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] microsites: do not use TLS when monitoring commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904377 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[00:35:02] <wikibugs>	 (03PS1) 10Dzahn: microsites: for transparency.wikimedia.org except HTTP 302, not 301 [puppet] - 10https://gerrit.wikimedia.org/r/904378 (https://phabricator.wikimedia.org/T327976)
[00:35:28] <wikibugs>	 (03PS2) 10Dzahn: microsites: for transparency.wikimedia.org expect HTTP 302, not 301 [puppet] - 10https://gerrit.wikimedia.org/r/904378 (https://phabricator.wikimedia.org/T327976)
[00:35:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] microsites: for transparency.wikimedia.org expect HTTP 302, not 301 [puppet] - 10https://gerrit.wikimedia.org/r/904378 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[00:37:50] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T333328 (10wiki_willy) a:03Papaul
[00:43:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) @jbond all the server @Jclark-ctr and I worked on are failing with the error below.  ` START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207'] Managem...
[00:45:32] <jinxer-wm>	 (ProbeDown) firing: (8) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:48:26] <wikibugs>	 (03PS1) 10Ssingh: pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309)
[00:49:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[00:51:58] <wikibugs>	 (03PS2) 10Ssingh: pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309)
[00:52:29] <wikibugs>	 (03PS1) 10Dzahn: microsites: commons-query.wm.org only works on port 80/http [puppet] - 10https://gerrit.wikimedia.org/r/904382 (https://phabricator.wikimedia.org/T333510)
[00:53:41] <wikibugs>	 (03PS2) 10Dzahn: microsites: commons-query.wm.org only works on port 80/http [puppet] - 10https://gerrit.wikimedia.org/r/904382 (https://phabricator.wikimedia.org/T333510)
[00:55:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] microsites: commons-query.wm.org only works on port 80/http [puppet] - 10https://gerrit.wikimedia.org/r/904382 (https://phabricator.wikimedia.org/T333510) (owner: 10Dzahn)
[00:55:53] <wikibugs>	 (03PS3) 10Dzahn: microsites: commons-query.wm.org only works on port 80/http [puppet] - 10https://gerrit.wikimedia.org/r/904382 (https://phabricator.wikimedia.org/T333510)
[00:57:00] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40436/console" [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[00:57:21] <icinga-wm>	 RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2023-03-28 17:02:45 (4227 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:01:47] <wikibugs>	 (03PS3) 10Ssingh: pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309)
[01:05:32] <jinxer-wm>	 (ProbeDown) firing: (4) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:05:47] <jinxer-wm>	 (ProbeDown) firing: (4) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:06:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*contint.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max" [puppet] - 10https://gerrit.wikimedia.org/r/904374 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[01:09:55] <wikibugs>	 (03PS1) 10Dzahn: microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507)
[01:10:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) (owner: 10Dzahn)
[01:10:32] <jinxer-wm>	 (ProbeDown) firing: (6) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:11:15] <wikibugs>	 (03PS2) 10Dzahn: microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507)
[01:11:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) (owner: 10Dzahn)
[01:13:10] <wikibugs>	 (03PS3) 10Dzahn: microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507)
[01:13:16] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] microsites: do not expect 301 for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/904384 (https://phabricator.wikimedia.org/T333507) (owner: 10Dzahn)
[01:15:32] <jinxer-wm>	 (ProbeDown) firing: (6) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:20:32] <jinxer-wm>	 (ProbeDown) resolved: (7) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:25:58] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "works per https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=probe_success%7Binstance%3D~%22.*miscweb.*%22%7D&g0.max_source_resol" [puppet] - 10https://gerrit.wikimedia.org/r/904273 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[01:36:41] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) The production role for ci::master is now applied on contint2002.  Some minor follow-ups were needed:  - run puppet mult...
[01:37:23] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) 05Open→03In progress
[01:38:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[01:46:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) @Jclark-ctr when you are back on site can you please check the network mgmt cable for db1209 and db1210. Thanks
[01:48:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[01:53:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[01:53:34] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:23:42] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:33:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[02:38:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[04:38:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[04:43:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[05:13:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[05:18:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[05:38:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[05:43:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T0600).
[06:10:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[06:15:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[06:23:34] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:33:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[06:38:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[06:51:58] <wikibugs>	 (03PS11) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102
[06:52:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[06:53:43] <wikibugs>	 (03PS12) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102
[06:55:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[06:58:48] <wikibugs>	 (03PS13) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T0700)
[07:00:44] <apergos>	 ah, let me check
[07:01:10] <apergos>	 no trainees signed up for today
[07:01:39] <apergos>	 aaaand no patches scheduled in the window either, a nice quiet morning for everyone
[07:01:44] <apergos>	 so see you all next time!
[07:02:25] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40439/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[07:05:58] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40441/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[07:11:28] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40442/console" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[07:14:21] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] C:httpd move htcacheclean to httpd class (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[07:15:00] <wikibugs>	 (03PS14) 10Slyngshede: C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102
[07:38:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[07:40:00] <wikibugs>	 (03PS1) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028)
[07:43:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[07:46:22] <wikibugs>	 (03CR) 10David Caro: maintain-dbusers: run isort and black and use pep563 types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro)
[07:46:56] <wikibugs>	 (03CR) 10David Caro: maintain-dbusers: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro)
[07:48:12] <wikibugs>	 (03PS8) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663)
[07:48:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[07:49:32] <wikibugs>	 (03PS9) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663)
[07:50:13] <wikibugs>	 (03PS10) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663)
[07:50:42] <wikibugs>	 (03CR) 10Ayounsi: "Thanks, some comments and I think we should be able to solve usecase #3" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[07:53:36] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: thumbor: Switch all summaries to histograms [deployment-charts] - 10https://gerrit.wikimedia.org/r/904452 (https://phabricator.wikimedia.org/T333445)
[07:56:17] <wikibugs>	 (03CR) 10David Caro: "I'm a bit confused about what this diff is showing :S, I'll try to rebase and resend, it seems to show a "Base" version that is not the on" [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro)
[07:59:23] <wikibugs>	 (03CR) 10Ayounsi: Add CORS headers to http endpoints of measure-dc domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[08:07:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Remove unused token hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/904179 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:08:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Remove references to unused token hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/904181 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:08:30] <wikibugs>	 10SRE, 10Data-Engineering, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10ayounsi) > @ayounsi - are you able to confirm trat dropped packets are no longer a problem for this host from the logstash firewall dashboards? I confirm.
[08:09:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: prep statsd/graphite records for easier write failover [dns] - 10https://gerrit.wikimedia.org/r/904185 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[08:10:26] <wikibugs>	 (03CR) 10Vgutierrez: "considering it's key to work properly please add a CORS check to 22-measure.vtc" [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[08:15:54] <wikibugs>	 (03CR) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[08:17:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove hardcoded statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/904186 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[08:18:36] <wikibugs>	 (03PS1) 10Elukey: role::kafka::jumbo::broker: upgrade all brokers to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064)
[08:19:13] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: openstack::nutcracker: Remove redis support [puppet] - 10https://gerrit.wikimedia.org/r/902074
[08:19:15] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: thumbor: Switch all summaries to histograms [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445)
[08:19:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM if it works.  Reading about the 'as-path unique-count' it doesn't count confed AS's, or multiples of the same AS, so I'm wondering wo" [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[08:19:51] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40443/console" [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey)
[08:20:04] <wikibugs>	 (03PS2) 10Filippo Giunchedi: profile: remove hardcoded statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/904186 (https://phabricator.wikimedia.org/T239862)
[08:20:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "@andrewbogott, reviews welcome!" [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris)
[08:22:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove hardcoded statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/904186 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[08:25:46] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: thumbor: Switch all summaries to histograms [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445)
[08:25:51] <wikibugs>	 (03PS2) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028)
[08:26:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Tested on superset-next (with/without CAS auth) and it works nicely :)" [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) (owner: 10Volans)
[08:27:54] <wikibugs>	 (03CR) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[08:29:05] <wikibugs>	 (03PS2) 10Elukey: role::kafka::main: deploy PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372)
[08:30:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC fails (float vs integer) https://puppet-compiler.wmflabs.org/output/904456/40446/thumbor1001.eqiad.wmnet/change.thumbor1001.eqiad.wmne" [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris)
[08:34:14] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Great! Thanks elukey." [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey)
[08:34:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:35:15] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] k8s: Remove unused token hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/904179 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:36:35] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: thumbor: Switch all summaries to histograms [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445)
[08:37:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] role::kafka::main: deploy PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey)
[08:38:11] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40447/console" [puppet] - 10https://gerrit.wikimedia.org/r/904181 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:38:18] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:38:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::kafka::main: deploy PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey)
[08:38:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40448/console" [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris)
[08:39:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Remove references to unused token hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/904181 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:39:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:43:17] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:43:37] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:44:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looking good, tests are happy:" [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[08:45:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris)
[08:47:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: WIP: Add new self hosted machinetranslation service (MinT) (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry)
[08:47:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry)
[08:47:40] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) (owner: 10Volans)
[08:48:37] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:49:57] <wikibugs>	 (03PS3) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028)
[08:51:16] <wikibugs>	 (03CR) 10Jameel Kaisar: Add CORS headers to http endpoints of measure-dc domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[08:53:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] WIP: Add new self hosted machinetranslation service (MinT) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry)
[08:54:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons.
[08:55:36] <elukey>	 !log move kafka main clusters to new truststore (PKI+Puppet root CA certs) - T319372
[08:55:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:41] <stashbot>	 T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372
[08:58:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Serve an HTTP response for measurement domains directly from Varnish - https://phabricator.wikimedia.org/T332028 (10JameelKaisar) a:03JameelKaisar
[08:58:55] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445 (10akosiaris) I 've upload a couple of changes to switch summaries to histograms in both environments. That way we will be able to have aggregatable data acros...
[09:04:46] <godog>	 !log silence LogstashIndexingFailures during investigation T180051
[09:04:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:52] <stashbot>	 T180051: Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051
[09:05:45] <wikibugs>	 (03PS1) 10Clément Goubert: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528)
[09:06:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert)
[09:09:00] <claime>	 !log Merging mw-on-k8s ATS lua routing script - T331318
[09:09:03] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) (owner: 10Giuseppe Lavagetto)
[09:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:05] <stashbot>	 T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318
[09:12:07] <wikibugs>	 (03PS1) 10DCausse: [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675)
[09:12:38] <claime>	 !log puppet disabled for A:cp-text - T331318
[09:12:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[09:15:55] <claime>	 !log puppet disabled for A:cp-upload - T331318
[09:16:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:01] <stashbot>	 T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318
[09:16:48] <claime>	 !log Running puppet on cp2028.codfw.wmnet (cp-upload noop test) - T331318
[09:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:27] <wikibugs>	 (03PS3) 10Ayounsi: Add policy to export prefixes to k8s nodes [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523)
[09:19:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[09:20:41] <wikibugs>	 (03PS2) 10DCausse: [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675)
[09:23:53] <claime>	 !log Re-enabling puppet for A:cp-upload - T331318
[09:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:59] <stashbot>	 T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318
[09:24:45] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:25:38] <wikibugs>	 10SRE, 10Data-Engineering, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10fgiunchedi) >>! In T238794#8738885, @BTullis wrote:  > @fgiunchedi - is this just a matter of removing some old config now? Or is there another reason why we're not seeing traff...
[09:27:03] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add policy to export prefixes to k8s nodes [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[09:27:38] <wikibugs>	 (03Merged) 10jenkins-bot: Add policy to export prefixes to k8s nodes [homer/public] - 10https://gerrit.wikimedia.org/r/904150 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[09:28:02] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@359f4bd]: Regular analytics weekly train (2nd) [analytics/refinery@359f4bd]
[09:28:43] <wikibugs>	 (03CR) 10DCausse: [DNM] flink-app: always include /etc/envoy/ssl/ca.crt (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[09:32:20] <wikibugs>	 (03CR) 10DCausse: "not meant to be merged, just to illustrate what I think might be the cause of:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[09:32:27] <wikibugs>	 (03CR) 10DCausse: [C: 04-2] [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[09:33:54] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@359f4bd]: Regular analytics weekly train (2nd) [analytics/refinery@359f4bd] (duration: 05m 53s)
[09:34:37] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@359f4bd] (thin): Regular analytics weekly train (2nd) THIN [analytics/refinery@359f4bd]
[09:34:45] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@359f4bd] (thin): Regular analytics weekly train (2nd) THIN [analytics/refinery@359f4bd] (duration: 00m 08s)
[09:35:03] <claime>	 !log Re-enabling puppet for cp4037 - T331318
[09:35:06] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@359f4bd] (hadoop-test): Regular analytics weekly train (2nd) TEST [analytics/refinery@359f4bd]
[09:35:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:10] <stashbot>	 T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318
[09:36:34] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@359f4bd] (hadoop-test): Regular analytics weekly train (2nd) TEST [analytics/refinery@359f4bd] (duration: 01m 28s)
[09:37:13] <wikibugs>	 (03CR) 10Jaime Nuche: "Is there value in updating the failing gerrit-git-fat-pull job to test lfs? From what I gather maybe it's not worth it we can just remove " [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar)
[09:37:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede)
[09:38:29] <wikibugs>	 (03PS1) 10Ayounsi: Move the as-path-regex out of the policy [homer/public] - 10https://gerrit.wikimedia.org/r/904486 (https://phabricator.wikimedia.org/T328523)
[09:39:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Move the as-path-regex out of the policy [homer/public] - 10https://gerrit.wikimedia.org/r/904486 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[09:39:48] <wikibugs>	 (03Merged) 10jenkins-bot: Move the as-path-regex out of the policy [homer/public] - 10https://gerrit.wikimedia.org/r/904486 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[09:43:05] <wikibugs>	 (03PS1) 10Btullis: Bump datahub version to 0.10.0 and re-enable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/904487 (https://phabricator.wikimedia.org/T329514)
[09:43:48] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] thumbor: Switch all summaries to histograms [deployment-charts] - 10https://gerrit.wikimedia.org/r/904452 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris)
[09:44:24] <claime>	 !log Re-enabling puppet for cp-text_ulsfo - T331318
[09:44:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:30] <stashbot>	 T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318
[09:47:56] <logmsgbot>	 !log joal@deploy2002 Started deploy [airflow-dags/analytics@b7b41ae]: Regular analytics weekly train (2nd) [airflow-dags/analytics@b7b41ae]
[09:48:07] <logmsgbot>	 !log joal@deploy2002 Finished deploy [airflow-dags/analytics@b7b41ae]: Regular analytics weekly train (2nd) [airflow-dags/analytics@b7b41ae] (duration: 00m 11s)
[09:48:28] <wikibugs>	 (03PS5) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[09:49:08] <wikibugs>	 (03PS2) 10Clément Goubert: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528)
[09:49:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10oleksandr_tsyba_WMDE) Thank you, @Ladsgroup! 🙏  //*in case of emergency // ` git rebase -i HEAD~1 drop 20bd3f71404912f60f45c1a84f0ed7d76386d6a5 git push --force` `
[09:50:07] <icinga-wm>	 RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:53:27] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:53:30] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:54:20] <wikibugs>	 (03PS1) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268)
[09:54:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[09:55:36] <wikibugs>	 (03CR) 10LSobanski: [C: 03+1] alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[09:55:42] <wikibugs>	 (03PS2) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268)
[09:56:21] <wikibugs>	 (03PS3) 10JMeybohm: k8s rsyslog: Use client cert instead of token [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268)
[09:58:19] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[09:58:22] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:00:05] <jouncebot>	 mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1000).
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1000)
[10:02:08] <wikibugs>	 (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669)
[10:04:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond)
[10:04:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P45985 and previous config saved to /var/cache/conftool/dbconfig/20230330-100457-ladsgroup.json
[10:08:28] <wikibugs>	 (03PS9) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[10:08:30] <wikibugs>	 (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669)
[10:09:09] <wikibugs>	 (03PS1) 10Vgutierrez: purged: Don't specify the kafka compression codec [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669)
[10:10:12] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/904261 (https://phabricator.wikimedia.org/T333538)
[10:10:14] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40451/console" [puppet] - 10https://gerrit.wikimedia.org/r/904489 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[10:10:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond)
[10:11:50] <wikibugs>	 (03PS2) 10Vgutierrez: purged: Don't specify the kafka compression codec [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669)
[10:12:34] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:12:37] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:12:59] <wikibugs>	 (03PS10) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[10:13:01] <wikibugs>	 (03PS6) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669)
[10:13:20] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40452/console" [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669) (owner: 10Vgutierrez)
[10:15:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond)
[10:16:21] <wikibugs>	 (03CR) 10Jbond: "updated i have also updated the paste and included the devices with no primary ipv4 which is quit e a lot.  wonder if this is expected?" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[10:18:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s7 T333538
[10:18:43] <stashbot>	 T333538: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T333538
[10:19:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T333538
[10:20:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P45987 and previous config saved to /var/cache/conftool/dbconfig/20230330-102002-ladsgroup.json
[10:20:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1181 with weight 0 T333538', diff saved to https://phabricator.wikimedia.org/P45988 and previous config saved to /var/cache/conftool/dbconfig/20230330-102012-ladsgroup.json
[10:23:42] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:27:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[10:28:48] <wikibugs>	 (03PS1) 10Hnowlan: Add service records for rest-gateway [dns] - 10https://gerrit.wikimedia.org/r/904493 (https://phabricator.wikimedia.org/T329074)
[10:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:29:29] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[10:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:35:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P45989 and previous config saved to /var/cache/conftool/dbconfig/20230330-103506-ladsgroup.json
[10:35:42] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons.
[10:35:50] <wikibugs>	 (03PS3) 10Hnowlan: service, k8s: Add service definitions for rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049)
[10:42:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] spark: provide CRUD rights on secret for spark-deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/902391 (https://phabricator.wikimedia.org/T332908) (owner: 10Nicolas Fraison)
[10:44:22] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[10:45:57] <Amir1>	 !log Starting s7 eqiad failover from db1136 to db1181 - T333538
[10:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:02] <stashbot>	 T333538: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T333538
[10:46:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1181 to s7 primary T333538', diff saved to https://phabricator.wikimedia.org/P45992 and previous config saved to /var/cache/conftool/dbconfig/20230330-104617-ladsgroup.json
[10:46:39] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/904261 (https://phabricator.wikimedia.org/T333538) (owner: 10Gerrit maintenance bot)
[10:49:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1136 T333538', diff saved to https://phabricator.wikimedia.org/P45993 and previous config saved to /var/cache/conftool/dbconfig/20230330-104928-ladsgroup.json
[10:50:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P45994 and previous config saved to /var/cache/conftool/dbconfig/20230330-105011-ladsgroup.json
[10:51:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[10:51:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[10:53:43] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:55:49] <Amir1>	 jouncebot: nowandnext
[10:55:49] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1000)
[10:55:50] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1000)
[10:55:50] <jouncebot>	 In 2 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300)
[10:55:50] <jouncebot>	 In 2 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300)
[10:56:18] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] role::kafka::jumbo::broker: upgrade all brokers to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey)
[10:57:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add service records for rest-gateway [dns] - 10https://gerrit.wikimedia.org/r/904493 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[10:58:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] service, k8s: Add service definitions for rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[10:58:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:58:39] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1075.mgmt.eqiad.wmnet with reboot policy FORCED
[10:58:52] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[10:59:01] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup)
[10:59:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "mwscript: Switch to use run.php"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893552 (https://phabricator.wikimedia.org/T326800) (owner: 10Ladsgroup)
[11:00:41] <wikibugs>	 (03PS1) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869)
[11:02:29] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:893552|Revert "Revert "mwscript: Switch to use run.php"" (T326800)]]
[11:02:34] <stashbot>	 T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800
[11:02:40] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10hnowlan) >>! In T320398#8718719, @akosiaris wrote: > * Some links to important graphs to look at and correlate when in an...
[11:03:52] <claime>	 !log Re-enabling puppet for cp-text - T331318
[11:03:56] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:893552|Revert "Revert "mwscript: Switch to use run.php"" (T326800)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[11:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:58] <stashbot>	 T331318: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318
[11:04:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump datahub version to 0.10.0 and re-enable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/904487 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:04:55] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:05:20] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Add service records for rest-gateway [dns] - 10https://gerrit.wikimedia.org/r/904493 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[11:06:19] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:07:12] <hnowlan>	 that's me ^ fixing in a second
[11:08:32] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox
[11:09:43] <wikibugs>	 (03Merged) 10jenkins-bot: Bump datahub version to 0.10.0 and re-enable standalone consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/904487 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[11:10:28] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:893552|Revert "Revert "mwscript: Switch to use run.php"" (T326800)]] (duration: 07m 59s)
[11:10:34] <stashbot>	 T326800: Make Wikimedia mwscript use run.php to run maintenance scripts - https://phabricator.wikimedia.org/T326800
[11:10:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:11:11] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for rest-gateway - hnowlan@cumin1001"
[11:11:17] <wikibugs>	 (03PS1) 10JMeybohm: WIP: deployment_server: Create k8s configs with pki certs [puppet] - 10https://gerrit.wikimedia.org/r/904500 (https://phabricator.wikimedia.org/T325268)
[11:12:11] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for rest-gateway - hnowlan@cumin1001"
[11:12:11] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:13:30] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756)
[11:15:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:17:04] <wikibugs>	 (03Abandoned) 10Hnowlan: restbase-dev: create new codfw cluster, replace old eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[11:17:29] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] service, k8s: Add service definitions for rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/891510 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[11:22:35] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:24:09] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:24:21] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:24:47] <wikibugs>	 (03PS2) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756)
[11:25:31] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:25:45] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.964 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:27:08] <wikibugs>	 (03PS7) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669)
[11:27:10] <wikibugs>	 (03PS11) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[11:28:54] <wikibugs>	 (03PS1) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[11:29:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/904158 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond)
[11:29:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[11:31:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[11:31:27] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/904502/40456/" [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[11:36:48] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: add dummy tokens for rest-gateway [labs/private] - 10https://gerrit.wikimedia.org/r/904511 (https://phabricator.wikimedia.org/T329049)
[11:44:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[11:44:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[11:47:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[11:49:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns an-worker1149-56 - jclark@cumin1001"
[11:50:50] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns an-worker1149-56 - jclark@cumin1001"
[11:50:50] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:51:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Bug in bridge-utils breaks IPv6 on interface if its not part of a bridge but vlan sub-int of it is - https://phabricator.wikimedia.org/T320429 (10jbond) just noting that ganeti also seems to hit this issue also reported in T233906
[11:55:56] <wikibugs>	 (03PS2) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[11:56:00] <wikibugs>	 (03PS1) 10Ladsgroup: Set externallinks to WRITE BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904512 (https://phabricator.wikimedia.org/T321662)
[11:57:25] <Amir1>	 jouncebot: nowandnext
[11:57:25] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 2 minute(s)
[11:57:25] <jouncebot>	 In 1 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300)
[11:57:25] <jouncebot>	 In 1 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300)
[11:57:29] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[11:57:32] <Amir1>	 coooool
[11:57:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[11:58:45] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set externallinks to WRITE BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904512 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup)
[11:59:29] <wikibugs>	 (03Merged) 10jenkins-bot: Set externallinks to WRITE BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904512 (https://phabricator.wikimedia.org/T321662) (owner: 10Ladsgroup)
[12:00:23] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602)
[12:00:29] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]]
[12:00:31] <wikibugs>	 (03PS3) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[12:00:36] <stashbot>	 T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662
[12:00:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo)
[12:02:12] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[12:02:13] <Amir1>	 the scap is erroring for helm
[12:02:16] <Amir1>	 Error: Kubernetes cluster unreachable: Get "https://kubemaster.svc.eqiad.wmnet:6443/version": dial tcp 10.2.2.8:6443: connect: connection refused
[12:02:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[12:02:31] <Amir1>	 claime: ^ :D
[12:03:26] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Add static console port for easier remote management [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602)
[12:06:22] <wikibugs>	 (03PS4) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[12:07:43] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] "LGTM." [labs/private] - 10https://gerrit.wikimedia.org/r/904511 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[12:08:39] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[12:10:24] <Amir1>	 tested on mwdebug on s4, s8 and s7 and everything worked fine
[12:12:07] <wikibugs>	 (03PS12) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[12:12:54] <wikibugs>	 (03PS1) 10Btullis: Run the datahub consumers in the GMS context [deployment-charts] - 10https://gerrit.wikimedia.org/r/904517 (https://phabricator.wikimedia.org/T329514)
[12:14:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[12:15:28] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] (duration: 14m 58s)
[12:15:34] <stashbot>	 T321662: Enable write both for externallinks in beta and production - https://phabricator.wikimedia.org/T321662
[12:16:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno) a:03Atieno
[12:17:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno)
[12:17:18] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[12:17:24] <wikibugs>	 (03CR) 10Jcrespo: "Ready to go: https://puppet-compiler.wmflabs.org/output/904514/40457/backup1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/904514 (https://phabricator.wikimedia.org/T306602) (owner: 10Jcrespo)
[12:17:33] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[12:17:59] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1074.mgmt.eqiad.wmnet with reboot policy FORCED
[12:18:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno)
[12:18:41] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: profile::bird::anycast: add template parameter [puppet] - 10https://gerrit.wikimedia.org/r/904518 (https://phabricator.wikimedia.org/T324992)
[12:19:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::bird::anycast: add template parameter [puppet] - 10https://gerrit.wikimedia.org/r/904518 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[12:19:14] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Run the datahub consumers in the GMS context [deployment-charts] - 10https://gerrit.wikimedia.org/r/904517 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[12:23:13] <wikibugs>	 (03CR) 10Ayounsi: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[12:24:06] <wikibugs>	 (03Merged) 10jenkins-bot: Run the datahub consumers in the GMS context [deployment-charts] - 10https://gerrit.wikimedia.org/r/904517 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[12:25:37] <wikibugs>	 (03PS1) 10Majavah: labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519
[12:25:47] <wikibugs>	 (03PS2) 10Majavah: labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519
[12:26:06] <wikibugs>	 (03PS3) 10Majavah: labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519 (https://phabricator.wikimedia.org/T333549)
[12:26:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519 (https://phabricator.wikimedia.org/T333549) (owner: 10Majavah)
[12:26:20] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[12:27:06] <wikibugs>	 (03Abandoned) 10DCausse: [DNM] flink-app: always include /etc/envoy/ssl/ca.crt [deployment-charts] - 10https://gerrit.wikimedia.org/r/904464 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[12:27:07] <logmsgbot>	 !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[12:28:13] <wikibugs>	 (03PS6) 10David Caro: maintain-dbusers: run isort and black and use pep563 types [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663)
[12:28:15] <wikibugs>	 (03PS6) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789)
[12:28:17] <wikibugs>	 (03PS11) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663)
[12:28:19] <wikibugs>	 (03PS6) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954)
[12:28:21] <wikibugs>	 (03PS10) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[12:29:07] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] labstore: add dumps access for dump-references-processor [puppet] - 10https://gerrit.wikimedia.org/r/904519 (https://phabricator.wikimedia.org/T333549) (owner: 10Majavah)
[12:31:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ladsgroup)
[12:31:45] <wikibugs>	 (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - allow egress to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520
[12:31:58] <logmsgbot>	 !log joal@deploy2002 Started deploy [airflow-dags/analytics@a6500cf]: Regular analytics weekly train (2nd) HOTFIX [airflow-dags/analytics@a6500cf]
[12:32:09] <logmsgbot>	 !log joal@deploy2002 Finished deploy [airflow-dags/analytics@a6500cf]: Regular analytics weekly train (2nd) HOTFIX [airflow-dags/analytics@a6500cf] (duration: 00m 11s)
[12:32:44] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] "It would be nice if we had something that did more than just check for an open port. Possibly something to follow up with observability fo" [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[12:33:40] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] "Same as the feedback in https://gerrit.wikimedia.org/r/c/operations/puppet/+/903805, it would be nice to have something that checks what t" [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[12:36:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons.
[12:37:20] <wikibugs>	 (03CR) 10Gmodena: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata)
[12:37:23] <wikibugs>	 (03PS1) 10EoghanGaffney: Removes unnecessary krb:present line [puppet] - 10https://gerrit.wikimedia.org/r/904522
[12:39:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] purged: Don't specify the kafka compression codec [puppet] - 10https://gerrit.wikimedia.org/r/904490 (https://phabricator.wikimedia.org/T332669) (owner: 10Vgutierrez)
[12:39:56] <wikibugs>	 (03CR) 10Andrew Bogott: "Your arguments are convincing :) For science, I've stopped nutcracker on the cloudweb hosts to confirm that there are no consequences... w" [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris)
[12:40:34] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - allow egress to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata)
[12:41:48] <vgutierrez>	 thx elukey 
[12:43:23] <icinga-wm>	 PROBLEM - nutcracker process on cloudweb1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[12:43:59] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Looks about right to me, assuming it runs successfully :)" [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[12:44:07] <icinga-wm>	 PROBLEM - nutcracker port on cloudweb1004 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[12:44:07] <icinga-wm>	 PROBLEM - nutcracker process on cloudweb1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[12:44:27] <icinga-wm>	 PROBLEM - nutcracker port on cloudweb1003 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker
[12:45:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: route data-persistence warnings to -feed [puppet] - 10https://gerrit.wikimedia.org/r/904525
[12:45:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:46:25] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-page-content-change-enrichment - allow egress to api-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata)
[12:46:47] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.provision cookbook: check for both default and wmf password - https://phabricator.wikimedia.org/T333554 (10ayounsi) p:05Triage→03Low
[12:47:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ottomata) Approved.  I'd guess that "AQS troubleshooting" would require kerberos as well.
[12:48:14] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM, thanks, but I think it'd be good to have a +1 from at least one other data-persistence team member before merging." [puppet] - 10https://gerrit.wikimedia.org/r/904525 (owner: 10Filippo Giunchedi)
[12:50:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:51:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10ayounsi) How does this compare to taking iBGP down between LEAF1 to SPINE2 if the link goes down?
[12:53:37] <claime>	 Amir1: Do you have more output, like what deployment failed ?
[12:53:41] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:25] <Amir1>	 claime: https://phabricator.wikimedia.org/P45995
[12:54:41] <claime>	 thabks
[12:54:45] <claime>	 s/b/n/
[12:55:40] <claime>	 So it failed temporarily and only for mw-debug eqiad
[12:55:43] <claime>	 Hmm
[12:56:34] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "😊" [puppet] - 10https://gerrit.wikimedia.org/r/904455 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1300)
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:06] <wikibugs>	 (03PS13) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[13:02:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[13:02:39] <claime>	 Amir1: It coincides almost perfectly with puppet runs on both kubemaster1001 and kubemaster1002 where it refreshed kube-apiserver
[13:03:44] <Amir1>	 it's fine for me, I can make it run it again
[13:03:46] <Amir1>	 just to be sure
[13:04:11] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]]
[13:04:33] <claime>	     Trigger: Thu 2023-03-30 13:29:00 UTC; 24min left
[13:04:37] <claime>	     Trigger: Thu 2023-03-30 13:30:00 UTC; 26min left
[13:04:49] <claime>	 They're a bit too close for two servers that do the exact same thing
[13:04:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform: Requesting access to analytics-privatedata-users for sfaci - https://phabricator.wikimedia.org/T333456 (10Ladsgroup)
[13:05:34] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:06:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P45996 and previous config saved to /var/cache/conftool/dbconfig/20230330-130625-ladsgroup.json
[13:10:07] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM. I can't find the documentation for the logfmt function, but I'd assume it parses logfmt formatted logs, in which case the rest make " [puppet] - 10https://gerrit.wikimedia.org/r/902334 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[13:10:36] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [restbase/deploy@47f3a61]: (no justification provided)
[13:10:40] <wikibugs>	 (03PS14) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[13:10:59] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:904512|Set externallinks to WRITE BOTH everywhere (T321662)]] (duration: 06m 47s)
[13:10:59] <wikibugs>	 (03PS1) 10JMeybohm: envoy: Add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/904527 (https://phabricator.wikimedia.org/T333551)
[13:11:20] <wikibugs>	 (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[13:11:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:12:11] <wikibugs>	 (03PS2) 10Volans: superset: add static html for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009)
[13:12:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] envoy: Add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/904527 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[13:13:38] <claime>	 Amir1: I guess it deployed correctly
[13:13:40] <claime>	 cgoubert@deploy2002:/srv/deployment-charts/helmfile.d/services/mw-debug$ helmfile -e eqiad status 2>/dev/null  | grep DEPLOYED
[13:13:42] <claime>	 LAST DEPLOYED: Thu Mar 30 13:05:07 2023
[13:14:01] <claime>	 So yeah, having both kubemasters run puppet with a 1 minute difference is bad.
[13:14:43] <Amir1>	 as long as it's transient, I don't mind 
[13:14:56] <claime>	 Well I do :D
[13:15:07] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Tests are still happy." [puppet] - 10https://gerrit.wikimedia.org/r/904450 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[13:15:58] <volans>	 that's fqdn_rand...
[13:16:42] <wikibugs>	 (03CR) 10JMeybohm: "This raises multiple warnings:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata)
[13:16:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:17:38] <volans>	 claime: T161145
[13:17:57] <claime>	 volans: I know.
[13:18:03] <claime>	 It's a pain in the ass sometimes
[13:18:13] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/900642 (owner: 10David Caro)
[13:18:18] <claime>	 Although I'd never seen the variable.fqdn_rand syntax before
[13:18:24] <claime>	 modules/profile/manifests/puppet/agent.pp:    $timer_interval = "*:${interval.fqdn_rand}/${interval}:00"
[13:18:31] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2022 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:18:37] <wikibugs>	 (03CR) 10Volans: [C: 03+2] superset: add static html for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/902107 (https://phabricator.wikimedia.org/T310009) (owner: 10Volans)
[13:18:52] <volans>	 claime: that's puppet's builtin
[13:18:55] <volans>	 IIRC
[13:19:01] <volans>	 in the past we used a different thing
[13:19:21] <claime>	 volans: Yeah, I just wonder what it uses the variable for actually
[13:19:33] <claime>	 I guess it's MAX ?
[13:20:21] <claime>	 I'm used to seeing it as fqdn_rand(MAX), not MAX.fqdn_rand
[13:20:32] <volans>	 301 j.bond :D
[13:21:07] <claime>	 007 j.bond >_>
[13:21:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P45997 and previous config saved to /var/cache/conftool/dbconfig/20230330-132130-ladsgroup.json
[13:23:05] <wikibugs>	 (03CR) 10JMeybohm: mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata)
[13:25:29] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Add the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/904527 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[13:25:56] <wikibugs>	 (03PS1) 10Ayounsi: Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669)
[13:26:03] <wikibugs>	 (03PS1) 10Volans: superset: fix typo in file path [puppet] - 10https://gerrit.wikimedia.org/r/904530
[13:28:18] <wikibugs>	 (03PS7) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321)
[13:28:26] <wikibugs>	 (03PS3) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074)
[13:28:34] <wikibugs>	 (03PS5) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[13:29:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] superset: fix typo in file path [puppet] - 10https://gerrit.wikimedia.org/r/904530 (owner: 10Volans)
[13:31:40] <wikibugs>	 (03CR) 10Slyngshede: "First draft, and not yet tested. Just checking that we agree on the direction of the implementation." [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[13:32:38] <wikibugs>	 (03PS1) 10Hnowlan: admin: move kamila to ops [puppet] - 10https://gerrit.wikimedia.org/r/904532
[13:32:41] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@47f3a61]: (no justification provided) (duration: 22m 04s)
[13:32:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Awesome!" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[13:34:47] <wikibugs>	 (03PS1) 10Ssingh: pybal: don't install python3-requests on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/904533 (https://phabricator.wikimedia.org/T321309)
[13:36:07] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40459/console" [puppet] - 10https://gerrit.wikimedia.org/r/904533 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:36:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P45998 and previous config saved to /var/cache/conftool/dbconfig/20230330-133635-ladsgroup.json
[13:37:43] <Lucas_WMDE>	 FYI, stashbot is currently having issues and !logs are not being processed
[13:38:31] <Lucas_WMDE>	 (cc nemo-yiannis, Amir1 from the past few !logs)
[13:39:12] <Amir1>	 :(
[13:39:20] <sukhe>	 !log disable Puppet on A:lvs to test 904533
[13:39:37] <sukhe>	 oh ok, just saw the stashbot thing :)
[13:41:24] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] pybal: don't install python3-requests on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/904533 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:41:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] alertmanager: route data-persistence warnings to -feed [puppet] - 10https://gerrit.wikimedia.org/r/904525 (owner: 10Filippo Giunchedi)
[13:42:20] <wikibugs>	 (03CR) 10Volans: "Nice start! I think it can be simplified a bit without too much extra logic, see comments inline, feel free to ping me." [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[13:44:36] <sukhe>	 !log enable Puppet on A:lvs to test 904533
[13:46:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] thumbor: Switch all summaries to histograms [deployment-charts] - 10https://gerrit.wikimedia.org/r/904452 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris)
[13:46:48] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] pybal: port check_pybal_ipvs_diff.py to urllib2 [puppet] - 10https://gerrit.wikimedia.org/r/904381 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:47:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] thumbor: Switch all summaries to histograms [puppet] - 10https://gerrit.wikimedia.org/r/904456 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris)
[13:49:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10FJoseph-WMF) Approved
[13:49:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: add Kavitha to the approver for the ops group [puppet] - 10https://gerrit.wikimedia.org/r/904535
[13:50:28] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "She is indeed our manager." [puppet] - 10https://gerrit.wikimedia.org/r/904535 (owner: 10Giuseppe Lavagetto)
[13:50:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add Kavitha to the approver for the ops group [puppet] - 10https://gerrit.wikimedia.org/r/904535 (owner: 10Giuseppe Lavagetto)
[13:51:19] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: Switch all summaries to histograms [deployment-charts] - 10https://gerrit.wikimedia.org/r/904452 (https://phabricator.wikimedia.org/T333445) (owner: 10Alexandros Kosiaris)
[13:51:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P45999 and previous config saved to /var/cache/conftool/dbconfig/20230330-135140-ladsgroup.json
[13:52:43] <icinga-wm>	 RECOVERY - nutcracker process on cloudweb1003 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[13:53:03] <icinga-wm>	 RECOVERY - nutcracker port on cloudweb1003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 https://wikitech.wikimedia.org/wiki/Nutcracker
[13:53:47] <icinga-wm>	 RECOVERY - nutcracker process on cloudweb1004 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker
[13:54:31] <icinga-wm>	 RECOVERY - nutcracker port on cloudweb1004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 https://wikitech.wikimedia.org/wiki/Nutcracker
[13:58:10] <wikibugs>	 (03PS1) 10JMeybohm: modules.mesh.configuration: Add version 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904537 (https://phabricator.wikimedia.org/T333551)
[13:58:12] <wikibugs>	 (03PS1) 10JMeybohm: mesh.configuration: Use wmf-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551)
[13:58:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "We are, it turns out, using nutcracker for Horizon session state.  See T333561" [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris)
[13:59:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] harbor: use external url for the proxies [puppet] - 10https://gerrit.wikimedia.org/r/900642 (owner: 10David Caro)
[13:59:18] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jhancock.wm) @Papaul the firmware has been updated.
[14:00:07] <wikibugs>	 (03PS1) 10Jbond: P:puppet::agent: allow to add a seed to the time the agent runs [puppet] - 10https://gerrit.wikimedia.org/r/904539
[14:00:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/904525 (owner: 10Filippo Giunchedi)
[14:01:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Papaul) @Jhancock.wm thanks
[14:01:16] <wikibugs>	 (03CR) 10Jelto: "I'm not sure if this is a typical use case to start a unit/timer after unmasking it. One host (phab1004) did not require a additional star" [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney)
[14:01:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[14:01:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40461/console" [puppet] - 10https://gerrit.wikimedia.org/r/904539 (owner: 10Jbond)
[14:03:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Papaul) Now that all the firmware are up to date I will recommend the re-image of the server.
[14:03:49] <wikibugs>	 (03PS1) 10Cwhite: logstash: collapse eventgate response_body field [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051)
[14:05:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: collapse eventgate response_body field [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051) (owner: 10Cwhite)
[14:06:12] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[14:07:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppet::agent: allow to add a seed to the time the agent runs [puppet] - 10https://gerrit.wikimedia.org/r/904539 (owner: 10Jbond)
[14:08:57] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[14:10:05] <sukhe>	 BGP alerts expected in ulsfo
[14:10:32] <wikibugs>	 (03PS2) 10Hnowlan: admin: move kamila to ops [puppet] - 10https://gerrit.wikimedia.org/r/904532 (https://phabricator.wikimedia.org/T333565)
[14:10:58] <wikibugs>	 (03CR) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney)
[14:11:30] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4010.ulsfo.wmnet with OS bullseye
[14:11:37] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye
[14:11:54] <wikibugs>	 (03PS3) 10Clément Goubert: P:kubernetes::master: profile::puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536
[14:12:23] <wikibugs>	 (03PS1) 10Bking: flink-app: temp fix for envoy proxy usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/904542 (https://phabricator.wikimedia.org/T333551)
[14:12:27] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[14:13:08] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] flink-app: temp fix for envoy proxy usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/904542 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking)
[14:13:14] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: only-users match tool users with or without prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) (owner: 10David Caro)
[14:14:05] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:15:07] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:16:17] <wikibugs>	 (03PS1) 10Jbond: redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543
[14:16:21] <wikibugs>	 (03CR) 10Bking: [C: 03+2] flink-app: temp fix for envoy proxy usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/904542 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking)
[14:17:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons.
[14:18:21] <wikibugs>	 (03PS1) 10Ayounsi: Kubestage: don't set next-hop self on exported prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/904544 (https://phabricator.wikimedia.org/T328523)
[14:18:45] <wikibugs>	 (03CR) 10Hashar: "We should be able to mark the Phabricator task with #release-engineering-team as well." [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[14:19:40] <wikibugs>	 (03PS2) 10Cwhite: logstash: remove eventgate response_body field [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051)
[14:20:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (owner: 10Jbond)
[14:20:29] <wikibugs>	 (03CR) 10Jelto: Adds flag to start after unmask, starts logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney)
[14:21:08] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: temp fix for envoy proxy usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/904542 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking)
[14:22:23] <wikibugs>	 (03Abandoned) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) (owner: 10David Caro)
[14:22:33] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[14:22:41] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:22:43] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:23:00] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:23:03] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:23:42] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:25:04] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40463/console" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert)
[14:25:53] <wikibugs>	 (03PS1) 10Bking: flink-app: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904546 (https://phabricator.wikimedia.org/T333551)
[14:26:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051) (owner: 10Cwhite)
[14:26:53] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] flink-app: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904546 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking)
[14:26:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[14:27:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[14:28:18] <wikibugs>	 (03CR) 10Bking: [C: 03+2] flink-app: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904546 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking)
[14:30:22] <wikibugs>	 (03PS4) 10Clément Goubert: kubemaster*.eqiad: Add puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536
[14:30:42] <wikibugs>	 (03PS6) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[14:31:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[14:31:33] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4010.ulsfo.wmnet with reason: host reimage
[14:32:18] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40464/console" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert)
[14:32:59] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata)
[14:34:33] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/904546 (https://phabricator.wikimedia.org/T333551) (owner: 10Bking)
[14:35:36] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:35:39] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:36:41] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:36:44] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:38:44] <wikibugs>	 (03PS1) 10Volans: superset: requestctl-generator error handling [puppet] - 10https://gerrit.wikimedia.org/r/904550
[14:39:01] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:39:02] <wikibugs>	 (03CR) 10Hashar: Migrate from git fat to git lfs (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar)
[14:39:08] <wikibugs>	 (03PS5) 10Clément Goubert: kubemaster*.eqiad: Add puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536
[14:39:09] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:39:37] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:39:40] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:40:09] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40465/console" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert)
[14:40:25] <wikibugs>	 (03PS7) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[14:40:36] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:41:02] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:43:30] <wikibugs>	 (03PS6) 10Clément Goubert: kubemaster*.eqiad: Add puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536
[14:43:37] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:43:44] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:44:12] <wikibugs>	 (03PS2) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869)
[14:44:17] <Lucas_WMDE>	 stashbot is apparently back btw, I didn’t notice ^^
[14:44:17] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[14:44:31] <wikibugs>	 (03CR) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney)
[14:44:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney)
[14:44:41] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40466/console" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert)
[14:45:13] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:46:17] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:46:45] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4010.ulsfo.wmnet with OS bullseye
[14:46:52] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled...
[14:47:29] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:47:31] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:49:02] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: temp fix, pin envoy image version to 1.18.3-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904553 (https://phabricator.wikimedia.org/T328675)
[14:49:14] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:49:23] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:50:15] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: temp fix, pin envoy image version to 1.18.3-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904553 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[14:51:00] <Dreamy_Jazz>	 Any deployer around for a urgent security fix?
[14:51:18] <Dreamy_Jazz>	 See https://phabricator.wikimedia.org/T333569
[14:51:20] <Amir1>	 meeting now, can do in twenty minutes or so
[14:52:06] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[14:52:53] <Dreamy_Jazz>	 If a deployer is able to start on this, please ping me and I'll come back.
[14:53:02] <Lucas_WMDE>	 o/ Dreamy_Jazz 
[14:53:50] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[14:55:03] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: temp fix, pin envoy image version to 1.18.3-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904553 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[14:56:52] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:57:01] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:57:06] <wikibugs>	 (03PS2) 10Hashar: Migrate from git fat to git lfs [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465)
[14:57:28] <wikibugs>	 (03PS3) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869)
[14:57:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) >>! In T332781#8741660, @ayounsi wrote: > How does this compare to taking iBGP down between LEAF1 to SPINE2 if the link g...
[14:57:59] <wikibugs>	 (03CR) 10Hashar: Migrate from git fat to git lfs (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar)
[14:58:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] openstack::nutcracker: Remove redis support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902074 (owner: 10Alexandros Kosiaris)
[14:59:55] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:59:56] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[15:01:27] <Lucas_WMDE>	 heads up, I’m deploying a security fix
[15:01:56] <wikibugs>	 (03PS2) 10Jbond: redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661)
[15:03:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert)
[15:03:57] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubemaster*.eqiad: Add puppet::agent::timer_seed [puppet] - 10https://gerrit.wikimedia.org/r/904536 (owner: 10Clément Goubert)
[15:04:20] <wikibugs>	 (03PS3) 10Jbond: redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661)
[15:05:16] <wikibugs>	 (03PS2) 10JMeybohm: mesh.configuration: Use wmf-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551)
[15:05:19] <claime>	 Amir1: kubemasters should now run puppet with more splay, so you shouldn´t  run into the issue anymore
[15:05:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:06:00] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Set backend_hit = esitest for HfP requests [puppet] - 10https://gerrit.wikimedia.org/r/904556 (https://phabricator.wikimedia.org/T308799)
[15:06:33] <wikibugs>	 (03CR) 10JMeybohm: mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata)
[15:07:14] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Set backend_hint = esitest for HfP requests [puppet] - 10https://gerrit.wikimedia.org/r/904556 (https://phabricator.wikimedia.org/T308799)
[15:08:20] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661) (owner: 10Jbond)
[15:08:37] <logmsgbot>	 !log lucaswerkmeister-wmde: Deployed security patch for T333569
[15:08:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661) (owner: 10Jbond)
[15:09:26] <wikibugs>	 (03PS8) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[15:10:55] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: wikidatardf-truthy-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:10:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:12:46] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: update log entries location [software/spicerack] - 10https://gerrit.wikimedia.org/r/904543 (https://phabricator.wikimedia.org/T326661) (owner: 10Jbond)
[15:13:28] <wikibugs>	 (03CR) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[15:14:51] <logmsgbot>	 !log lucaswerkmeister-wmde: Deployed security patch for T333569
[15:15:25] <wikibugs>	 (03PS1) 10JMeybohm: Update default tls terminator/mesh envoy version to 1.18.3-2 [puppet] - 10https://gerrit.wikimedia.org/r/904557 (https://phabricator.wikimedia.org/T333551)
[15:16:07] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/904558
[15:16:37] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/904558 (owner: 10Volans)
[15:18:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: alertmanager: disable phabricator task creation for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315)
[15:18:12] <wikibugs>	 (03PS9) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[15:19:28] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Set backend_hint = esitest for HfP requests [puppet] - 10https://gerrit.wikimedia.org/r/904556 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[15:19:30] <wikibugs>	 (03PS10) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[15:20:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:21:26] <wikibugs>	 (03PS1) 10Volans: Upstream release v6.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/904561
[15:21:38] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v6.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/904561 (owner: 10Volans)
[15:25:45] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v6.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/904561 (owner: 10Volans)
[15:26:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/904522 (owner: 10EoghanGaffney)
[15:26:34] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[15:26:56] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] "Approved by manager on phab" [puppet] - 10https://gerrit.wikimedia.org/r/904532 (https://phabricator.wikimedia.org/T333565) (owner: 10Hnowlan)
[15:27:30] <wikibugs>	 (03PS1) 10Andrew Bogott: Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477)
[15:28:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/904532 (https://phabricator.wikimedia.org/T333565) (owner: 10Hnowlan)
[15:28:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "This needs to be merged during a pre-defined window on Monday morning" [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott)
[15:28:30] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): admin: add .gitconfig for lucaswerkmeister-wmde [puppet] - 10https://gerrit.wikimedia.org/r/904563
[15:30:05] <jouncebot>	 brennen and mutante: #bothumor I � Unicode. All rise for Phabricator update window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1530).
[15:30:13] <brennen>	 o/
[15:30:49] <volans>	 !log uploaded spicerack_6.4.1 to apt.wikimedia.org bullseye-wikimedia
[15:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:01] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:32:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: maintenance
[15:32:18] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: maintenance
[15:32:28] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: maintenance
[15:32:32] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2002-dev
[15:32:41] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: maintenance
[15:32:57] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2002-dev
[15:33:16] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2002-dev
[15:33:36] <mutante>	 !log phabricator maintenance / deploy window starting
[15:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:16] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2002-dev
[15:34:24] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@9f0866e]: test deploy to phab2002 for T333516
[15:34:29] <stashbot>	 T333516: Phabricator deployment 2023-03-30 - https://phabricator.wikimedia.org/T333516
[15:34:35] <volans>	 !log upgraded spicerack to v6.4.1 on the cumin hosts
[15:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:47] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "this affects more then just the systemd timer so we need to think about it more carfully.  i also think that we should instead try to noti" [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney)
[15:34:54] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@9f0866e]: test deploy to phab2002 for T333516 (duration: 00m 30s)
[15:35:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) Sounds good to me. This is what we need to do with cloudcontrol2004-dev:  * figure out how to...
[15:35:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) Third batch  |Host|U space|Existing port|New port| |cloudcephosd2001-dev|3|asw-b1-codfw ge-1/0/...
[15:35:44] <wikibugs>	 (03CR) 10David Caro: alertmanager: disable phabricator task creation for WMCS alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez)
[15:35:51] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@9f0866e]: deploy to phab1004 for T333516
[15:35:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1207']
[15:36:13] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2001-dev
[15:36:33] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@9f0866e]: deploy to phab1004 for T333516 (duration: 00m 42s)
[15:36:42] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2001-dev
[15:39:25] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - allow egress to api-ro (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/904520 (owner: 10Ottomata)
[15:40:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1208']
[15:40:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott)
[15:43:10] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:44:01] <mutante>	 !log phabricator maintenance window / deployment ended (T329974)
[15:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:11] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: alertmanager: update phabricator project for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315)
[15:44:13] <stashbot>	 T329974: Show "other assignee" avatar on tasks in workboard - https://phabricator.wikimedia.org/T329974
[15:50:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10Jclark-ctr) @jbond i have batteries for all of these can this be done tomorrow?  If possible can you shut down server and I can preform repair 9am est tomorrow?
[15:53:17] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] admin: add .gitconfig for lucaswerkmeister-wmde [puppet] - 10https://gerrit.wikimedia.org/r/904563 (owner: 10Lucas Werkmeister (WMDE))
[15:53:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10jbond) @Jclark-ctr you will need to contacts someone in analytics (possibly @BTullis) and data persistence (maybe @MatthewVernon)
[15:54:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: alertmanager: update phabricator project for WMCS alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez)
[16:00:04] <jouncebot>	 jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1600)
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1207']
[16:00:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: replace Icinga with Prometheus for SMTP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[16:01:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM :crossingfingers:" [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez)
[16:01:16] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2001-dev
[16:01:32] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2001-dev
[16:03:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1208']
[16:03:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "true! well, there is "query_response", but not in wide usage yet" [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[16:05:23] <wikibugs>	 10SRE, 10observability, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551 (10lmata) @BTullis, @jcrespo   coming in late to this thread to update you that we've scheduled to tackle {T108027} next quarter (q4), which I think would address this issue. Feel free to reach...
[16:05:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "let me confirm it works like this and then try to add a query_response check later" [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[16:06:36] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] kubernetes: add dummy tokens for rest-gateway [labs/private] - 10https://gerrit.wikimedia.org/r/904511 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[16:06:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: update phabricator project for WMCS alerts [puppet] - 10https://gerrit.wikimedia.org/r/904559 (https://phabricator.wikimedia.org/T333315) (owner: 10Arturo Borrero Gonzalez)
[16:06:54] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap: block Scap execution on inactive deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche)
[16:08:31] <wikibugs>	 (03CR) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[16:09:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209']
[16:09:41] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['db1209']
[16:09:45] <wikibugs>	 (03PS1) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580)
[16:09:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1211']
[16:10:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1212']
[16:18:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[16:18:30] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:19:26] <wikibugs>	 (03CR) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[16:19:39] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2003-dev
[16:19:55] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2003-dev
[16:20:08] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephmon2004-dev
[16:20:16] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephmon2004-dev
[16:20:23] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mesh.configuration: Use wmf-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[16:20:23] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2003-dev
[16:20:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] modules.mesh.configuration: Add version 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904537 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[16:21:25] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2003-dev
[16:23:30] <jinxer-wm>	 (Emergency syslog message) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:25:18] <wikibugs>	 (03Merged) 10jenkins-bot: modules.mesh.configuration: Add version 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904537 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[16:25:56] <wikibugs>	 (03PS1) 10Papaul: Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/904574 (https://phabricator.wikimedia.org/T326661)
[16:26:23] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.configuration: Use wmf-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/904538 (https://phabricator.wikimedia.org/T333551) (owner: 10JMeybohm)
[16:29:00] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: remove eventgate response_body field (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904264 (https://phabricator.wikimedia.org/T180051) (owner: 10Cwhite)
[16:31:29] <wikibugs>	 (03PS1) 10Hashar: Extract and deploy upstream plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575
[16:32:31] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1211']
[16:36:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1212']
[16:42:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1213']
[16:43:03] <wikibugs>	 (03CR) 10Hashar: "In the child change https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/904575 I am adding some more plugins tracked by git lfs." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904239 (https://phabricator.wikimedia.org/T333465) (owner: 10Hashar)
[16:43:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) All remaining (non public-vlan) hosts have been moved and look good to me (reachable, MAC addr...
[16:44:51] <wikibugs>	 (03CR) 10Brennen Bearnes: "+1 on overall idea." [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[16:46:21] <wikibugs>	 (03CR) 10Hashar: "The parent change migrates Gerrit deployment from git-fat to git-lfs. Jaime and I successfully used it for the Gitlab jenkins-deploy repo." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/904575 (owner: 10Hashar)
[16:46:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1214']
[16:46:58] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "With git-lfs, I have proposed to add the bundled Gerrit plugins in the deployment repository again https://gerrit.wikimedia.org/r/c/operat" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/699035 (https://phabricator.wikimedia.org/T278990) (owner: 10Ahmon Dancy)
[16:47:14] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Upstream: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134 (10BCornwall) 05Stalled→03Invalid And another 3 years have passed. Since OCSP is in a bit of a zombie state and its future support in Firefox is questionable (see Mozilla's crlite project), it...
[16:52:10] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-03-27-111537-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/904578
[16:54:20] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:55:35] <wikibugs>	 (03PS1) 10Cwhite: logstash: restore logstash index patch level [puppet] - 10https://gerrit.wikimedia.org/r/904265 (https://phabricator.wikimedia.org/T180051)
[16:56:44] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson
[16:56:53] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/904574 (https://phabricator.wikimedia.org/T326661) (owner: 10Papaul)
[16:58:17] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-03-27-111537-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/904578 (owner: 10BryanDavis)
[16:58:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[17:00:05] <jouncebot>	 bd808: How many deployers does it take to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1700)
[17:00:33] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:00:34] <bd808>	 jouncebot: I think just 1, but that's not a great punchline. ;)
[17:00:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:01:03] * bd808 will be deploying an updated developer portal today
[17:01:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1207.eqiad.wmnet with OS bullseye
[17:01:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1207.eqiad.wmnet with OS bullseye
[17:03:40] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-03-27-111537-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/904578 (owner: 10BryanDavis)
[17:04:05] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1213']
[17:04:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1215']
[17:04:42] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:05:48] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:06:01] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:06:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1214']
[17:07:11] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:07:23] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:08:32] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:09:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "works per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*phab.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h" [puppet] - 10https://gerrit.wikimedia.org/r/903826 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[17:09:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1216']
[17:10:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED
[17:10:42] <sukhe>	 BGP alerts in ulsfo expected
[17:10:55] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED
[17:14:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[17:14:19] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:14:31] <icinga-wm>	 PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[17:15:21] <sukhe>	 ^ expected
[17:15:45] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[17:16:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1207.eqiad.wmnet with reason: host reimage
[17:16:47] <wikibugs>	 (03PS2) 10Btullis: Correct the datahub elasticsearch index prefix for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/904571 (https://phabricator.wikimedia.org/T333580)
[17:19:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1207.eqiad.wmnet with reason: host reimage
[17:19:55] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:20:04] <sukhe>	 ^ expected
[17:20:18] <sukhe>	 there isn't a good way to silence these alerts
[17:20:41] <sukhe>	 partly I also don't want to today (so that we can know if something breaks) but yeah, in general as well :)
[17:21:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1208.eqiad.wmnet with OS bullseye
[17:21:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1208.eqiad.wmnet with OS bullseye
[17:22:26] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10BTullis) @Cmjohnson I can't think of any reason why six disks should have failed. I think they're all single volume RAID 0 logical volumes, aren't they? We've power cycled it a few times with...
[17:25:27] <wikibugs>	 (03PS1) 10Cmjohnson: updating site.pp and netboot with new gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/904586 (https://phabricator.wikimedia.org/T326366)
[17:25:39] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:26:24] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add k8s statsd-exporter ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/901631 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[17:26:40] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] updating site.pp and netboot with new gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/904586 (https://phabricator.wikimedia.org/T326366) (owner: 10Cmjohnson)
[17:27:27] <logmsgbot>	 !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@8b242c2]: (no justification provided)
[17:27:38] <logmsgbot>	 !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@8b242c2]: (no justification provided) (duration: 00m 11s)
[17:28:02] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1215']
[17:28:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1216']
[17:28:38] <SandraEbele>	 !log killed Oozie mediawiki-history-check_denormalize job and started Airflow mediawiki_history_check_denormalize dag.
[17:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:31] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:29:32] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit1003']
[17:29:45] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['gerrit1003']
[17:30:17] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit1003']
[17:30:25] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['gerrit1003']
[17:32:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1217']
[17:32:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1218']
[17:34:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[17:35:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[17:36:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1208.eqiad.wmnet with reason: host reimage
[17:36:16] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[17:36:17] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1207.eqiad.wmnet with OS bullseye
[17:36:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1207.eqiad.wmnet with OS bullseye completed: - db1207 (**PASS**)   - Removed from Puppet an...
[17:36:44] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gerrit1003.wikimedia.org with OS bullseye
[17:36:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1211.eqiad.wmnet with OS bullseye
[17:36:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gerrit1003.wikimedia.org with OS bullseye
[17:36:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1211.eqiad.wmnet with OS bullseye
[17:39:12] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1208.eqiad.wmnet with reason: host reimage
[17:49:28] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gerrit1003.wikimedia.org with OS bullseye
[17:49:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gerrit1003.wikimedia.org with OS bullseye executed with errors: - gerrit1003 (*...
[17:51:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1211.eqiad.wmnet with reason: host reimage
[17:54:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[17:54:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1211.eqiad.wmnet with reason: host reimage
[17:55:17] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gerrit1003.wikimedia.org with OS bullseye
[17:55:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gerrit1003.wikimedia.org with OS bullseye
[17:56:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Cmjohnson)
[17:57:27] <wikibugs>	 10SRE, 10Traffic-Icebox: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10BCornwall) 05Stalled→03Resolved a:03BCornwall I haven't been able to see any indication that this has been an issue for the entirety of our metrics. @Ema's great work likely has fix...
[18:00:05] <jouncebot>	 dduvall and dancy: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T1800).
[18:05:42] <wikibugs>	 10SRE, 10Traffic, 10HTTPS, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BCornwall)
[18:05:55] <wikibugs>	 10SRE, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) 05Stalled→03Declined I'm going to decline this as it's not possible. I will follow it up with T333591 which tracks moving the domain.
[18:06:20] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:06:23] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:08:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[18:09:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) There doesn't seem to be a raid controller {F36934521}
[18:09:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:12:05] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:12:08] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:14:28] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[18:15:30] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] grizzly: adapt managed dashboards to 0.2 metadata approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903776 (https://phabricator.wikimedia.org/T332895) (owner: 10Herron)
[18:21:46] <wikibugs>	 (03PS2) 10Dzahn: miscweb: move simplestatic.erb out of role/templates/apache/sites/ [puppet] - 10https://gerrit.wikimedia.org/r/902141
[18:22:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:22:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1211.eqiad.wmnet with OS bullseye
[18:22:26] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:22:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1208.eqiad.wmnet with OS bullseye
[18:22:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1211.eqiad.wmnet with OS bullseye completed: - db1211 (**PASS**)   - Removed from Puppet an...
[18:22:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1208.eqiad.wmnet with OS bullseye completed: - db1208 (**WARN**)   - Removed from Puppet an...
[18:22:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "used only by https://openstack-browser.toolforge.org/puppetclass/role::simplestatic afaict" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn)
[18:22:59] <logmsgbot>	 !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@5355ead]: (no justification provided)
[18:23:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1212.eqiad.wmnet with OS bullseye
[18:23:11] <logmsgbot>	 !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@5355ead]: (no justification provided) (duration: 00m 12s)
[18:23:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1212.eqiad.wmnet with OS bullseye
[18:23:33] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:23:36] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:24:02] <wikibugs>	 (03PS1) 10Mforns: modules::profile::manifests::airflow.pp: add plugins_folder path [puppet] - 10https://gerrit.wikimedia.org/r/904609 (https://phabricator.wikimedia.org/T324485)
[18:24:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "unknown why compiling on cloud VPS not working: Hosts that were skipped (fail fast)" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn)
[18:25:39] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904611 (https://phabricator.wikimedia.org/T330208)
[18:25:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904611 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot)
[18:26:26] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904611 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot)
[18:26:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: move simplestatic.erb out of role/templates/apache/sites/ [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn)
[18:27:51] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:27:53] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:30:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1213.eqiad.wmnet with OS bullseye
[18:30:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1213.eqiad.wmnet with OS bullseye
[18:31:18] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:31:19] <wikibugs>	 (03PS1) 10Dzahn: simplestatic: fix path to erb template [puppet] - 10https://gerrit.wikimedia.org/r/904613
[18:31:20] <SandraEbele>	 !log Killed Oozie mediawiki-wikitext-history-coord and mediawiki-wikitext-current-coord
[18:31:21] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:31:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "duh, follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/904613/" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn)
[18:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1217']
[18:31:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] simplestatic: fix path to erb template [puppet] - 10https://gerrit.wikimedia.org/r/904613 (owner: 10Dzahn)
[18:32:12] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1218']
[18:32:16] <SandraEbele>	 !log started Airflow mediwiki wikitext dags after killing oozie jobs as part of Migration task
[18:32:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:12] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.2  refs T330208
[18:33:18] <stashbot>	 T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208
[18:33:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:34:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed working / noop on dashiki-02.dashiki.eqiad.wmflabs now" [puppet] - 10https://gerrit.wikimedia.org/r/904613 (owner: 10Dzahn)
[18:34:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed working / noop on dashiki-02.dashiki.eqiad.wmflabs now after follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/902141 (owner: 10Dzahn)
[18:37:16] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[18:37:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1212.eqiad.wmnet with reason: host reimage
[18:38:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:40:55] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1212.eqiad.wmnet with reason: host reimage
[18:41:04] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:41:06] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:41:56] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:41:59] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:43:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:44:58] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:44:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1213.eqiad.wmnet with reason: host reimage
[18:45:01] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:45:44] <wikibugs>	 (03PS2) 10Dzahn: decom miscweb2002 [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896)
[18:45:48] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:45:51] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:46:22] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:46:26] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:46:45] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:46:56] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:47:00] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:48:11] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1213.eqiad.wmnet with reason: host reimage
[18:48:48] <wikibugs>	 (03PS1) 10Herron: dns: repoint alert host services to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/904614 (https://phabricator.wikimedia.org/T333478)
[18:48:56] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:49:00] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:52:56] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:52:59] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:54:09] <wikibugs>	 (03CR) 10Herron: "to be merged after the related puppet patch during planned failover window" [dns] - 10https://gerrit.wikimedia.org/r/904614 (https://phabricator.wikimedia.org/T333478) (owner: 10Herron)
[18:54:50] <wikibugs>	 (03CR) 10Herron: "jftr I512758d23fe0682e5ce302d15b838c8b836dc4f3" [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron)
[18:55:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:55:23] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:55:26] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[18:57:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:57:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1212.eqiad.wmnet with OS bullseye
[18:57:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1212.eqiad.wmnet with OS bullseye completed: - db1212 (**PASS**)   - Removed from Puppet an...
[18:57:32] <wikibugs>	 (03PS1) 10Dzahn: gitlab_runner: run clear-docker-cache every hour [puppet] - 10https://gerrit.wikimedia.org/r/904616 (https://phabricator.wikimedia.org/T333586)
[18:58:42] <wikibugs>	 (03PS1) 10Stevemunene: Jupyterhub-conda exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511)
[18:59:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1219']
[19:00:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1220']
[19:02:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[19:02:20] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40468/console" [puppet] - 10https://gerrit.wikimedia.org/r/904617 (https://phabricator.wikimedia.org/T333511) (owner: 10Stevemunene)
[19:04:21] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:04:24] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:05:31] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:05:34] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:06:30] <wikibugs>	 (03PS6) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306)
[19:08:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[19:08:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1213.eqiad.wmnet with OS bullseye
[19:08:47] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:08:50] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:08:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1213.eqiad.wmnet with OS bullseye completed: - db1213 (**PASS**)   - Removed from Puppet an...
[19:09:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1215.eqiad.wmnet with OS bullseye
[19:09:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1215.eqiad.wmnet with OS bullseye
[19:09:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1214.eqiad.wmnet with OS bullseye
[19:09:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1214.eqiad.wmnet with OS bullseye
[19:10:36] <wikibugs>	 (03PS7) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306)
[19:11:13] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] modules::profile::manifests::airflow.pp: add plugins_folder path [puppet] - 10https://gerrit.wikimedia.org/r/904609 (https://phabricator.wikimedia.org/T324485) (owner: 10Mforns)
[19:11:25] <wikibugs>	 (03PS1) 10Dzahn: miscweb: remove miscweb2002 from rsync dest hosts [puppet] - 10https://gerrit.wikimedia.org/r/904619 (https://phabricator.wikimedia.org/T331896)
[19:11:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: remove miscweb2002 from rsync dest hosts [puppet] - 10https://gerrit.wikimedia.org/r/904619 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn)
[19:11:53] <wikibugs>	 (03PS8) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306)
[19:13:32] <wikibugs>	 (03PS1) 10Ssingh: hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010) [puppet] - 10https://gerrit.wikimedia.org/r/904620 (https://phabricator.wikimedia.org/T321309)
[19:14:00] <sukhe>	 bblack: /win 14
[19:14:02] <sukhe>	 er
[19:14:10] <bblack>	 my 14 is probably not yours :)
[19:14:19] <sukhe>	 haha
[19:14:35] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010) [puppet] - 10https://gerrit.wikimedia.org/r/904620 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:14:39] <sukhe>	 ^ thanks!
[19:14:41] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40469/console" [puppet] - 10https://gerrit.wikimedia.org/r/904620 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:14:57] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:14:58] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:15:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "the only change is on miscweb2002 itself, not the others machines. removes ferm rule but not rsync itself.." [puppet] - 10https://gerrit.wikimedia.org/r/904619 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn)
[19:15:15] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: set bpg-med to 101 for lvs4008 (100 for lvs4010) [puppet] - 10https://gerrit.wikimedia.org/r/904620 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:15:42] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:15:43] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:16:07] <wikibugs>	 (03PS3) 10Dzahn: decom miscweb2002 [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896)
[19:16:16] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gerrit1003.wikimedia.org with OS bullseye
[19:16:20] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:16:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gerrit1003.wikimedia.org with OS bullseye executed with errors: - gerrit1003 (*...
[19:16:22] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:16:39] <wikibugs>	 (03PS4) 10Dzahn: miscweb/site: remove miscweb2002 from site [puppet] - 10https://gerrit.wikimedia.org/r/902229 (https://phabricator.wikimedia.org/T331896)
[19:17:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:18:17] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:18:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1219']
[19:18:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1220']
[19:18:26] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:18:29] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:18:53] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:18:56] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:19:01] <icinga-wm>	 RECOVERY - pybal on lvs4008 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:19:03] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:20:33] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[19:22:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1221']
[19:22:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1222']
[19:23:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1215.eqiad.wmnet with reason: host reimage
[19:24:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[19:24:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage
[19:26:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1215.eqiad.wmnet with reason: host reimage
[19:29:22] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1214.eqiad.wmnet with reason: host reimage
[19:35:35] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:40:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[19:42:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[19:45:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 06): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Ladsgroup)
[19:46:38] <wikibugs>	 (03PS1) 10Nray: Remove inline script from United States static page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904621 (https://phabricator.wikimedia.org/T331681)
[19:46:59] <wikibugs>	 (03PS2) 10Nray: Remove inline script from United States static page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904621 (https://phabricator.wikimedia.org/T331681)
[19:54:29] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[19:54:30] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1215.eqiad.wmnet with OS bullseye
[19:54:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[19:54:34] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1214.eqiad.wmnet with OS bullseye
[19:54:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1215.eqiad.wmnet with OS bullseye completed: - db1215 (**WARN**)   - Removed from Puppet an...
[19:54:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1214.eqiad.wmnet with OS bullseye completed: - db1214 (**PASS**)   - Removed from Puppet an...
[19:54:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1216.eqiad.wmnet with OS bullseye
[19:55:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1216.eqiad.wmnet with OS bullseye
[19:55:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1217.eqiad.wmnet with OS bullseye
[19:55:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1217.eqiad.wmnet with OS bullseye
[20:02:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1222']
[20:02:37] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1221']
[20:06:35] <thcipriani>	 jouncebot: now
[20:06:35] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230330T2000)
[20:06:42] <thcipriani>	 huh, but no ping
[20:07:16] <thcipriani>	 anyway nray you around for backport and config window?
[20:07:26] <nray>	 yes, I'm here!
[20:07:40] <thcipriani>	 cool, I guess I can be your deployer :)
[20:07:45] <nray>	 thank you!
[20:09:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1216.eqiad.wmnet with reason: host reimage
[20:09:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1217.eqiad.wmnet with reason: host reimage
[20:10:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904621 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray)
[20:11:00] <wikibugs>	 (03Merged) 10jenkins-bot: Remove inline script from United States static page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904621 (https://phabricator.wikimedia.org/T331681) (owner: 10Nray)
[20:11:12] <logmsgbot>	 !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:904621|Remove inline script from United States static page (T331681)]]
[20:11:17] <stashbot>	 T331681: Make a proposal for supporting the  disabling of multiple features in client preferences - https://phabricator.wikimedia.org/T331681
[20:12:15] <wikibugs>	 (03CR) 10CDanis: alerting_host: failover icinga and alertmanger from eqiad to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron)
[20:12:30] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1216.eqiad.wmnet with reason: host reimage
[20:12:31] <logmsgbot>	 !log thcipriani@deploy2002 nray and thcipriani: Backport for [[gerrit:904621|Remove inline script from United States static page (T331681)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:12:43] <thcipriani>	 ^ nray should be live on mwdebug machines, check please
[20:12:53] <nray>	 thcipriani: thank you, checking now
[20:14:38] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1217.eqiad.wmnet with reason: host reimage
[20:14:44] <nray>	 thcipriani: looks good, you can proceed
[20:14:58] <thcipriani>	 okie doke, doing so
[20:20:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1223']
[20:20:54] <logmsgbot>	 !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:904621|Remove inline script from United States static page (T331681)]] (duration: 09m 42s)
[20:20:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1224']
[20:20:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:21:00] <stashbot>	 T331681: Make a proposal for supporting the  disabling of multiple features in client preferences - https://phabricator.wikimedia.org/T331681
[20:21:10] <thcipriani>	 nray: live everywhere
[20:21:19] <nray>	 thcipriani: thanks so much for your help!
[20:21:26] <thcipriani>	 sure thing :)
[20:24:50] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:25:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:27:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[20:27:39] <wikibugs>	 10SRE, 10Traffic: Performance implications of buffer sizes in Apache Traffic Server intercept plugins - https://phabricator.wikimedia.org/T287847 (10BCornwall)
[20:27:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[20:30:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[20:30:54] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1216.eqiad.wmnet with OS bullseye
[20:30:58] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[20:30:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1217.eqiad.wmnet with OS bullseye
[20:31:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1216.eqiad.wmnet with OS bullseye completed: - db1216 (**PASS**)   - Removed from Puppet an...
[20:31:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1217.eqiad.wmnet with OS bullseye completed: - db1217 (**WARN**)   - Removed from Puppet an...
[20:33:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1218.eqiad.wmnet with OS bullseye
[20:33:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1218.eqiad.wmnet with OS bullseye
[20:33:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1219.eqiad.wmnet with OS bullseye
[20:33:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1219.eqiad.wmnet with OS bullseye
[20:35:31] <wikibugs>	 (03PS2) 10Jdlrobson: Disable Vector js/css sharing on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809)
[20:36:05] <Jdlrobson>	 thcipriani: is it too late to add something to the window?
[20:39:48] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10BCornwall)
[20:40:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10BCornwall) p:05Medium→03Triage
[20:41:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1224']
[20:42:40] <wikibugs>	 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10BCornwall)
[20:42:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1223']
[20:42:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:44:36] <wikibugs>	 (03PS2) 10Andrew Bogott: Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477)
[20:44:38] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs traffic shaping: label IPs circa 2017 [puppet] - 10https://gerrit.wikimedia.org/r/904623
[20:44:40] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs traffic-shaping: replace labstore100[67] with clouddumps100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904624
[20:44:42] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs traffic shaping: remove refs to labstore100[12] [puppet] - 10https://gerrit.wikimedia.org/r/904625
[20:44:44] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs traffic_shaping: replace labstore1003 rules with rules for scratch.svc [puppet] - 10https://gerrit.wikimedia.org/r/904626
[20:44:46] <wikibugs>	 (03PS1) 10Andrew Bogott: nfs traffic_shaping: replace labstore1004 rules with rules for tools-nfs.svc [puppet] - 10https://gerrit.wikimedia.org/r/904627 (https://phabricator.wikimedia.org/T333477)
[20:46:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "this should only be merged after we switch to the new nfs server" [puppet] - 10https://gerrit.wikimedia.org/r/904627 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott)
[20:46:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:47:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1218.eqiad.wmnet with reason: host reimage
[20:47:57] <wikibugs>	 (03CR) 10Andrew Bogott: Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott)
[20:47:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1219.eqiad.wmnet with reason: host reimage
[20:48:18] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10netbox: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10BCornwall)
[20:51:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1218.eqiad.wmnet with reason: host reimage
[20:53:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1219.eqiad.wmnet with reason: host reimage
[20:58:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Update certspotter - https://phabricator.wikimedia.org/T204993 (10BCornwall) p:05Medium→03Triage
[20:59:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1225']
[21:00:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED
[21:05:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[21:06:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[21:11:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1209.mgmt.eqiad.wmnet with reboot policy FORCED
[21:12:56] <wikibugs>	 (03PS1) 10Ladsgroup: admin: Add sfaci ssh key and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/904629 (https://phabricator.wikimedia.org/T333456)
[21:13:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED
[21:13:36] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[21:13:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1219.eqiad.wmnet with OS bullseye
[21:13:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[21:13:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1218.eqiad.wmnet with OS bullseye
[21:13:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1219.eqiad.wmnet with OS bullseye completed: - db1219 (**WARN**)   - Removed from Puppet an...
[21:13:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1218.eqiad.wmnet with OS bullseye completed: - db1218 (**PASS**)   - Removed from Puppet an...
[21:14:43] <wikibugs>	 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BCornwall) 05Open→03In progress p:05Medium→03High a:05BBlack→03BCornwall Since there wasn't any feedback on this, I guess I'll claim this ticket since I'm actively trying to fix this. I'll ask me...
[21:22:05] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1225']
[21:24:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209']
[21:24:52] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy)
[21:25:02] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy)
[21:27:26] <wikibugs>	 (03PS1) 10Andrew Bogott: labstore1004: park in an 'insetup' role until we're ready to decom [puppet] - 10https://gerrit.wikimedia.org/r/904630 (https://phabricator.wikimedia.org/T333477)
[21:30:01] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1210.mgmt.eqiad.wmnet with reboot policy FORCED
[21:33:50] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10netbox: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) Correct, and we've already the first validators in netbox-next that will be released to prod shortly so this can b...
[21:35:06] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1210']
[21:35:35] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1209']
[21:42:17] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable visual enhancements on pages using __NEWSECTIONLINK__ on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904631 (https://phabricator.wikimedia.org/T333570)
[21:48:54] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:52:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:57:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:06:18] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1210']
[22:07:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209']
[22:15:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: replace Icinga with Prometheus for SMTP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:15:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I made a follow-up ticket for adding actual "send/expect" patterns to all TCP checks. Thanks for reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:16:10] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db1209']
[22:16:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['gerrit1003']
[22:17:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['gerrit1003']
[22:18:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "also checked that on new machine vrts2001 we already have port 25 with exim listening" [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:20:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209']
[22:20:53] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1209']
[22:21:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1209']
[22:27:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1209']
[22:33:08] <wikibugs>	 (03PS1) 10Cwhite: logstash: replace grafana ecs fields [puppet] - 10https://gerrit.wikimedia.org/r/904590
[22:35:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "works per https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=probe_success%7Binstance%3D~%22.*otrs.*%22%7D&g0.max_source_resoluti" [puppet] - 10https://gerrit.wikimedia.org/r/903805 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[22:40:28] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: replace grafana ecs fields [puppet] - 10https://gerrit.wikimedia.org/r/904590 (owner: 10Cwhite)
[22:47:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1220.eqiad.wmnet with OS bullseye
[22:47:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1220.eqiad.wmnet with OS bullseye
[22:50:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1221.eqiad.wmnet with OS bullseye
[22:51:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1221.eqiad.wmnet with OS bullseye
[22:51:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[22:52:18] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Papaul) @Jelto  the 2 disks are in place in gitlab2003
[22:52:56] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Papaul) a:03Jelto
[22:57:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:58:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED
[22:59:43] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED
[23:01:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1220.eqiad.wmnet with reason: host reimage
[23:02:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:02:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1209.eqiad.wmnet with OS bullseye
[23:02:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1209.eqiad.wmnet with OS bullseye
[23:03:51] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata) a:05lmata→03herron We've made this item and subsequent follow-up an OKR for Q4, handing it off to @herron
[23:04:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1220.eqiad.wmnet with reason: host reimage
[23:05:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage
[23:07:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) @BTullis what HW raid to  not in task
[23:08:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage
[23:09:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED
[23:11:10] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T333328 (10Papaul) 05Open→03Resolved This was fixed by @Jhancock.wm
[23:13:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1210.eqiad.wmnet with OS bullseye
[23:13:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1210.eqiad.wmnet with OS bullseye
[23:15:04] <icinga-wm>	 PROBLEM - Check systemd state on graphite1005 is CRITICAL: CRITICAL - degraded: The following units failed: statsd-proxy-socat-6to4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:16:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage
[23:18:33] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1149.mgmt.eqiad.wmnet with reboot policy FORCED
[23:19:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:19:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED
[23:20:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1209.eqiad.wmnet with reason: host reimage
[23:23:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:24:16] <icinga-wm>	 RECOVERY - Check systemd state on graphite1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:54] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:26:55] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1221.eqiad.wmnet with OS bullseye
[23:26:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:26:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1220.eqiad.wmnet with OS bullseye
[23:27:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1221.eqiad.wmnet with OS bullseye completed: - db1221 (**WARN**)   - Removed from Puppet an...
[23:27:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1220.eqiad.wmnet with OS bullseye completed: - db1220 (**PASS**)   - Removed from Puppet an...
[23:29:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1150.mgmt.eqiad.wmnet with reboot policy FORCED
[23:30:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1151.mgmt.eqiad.wmnet with reboot policy FORCED
[23:31:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1222.eqiad.wmnet with OS bullseye
[23:31:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1222.eqiad.wmnet with OS bullseye
[23:32:10] <icinga-wm>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1003.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:17] <wikibugs>	 (03PS1) 10Cwhite: logstash: normalize_level add grafana error level alias [puppet] - 10https://gerrit.wikimedia.org/r/904591
[23:34:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage
[23:35:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:37:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1151.mgmt.eqiad.wmnet with reboot policy FORCED
[23:37:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage
[23:38:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1152.mgmt.eqiad.wmnet with reboot policy FORCED
[23:39:05] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:39:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1209.eqiad.wmnet with OS bullseye
[23:39:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1209.eqiad.wmnet with OS bullseye completed: - db1209 (**PASS**)   - Removed from Puppet an...
[23:40:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[23:41:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1223.eqiad.wmnet with OS bullseye
[23:41:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1223.eqiad.wmnet with OS bullseye
[23:44:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1152.mgmt.eqiad.wmnet with reboot policy FORCED
[23:45:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED
[23:45:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage
[23:48:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1222.eqiad.wmnet with reason: host reimage
[23:51:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1224.eqiad.wmnet with OS bullseye
[23:51:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1224.eqiad.wmnet with OS bullseye
[23:51:20] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1153.mgmt.eqiad.wmnet with reboot policy FORCED
[23:51:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1154.mgmt.eqiad.wmnet with reboot policy FORCED
[23:55:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10andrea.denisse)
[23:55:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: esams 1 VM request for prometheus3002 - https://phabricator.wikimedia.org/T333627 (10andrea.denisse) a:03andrea.denisse
[23:56:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage
[23:58:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:59:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1154.mgmt.eqiad.wmnet with reboot policy FORCED
[23:59:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: host reimage
[23:59:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1155.mgmt.eqiad.wmnet with reboot policy FORCED
[23:59:53] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus3002.esams.wmnet
[23:59:54] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox