[00:07:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:07:30] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:21:32] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:21:52] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:38:58] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965584
[00:39:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965584 (owner: 10TrainBranchBot)
[00:39:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[00:49:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[00:53:57] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965584 (owner: 10TrainBranchBot)
[00:55:16] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:00:12] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100%
[02:00:12] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[02:05:04] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[02:05:40] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.98 ms
[02:05:40] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.24 ms
[02:14:24] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:16:22] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[02:18:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:38:36] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:03:36] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:45:14] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:45:48] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 45 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:51:14] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 20 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:52:32] <icinga-wm>	 PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:55:14] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:01:02] <icinga-wm>	 RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:35:34] <icinga-wm>	 PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:48:04] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:51:14] <wikibugs>	 (03PS1) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852)
[05:52:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[05:54:10] <wikibugs>	 (03PS2) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852)
[05:57:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bookworm
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T0600)
[06:00:25] <wikibugs>	 (03PS1) 10Ayounsi: Turnilo: wmf_netflow: change forwarded 1/0 to yes/no [puppet] - 10https://gerrit.wikimedia.org/r/966800 (https://phabricator.wikimedia.org/T331707)
[06:04:17] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:08:20] <wikibugs>	 (03PS2) 10Stevemunene: Switch druid1006 zookeeper node with druid1011 [puppet] - 10https://gerrit.wikimedia.org/r/965501 (https://phabricator.wikimedia.org/T336042)
[06:08:22] <wikibugs>	 (03PS2) 10Stevemunene: Switch druid1005 zookeeper node with druid1010 [puppet] - 10https://gerrit.wikimedia.org/r/965500 (https://phabricator.wikimedia.org/T336042)
[06:08:24] <wikibugs>	 (03PS2) 10Stevemunene: Switch druid1004 zookeeper node with druid1009 [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042)
[06:08:26] <logmsgbot>	 !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2160.codfw.wmnet with OS bookworm
[06:08:26] <wikibugs>	 (03PS4) 10Stevemunene: druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042)
[06:09:17] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:14:58] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:16:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bookworm
[06:17:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[06:22:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[06:27:24] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] nginx: make /etc/nginx depend on the package [puppet] - 10https://gerrit.wikimedia.org/r/966549 (owner: 10Majavah)
[06:32:22] <icinga-wm>	 RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:41] <wikibugs>	 (03CR) 10Ayounsi: "1 small comment, +1 otherwise" [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[06:36:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2160.codfw.wmnet with reason: host reimage
[06:37:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10ayounsi) Isn't OSPF required there to benefit from the end to end link cost calculations (eg. draining a transport link)?
[06:38:16] <XioNoX>	 !log push pfw policies - T349101
[06:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2160.codfw.wmnet with reason: host reimage
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and taavi: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T0700). nyaa~
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2160.codfw.wmnet with OS bookworm
[07:00:47] * taavi blames TheresNoTime for that message
[07:00:56] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train [airflow-dags@5dcce3bd]
[07:02:58] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:03:10] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:03:36] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:04:48] <logmsgbot>	 !log aqu@deploy2002 deploy aborted: Add missing MR in yesterday weekly train [airflow-dags@5dcce3bd] (duration: 03m 52s)
[07:05:11] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053)
[07:05:13] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metriscinfra: haproxy: add route for meta monitor service [puppet] - 10https://gerrit.wikimedia.org/r/966805 (https://phabricator.wikimedia.org/T288053)
[07:05:13] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided)
[07:05:19] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 06s)
[07:05:26] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:05:56] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train (run 2) [airflow-dags@5dcce3bd]
[07:06:03] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train (run 2) [airflow-dags@5dcce3bd] (duration: 00m 07s)
[07:06:36] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:06:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:06:58] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:07:28] <wikibugs>	 (03PS2) 10Volans: dhcp: adapt to new Spicerack's dhcp() API [cookbooks] - 10https://gerrit.wikimedia.org/r/966490 (https://phabricator.wikimedia.org/T341973)
[07:07:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Remove testreduce grants [puppet] - 10https://gerrit.wikimedia.org/r/966327 (https://phabricator.wikimedia.org/T345831) (owner: 10Marostegui)
[07:07:40] <wikibugs>	 (03PS1) 10Marostegui: db2132: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/966806 (https://phabricator.wikimedia.org/T349090)
[07:08:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[07:08:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2132: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/966806 (https://phabricator.wikimedia.org/T349090) (owner: 10Marostegui)
[07:09:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah)
[07:09:53] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053)
[07:09:55] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::metriscinfra: haproxy: add route for meta monitor service [puppet] - 10https://gerrit.wikimedia.org/r/966805 (https://phabricator.wikimedia.org/T288053)
[07:13:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[07:13:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah)
[07:14:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede)
[07:14:50] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053)
[07:14:52] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::metriscinfra: haproxy: add route for meta monitor service [puppet] - 10https://gerrit.wikimedia.org/r/966805 (https://phabricator.wikimedia.org/T288053)
[07:15:38] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965585 (https://phabricator.wikimedia.org/T348607)
[07:18:37] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede)
[07:20:19] <wikibugs>	 (03Merged) 10jenkins-bot: puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede)
[07:20:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2132.codfw.wmnet with OS bookworm
[07:24:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] otel-coll: bump resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/966514 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi)
[07:26:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965585 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[07:27:12] <logmsgbot>	 !log filippo@deploy2002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[07:27:25] <logmsgbot>	 !log filippo@deploy2002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[07:28:09] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply
[07:28:17] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply
[07:28:26] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply
[07:28:32] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply
[07:31:09] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] "hieradata/hosts/pc1015.yaml has a duplicated line at line 7 otherwise lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui)
[07:34:52] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: fix --new with puppet 7 support [cookbooks] - 10https://gerrit.wikimedia.org/r/966809
[07:37:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2132.codfw.wmnet with reason: host reimage
[07:37:54] <volans>	 !log temporarily disabled puppet on the A:cumin hosts to deploy and test spicerack v8.0.0
[07:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:52] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Self-merging to test spicerack 8.0.0 on cumin2002, puppet is disabled on cumin1001. I'll be happy to do any post-merge fix." [cookbooks] - 10https://gerrit.wikimedia.org/r/966490 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[07:40:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2132.codfw.wmnet with reason: host reimage
[07:42:12] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Some or all of the undeletion failed - https://phabricator.wikimedia.org/T348937 (10MatthewVernon) We did switch DCs recently, the impact of which is more load on thumbor (and during the switchover we discovered there was a shortage of thumbor pods...
[07:43:28] <wikibugs>	 (03Merged) 10jenkins-bot: dhcp: adapt to new Spicerack's dhcp() API [cookbooks] - 10https://gerrit.wikimedia.org/r/966490 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[07:46:43] <logmsgbot>	 !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2132.codfw.wmnet with OS bookworm
[07:47:10] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965585 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[07:47:47] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[07:47:58] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965585 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[07:54:58] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[08:00:04] <jouncebot>	 brennen and hashar: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T0800).
[08:00:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol)
[08:02:29] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-e8/ssw links - ayounsi@cumin1001"
[08:02:41] <wikibugs>	 (03PS1) 10Jgiannelos: tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810
[08:03:16] <wikibugs>	 10SRE, 10DBA, 10MW-1.41-notes (1.41.0-wmf.30; 2023-10-10): Error connecting to db2109 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection refused - https://phabricator.wikimedia.org/T348419 (10Marostegui)
[08:03:38] <wikibugs>	 (03PS2) 10Jgiannelos: tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 (https://phabricator.wikimedia.org/T347717)
[08:03:45] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-e8/ssw links - ayounsi@cumin1001"
[08:03:45] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:06:04] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet
[08:07:22] <wikibugs>	 (03Abandoned) 10Hashar: gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox)
[08:08:24] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet
[08:10:04] <wikibugs>	 (03CR) 10Hashar: logging: reorder wmgMonologProcessors entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) (owner: 10Hashar)
[08:11:54] <wikibugs>	 (03CR) 10David Caro: "Can you elaborate on what is this and how will it work?" [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah)
[08:12:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Bring Juniper switches in eqiad racks E5-7 and F5-7 online and ready for servers - https://phabricator.wikimedia.org/T334230 (10ayounsi)
[08:12:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi)
[08:12:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10ayounsi) 05Resolved→03Open a:05cmooney→03Jclark-ctr I can't get the links to the Dell switches up, only looking at lsw1-e8 for now it seems li...
[08:14:44] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet
[08:15:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[08:15:37] <wikibugs>	 10ops-eqiad: Add test server to row E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi)
[08:16:03] <wikibugs>	 10ops-eqiad: Add test server to row E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi)
[08:16:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi)
[08:18:22] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet
[08:20:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[08:20:13] <wikibugs>	 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi)
[08:23:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, wmf for Ricki_Jay - https://phabricator.wikimedia.org/T349170 (10RickiJay-WMDE)
[08:23:34] <wikibugs>	 (03PS1) 10WMDE-Fisch: Revert "Revert "Workaround to center search terms label"" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966610 (https://phabricator.wikimedia.org/T252346)
[08:27:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Use ClusterIP services for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm)
[08:28:34] <icinga-wm>	 PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:28:47] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Use ClusterIP services for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm)
[08:30:07] <wikibugs>	 (03PS1) 10Volans: locking: fix path for Spicerack modules locks [software/spicerack] - 10https://gerrit.wikimedia.org/r/966812 (https://phabricator.wikimedia.org/T341973)
[08:31:10] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861)
[08:32:33] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[08:40:17] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Self-merging to fix bug found while testing v8.0.0" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966812 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[08:40:58] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[08:44:11] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Start backing up only clusters 28 and 29 from ES [puppet] - 10https://gerrit.wikimedia.org/r/966814 (https://phabricator.wikimedia.org/T342685)
[08:47:07] <wikibugs>	 (03Merged) 10jenkins-bot: locking: fix path for Spicerack modules locks [software/spicerack] - 10https://gerrit.wikimedia.org/r/966812 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[08:50:51] <wikibugs>	 (03CR) 10Jcrespo: "So this is the way I suggested to implement this- with a static configuration, as once we know this is working the first time, it should w" [puppet] - 10https://gerrit.wikimedia.org/r/966814 (https://phabricator.wikimedia.org/T342685) (owner: 10Jcrespo)
[08:51:03] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[08:53:19] <wikibugs>	 (03CR) 10Jcrespo: "One remaining question is the name of the backups- right now because we could only do full backups, the backups were called "es4"/"es5"- t" [puppet] - 10https://gerrit.wikimedia.org/r/966814 (https://phabricator.wikimedia.org/T342685) (owner: 10Jcrespo)
[08:53:36] <wikibugs>	 (03PS1) 10JMeybohm: wikifunctions: Make app and mesh port different [deployment-charts] - 10https://gerrit.wikimedia.org/r/966816 (https://phabricator.wikimedia.org/T343388)
[08:54:07] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm
[08:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:58:34] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/34/console" [puppet] - 10https://gerrit.wikimedia.org/r/966568 (https://phabricator.wikimedia.org/T344884) (owner: 10Brennen Bearnes)
[09:00:48] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@c17c91c]: Fix following yesterday weekly train deploy [airflow-dags@c17c91ce]
[09:01:14] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966817
[09:01:58] <logmsgbot>	 !log aqu@deploy2002 deploy aborted: Fix following yesterday weekly train deploy [airflow-dags@c17c91ce] (duration: 01m 10s)
[09:02:07] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@c17c91c]: Fix following yesterday weekly train deploy - Second try [airflow-dags@c17c91ce]
[09:02:13] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@c17c91c]: Fix following yesterday weekly train deploy - Second try [airflow-dags@c17c91ce] (duration: 00m 06s)
[09:02:22] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966817 (owner: 10Volans)
[09:03:33] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] "lgtm, I'll test the change on phab2002 and then on phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/966568 (https://phabricator.wikimedia.org/T344884) (owner: 10Brennen Bearnes)
[09:04:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:05:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet
[09:06:57] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage
[09:07:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:08:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: allow thanos-rule to serve /rule [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102)
[09:08:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: reverse-proxy /rule to rule-hosts [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102)
[09:08:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Make app and mesh port different [deployment-charts] - 10https://gerrit.wikimedia.org/r/966816 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm)
[09:09:18] <wikibugs>	 (03CR) 10Jbond: Don't require dummy 'team' label for multi-owner alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi)
[09:09:52] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Make app and mesh port different [deployment-charts] - 10https://gerrit.wikimedia.org/r/966816 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm)
[09:09:59] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:10:01] <wikibugs>	 (03PS1) 10Volans: Upstream release v8.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966820
[09:10:04] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage
[09:10:14] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v8.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966820 (owner: 10Volans)
[09:11:12] <wikibugs>	 (03CR) 10Marostegui: mariadb: Productionize pc1016, pc2016 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui)
[09:11:14] <wikibugs>	 (03PS7) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408)
[09:11:24] <wikibugs>	 (03CR) 10Jbond: "post merge comment" [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede)
[09:12:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah)
[09:13:05] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet
[09:13:12] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:13:17] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1006.eqiad.wmnet
[09:14:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1007.eqiad.wmnet
[09:16:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "Allow me to merge this as is for the time being so I can test and generate backups right away, confirming this works; but please note that" [puppet] - 10https://gerrit.wikimedia.org/r/966814 (https://phabricator.wikimedia.org/T342685) (owner: 10Jcrespo)
[09:16:23] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v8.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966820 (owner: 10Volans)
[09:16:49] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on stat1009.eqiad.wmnet with reason: Moving /home to /srv/home on stat1009 and rebooting
[09:17:03] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on stat1009.eqiad.wmnet with reason: Moving /home to /srv/home on stat1009 and rebooting
[09:17:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[09:17:27] <godog>	 jbond: I'm not sure I understand the comments on https://gerrit.wikimedia.org/r/c/operations/alerts/+/956794 as that change didn't change any semantics
[09:17:33] <icinga-wm>	 RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:17:40] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[09:18:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:18:19] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I think this makes sense. Adding John." [puppet] - 10https://gerrit.wikimedia.org/r/960063 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar)
[09:19:00] <godog>	 or at least that was my intention
[09:19:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Makse sense to me, adding John" [puppet] - 10https://gerrit.wikimedia.org/r/960062 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar)
[09:19:49] <godog>	 i.e. the alerts still have per-team 'team' label, it is part of the expression via group_left
[09:19:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[09:20:03] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960064 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar)
[09:20:07] <wikibugs>	 (03CR) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah)
[09:20:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/966809 (owner: 10Volans)
[09:20:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[09:20:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1006.eqiad.wmnet
[09:21:09] <wikibugs>	 (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:21:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm, merging" [puppet] - 10https://gerrit.wikimedia.org/r/960063 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar)
[09:21:56] <jynus>	 !log starting new backup of es1022, es1025 (new clusters only)
[09:21:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:11] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[09:22:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] envoyproxy: remove skip_install from tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/960062 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar)
[09:22:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Remove minversion=1.6 from tox.ini files [puppet] - 10https://gerrit.wikimedia.org/r/960064 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar)
[09:22:53] <hashar>	 jbond: thank you :)
[09:23:05] <logmsgbot>	 !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm
[09:23:08] <jynus>	 !log aborting backup of es1022, es1025 (there was already another backup running)
[09:23:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[09:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:16] <hashar>	 I was talking to volans about those changes and he remembered me sre foundation is the goto team for anything related to Puppet  + CI :)
[09:24:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Please note that the semantics of these alerts didn't change, i.e. there's still a per-team label attached to the alerts via group_left()." [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi)
[09:25:56] <volans>	 !log uploaded spicerack_8.0.1 to apt.wikimedia.org bullseye-wikimedia
[09:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:59] <wikibugs>	 (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:27:05] <wikibugs>	 (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:27:11] <wikibugs>	 (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:27:18] <wikibugs>	 (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 (owner: 10Volans)
[09:28:27] <wikibugs>	 (03PS5) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119
[09:30:25] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix --new with puppet 7 support [cookbooks] - 10https://gerrit.wikimedia.org/r/966809 (owner: 10Volans)
[09:30:50] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.dhcp: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:31:04] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.provision: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:31:21] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:31:52] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reboot-single: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:32:34] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tox.ini: remove optimization for tox <4 [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 (owner: 10Volans)
[09:32:51] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: fix --new with puppet 7 support [cookbooks] - 10https://gerrit.wikimedia.org/r/966809 (owner: 10Volans)
[09:33:10] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.dhcp: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:33:27] <wikibugs>	 (03PS2) 10Ladsgroup: Set s6 and s8 to write both for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732)
[09:33:36] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:33:50] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:34:09] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reboot-single: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[09:34:28] <wikibugs>	 (03CR) 10Jbond: "thanks filippo" [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi)
[09:36:18] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 (owner: 10Volans)
[09:37:07] <wikibugs>	 (03Merged) 10jenkins-bot: tox.ini: remove optimization for tox <4 [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 (owner: 10Volans)
[09:37:58] <wikibugs>	 (03PS1) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821
[09:39:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[09:39:18] <wikibugs>	 (03PS1) 10Hashar: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822
[09:47:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[09:47:36] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet
[09:48:04] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet
[09:49:29] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "ship it, we can fine tune later, unless Chris have concerns" [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans)
[09:50:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, wmf for Ricki_Jay - https://phabricator.wikimedia.org/T349170 (10RickiJay-WMDE) 05Open→03Resolved a:03RickiJay-WMDE
[09:52:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on stat1009.eqiad.wmnet with reason: Extending downtime for stat1009
[09:52:07] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on stat1009.eqiad.wmnet with reason: Extending downtime for stat1009
[09:52:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[09:54:14] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[09:58:44] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1000)
[10:03:25] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[10:03:42] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] "This is grand!" [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey)
[10:04:35] <Amir1>	 jouncebot: nowandnext
[10:04:35] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1000)
[10:04:35] <jouncebot>	 In 2 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1300)
[10:07:23] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm
[10:09:07] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[10:09:16] <wikibugs>	 (03PS1) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823
[10:09:18] <wikibugs>	 (03PS1) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824
[10:13:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[10:13:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[10:16:59] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage
[10:17:04] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: add listeners for cxserver and eventgate-analytics to rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/966827 (https://phabricator.wikimedia.org/T348607)
[10:18:53] <wikibugs>	 (03CR) 10David Caro: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah)
[10:19:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: add listeners for cxserver and eventgate-analytics to rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/966827 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[10:19:42] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage
[10:21:10] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review Luca :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966827 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[10:22:01] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add listeners for cxserver and eventgate-analytics to rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/966827 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[10:26:33] <volans>	 btullis: FYI the reboot cookbook is taking forever because is waiting for a successful puppet run, but puppet is failing on stat1007, see https://puppetboard.wikimedia.org/report/stat1007.eqiad.wmnet/59b3decc9bbd3ac0bb06dbe8c55aecc4ed36a924
[10:27:21] <btullis>	 volans: Thanks. I'm on it already with stevemunene - Should be resolved shortly.
[10:27:36] <wikibugs>	 (03PS2) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821
[10:28:26] <volans>	 ack
[10:28:28] <volans>	 great
[10:28:49] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[10:28:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[10:30:57] <wikibugs>	 (03CR) 10Slyngshede: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[10:31:58] <stevemunene>	 should be ok now volans btullis 
[10:32:41] <logmsgbot>	 !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm
[10:33:18] <wikibugs>	 (03PS1) 10Jbond: late_command: drop signed-by config [puppet] - 10https://gerrit.wikimedia.org/r/966825
[10:35:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1007.eqiad.wmnet
[10:35:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/966825 (owner: 10Jbond)
[10:35:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] late_command: drop signed-by config [puppet] - 10https://gerrit.wikimedia.org/r/966825 (owner: 10Jbond)
[10:37:13] <logmsgbot>	 !log volans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye
[10:40:18] <volans>	 !log re-enabled puppet on the cumin hosts. installed spicerack 8.0.1 on the cumin hosts
[10:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:24] <wikibugs>	 (03CR) 10Effie Mouzeli: ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[10:50:21] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos)
[10:50:59] <wikibugs>	 (03PS2) 10Kosta Harlan: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861)
[10:51:04] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[10:51:06] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos)
[10:51:54] <wikibugs>	 (03Merged) 10jenkins-bot: tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos)
[10:52:45] <icinga-wm>	 PROBLEM - SSH on stat1009 is CRITICAL: connect to address 10.64.21.17 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:56:52] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: ipoid: Remove APP_CONFIG env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/935720 (owner: 10Alexandros Kosiaris)
[10:58:44] <wikibugs>	 (03PS1) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846
[10:59:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set s6 and s8 to write both for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup)
[10:59:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup)
[10:59:39] <wikibugs>	 (03PS3) 10Effie Mouzeli: osm: remove imposm-deploy-import [puppet] - 10https://gerrit.wikimedia.org/r/862281
[10:59:55] <wikibugs>	 (03Merged) 10jenkins-bot: Set s6 and s8 to write both for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup)
[11:00:01] <wikibugs>	 (03PS1) 10Hashar: tox: remove envdir optimizations [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434)
[11:00:59] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:966592|Set s6 and s8 to write both for pagelinks migration (T345732)]]
[11:01:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[11:01:11] <stashbot>	 T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732
[11:01:41] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm
[11:02:21] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:966592|Set s6 and s8 to write both for pagelinks migration (T345732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:03:00] <wikibugs>	 (03PS1) 10Hnowlan: editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966848 (https://phabricator.wikimedia.org/T336415)
[11:03:38] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:04:39] <wikibugs>	 (03PS1) 10Jgiannelos: tegola: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966849
[11:05:35] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[11:06:08] <wikibugs>	 (03PS5) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312)
[11:07:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney)
[11:08:23] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[11:10:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822 (owner: 10Hashar)
[11:11:02] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966849 (owner: 10Jgiannelos)
[11:11:09] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:966592|Set s6 and s8 to write both for pagelinks migration (T345732)]] (duration: 10m 10s)
[11:11:13] <stashbot>	 T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732
[11:11:28] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route editor-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/966851 (https://phabricator.wikimedia.org/T336415)
[11:12:00] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage
[11:12:09] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966849 (owner: 10Jgiannelos)
[11:12:59] <wikibugs>	 (03Merged) 10jenkins-bot: tegola: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966849 (owner: 10Jgiannelos)
[11:13:10] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966848 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[11:13:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "seems fine to me, just need to run tox -e py3-format" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[11:14:05] <wikibugs>	 (03Merged) 10jenkins-bot: editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966848 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[11:14:47] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage
[11:15:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434) (owner: 10Hashar)
[11:16:01] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[11:16:03] <wikibugs>	 (03PS1) 10Btullis: Set the class for each of the spark shuffle services [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910)
[11:16:15] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[11:18:41] <wikibugs>	 (03PS2) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823
[11:18:43] <wikibugs>	 (03PS2) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824
[11:19:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:20:58] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[11:21:13] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[11:21:40] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8 NOOP 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:22:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[11:23:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[11:23:54] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[11:24:27] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[11:27:16] <wikibugs>	 (03PS2) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846
[11:27:22] <wikibugs>	 (03CR) 10Hashar: debug_presentation: script to render HTML templates (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[11:28:57] <wikibugs>	 (03PS3) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823
[11:28:59] <wikibugs>	 (03PS3) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824
[11:29:02] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply
[11:29:14] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply
[11:29:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:31:16] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[11:32:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[11:32:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[11:33:26] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I need to set `html.change_id` and `html.job_id`" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[11:33:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[11:34:22] <wikibugs>	 (03PS1) 10Volans: locking: delete the key on etcd if no locks remain [software/spicerack] - 10https://gerrit.wikimedia.org/r/966854 (https://phabricator.wikimedia.org/T341973)
[11:34:23] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[11:35:33] <wikibugs>	 (03PS3) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846
[11:35:41] <wikibugs>	 (03PS6) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312)
[11:36:36] <wikibugs>	 (03CR) 10Hashar: "The rendering is missing diffs, catalogues and always flag a compilation failure cause the PCC files are missing. But that is a first pass" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[11:36:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney)
[11:38:51] <wikibugs>	 (03PS4) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823
[11:38:53] <wikibugs>	 (03PS4) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824
[11:39:35] <wikibugs>	 (03CR) 10Slyngshede: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[11:43:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[11:43:27] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm
[11:44:15] <icinga-wm>	 RECOVERY - SSH on stat1009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:44:34] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1009.eqiad.wmnet
[11:44:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966854 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:48:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[11:50:13] <wikibugs>	 (03PS5) 10KartikMistry: Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982)
[11:51:43] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1009.eqiad.wmnet
[11:55:00] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "thanks for the follow up but see inline the > 0 will exclude the resources == 0 check" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[11:58:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[12:15:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[12:17:26] <wikibugs>	 (03CR) 10Volans: [C: 03+2] locking: delete the key on etcd if no locks remain [software/spicerack] - 10https://gerrit.wikimedia.org/r/966854 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[12:17:28] <arnaudb>	 !log repool db2161 and db1126
[12:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P53001 and previous config saved to /var/cache/conftool/dbconfig/20231018-121811-arnaudb.json
[12:18:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P53002 and previous config saved to /var/cache/conftool/dbconfig/20231018-121828-arnaudb.json
[12:20:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[12:24:48] <wikibugs>	 (03PS3) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821
[12:25:01] <wikibugs>	 (03Merged) 10jenkins-bot: locking: delete the key on etcd if no locks remain [software/spicerack] - 10https://gerrit.wikimedia.org/r/966854 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[12:26:22] <wikibugs>	 (03CR) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[12:31:29] * kart_ deploing cxserver..
[12:31:39] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry)
[12:32:33] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry)
[12:33:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P53003 and previous config saved to /var/cache/conftool/dbconfig/20231018-123315-arnaudb.json
[12:33:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P53004 and previous config saved to /var/cache/conftool/dbconfig/20231018-123333-arnaudb.json
[12:36:56] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui)
[12:37:44] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:37:54] <wikibugs>	 (03PS2) 10Anzx: dewiktionary: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966569 (https://phabricator.wikimedia.org/T348978)
[12:38:08] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:38:08] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "Could you do a preliminary check, mostly regarding my solution for only rolling this out on sretest. I'd like to test is a little better b" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[12:38:44] <wikibugs>	 (03PS2) 10Anzx: knwiktionary: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966574 (https://phabricator.wikimedia.org/T349036)
[12:39:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2109.codfw.wmnet with reason: db2109 downtime while repooling
[12:40:01] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2109.codfw.wmnet with reason: db2109 downtime while repooling
[12:42:03] <wikibugs>	 (03CR) 10Joal: [C: 03+1] Turnilo: wmf_netflow: change forwarded 1/0 to yes/no [puppet] - 10https://gerrit.wikimedia.org/r/966800 (https://phabricator.wikimedia.org/T331707) (owner: 10Ayounsi)
[12:42:21] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[12:43:53] <wikibugs>	 (03PS5) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[12:44:23] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[12:44:47] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[12:44:58] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[12:48:09] <wikibugs>	 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10ayounsi)
[12:48:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P53005 and previous config saved to /var/cache/conftool/dbconfig/20231018-124820-arnaudb.json
[12:48:29] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966858
[12:48:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P53006 and previous config saved to /var/cache/conftool/dbconfig/20231018-124838-arnaudb.json
[12:48:42] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966858 (owner: 10Volans)
[12:49:14] <wikibugs>	 10SRE, 10ops-codfw, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10Papaul) @nskaggs you are correct even 1 additional rack isnt't possible at this time. Sorry about that.
[12:51:36] <kart_>	 Keeping watch on eqiad graphs..
[12:51:42] <jbond>	 !log upload puppet_7.23.0-1~debu11u1 (bullseye backport 
[12:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:04] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[12:53:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[12:54:28] <wikibugs>	 (03CR) 10Volans: "LGTM, suggested a modification inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[12:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:56:02] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966858 (owner: 10Volans)
[12:56:30] <wikibugs>	 (03CR) 10Volans: sre.puppet: move get_puppet_version to sre.puppet (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[12:57:10] <wikibugs>	 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10ayounsi) Let's move all the A/AAAA SVC records to Netbox.  And keep the CNAMEs in the DNS repo if we can't get rid of them. Then have follow up tasks t...
[12:58:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[12:59:08] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[12:59:33] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[12:59:44] <wikibugs>	 (03PS1) 10Volans: Upstream release v8.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966859
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1300).
[13:00:05] <jouncebot>	 kimberly_sarabia, WMDE-Fisch, and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:06] <wikibugs>	 (03PS4) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821
[13:00:28] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v8.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966859 (owner: 10Volans)
[13:00:35] * TheresNoTime can't deploy right now
[13:00:47] <kart_>	 OK. That's not working for cxserver; reverting patch..
[13:00:49] <wikibugs>	 (03CR) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[13:00:49] <WMDE-Fisch>	 o/
[13:01:16] <wikibugs>	 (03PS1) 10KartikMistry: Revert "Update cxserver to 2023-10-12-080927-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966611
[13:01:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[13:01:23] <wikibugs>	 (03PS3) 10Anzx: dewiktionary: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966569 (https://phabricator.wikimedia.org/T348978)
[13:01:56] <WMDE-Fisch>	 ( I can't deploy myself though )
[13:02:21] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2023-10-12-080927-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966611 (owner: 10KartikMistry)
[13:02:23] <wikibugs>	 (03CR) 10Slyngshede: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[13:02:51] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "Giving myself a +2 because the spark failure is affecting every job on the hadoop-test cluster." [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[13:02:53] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Set the class for each of the spark shuffle services [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[13:03:02] <kimberly_sarabia>	 hello
[13:03:10] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update cxserver to 2023-10-12-080927-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966611 (owner: 10KartikMistry)
[13:03:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P53007 and previous config saved to /var/cache/conftool/dbconfig/20231018-130325-arnaudb.json
[13:03:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P53008 and previous config saved to /var/cache/conftool/dbconfig/20231018-130343-arnaudb.json
[13:04:23] <wikibugs>	 (03CR) 10Jforrester: "Neat!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[13:04:27] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[13:04:33] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "wmfusercontent: add TXT record for cert validation" [dns] - 10https://gerrit.wikimedia.org/r/966243 (owner: 10Ssingh)
[13:04:43] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[13:04:58] <sukhe>	 !log running authdns-update for CR 966243
[13:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:07] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[13:05:30] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[13:06:06] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[13:06:29] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[13:06:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah)
[13:06:56] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v8.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966859 (owner: 10Volans)
[13:07:34] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[13:07:38] <WMDE-Fisch>	 So nobody able to deploy? :-/
[13:07:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:08:20] <kart_>	 hnowlan: around?
[13:09:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[13:09:59] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:10:35] <kart_>	 I deployed and reverted cxserver patch, but reverting change is not affecting. What can be reason? Patch reverted was: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965022
[13:10:48] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[13:11:32] <kimberly_sarabia>	 Anyone around to deploy?
[13:12:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:13:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[13:14:07] <wikibugs>	 (03PS2) 10Cathal Mooney: Add homer automation for management router bgp [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635)
[13:14:23] <volans>	 !log uploaded spicerack_8.0.2 to apt.wikimedia.org bullseye-wikimedia
[13:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:34] <wikibugs>	 (03CR) 10Cathal Mooney: Add homer automation for management router bgp (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[13:14:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Turnilo: wmf_netflow: change forwarded 1/0 to yes/no [puppet] - 10https://gerrit.wikimedia.org/r/966800 (https://phabricator.wikimedia.org/T331707) (owner: 10Ayounsi)
[13:15:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add homer automation for management router bgp [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[13:16:04] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/966573 (https://phabricator.wikimedia.org/T348335) (owner: 10Ssingh)
[13:16:16] <wikibugs>	 (03PS1) 10Majavah: kubeadm: drop default [puppet] - 10https://gerrit.wikimedia.org/r/966863
[13:16:18] <wikibugs>	 (03PS1) 10Majavah: kubeadm: drop version upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/966864 (https://phabricator.wikimedia.org/T343869)
[13:16:20] <wikibugs>	 (03PS1) 10Majavah: aptrepo: drop k8s 1.22 components [puppet] - 10https://gerrit.wikimedia.org/r/966865 (https://phabricator.wikimedia.org/T298005)
[13:17:25] <wikibugs>	 (03PS1) 10KartikMistry: cxserver: Pin chart to 0.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966866
[13:17:26] <kart_>	 Looks like I have to pin chart to 0.2.2
[13:17:35] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/36/console" [puppet] - 10https://gerrit.wikimedia.org/r/966863 (owner: 10Majavah)
[13:17:43] <wikibugs>	 (03PS1) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171)
[13:18:03] <hnowlan>	 kart_: hey, here 
[13:18:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Pin chart to 0.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966866 (owner: 10KartikMistry)
[13:19:27] <hnowlan>	 kart_: you could also bump to 0.2.4, might be easier 
[13:19:33] <wikibugs>	 (03PS2) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171)
[13:19:35] <hnowlan>	 if it's an emergency I can do a rollback in helm 
[13:19:47] <wikibugs>	 (03PS2) 10KartikMistry: cxserver: Pin chart to 0.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966866
[13:20:11] <wikibugs>	 (03PS3) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171)
[13:20:17] <wikibugs>	 (03PS4) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171)
[13:20:26] <kart_>	 hnowlan: can you please do emergency revert?
[13:20:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede)
[13:20:39] <wikibugs>	 (03CR) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[13:21:37] <hnowlan>	 kart_: ack, will do 
[13:21:58] <kart_>	 hnowlan: I just broke cxserver again :/
[13:22:44] <wikibugs>	 (03CR) 10Herron: [C: 03+1] thanos: allow thanos-rule to serve /rule [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[13:23:14] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet
[13:23:35] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet
[13:23:48] <hnowlan>	 kart_: there was a deploy done at 12:44 and 13:05 today - do you want me to roll back to 12:44 or to the previous one on Wed Oct 11 11:57:42 2023 which uses the 0.2.2 chart? 
[13:24:52] <kart_>	 hnowlan: one with 0.2.2 chart 
[13:25:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] osm: remove imposm-deploy-import [puppet] - 10https://gerrit.wikimedia.org/r/862281 (owner: 10Effie Mouzeli)
[13:25:32] <wikibugs>	 (03CR) 10Herron: [C: 03+1] thanos: reverse-proxy /rule to rule-hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[13:25:57] <hnowlan>	 kart_: rollback done 
[13:27:08] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update evaluators to WASM images [deployment-charts] - 10https://gerrit.wikimedia.org/r/966868 (https://phabricator.wikimedia.org/T343829)
[13:27:17] <kart_>	 hnowlan: Thanks a lot!
[13:27:33] <kart_>	 hnowlan: I should have rollback access, right? 
[13:28:43] <hnowlan>	 kart_: for emergency rollback you'd need sudo on the deploy hosts 
[13:28:51] <wikibugs>	 (03Merged) 10jenkins-bot: Add homer automation for management router bgp [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[13:28:59] <kart_>	 hnowlan: Can you revisit patch, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965022 -- what's wrong with it? Also, do you know if any other services also done similar work?
[13:29:05] <kart_>	 hnowlan: noted.
[13:29:11] <hnowlan>	 generally we'd encourage using helmfile for rollbacks where it isn't a critical situation 
[13:29:21] <hnowlan>	 I actually didn't look at the service, what failed? 
[13:30:02] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10fgiunchedi)
[13:30:06] <wikibugs>	 10Puppet, 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10fgiunchedi) 05Open→03Declined
[13:30:08] <kart_>	 hnowlan: Page loading from RestBase (ie example, https://cxserver.wikimedia.org/v2/page/es/it/Mariana_BO) So, cxserver won't load page and thus fails.
[13:31:13] <kart_>	 Configuration seems wrong for sure.
[13:33:17] <hnowlan>	 hard to know without more logging in the service really. is there debug logging we could turn on in staging? 
[13:33:46] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede)
[13:34:47] <kart_>	 I'll come up with some ideas later but I need to step out for dinner now. Let's talk on the patch once I submit new patch if that's fine?
[13:34:52] <hnowlan>	 generally I would test a change like this in staging before rolling to eqiad/codfw by doing something along the lines of `curl -vk  https://staging.svc.eqiad.wmnet:4002/v2/page/es/it/Mariana_BO` 
[13:35:04] <wikibugs>	 (03CR) 10Xcollazo: [C: 03+1] Set the class for each of the spark shuffle services [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[13:35:19] <hnowlan>	 assuming that path/port combo are correct and `Page es:Mariana_BO could not be found.` is the error you were seeing 
[13:36:09] <kart_>	 Yes
[13:36:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] debug_presentation: script to render HTML templates (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[13:38:25] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10dcausse)
[13:38:33] <wikibugs>	 (03PS1) 10Majavah: openstack: encapi: don't try to close the connection [puppet] - 10https://gerrit.wikimedia.org/r/966871 (https://phabricator.wikimedia.org/T349195)
[13:38:53] <hnowlan>	 kart_: the URI path you're using to access the mwapi rest.php is incorrect - you need to specify the host header in your request and remove it from the URI (for example - `curl -H "Host: es.wikipedia.org" localhost:6500/w/rest.php/v1/page/Mariana_BO`) 
[13:38:56] <wikibugs>	 (03CR) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[13:39:13] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove superflous brackets from bgp templates [homer/public] - 10https://gerrit.wikimedia.org/r/966872
[13:39:16] <hnowlan>	 that's probably what breaks. it'd be nice to have some kind of logging to make that more visible within the application though 
[13:39:58] <kart_>	 Noted.
[13:40:00] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove superflous brackets from bgp templates [homer/public] - 10https://gerrit.wikimedia.org/r/966872 (owner: 10Cathal Mooney)
[13:40:38] <wikibugs>	 (03Merged) 10jenkins-bot: Remove superflous brackets from bgp templates [homer/public] - 10https://gerrit.wikimedia.org/r/966872 (owner: 10Cathal Mooney)
[13:41:48] <kart_>	 hnowlan: Can you please also comment on the patch? IRC logs is easier to lost :)
[13:41:58] <hnowlan>	 ack
[13:42:15] <wikibugs>	 (03PS7) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312)
[13:43:08] <wikibugs>	 (03CR) 10Hnowlan: Update cxserver to 2023-10-12-080927-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry)
[13:44:17] <kart_>	 Thanks!
[13:44:53] <wikibugs>	 10SRE-swift-storage, 10serviceops, 10Patch-For-Review: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 (10fgiunchedi)
[13:45:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] haproxy: enable healthcheck-dedicated backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur)
[13:50:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/966865 (https://phabricator.wikimedia.org/T298005) (owner: 10Majavah)
[13:50:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] install_server: create aqs reuse partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[13:51:07] <wikibugs>	 (03PS4) 10Eevans: install_server: create aqs partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738)
[13:51:39] <wikibugs>	 (03PS5) 10Eevans: install_server: create aqs partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738)
[13:52:57] <wikibugs>	 (03PS5) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823
[13:52:59] <wikibugs>	 (03PS5) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824
[13:53:19] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[13:53:36] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] install_server: create aqs partition reuse recipe (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[13:54:30] <wikibugs>	 (03PS1) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973)
[13:56:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[13:57:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[13:58:25] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[13:59:05] <wikibugs>	 (03Abandoned) 10KartikMistry: cxserver: Pin chart to 0.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966866 (owner: 10KartikMistry)
[14:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1400)
[14:01:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 (owner: 10Volans)
[14:02:15] <wikibugs>	 (03PS2) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973)
[14:03:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye
[14:03:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye
[14:03:27] <icinga-wm>	 PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:03:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:03:48] <wikibugs>	 (03PS6) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824
[14:03:56] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[14:04:13] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:04:21] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:04:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:07:04] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:07:14] <wikibugs>	 (03PS1) 10Slavina Stefanova: harbor: upgrade from 2.5 to 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241)
[14:08:27] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:08:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:10:13] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "headers and content seems similar between the two endpoints" [puppet] - 10https://gerrit.wikimedia.org/r/966851 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[14:10:22] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10SRE Observability: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10lmata) Will radar for now; please let us know if you'd like us to engage somehow.
[14:12:11] <wikibugs>	 (03PS2) 10Effie Mouzeli: P:memcached::memkeys: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955701 (owner: 10Majavah)
[14:12:58] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: Revert "tegola-vector-tiles: use tegola image with debug enabled on codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951850 (owner: 10Effie Mouzeli)
[14:13:38] <wikibugs>	 (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[14:18:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1114']
[14:20:07] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[14:20:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage
[14:21:25] <wikibugs>	 (03PS1) 10Jbond: sre.puppet.renew-cert: don't disable puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966877
[14:22:48] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[14:23:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/966877 (owner: 10Jbond)
[14:23:31] <wikibugs>	 (03CR) 10David Caro: "LGTM, to test this in toolsbeta, you have to ssh to the puppetmaster there:" [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) (owner: 10Slavina Stefanova)
[14:23:35] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage
[14:25:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1114']
[14:25:26] <wikibugs>	 (03PS3) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973)
[14:25:42] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: route editor-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/966851 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan)
[14:25:44] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[14:25:50] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro    so i submitted the logs and here is Dells Response.    The only errors showing in the System Event Log (SEL) ar...
[14:26:12] <wikibugs>	 (03PS6) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119
[14:27:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[14:27:48] <wikibugs>	 (03CR) 10Volans: [C: 03+2] svc records: add missing comments for reserved IPs (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/965119 (owner: 10Volans)
[14:27:50] <wikibugs>	 (03PS1) 10Hashar: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878
[14:27:52] <wikibugs>	 (03PS1) 10Hashar: Add style to HTML output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879
[14:28:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[14:28:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[14:28:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.puppet.renew-cert: don't disable puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966877 (owner: 10Jbond)
[14:28:40] <wikibugs>	 (03CR) 10Hashar: "Screenshot: https://phabricator.wikimedia.org/F38608194" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar)
[14:29:07] <wikibugs>	 (03CR) 10Hashar: "That is not really needed, but I felt we could avoid repetition :)" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 (owner: 10Hashar)
[14:31:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye
[14:31:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye
[14:32:24] <wikibugs>	 (03PS4) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973)
[14:32:38] <wikibugs>	 (03Merged) 10jenkins-bot: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond)
[14:32:50] <wikibugs>	 (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[14:32:54] <wikibugs>	 (03Merged) 10jenkins-bot: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond)
[14:32:56] <wikibugs>	 (03Merged) 10jenkins-bot: sre.puppet.renew-cert: don't disable puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966877 (owner: 10Jbond)
[14:33:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[14:34:17] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[14:34:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) just in case you need it: > What OS are you running on the server?  Debian Bullseye (11): 5.10.0-19-amd64 #1 SMP Debian 5.10.1...
[14:34:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[14:36:20] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10SRE Observability, 10fundraising-tech-ops: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10Jgreen)
[14:36:22] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1051 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:38:36] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[14:41:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 (owner: 10Hashar)
[14:42:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, can you also add a changelog entry for all theses changes either as a new CR or can be included in this one" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar)
[14:44:28] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1010.eqiad.wmnet with OS bullseye
[14:46:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage
[14:49:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage
[14:51:55] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[14:51:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1108.eqiad.wmnet with OS bullseye
[14:52:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye completed: - cp1108 (**PASS**)   - Removed from Puppet...
[14:53:37] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:12] <wikibugs>	 (03PS1) 10Herron: graphite-web: switch logrotate to copytruncate [puppet] - 10https://gerrit.wikimedia.org/r/966881
[14:54:21] <wikibugs>	 (03PS2) 10Herron: graphite-web: switch logrotate to copytruncate [puppet] - 10https://gerrit.wikimedia.org/r/966881
[14:56:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[14:56:12] <wikibugs>	 (03PS1) 10Elukey: install_server: fix reuse-parts-test.cfg [puppet] - 10https://gerrit.wikimedia.org/r/966882
[14:56:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110']
[14:56:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111']
[14:56:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1111']
[14:57:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111']
[14:57:06] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1111']
[14:57:09] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:57:12] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] install_server: fix reuse-parts-test.cfg [puppet] - 10https://gerrit.wikimedia.org/r/966882 (owner: 10Elukey)
[14:57:20] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet
[14:57:33] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet
[14:57:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1111
[14:58:06] <elukey>	 !log powercycle titan1001 (no mgmt console / tty available, no host metrics, no ssh)
[14:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: fix reuse-parts-test.cfg [puppet] - 10https://gerrit.wikimedia.org/r/966882 (owner: 10Elukey)
[14:58:55] <wikibugs>	 (03CR) 10Herron: "please lmk what if any tasks should be attached" [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron)
[14:58:58] <wikibugs>	 (03PS2) 10Hashar: tox: remove envdir optimizations [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434)
[14:59:00] <wikibugs>	 (03PS2) 10Hashar: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822
[14:59:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1111
[14:59:02] <wikibugs>	 (03PS4) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846
[14:59:05] <wikibugs>	 (03PS2) 10Hashar: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878
[14:59:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111']
[14:59:07] <wikibugs>	 (03PS2) 10Hashar: Add style to HTML output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879
[14:59:10] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1111']
[14:59:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111']
[14:59:31] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1010.eqiad.wmnet with OS bullseye
[14:59:38] <wikibugs>	 (03CR) 10Hashar: "Rebased in order to add a CHANGELOG entry and avoid conflicting with another series of patches. They are now all in a single series." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434) (owner: 10Hashar)
[14:59:39] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1111']
[15:00:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[15:00:49] <wikibugs>	 (03CR) 10Hashar: "I have amended the whole series to have each change add an entry in CHANGELOG. Based on this last change:" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar)
[15:01:25] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:01:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111']
[15:01:35] <icinga-wm>	 RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Fri 03 Nov 2023 08:51:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:01:36] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1111']
[15:01:49] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:01:53] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:02:07] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[15:02:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111']
[15:02:17] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1010.eqiad.wmnet with OS bullseye
[15:02:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1111']
[15:03:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye
[15:03:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye
[15:03:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:03:37] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:03:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110']
[15:04:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye
[15:05:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye
[15:06:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:06:32] <wikibugs>	 (03CR) 10Effie Mouzeli: "This will not be needed as we have defined proxies in values.yaml for eqiad and codfw. Cronjobs should inherit those vars. I will get back" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[15:07:11] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:07:11] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1114.eqiad.wmnet with OS bullseye
[15:07:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye completed: - cp1114 (**PASS**)   - Removed from Puppet...
[15:08:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[15:08:09] <wikibugs>	 (03PS1) 10Ebernhardson: Switch articletopic over to outlink topic prediction [deployment-charts] - 10https://gerrit.wikimedia.org/r/966883
[15:09:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1107']
[15:09:34] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1107']
[15:10:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye
[15:10:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye
[15:12:46] <logmsgbot>	 !log dancy@deploy2002 Started deploy [releng/jenkins-deploy@2cf7af2] (releasing): (no justification provided)
[15:13:03] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:31] <logmsgbot>	 !log dancy@deploy2002 Finished deploy [releng/jenkins-deploy@2cf7af2] (releasing): (no justification provided) (duration: 00m 44s)
[15:14:03] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:57] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus updater: Switch articletopic over to outlink topic prediction [deployment-charts] - 10https://gerrit.wikimedia.org/r/966883
[15:14:59] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: disable jemalloc and increase task manager mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/966884
[15:15:38] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cirrus updater: disable jemalloc and increase task manager mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/966884 (owner: 10Ebernhardson)
[15:15:43] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:43] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[15:18:54] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route all requests for /api/rest_v1/metrics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/966885 (https://phabricator.wikimedia.org/T336385)
[15:19:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1106']
[15:20:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1111.eqiad.wmnet with reason: host reimage
[15:21:12] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Switch articletopic over to outlink topic prediction [deployment-charts] - 10https://gerrit.wikimedia.org/r/966883 (owner: 10Ebernhardson)
[15:21:37] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: disable jemalloc and increase task manager mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/966884 (owner: 10Ebernhardson)
[15:21:43] <wikibugs>	 (03PS1) 10Volans: documentation: expand distributed locking docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/966886 (https://phabricator.wikimedia.org/T341973)
[15:22:06] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Switch articletopic over to outlink topic prediction [deployment-charts] - 10https://gerrit.wikimedia.org/r/966883 (owner: 10Ebernhardson)
[15:22:28] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: disable jemalloc and increase task manager mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/966884 (owner: 10Ebernhardson)
[15:22:43] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[15:23:08] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] debug_presentation: script to render HTML templates (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[15:23:28] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1111.eqiad.wmnet with reason: host reimage
[15:23:35] <wikibugs>	 (03PS5) 10BCornwall: slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606)
[15:25:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1106']
[15:26:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye
[15:26:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye
[15:27:49] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[15:27:51] <wikibugs>	 (03CR) 10BCornwall: [V: 03+2 C: 03+2] slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[15:28:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage
[15:28:21] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[15:28:34] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:28:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1105']
[15:29:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1105']
[15:29:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye
[15:29:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye
[15:32:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[15:32:41] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage
[15:33:05] <wikibugs>	 (03PS1) 10Jdlrobson: Fix Typo in OS Dark Mode field [extensions/WikimediaEvents] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966615 (https://phabricator.wikimedia.org/T346106)
[15:33:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10nskaggs) Can someone provide an update on what's happening with these machines? Where they indeed sent back? Do we have replacement hardware?
[15:36:39] <wikibugs>	 (03PS1) 10Vgutierrez: ssl: Add digicert-2023 unified public certificates [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119)
[15:40:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:40:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1106.eqiad.wmnet with reason: host reimage
[15:41:52] <kimberly_sarabia>	 brennen: Just left a comment on https://phabricator.wikimedia.org/T348354 that this patch https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/966615 needs to go out with the next train to avoid a spike in EventGate schema validation errors
[15:43:09] <wikibugs>	 (03PS2) 10Hnowlan: Add script for automating joining a single node to the cluster [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/829807 (https://phabricator.wikimedia.org/T309619)
[15:43:34] <inflatador>	 !log bking@deploy2002 destroy dse-k8s-services instance of rdf-streaming-updater T349095
[15:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:38] <stashbot>	 T349095: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095
[15:44:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1106.eqiad.wmnet with reason: host reimage
[15:44:22] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:44:49] <wikibugs>	 (03Abandoned) 10Hnowlan: service::deploy::gitclone: don't append deploy to repo [puppet] - 10https://gerrit.wikimedia.org/r/677620 (owner: 10Hnowlan)
[15:45:28] <wikibugs>	 (03PS5) 10Hnowlan: wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881)
[15:45:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10Jclark-ctr) Unsure if port is turned off or if fs dell optics are not compatible.    I put loopback on optic in dell switch and link did not come up
[15:46:06] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:46:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) @jclark-ctr these need a single NIC connected to the `cloud-hosts` as the primary VLAN, and `cloud-instances` and `cloud-private` VLANs trunked (we...
[15:46:07] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1111.eqiad.wmnet with OS bullseye
[15:46:16] <wikibugs>	 (03PS1) 10Btullis: Partial fix for multiple spark shufflers [puppet] - 10https://gerrit.wikimedia.org/r/966889 (https://phabricator.wikimedia.org/T344910)
[15:46:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye completed: - cp1111 (**PASS**)   - Removed from Puppet...
[15:46:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage
[15:47:25] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1010.eqiad.wmnet with OS bullseye
[15:47:49] <wikibugs>	 (03CR) 10Slavina Stefanova: harbor: upgrade from 2.5 to 2.9 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) (owner: 10Slavina Stefanova)
[15:48:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[15:49:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1104']
[15:49:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage
[15:49:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:49:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1104']
[15:50:07] <wikibugs>	 (03PS2) 10Vgutierrez: base,ssl: Add digicert-2023 unified public certs and RSA intermediate [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119)
[15:50:15] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1010.eqiad.wmnet with OS bullseye
[15:50:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye
[15:50:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye
[15:50:51] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[15:50:53] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1107.eqiad.wmnet with OS bullseye
[15:51:00] <brennen>	 kimberly_sarabia: having a look
[15:51:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye completed: - cp1107 (**WARN**)   - Downtimed on Icinga/...
[15:51:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[15:51:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102']
[15:51:45] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/966889 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[15:51:48] <wikibugs>	 (03PS1) 10Gmodena: mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890
[15:51:54] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102']
[15:52:05] <wikibugs>	 (03CR) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[15:52:13] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[15:52:54] <wikibugs>	 (03CR) 10Vgutierrez: base,ssl: Add digicert-2023 unified public certs and RSA intermediate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[15:52:59] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[15:53:09] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:53:22] <wikibugs>	 (03PS2) 10Gmodena: mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 (https://phabricator.wikimedia.org/T345805)
[15:53:59] <brennen>	 kimberly_sarabia:  right on, i'll do a backport before train moves forward.
[15:54:11] <kimberly_sarabia>	 brennen: thank you!
[15:54:14] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] base,ssl: Add digicert-2023 unified public certs and RSA intermediate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[15:54:20] <brennen>	 sure thing.
[15:55:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan)
[15:55:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] base,ssl: Add digicert-2023 unified public certs and RSA intermediate [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[15:55:54] <wikibugs>	 (03CR) 10TChin: [C: 03+1] mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 (https://phabricator.wikimedia.org/T345805) (owner: 10Gmodena)
[15:57:26] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1110.eqiad.wmnet with OS bullseye
[15:57:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**)   - Removed f...
[15:57:55] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163)
[15:58:07] <wikibugs>	 (03PS1) 10Btullis: Change the first spark shuffler service to use the default port [puppet] - 10https://gerrit.wikimedia.org/r/966892 (https://phabricator.wikimedia.org/T344910)
[15:59:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[15:59:55] <wikibugs>	 (03PS1) 10Vgutierrez: hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119)
[16:00:32] <wikibugs>	 (03PS5) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973)
[16:00:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:01:32] <wikibugs>	 (03PS2) 10Vgutierrez: hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119)
[16:02:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:02:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1106.eqiad.wmnet with OS bullseye
[16:02:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye completed: - cp1106 (**PASS**)   - Removed from Puppet...
[16:02:39] <wikibugs>	 (03CR) 10Elukey: ml-services: deploy nllb in llm namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[16:02:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[16:04:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:04:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102']
[16:05:09] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/966892 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[16:05:14] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1102']
[16:05:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage
[16:05:41] <wikibugs>	 (03PS1) 10Vgutierrez: ssl: Add dummy digicert-2023 unified keys [labs/private] - 10https://gerrit.wikimedia.org/r/966894 (https://phabricator.wikimedia.org/T341119)
[16:06:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1102
[16:06:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:06:28] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] ssl: Add dummy digicert-2023 unified keys [labs/private] - 10https://gerrit.wikimedia.org/r/966894 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[16:07:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[16:07:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:07:28] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1105.eqiad.wmnet with OS bullseye
[16:07:29] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1102
[16:07:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye completed: - cp1105 (**PASS**)   - Removed from Puppet...
[16:07:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102']
[16:07:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102']
[16:08:03] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1010.eqiad.wmnet with reason: host reimage
[16:08:06] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/43/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[16:08:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage
[16:08:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[16:08:33] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan)
[16:08:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110']
[16:09:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:10:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110']
[16:10:16] <wikibugs>	 (03CR) 10Xcollazo: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/966889 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[16:11:02] <wikibugs>	 (03PS2) 10Kosta Harlan: labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018)
[16:11:08] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1010.eqiad.wmnet with reason: host reimage
[16:11:25] <James_F>	 jouncebot: nowandnext
[16:11:25] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 48 minute(s)
[16:11:25] <jouncebot>	 In 0 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1700)
[16:11:28] <wikibugs>	 (03CR) 10Xcollazo: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/966892 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[16:11:38] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Partial fix for multiple spark shufflers [puppet] - 10https://gerrit.wikimedia.org/r/966889 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[16:11:48] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Change the first spark shuffler service to use the default port [puppet] - 10https://gerrit.wikimedia.org/r/966892 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[16:11:53] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Update evaluators to WASM images [deployment-charts] - 10https://gerrit.wikimedia.org/r/966868 (https://phabricator.wikimedia.org/T343829)
[16:13:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1102
[16:13:21] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Update evaluators to WASM images [deployment-charts] - 10https://gerrit.wikimedia.org/r/966868 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester)
[16:14:09] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update evaluators to WASM images [deployment-charts] - 10https://gerrit.wikimedia.org/r/966868 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester)
[16:14:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1102
[16:14:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110']
[16:14:34] <logmsgbot>	 !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp1110']
[16:14:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102']
[16:14:45] <wikibugs>	 (03PS6) 10Jbond: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[16:15:03] <wikibugs>	 (03CR) 10Jbond: "hopefully that fixes it" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[16:15:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[16:15:47] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:16:37] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:17:01] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:17:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[16:17:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cp1102 - jclark@cumin1001"
[16:17:51] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:17:53] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:18:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cp1102 - jclark@cumin1001"
[16:18:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:18:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102']
[16:18:42] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:18:45] <wikibugs>	 (03PS3) 10Vgutierrez: hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119)
[16:18:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110']
[16:19:15] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110']
[16:19:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye
[16:19:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye
[16:20:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1101']
[16:20:23] <wikibugs>	 (03PS3) 10Hashar: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822
[16:20:25] <wikibugs>	 (03PS5) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846
[16:20:27] <wikibugs>	 (03PS3) 10Hashar: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878
[16:20:29] <wikibugs>	 (03PS3) 10Hashar: Add style to HTML output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879
[16:20:31] <wikibugs>	 (03PS1) 10Hashar: tox: add commands to allowlist_externals [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966895
[16:20:33] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez)
[16:20:40] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1101']
[16:20:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100']
[16:21:51] <wikibugs>	 (03CR) 10Hashar: debug_presentation: script to render HTML templates (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar)
[16:21:59] <wikibugs>	 (03CR) 10TChin: [C: 03+2] mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 (https://phabricator.wikimedia.org/T345805) (owner: 10Gmodena)
[16:22:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1100']
[16:22:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye
[16:22:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1100.eqiad.wmnet with OS bullseye
[16:23:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye
[16:23:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye
[16:23:28] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 (https://phabricator.wikimedia.org/T345805) (owner: 10Gmodena)
[16:24:26] <wikibugs>	 (03PS7) 10Jbond: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[16:24:34] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102']
[16:24:50] <wikibugs>	 (03PS1) 10Jforrester: Revert "wikifunctions: Update evaluators to WASM images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966616
[16:24:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:25:00] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Revert "wikifunctions: Update evaluators to WASM images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966616 (owner: 10Jforrester)
[16:25:27] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS bullseye
[16:25:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye
[16:25:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "wikifunctions: Update evaluators to WASM images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966616 (owner: 10Jforrester)
[16:26:52] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:28:01] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:28:15] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:28:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:28:33] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1103.eqiad.wmnet with OS bullseye
[16:28:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye completed: - cp1103 (**PASS**)   - Removed from Puppet...
[16:29:05] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:29:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[16:30:27] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:30:51] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:33:13] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1010.eqiad.wmnet with OS bullseye
[16:34:47] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: deploy nllb in llm namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos)
[16:37:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage
[16:39:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage
[16:40:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage
[16:40:52] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage
[16:43:39] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage
[16:44:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:46:06] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage
[16:49:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:51:30] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Temporarily add WASM JS service for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/966898 (https://phabricator.wikimedia.org/T343829)
[16:53:24] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[16:54:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[16:54:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:56:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[16:56:03] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1102.eqiad.wmnet with OS bullseye
[16:56:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye completed: - cp1102 (**PASS**)   - Removed from Puppet...
[16:57:51] <wikibugs>	 (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Temporarily add WASM JS service for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/966898 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester)
[16:59:22] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Temporarily add WASM JS service for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/966898 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester)
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1700)
[17:00:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[17:00:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[17:01:15] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[17:01:16] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1101.eqiad.wmnet with OS bullseye
[17:01:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye completed: - cp1101 (**PASS**)   - Removed from Puppet...
[17:02:46] <jinxer-wm>	 (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:03:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[17:04:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[17:04:06] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1100.eqiad.wmnet with OS bullseye
[17:04:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[17:04:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1100.eqiad.wmnet with OS bullseye completed: - cp1100 (**PASS**)   - Removed from Puppet...
[17:04:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1104']
[17:04:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1104']
[17:05:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS bullseye
[17:05:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye
[17:05:28] <icinga-wm>	 PROBLEM - Check systemd state on an-airflow1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@wmde.service,wmf_auto_restart_airflow-scheduler@wmde.service,wmf_auto_restart_airflow-webserver@wmde.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:52] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[17:05:58] <icinga-wm>	 PROBLEM - Checks that the airflow database for airflow wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[17:07:00] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance
[17:07:25] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance
[17:07:46] <jinxer-wm>	 (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:09:59] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:10:46] <jinxer-wm>	 (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:12:05] <wikibugs>	 (03PS1) 10Bking: dse-k8s: remove rdf-streaming-updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/966902 (https://phabricator.wikimedia.org/T349095)
[17:12:17] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1110.eqiad.wmnet with OS bullseye
[17:12:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**)   - Removed f...
[17:13:30] <XioNoX>	 !log restart turnilo to pickup UI change
[17:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:47] <jinxer-wm>	 (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:17:44] <brennen>	 jouncebot nowandnext
[17:17:45] <jouncebot>	 For the next 0 hour(s) and 42 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1700)
[17:17:45] <jouncebot>	 In 0 hour(s) and 42 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1800)
[17:17:45] <jouncebot>	 In 0 hour(s) and 42 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1800)
[17:19:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[17:22:19] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage
[17:23:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110']
[17:24:29] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110']
[17:25:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[17:25:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10cmooney) >>! In T349125#9260678, @ayounsi wrote: > Isn't OSPF required there to benefit from the end to end link cost calculations (...
[17:25:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage
[17:26:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[17:26:51] <wikibugs>	 (03PS1) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601)
[17:27:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:28:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1110
[17:29:33] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1110
[17:30:14] <wikibugs>	 (03PS1) 10Btullis: Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910)
[17:30:17] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1110.mgmt.eqiad.wmnet with reboot policy FORCED
[17:33:52] <wikibugs>	 (03PS2) 10Btullis: Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910)
[17:34:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1110.mgmt.eqiad.wmnet with reboot policy FORCED
[17:34:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye
[17:34:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye
[17:35:47] <wikibugs>	 (03PS3) 10Btullis: Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910)
[17:38:21] <wikibugs>	 (03CR) 10Xcollazo: [C: 03+1] Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[17:40:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[17:40:32] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Fix Typo in OS Dark Mode field [extensions/WikimediaEvents] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966615 (https://phabricator.wikimedia.org/T346106) (owner: 10Jdlrobson)
[17:41:04] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[17:42:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[17:42:28] <wikibugs>	 (03Merged) 10jenkins-bot: Fix Typo in OS Dark Mode field [extensions/WikimediaEvents] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966615 (https://phabricator.wikimedia.org/T346106) (owner: 10Jdlrobson)
[17:43:15] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update DNS records for Greenhouse [dns] - 10https://gerrit.wikimedia.org/r/966573 (https://phabricator.wikimedia.org/T348335) (owner: 10Ssingh)
[17:43:18] <wikibugs>	 (03PS2) 10Ssingh: wikimedia.org: update DNS records for Greenhouse [dns] - 10https://gerrit.wikimedia.org/r/966573 (https://phabricator.wikimedia.org/T348335)
[17:43:26] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:43:36] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:44:45] <sukhe>	 !log running authdns-update for CR 966573
[17:44:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:53] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:47:11] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:50:13] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "airflow-wmde: Place airflow1007 in airflow-wmde role" [puppet] - 10https://gerrit.wikimedia.org/r/966617
[17:50:16] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "airflow-wmde: Place airflow1007 in airflow-wmde role" [puppet] - 10https://gerrit.wikimedia.org/r/966617 (owner: 10Ryan Kemper)
[17:50:44] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[17:51:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage
[17:52:14] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:52:20] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:52:43] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:52:53] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:54:13] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:55:03] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage
[17:55:13] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:55:44] <wikibugs>	 (03PS1) 10Herron: pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995)
[17:56:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[17:56:21] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "airflow-wmde: configure wmde airflow instance" [puppet] - 10https://gerrit.wikimedia.org/r/966618
[17:56:28] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "airflow-wmde: configure wmde airflow instance" [puppet] - 10https://gerrit.wikimedia.org/r/966618 (owner: 10Ryan Kemper)
[17:56:52] <wikibugs>	 (03PS2) 10Herron: pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995)
[17:58:27] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:59:27] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:00:07] <jouncebot>	 brennen and hashar: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1800).
[18:00:07] <jouncebot>	 brennen and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1800).
[18:01:21] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 5.640 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:01:27] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:01:29] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "airflow-wmde: Create scap deployment source for wmde" [puppet] - 10https://gerrit.wikimedia.org/r/966619
[18:01:36] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "airflow-wmde: Create scap deployment source for wmde" [puppet] - 10https://gerrit.wikimedia.org/r/966619 (owner: 10Ryan Kemper)
[18:02:31] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "airflow-wmde: Add wmde service user to the Yarn production queue" [puppet] - 10https://gerrit.wikimedia.org/r/966620
[18:02:41] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "airflow-wmde: Add wmde service user to the Yarn production queue" [puppet] - 10https://gerrit.wikimedia.org/r/966620 (owner: 10Ryan Kemper)
[18:02:55] <brennen>	 o/
[18:03:45] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:966615|Fix Typo in OS Dark Mode field (T346106)]]
[18:03:56] <stashbot>	 T346106: Interface customization baseline instrumentation  - https://phabricator.wikimedia.org/T346106
[18:03:58] <brennen>	 kimberly_sarabia: ^ anything to test here?
[18:05:08] <logmsgbot>	 !log brennen@deploy2002 brennen and jdlrobson: Backport for [[gerrit:966615|Fix Typo in OS Dark Mode field (T346106)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:06:02] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) 05Open→03Resolved We have updated the DNS records for Greenhouse, confirmed email delivery including 'reply-to' and checklist on the Greenhouse web interface. Marking th...
[18:06:08] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) For posterity: we are now using `gh-mail.wikimedia.org` for the Greenhouse mails.
[18:06:26] <wikibugs>	 (03PS1) 10BCornwall: hiera: remove dns6001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/966907 (https://phabricator.wikimedia.org/T342154)
[18:06:50] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: remove dns6001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/966907 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[18:06:58] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: remove dns6001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/966907 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[18:09:59] <jinxer-wm>	 (PuppetFailure) firing: (3) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[18:12:08] <logmsgbot>	 !log brennen@deploy2002 brennen and jdlrobson: Continuing with sync
[18:12:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[18:12:20] <brennen>	 (proceeding as this seems pretty low-risk.)
[18:14:17] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/966828
[18:15:09] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[18:15:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[18:15:37] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:17:12] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns6001.wikimedia.org with OS bookworm
[18:17:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns6001.wikimedia.org with OS bookworm
[18:17:31] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:966615|Fix Typo in OS Dark Mode field (T346106)]] (duration: 13m 46s)
[18:17:36] <stashbot>	 T346106: Interface customization baseline instrumentation  - https://phabricator.wikimedia.org/T346106
[18:18:27] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] [beta] Make temp user config SUL-friendly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965879 (https://phabricator.wikimedia.org/T342475) (owner: 10Gergő Tisza)
[18:18:43] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:20:12] <brennen>	 !log train 1.42.0-wmf.1 (T348354): logs clean and no blockers, rolling to group1
[18:20:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:17] <stashbot>	 T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354
[18:20:21] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/966909
[18:20:32] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966910 (https://phabricator.wikimedia.org/T348354)
[18:20:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966910 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot)
[18:21:41] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966910 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot)
[18:22:20] <brett>	 BFD status alerts are the reimaging of DNS hosts
[18:23:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) >>! In T324998#9262325, @nskaggs wrote: > Can someone provide an update on what's happening with these machines? Where they indeed sent back?...
[18:24:25] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:27:47] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[18:28:11] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.1  refs T348354
[18:28:13] <kimberly_sarabia>	 brennen: sorry for the delay. everything LGTM on my end
[18:28:16] <stashbot>	 T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354
[18:29:00] <brennen>	 kimberly_sarabia: cool, thx.
[18:29:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[18:32:59] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Read codfw events in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/966912 (https://phabricator.wikimedia.org/T347075)
[18:33:52] <logmsgbot>	 !log brennen@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.1  refs T348354 (duration: 05m 40s)
[18:34:03] <stashbot>	 T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354
[18:34:19] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Read codfw events in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/966912 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[18:35:07] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Read codfw events in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/966912 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[18:35:39] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[18:35:42] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:36:38] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[18:36:48] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:41:49] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns6001.wikimedia.org with reason: host reimage
[18:45:10] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns6001.wikimedia.org with reason: host reimage
[18:48:59] <icinga-wm>	 PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[18:50:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[18:55:48] <wikibugs>	 (03PS1) 10Eevans: cqlsh-instance (new) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/966913
[18:56:45] <wikibugs>	 (03PS2) 10Eevans: cqlsh-instance (new) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/966913
[18:58:37] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:59:19] <icinga-wm>	 RECOVERY - Recursive DNS on 185.15.58.5 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[19:00:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[19:00:10] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1104.eqiad.wmnet with OS bullseye
[19:00:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[19:00:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1110.eqiad.wmnet with OS bullseye
[19:00:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye completed: - cp1104 (**PASS**)   - Removed from Puppet...
[19:00:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye completed: - cp1110 (**WARN**)   - Removed from Puppet...
[19:00:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[19:01:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) 05Open→03Resolved
[19:01:09] <wikibugs>	 (03CR) 10Bking: [C: 03+2] flink-app chart: Add zookeeper to egress_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/963130 (owner: 10Ebernhardson)
[19:02:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye
[19:02:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye
[19:03:37] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:03:53] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:10:27] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:15:13] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:15:55] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:16:35] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:16:48] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns6001.wikimedia.org with OS bookworm
[19:16:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns6001.wikimedia.org with OS bookworm completed: - dns6001 (**PASS**)   - Downtimed on Icinga/Al...
[19:17:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:19:54] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Remove unused $wgIncludeLegacyJavaScript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915
[19:20:25] <wikibugs>	 (03PS1) 10BCornwall: Revert "hiera: remove dns6001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/966624
[19:22:17] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus-updater: Update deployed container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966916
[19:23:25] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus-updater: Update deployed container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966916 (owner: 10Ebernhardson)
[19:24:10] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-updater: Update deployed container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966916 (owner: 10Ebernhardson)
[19:25:01] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[19:25:12] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:25:30] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:25:57] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: Last dump for es5 at codfw (es2025) taken on 2023-10-18 19:16:13 is 7 GiB, but the previous one was 4919 GiB, a change of -99.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:28:07] <wikibugs>	 (03PS1) 10Herron: pyrra: add prometheus external url [puppet] - 10https://gerrit.wikimedia.org/r/966917 (https://phabricator.wikimedia.org/T302995)
[19:28:17] <wikibugs>	 (03PS2) 10Herron: pyrra: add prometheus external url [puppet] - 10https://gerrit.wikimedia.org/r/966917 (https://phabricator.wikimedia.org/T302995)
[19:28:54] <wikibugs>	 (03PS1) 10BCornwall: mtail: Record bad requests for HAProxy SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966918 (https://phabricator.wikimedia.org/T341606)
[19:28:56] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Wow, I had no idea this existed, and I hate it. It seems really difficult to review, other than just trusting that you know what you're do" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[19:29:21] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns6001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/966624 (owner: 10BCornwall)
[19:29:32] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns6001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/966624 (owner: 10BCornwall)
[19:30:05] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:27] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Remove $wgApiFrameOptions override for enwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966919 (https://phabricator.wikimedia.org/T131183)
[19:30:39] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye
[19:30:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)   -...
[19:31:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mtail: Record bad requests for HAProxy SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966918 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[19:32:32] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/49/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966917 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:33:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye
[19:33:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye
[19:33:23] <wikibugs>	 (03PS2) 10BCornwall: mtail: Record bad requests for HAProxy SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966918 (https://phabricator.wikimedia.org/T341606)
[19:33:35] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye
[19:33:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)   -...
[19:33:46] <wikibugs>	 (03CR) 10Herron: [V: 03+1 C: 03+2] pyrra: add prometheus external url [puppet] - 10https://gerrit.wikimedia.org/r/966917 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:34:11] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[19:35:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) @Papaul  this is still failing    [25/50, retrying in 75.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb' r...
[19:36:49] <wikibugs>	 (03PS1) 10Herron: pyrra::filesystem: correct config permissions [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995)
[19:37:49] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: Last dump for es5 at eqiad (es1025) taken on 2023-10-18 19:11:19 is 7 GiB, but the previous one was 4919 GiB, a change of -99.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:38:04] <jynus>	 ^never an alert made me so happy!
[19:38:23] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/50/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:38:28] <MatmaRex>	 :o
[19:38:32] <jynus>	 our backups are -99.9% faster
[19:38:39] <icinga-wm>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: Last dump for es4 at codfw (es2022) taken on 2023-10-18 19:16:13 is 7 GiB, but the previous one was 4984 GiB, a change of -99.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:38:54] <wikibugs>	 (03PS1) 10Bking: dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095)
[19:38:56] <wikibugs>	 (03PS2) 10Herron: pyrra::filesystem: correct config permissions [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995)
[19:39:06] <jynus>	 well, 99.9% faster, I guess
[19:39:14] <jynus>	 or -99.9% slower
[19:39:33] <wikibugs>	 (03PS2) 10Bking: dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095)
[19:40:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye
[19:40:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye
[19:41:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Volans) @Jclark-ctr: ` Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, No pup...
[19:41:29] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/51/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:42:04] <wikibugs>	 (03CR) 10Herron: [V: 03+1 C: 03+2] pyrra::filesystem: correct config permissions [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:43:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:43:35] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: Last dump for es4 at eqiad (es1022) taken on 2023-10-18 19:11:19 is 7 GiB, but the previous one was 4984 GiB, a change of -99.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:45:09] <wikibugs>	 (03CR) 10Jforrester: "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński)
[19:48:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[19:48:54] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es4 in codfw on backupmon1001 is CRITICAL: Last dump for es4 at codfw (es2022) taken on 2023-10-18 19:16:13 is 7 GiB, but the previous one was 4984 GiB, a change of -99.9 % Jcrespo expected after cluster split - The acknowledgement expires at: 2023-10-25 19:48:29. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:48:54] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es4 in eqiad on backupmon1001 is CRITICAL: Last dump for es4 at eqiad (es1022) taken on 2023-10-18 19:11:19 is 7 GiB, but the previous one was 4984 GiB, a change of -99.9 % Jcrespo expected after cluster split - The acknowledgement expires at: 2023-10-25 19:48:29. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:48:54] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es5 in codfw on backupmon1001 is CRITICAL: Last dump for es5 at codfw (es2025) taken on 2023-10-18 19:16:13 is 7 GiB, but the previous one was 4919 GiB, a change of -99.8 % Jcrespo expected after cluster split - The acknowledgement expires at: 2023-10-25 19:48:29. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:48:54] <icinga-wm>	 ACKNOWLEDGEMENT - dump of es5 in eqiad on backupmon1001 is CRITICAL: Last dump for es5 at eqiad (es1025) taken on 2023-10-18 19:11:19 is 7 GiB, but the previous one was 4919 GiB, a change of -99.9 % Jcrespo expected after cluster split - The acknowledgement expires at: 2023-10-25 19:48:29. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[19:56:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) thanks @Volans
[19:56:52] <wikibugs>	 (03PS1) 10Jclark-ctr: add db1229 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966923 (https://phabricator.wikimedia.org/T342176)
[19:58:02] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add db1229 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966923 (https://phabricator.wikimedia.org/T342176) (owner: 10Jclark-ctr)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:03:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:06:31] <wikibugs>	 (03PS3) 10Bking: dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095)
[20:12:12] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:22:12] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:23:42] <wikibugs>	 (03PS1) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926
[20:24:24] <wikibugs>	 (03PS2) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926
[20:27:12] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:31:30] <wikibugs>	 (03PS3) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926
[20:32:20] <wikibugs>	 (03PS4) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926
[20:33:42] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[20:34:14] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[20:36:13] <wikibugs>	 (03PS3) 10Herron: pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995)
[20:37:03] <wikibugs>	 (03CR) 10Herron: "please see PCC on the related patch above this one" [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[20:37:16] <wikibugs>	 (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[20:42:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye
[20:42:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)   -...
[20:43:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye
[20:43:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye
[20:43:49] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye
[20:43:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)   -...
[20:44:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye
[20:44:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye
[20:46:06] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye
[20:46:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)   -...
[20:46:28] <wikibugs>	 (03PS1) 10Cathal Mooney: Include two new temp codfw sretest hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966928 (https://phabricator.wikimedia.org/T345803)
[20:46:30] <wikibugs>	 (03PS5) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926
[20:46:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye
[20:46:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye
[20:51:31] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Include two new temp codfw sretest hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966928 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney)
[20:51:43] <wikibugs>	 (03PS1) 10BCornwall: mtail: Record bad requests for ATS SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966930 (https://phabricator.wikimedia.org/T341606)
[20:51:52] <wikibugs>	 (03CR) 10Bking: [C: 03+1] Include two new temp codfw sretest hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966928 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney)
[20:52:45] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Include two new temp codfw sretest hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966928 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney)
[20:53:29] <wikibugs>	 (03PS2) 10BCornwall: mtail: Record bad requests for ATS SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966930 (https://phabricator.wikimedia.org/T341606)
[20:54:35] <wikibugs>	 (03CR) 10Bking: [C: 03+1] flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 (owner: 10Ebernhardson)
[20:55:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:57:43] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Limit staging consumer to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/966931
[20:59:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage
[20:59:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T2100)
[21:00:51] <wikibugs>	 (03PS1) 10Cathal Mooney: Specify partman receipe for sretest2003 & sretest2004 [puppet] - 10https://gerrit.wikimedia.org/r/966932 (https://phabricator.wikimedia.org/T345803)
[21:02:06] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Limit staging consumer to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/966931 (owner: 10Ebernhardson)
[21:02:10] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage
[21:03:26] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Limit staging consumer to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/966931 (owner: 10Ebernhardson)
[21:04:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:04:21] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 (owner: 10Ebernhardson)
[21:05:11] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 (owner: 10Ebernhardson)
[21:08:31] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:08:35] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[21:08:54] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:08:54] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[21:09:34] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron)
[21:10:46] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[21:11:05] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi)
[21:15:38] <wikibugs>	 (03CR) 10Bking: [C: 03+1] Specify partman receipe for sretest2003 & sretest2004 [puppet] - 10https://gerrit.wikimedia.org/r/966932 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney)
[21:16:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:16:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[21:19:16] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Specify partman receipe for sretest2003 & sretest2004 [puppet] - 10https://gerrit.wikimedia.org/r/966932 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney)
[21:21:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:23:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[21:23:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1229.eqiad.wmnet with OS bullseye
[21:23:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye completed: - db1229 (**WARN**)   - Downtimed o...
[21:23:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr)
[21:23:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) a:03Jclark-ctr
[21:23:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) 05Open→03Resolved
[21:35:09] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED
[21:44:44] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED
[21:52:07] <wikibugs>	 (03Abandoned) 10Krinkle: [BETA HACK] Allow external access from anywhere to parsoid port 80 for CI purposes [puppet] - 10https://gerrit.wikimedia.org/r/941477 (owner: 10Krinkle)
[21:54:31] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED
[21:54:47] <wikibugs>	 (03CR) 10Krinkle: [BETA HACK] Attempt to secure Puppet DB better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle)
[21:56:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:56:37] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:58:14] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED
[22:01:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[22:02:12] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Remove $wgApiFrameOptions override for enwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966919 (https://phabricator.wikimedia.org/T131183) (owner: 10Bartosz Dziewoński)
[22:08:47] <icinga-wm>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 41 probes of 779 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:14:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[22:14:11] <icinga-wm>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 4 probes of 779 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:19:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[22:24:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[22:30:14] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[22:50:13] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS13030/IPv6: Connect - Init7, AS13030/IPv4: Connect - Init7, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:56:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[23:01:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[23:03:37] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:06:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[23:38:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[23:43:10] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[23:48:45] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status