[00:07:12] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:30] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:21:32] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:21:52] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:38:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965584 [00:39:04] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965584 (owner: 10TrainBranchBot) [00:39:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [00:49:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [00:53:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965584 (owner: 10TrainBranchBot) [00:55:16] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:00:12] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [02:00:12] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:05:04] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [02:05:40] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.98 ms [02:05:40] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.24 ms [02:14:24] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:16:22] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [02:18:40] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:38:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:45:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:45:48] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 45 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:51:14] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 20 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:52:32] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:01:02] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:34] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:04] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:14] (03PS1) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) [05:52:12] (03CR) 10CI reject: [V: 04-1] Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [05:54:10] (03PS2) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) [05:57:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bookworm [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T0600) [06:00:25] (03PS1) 10Ayounsi: Turnilo: wmf_netflow: change forwarded 1/0 to yes/no [puppet] - 10https://gerrit.wikimedia.org/r/966800 (https://phabricator.wikimedia.org/T331707) [06:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:20] (03PS2) 10Stevemunene: Switch druid1006 zookeeper node with druid1011 [puppet] - 10https://gerrit.wikimedia.org/r/965501 (https://phabricator.wikimedia.org/T336042) [06:08:22] (03PS2) 10Stevemunene: Switch druid1005 zookeeper node with druid1010 [puppet] - 10https://gerrit.wikimedia.org/r/965500 (https://phabricator.wikimedia.org/T336042) [06:08:24] (03PS2) 10Stevemunene: Switch druid1004 zookeeper node with druid1009 [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042) [06:08:26] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2160.codfw.wmnet with OS bookworm [06:08:26] (03PS4) 10Stevemunene: druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042) [06:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:58] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bookworm [06:17:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [06:22:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [06:27:24] (03CR) 10Majavah: [C: 03+2] nginx: make /etc/nginx depend on the package [puppet] - 10https://gerrit.wikimedia.org/r/966549 (owner: 10Majavah) [06:32:22] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:41] (03CR) 10Ayounsi: "1 small comment, +1 otherwise" [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [06:36:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2160.codfw.wmnet with reason: host reimage [06:37:04] 10SRE, 10Infrastructure-Foundations, 10netops: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10ayounsi) Isn't OSPF required there to benefit from the end to end link cost calculations (eg. draining a transport link)? [06:38:16] !log push pfw policies - T349101 [06:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2160.codfw.wmnet with reason: host reimage [07:00:04] Amir1, Urbanecm, and taavi: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2160.codfw.wmnet with OS bookworm [07:00:47] * taavi blames TheresNoTime for that message [07:00:56] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train [airflow-dags@5dcce3bd] [07:02:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:03:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:03:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:04:48] !log aqu@deploy2002 deploy aborted: Add missing MR in yesterday weekly train [airflow-dags@5dcce3bd] (duration: 03m 52s) [07:05:11] (03PS1) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) [07:05:13] (03PS1) 10Majavah: P:wmcs::metriscinfra: haproxy: add route for meta monitor service [puppet] - 10https://gerrit.wikimedia.org/r/966805 (https://phabricator.wikimedia.org/T288053) [07:05:13] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided) [07:05:19] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 06s) [07:05:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:05:56] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train (run 2) [airflow-dags@5dcce3bd] [07:06:03] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train (run 2) [airflow-dags@5dcce3bd] (duration: 00m 07s) [07:06:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:07:28] (03PS2) 10Volans: dhcp: adapt to new Spicerack's dhcp() API [cookbooks] - 10https://gerrit.wikimedia.org/r/966490 (https://phabricator.wikimedia.org/T341973) [07:07:38] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Remove testreduce grants [puppet] - 10https://gerrit.wikimedia.org/r/966327 (https://phabricator.wikimedia.org/T345831) (owner: 10Marostegui) [07:07:40] (03PS1) 10Marostegui: db2132: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/966806 (https://phabricator.wikimedia.org/T349090) [07:08:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [07:08:14] (03CR) 10Marostegui: [C: 03+2] db2132: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/966806 (https://phabricator.wikimedia.org/T349090) (owner: 10Marostegui) [07:09:02] (03CR) 10CI reject: [V: 04-1] P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [07:09:53] (03PS2) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) [07:09:55] (03PS2) 10Majavah: P:wmcs::metriscinfra: haproxy: add route for meta monitor service [puppet] - 10https://gerrit.wikimedia.org/r/966805 (https://phabricator.wikimedia.org/T288053) [07:13:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [07:13:39] (03CR) 10CI reject: [V: 04-1] P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [07:14:03] (03CR) 10Filippo Giunchedi: [C: 03+1] puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [07:14:50] (03PS3) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) [07:14:52] (03PS3) 10Majavah: P:wmcs::metriscinfra: haproxy: add route for meta monitor service [puppet] - 10https://gerrit.wikimedia.org/r/966805 (https://phabricator.wikimedia.org/T288053) [07:15:38] (03PS1) 10Kevin Bazira: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965585 (https://phabricator.wikimedia.org/T348607) [07:18:37] (03CR) 10Slyngshede: [C: 03+2] puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [07:20:19] (03Merged) 10jenkins-bot: puppet-agent-fail: enable check for all clusters. [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [07:20:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2132.codfw.wmnet with OS bookworm [07:24:35] (03CR) 10Filippo Giunchedi: [C: 03+2] otel-coll: bump resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/966514 (https://phabricator.wikimedia.org/T345712) (owner: 10Filippo Giunchedi) [07:26:04] (03CR) 10Elukey: [C: 03+1] ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965585 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [07:27:12] !log filippo@deploy2002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [07:27:25] !log filippo@deploy2002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [07:28:09] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [07:28:17] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [07:28:26] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [07:28:32] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [07:31:09] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] "hieradata/hosts/pc1015.yaml has a duplicated line at line 7 otherwise lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [07:34:52] (03PS1) 10Volans: sre.hosts.reimage: fix --new with puppet 7 support [cookbooks] - 10https://gerrit.wikimedia.org/r/966809 [07:37:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2132.codfw.wmnet with reason: host reimage [07:37:54] !log temporarily disabled puppet on the A:cumin hosts to deploy and test spicerack v8.0.0 [07:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:52] (03CR) 10Volans: [C: 03+2] "Self-merging to test spicerack 8.0.0 on cumin2002, puppet is disabled on cumin1001. I'll be happy to do any post-merge fix." [cookbooks] - 10https://gerrit.wikimedia.org/r/966490 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [07:40:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2132.codfw.wmnet with reason: host reimage [07:42:12] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Some or all of the undeletion failed - https://phabricator.wikimedia.org/T348937 (10MatthewVernon) We did switch DCs recently, the impact of which is more load on thumbor (and during the switchover we discovered there was a shortage of thumbor pods... [07:43:28] (03Merged) 10jenkins-bot: dhcp: adapt to new Spicerack's dhcp() API [cookbooks] - 10https://gerrit.wikimedia.org/r/966490 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [07:46:43] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2132.codfw.wmnet with OS bookworm [07:47:10] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965585 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [07:47:47] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:47:58] (03Merged) 10jenkins-bot: ml-services: update the recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965585 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [07:54:58] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:00:04] brennen and hashar: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T0800). [08:00:40] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/966553 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [08:02:29] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-e8/ssw links - ayounsi@cumin1001" [08:02:41] (03PS1) 10Jgiannelos: tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 [08:03:16] 10SRE, 10DBA, 10MW-1.41-notes (1.41.0-wmf.30; 2023-10-10): Error connecting to db2109 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection refused - https://phabricator.wikimedia.org/T348419 (10Marostegui) [08:03:38] (03PS2) 10Jgiannelos: tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 (https://phabricator.wikimedia.org/T347717) [08:03:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-e8/ssw links - ayounsi@cumin1001" [08:03:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:06:04] !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet [08:07:22] (03Abandoned) 10Hashar: gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [08:08:24] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet [08:10:04] (03CR) 10Hashar: logging: reorder wmgMonologProcessors entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966529 (https://phabricator.wikimedia.org/T349086) (owner: 10Hashar) [08:11:54] (03CR) 10David Caro: "Can you elaborate on what is this and how will it work?" [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [08:12:41] 10SRE, 10Infrastructure-Foundations, 10netops: Bring Juniper switches in eqiad racks E5-7 and F5-7 online and ready for servers - https://phabricator.wikimedia.org/T334230 (10ayounsi) [08:12:47] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [08:12:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10ayounsi) 05Resolved→03Open a:05cmooney→03Jclark-ctr I can't get the links to the Dell switches up, only looking at lsw1-e8 for now it seems li... [08:14:44] !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet [08:15:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:15:37] 10ops-eqiad: Add test server to row E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi) [08:16:03] 10ops-eqiad: Add test server to row E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi) [08:16:05] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [08:18:22] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet [08:20:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:20:13] 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi) [08:23:08] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, wmf for Ricki_Jay - https://phabricator.wikimedia.org/T349170 (10RickiJay-WMDE) [08:23:34] (03PS1) 10WMDE-Fisch: Revert "Revert "Workaround to center search terms label"" [extensions/AdvancedSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966610 (https://phabricator.wikimedia.org/T252346) [08:27:45] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Use ClusterIP services for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [08:28:34] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:28:47] (03Merged) 10jenkins-bot: wikifunctions: Use ClusterIP services for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [08:30:07] (03PS1) 10Volans: locking: fix path for Spicerack modules locks [software/spicerack] - 10https://gerrit.wikimedia.org/r/966812 (https://phabricator.wikimedia.org/T341973) [08:31:10] (03PS1) 10Kosta Harlan: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) [08:32:33] (03CR) 10Kosta Harlan: ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [08:40:17] (03CR) 10Volans: [C: 03+2] "Self-merging to fix bug found while testing v8.0.0" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966812 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:40:58] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [08:44:11] (03PS1) 10Jcrespo: dbbackups: Start backing up only clusters 28 and 29 from ES [puppet] - 10https://gerrit.wikimedia.org/r/966814 (https://phabricator.wikimedia.org/T342685) [08:47:07] (03Merged) 10jenkins-bot: locking: fix path for Spicerack modules locks [software/spicerack] - 10https://gerrit.wikimedia.org/r/966812 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:50:51] (03CR) 10Jcrespo: "So this is the way I suggested to implement this- with a static configuration, as once we know this is working the first time, it should w" [puppet] - 10https://gerrit.wikimedia.org/r/966814 (https://phabricator.wikimedia.org/T342685) (owner: 10Jcrespo) [08:51:03] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [08:53:19] (03CR) 10Jcrespo: "One remaining question is the name of the backups- right now because we could only do full backups, the backups were called "es4"/"es5"- t" [puppet] - 10https://gerrit.wikimedia.org/r/966814 (https://phabricator.wikimedia.org/T342685) (owner: 10Jcrespo) [08:53:36] (03PS1) 10JMeybohm: wikifunctions: Make app and mesh port different [deployment-charts] - 10https://gerrit.wikimedia.org/r/966816 (https://phabricator.wikimedia.org/T343388) [08:54:07] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [08:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:58:34] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/34/console" [puppet] - 10https://gerrit.wikimedia.org/r/966568 (https://phabricator.wikimedia.org/T344884) (owner: 10Brennen Bearnes) [09:00:48] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@c17c91c]: Fix following yesterday weekly train deploy [airflow-dags@c17c91ce] [09:01:14] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966817 [09:01:58] !log aqu@deploy2002 deploy aborted: Fix following yesterday weekly train deploy [airflow-dags@c17c91ce] (duration: 01m 10s) [09:02:07] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@c17c91c]: Fix following yesterday weekly train deploy - Second try [airflow-dags@c17c91ce] [09:02:13] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@c17c91c]: Fix following yesterday weekly train deploy - Second try [airflow-dags@c17c91ce] (duration: 00m 06s) [09:02:22] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966817 (owner: 10Volans) [09:03:33] (03CR) 10Jelto: [V: 03+1 C: 03+2] "lgtm, I'll test the change on phab2002 and then on phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/966568 (https://phabricator.wikimedia.org/T344884) (owner: 10Brennen Bearnes) [09:04:59] (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:05:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet [09:06:57] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [09:07:59] (PuppetFailure) firing: (2) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:08:32] (03PS1) 10Filippo Giunchedi: thanos: allow thanos-rule to serve /rule [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) [09:08:34] (03PS1) 10Filippo Giunchedi: thanos: reverse-proxy /rule to rule-hosts [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) [09:08:49] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Make app and mesh port different [deployment-charts] - 10https://gerrit.wikimedia.org/r/966816 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [09:09:18] (03CR) 10Jbond: Don't require dummy 'team' label for multi-owner alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi) [09:09:52] (03Merged) 10jenkins-bot: wikifunctions: Make app and mesh port different [deployment-charts] - 10https://gerrit.wikimedia.org/r/966816 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [09:09:59] (PuppetFailure) firing: (3) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:10:01] (03PS1) 10Volans: Upstream release v8.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966820 [09:10:04] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [09:10:14] (03CR) 10Volans: [C: 03+2] Upstream release v8.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966820 (owner: 10Volans) [09:11:12] (03CR) 10Marostegui: mariadb: Productionize pc1016, pc2016 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [09:11:14] (03PS7) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [09:11:24] (03CR) 10Jbond: "post merge comment" [alerts] - 10https://gerrit.wikimedia.org/r/966554 (owner: 10Slyngshede) [09:12:55] (03CR) 10Filippo Giunchedi: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [09:13:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet [09:13:12] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1006.eqiad.wmnet [09:14:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1007.eqiad.wmnet [09:16:14] (03CR) 10Jcrespo: [C: 03+2] "Allow me to merge this as is for the time being so I can test and generate backups right away, confirming this works; but please note that" [puppet] - 10https://gerrit.wikimedia.org/r/966814 (https://phabricator.wikimedia.org/T342685) (owner: 10Jcrespo) [09:16:23] (03Merged) 10jenkins-bot: Upstream release v8.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966820 (owner: 10Volans) [09:16:49] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on stat1009.eqiad.wmnet with reason: Moving /home to /srv/home on stat1009 and rebooting [09:17:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on stat1009.eqiad.wmnet with reason: Moving /home to /srv/home on stat1009 and rebooting [09:17:13] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [09:17:27] jbond: I'm not sure I understand the comments on https://gerrit.wikimedia.org/r/c/operations/alerts/+/956794 as that change didn't change any semantics [09:17:33] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:17:40] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [09:18:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:19] (03CR) 10Volans: [C: 03+1] "I think this makes sense. Adding John." [puppet] - 10https://gerrit.wikimedia.org/r/960063 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [09:19:00] or at least that was my intention [09:19:24] (03CR) 10Volans: [C: 03+1] "Makse sense to me, adding John" [puppet] - 10https://gerrit.wikimedia.org/r/960062 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [09:19:49] i.e. the alerts still have per-team 'team' label, it is part of the expression via group_left [09:19:55] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [09:20:03] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960064 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [09:20:07] (03CR) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [09:20:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/966809 (owner: 10Volans) [09:20:51] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [09:20:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1006.eqiad.wmnet [09:21:09] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:21:54] (03CR) 10Jbond: [C: 03+2] "lgtm, merging" [puppet] - 10https://gerrit.wikimedia.org/r/960063 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [09:21:56] !log starting new backup of es1022, es1025 (new clusters only) [09:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:11] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [09:22:12] (03CR) 10Jbond: [C: 03+2] envoyproxy: remove skip_install from tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/960062 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [09:22:31] (03CR) 10Jbond: [C: 03+2] Remove minversion=1.6 from tox.ini files [puppet] - 10https://gerrit.wikimedia.org/r/960064 (https://phabricator.wikimedia.org/T345152) (owner: 10Hashar) [09:22:53] jbond: thank you :) [09:23:05] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [09:23:08] !log aborting backup of es1022, es1025 (there was already another backup running) [09:23:09] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [09:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:16] I was talking to volans about those changes and he remembered me sre foundation is the goto team for anything related to Puppet + CI :) [09:24:33] (03CR) 10Filippo Giunchedi: [C: 03+2] "Please note that the semantics of these alerts didn't change, i.e. there's still a per-team label attached to the alerts via group_left()." [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi) [09:25:56] !log uploaded spicerack_8.0.1 to apt.wikimedia.org bullseye-wikimedia [09:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:59] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:27:05] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:27:11] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:27:18] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 (owner: 10Volans) [09:28:27] (03PS5) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 [09:30:25] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix --new with puppet 7 support [cookbooks] - 10https://gerrit.wikimedia.org/r/966809 (owner: 10Volans) [09:30:50] (03CR) 10Volans: [C: 03+2] sre.hosts.dhcp: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:31:04] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:31:21] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:31:52] (03CR) 10Volans: [C: 03+2] sre.hosts.reboot-single: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:32:34] (03CR) 10Volans: [C: 03+2] tox.ini: remove optimization for tox <4 [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 (owner: 10Volans) [09:32:51] (03Merged) 10jenkins-bot: sre.hosts.reimage: fix --new with puppet 7 support [cookbooks] - 10https://gerrit.wikimedia.org/r/966809 (owner: 10Volans) [09:33:10] (03Merged) 10jenkins-bot: sre.hosts.dhcp: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966187 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:33:27] (03PS2) 10Ladsgroup: Set s6 and s8 to write both for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732) [09:33:36] (03Merged) 10jenkins-bot: sre.hosts.provision: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966188 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:33:50] (03Merged) 10jenkins-bot: sre.hosts.reimage: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966189 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:34:09] (03Merged) 10jenkins-bot: sre.hosts.reboot-single: make the lock per-host [cookbooks] - 10https://gerrit.wikimedia.org/r/966190 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:34:28] (03CR) 10Jbond: "thanks filippo" [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi) [09:36:18] (03CR) 10Ayounsi: [C: 03+1] svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 (owner: 10Volans) [09:37:07] (03Merged) 10jenkins-bot: tox.ini: remove optimization for tox <4 [cookbooks] - 10https://gerrit.wikimedia.org/r/966191 (owner: 10Volans) [09:37:58] (03PS1) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 [09:39:11] (03CR) 10CI reject: [V: 04-1] puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [09:39:18] (03PS1) 10Hashar: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822 [09:47:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:47:36] !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet [09:48:04] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet [09:49:29] (03CR) 10Ayounsi: [C: 03+1] "ship it, we can fine tune later, unless Chris have concerns" [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [09:50:41] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, wmf for Ricki_Jay - https://phabricator.wikimedia.org/T349170 (10RickiJay-WMDE) 05Open→03Resolved a:03RickiJay-WMDE [09:52:05] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on stat1009.eqiad.wmnet with reason: Extending downtime for stat1009 [09:52:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on stat1009.eqiad.wmnet with reason: Extending downtime for stat1009 [09:52:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:54:14] !log volans@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [09:58:44] (03CR) 10Effie Mouzeli: [C: 03+1] role::redis::misc::{master,slave}: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1000) [10:03:25] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [10:03:42] (03CR) 10Effie Mouzeli: [C: 03+2] "This is grand!" [puppet] - 10https://gerrit.wikimedia.org/r/965124 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:04:35] jouncebot: nowandnext [10:04:35] For the next 0 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1000) [10:04:35] In 2 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1300) [10:07:23] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [10:09:07] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [10:09:16] (03PS1) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 [10:09:18] (03PS1) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 [10:13:01] (03CR) 10CI reject: [V: 04-1] sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [10:13:28] (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [10:16:59] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [10:17:04] (03PS1) 10Kevin Bazira: ml-services: add listeners for cxserver and eventgate-analytics to rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/966827 (https://phabricator.wikimedia.org/T348607) [10:18:53] (03CR) 10David Caro: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [10:19:22] (03CR) 10Elukey: [C: 03+1] ml-services: add listeners for cxserver and eventgate-analytics to rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/966827 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:19:42] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [10:21:10] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review Luca :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966827 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:22:01] (03Merged) 10jenkins-bot: ml-services: add listeners for cxserver and eventgate-analytics to rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/966827 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:26:33] btullis: FYI the reboot cookbook is taking forever because is waiting for a successful puppet run, but puppet is failing on stat1007, see https://puppetboard.wikimedia.org/report/stat1007.eqiad.wmnet/59b3decc9bbd3ac0bb06dbe8c55aecc4ed36a924 [10:27:21] volans: Thanks. I'm on it already with stevemunene - Should be resolved shortly. [10:27:36] (03PS2) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 [10:28:26] ack [10:28:28] great [10:28:49] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:28:49] (03CR) 10CI reject: [V: 04-1] puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [10:30:57] (03CR) 10Slyngshede: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [10:31:58] should be ok now volans btullis [10:32:41] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [10:33:18] (03PS1) 10Jbond: late_command: drop signed-by config [puppet] - 10https://gerrit.wikimedia.org/r/966825 [10:35:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1007.eqiad.wmnet [10:35:49] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/966825 (owner: 10Jbond) [10:35:51] (03CR) 10Jbond: [C: 03+2] late_command: drop signed-by config [puppet] - 10https://gerrit.wikimedia.org/r/966825 (owner: 10Jbond) [10:37:13] !log volans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye [10:40:18] !log re-enabled puppet on the cumin hosts. installed spicerack 8.0.1 on the cumin hosts [10:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:24] (03CR) 10Effie Mouzeli: ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [10:50:21] (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [10:50:59] (03PS2) 10Kosta Harlan: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) [10:51:04] (03CR) 10Kosta Harlan: ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [10:51:06] (03CR) 10Jgiannelos: [C: 03+2] tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [10:51:54] (03Merged) 10jenkins-bot: tegola: Configure logger to use json output [deployment-charts] - 10https://gerrit.wikimedia.org/r/966810 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [10:52:45] PROBLEM - SSH on stat1009 is CRITICAL: connect to address 10.64.21.17 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:56:52] (03Abandoned) 10Effie Mouzeli: ipoid: Remove APP_CONFIG env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/935720 (owner: 10Alexandros Kosiaris) [10:58:44] (03PS1) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 [10:59:10] (03CR) 10Ladsgroup: [C: 03+2] Set s6 and s8 to write both for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:59:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [10:59:39] (03PS3) 10Effie Mouzeli: osm: remove imposm-deploy-import [puppet] - 10https://gerrit.wikimedia.org/r/862281 [10:59:55] (03Merged) 10jenkins-bot: Set s6 and s8 to write both for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966592 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [11:00:01] (03PS1) 10Hashar: tox: remove envdir optimizations [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434) [11:00:59] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:966592|Set s6 and s8 to write both for pagelinks migration (T345732)]] [11:01:10] (03CR) 10CI reject: [V: 04-1] debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [11:01:11] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [11:01:41] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [11:02:21] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:966592|Set s6 and s8 to write both for pagelinks migration (T345732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:03:00] (03PS1) 10Hnowlan: editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966848 (https://phabricator.wikimedia.org/T336415) [11:03:38] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:39] (03PS1) 10Jgiannelos: tegola: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966849 [11:05:35] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:06:08] (03PS5) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) [11:07:20] (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [11:08:23] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [11:10:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822 (owner: 10Hashar) [11:11:02] (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966849 (owner: 10Jgiannelos) [11:11:09] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:966592|Set s6 and s8 to write both for pagelinks migration (T345732)]] (duration: 10m 10s) [11:11:13] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [11:11:28] (03PS1) 10Hnowlan: trafficserver: route editor-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/966851 (https://phabricator.wikimedia.org/T336415) [11:12:00] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [11:12:09] (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966849 (owner: 10Jgiannelos) [11:12:59] (03Merged) 10jenkins-bot: tegola: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966849 (owner: 10Jgiannelos) [11:13:10] (03CR) 10Hnowlan: [C: 03+2] editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966848 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [11:13:32] (03CR) 10Jbond: [C: 03+1] "seems fine to me, just need to run tox -e py3-format" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [11:14:05] (03Merged) 10jenkins-bot: editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/966848 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [11:14:47] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [11:15:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434) (owner: 10Hashar) [11:16:01] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [11:16:03] (03PS1) 10Btullis: Set the class for each of the spark shuffle services [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) [11:16:15] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [11:18:41] (03PS2) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 [11:18:43] (03PS2) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 [11:19:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:58] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [11:21:13] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [11:21:40] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8 NOOP 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:22:47] (03CR) 10CI reject: [V: 04-1] sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [11:23:13] (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [11:23:54] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [11:24:27] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [11:27:16] (03PS2) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 [11:27:22] (03CR) 10Hashar: debug_presentation: script to render HTML templates (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [11:28:57] (03PS3) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 [11:28:59] (03PS3) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 [11:29:02] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [11:29:14] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [11:29:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:16] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [11:32:50] (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [11:32:59] (03CR) 10CI reject: [V: 04-1] sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [11:33:26] (03CR) 10Hashar: [C: 04-1] "I need to set `html.change_id` and `html.job_id`" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [11:33:38] (03CR) 10Jbond: [C: 03+1] debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [11:34:22] (03PS1) 10Volans: locking: delete the key on etcd if no locks remain [software/spicerack] - 10https://gerrit.wikimedia.org/r/966854 (https://phabricator.wikimedia.org/T341973) [11:34:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [11:35:33] (03PS3) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 [11:35:41] (03PS6) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) [11:36:36] (03CR) 10Hashar: "The rendering is missing diffs, catalogues and always flag a compilation failure cause the PCC files are missing. But that is a first pass" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [11:36:52] (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [11:38:51] (03PS4) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 [11:38:53] (03PS4) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 [11:39:35] (03CR) 10Slyngshede: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [11:43:23] (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [11:43:27] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm [11:44:15] RECOVERY - SSH on stat1009 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:44:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1009.eqiad.wmnet [11:44:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966854 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:48:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [11:50:13] (03PS5) 10KartikMistry: Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) [11:51:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1009.eqiad.wmnet [11:55:00] (03CR) 10Jbond: [C: 04-1] "thanks for the follow up but see inline the > 0 will exclude the resources == 0 check" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [11:58:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:15:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:17:26] (03CR) 10Volans: [C: 03+2] locking: delete the key on etcd if no locks remain [software/spicerack] - 10https://gerrit.wikimedia.org/r/966854 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:17:28] !log repool db2161 and db1126 [12:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P53001 and previous config saved to /var/cache/conftool/dbconfig/20231018-121811-arnaudb.json [12:18:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P53002 and previous config saved to /var/cache/conftool/dbconfig/20231018-121828-arnaudb.json [12:20:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:24:48] (03PS3) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 [12:25:01] (03Merged) 10jenkins-bot: locking: delete the key on etcd if no locks remain [software/spicerack] - 10https://gerrit.wikimedia.org/r/966854 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:26:22] (03CR) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [12:31:29] * kart_ deploing cxserver.. [12:31:39] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [12:32:33] (03Merged) 10jenkins-bot: Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [12:33:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P53003 and previous config saved to /var/cache/conftool/dbconfig/20231018-123315-arnaudb.json [12:33:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P53004 and previous config saved to /var/cache/conftool/dbconfig/20231018-123333-arnaudb.json [12:36:56] (03CR) 10Arnaudb: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [12:37:44] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:37:54] (03PS2) 10Anzx: dewiktionary: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966569 (https://phabricator.wikimedia.org/T348978) [12:38:08] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:38:08] (03CR) 10Slyngshede: [V: 03+1] "Could you do a preliminary check, mostly regarding my solution for only rolling this out on sretest. I'd like to test is a little better b" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [12:38:44] (03PS2) 10Anzx: knwiktionary: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966574 (https://phabricator.wikimedia.org/T349036) [12:39:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2109.codfw.wmnet with reason: db2109 downtime while repooling [12:40:01] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2109.codfw.wmnet with reason: db2109 downtime while repooling [12:42:03] (03CR) 10Joal: [C: 03+1] Turnilo: wmf_netflow: change forwarded 1/0 to yes/no [puppet] - 10https://gerrit.wikimedia.org/r/966800 (https://phabricator.wikimedia.org/T331707) (owner: 10Ayounsi) [12:42:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [12:43:53] (03PS5) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [12:44:23] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:44:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [12:44:58] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:48:09] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10ayounsi) [12:48:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P53005 and previous config saved to /var/cache/conftool/dbconfig/20231018-124820-arnaudb.json [12:48:29] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966858 [12:48:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P53006 and previous config saved to /var/cache/conftool/dbconfig/20231018-124838-arnaudb.json [12:48:42] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966858 (owner: 10Volans) [12:49:14] 10SRE, 10ops-codfw, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10Papaul) @nskaggs you are correct even 1 additional rack isnt't possible at this time. Sorry about that. [12:51:36] Keeping watch on eqiad graphs.. [12:51:42] !log upload puppet_7.23.0-1~debu11u1 (bullseye backport [12:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:04] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [12:53:29] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [12:54:28] (03CR) 10Volans: "LGTM, suggested a modification inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [12:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:56:02] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/966858 (owner: 10Volans) [12:56:30] (03CR) 10Volans: sre.puppet: move get_puppet_version to sre.puppet (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [12:57:10] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10ayounsi) Let's move all the A/AAAA SVC records to Netbox. And keep the CNAMEs in the DNS repo if we can't get rid of them. Then have follow up tasks t... [12:58:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:59:08] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:59:33] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:59:44] (03PS1) 10Volans: Upstream release v8.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966859 [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1300). [13:00:05] kimberly_sarabia, WMDE-Fisch, and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] (03PS4) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 [13:00:28] (03CR) 10Volans: [C: 03+2] Upstream release v8.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966859 (owner: 10Volans) [13:00:35] * TheresNoTime can't deploy right now [13:00:47] OK. That's not working for cxserver; reverting patch.. [13:00:49] (03CR) 10Slyngshede: puppet_agent_failed: label alert with appropriate team. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [13:00:49] o/ [13:01:16] (03PS1) 10KartikMistry: Revert "Update cxserver to 2023-10-12-080927-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966611 [13:01:19] (03CR) 10CI reject: [V: 04-1] puppet_agent_failed: label alert with appropriate team. [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [13:01:23] (03PS3) 10Anzx: dewiktionary: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966569 (https://phabricator.wikimedia.org/T348978) [13:01:56] ( I can't deploy myself though ) [13:02:21] (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2023-10-12-080927-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966611 (owner: 10KartikMistry) [13:02:23] (03CR) 10Slyngshede: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [13:02:51] (03CR) 10Btullis: [V: 03+1] "Giving myself a +2 because the spark failure is affecting every job on the hadoop-test cluster." [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:02:53] (03CR) 10Btullis: [V: 03+1 C: 03+2] Set the class for each of the spark shuffle services [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:03:02] hello [13:03:10] (03Merged) 10jenkins-bot: Revert "Update cxserver to 2023-10-12-080927-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966611 (owner: 10KartikMistry) [13:03:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P53007 and previous config saved to /var/cache/conftool/dbconfig/20231018-130325-arnaudb.json [13:03:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P53008 and previous config saved to /var/cache/conftool/dbconfig/20231018-130343-arnaudb.json [13:04:23] (03CR) 10Jforrester: "Neat!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [13:04:27] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [13:04:33] (03CR) 10Ssingh: [C: 03+2] Revert "wmfusercontent: add TXT record for cert validation" [dns] - 10https://gerrit.wikimedia.org/r/966243 (owner: 10Ssingh) [13:04:43] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [13:04:58] !log running authdns-update for CR 966243 [13:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:07] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [13:05:30] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [13:06:06] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [13:06:29] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [13:06:30] (03CR) 10Filippo Giunchedi: P:wmcs::metricsinfra: add meta monitoring app skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) (owner: 10Majavah) [13:06:56] (03Merged) 10jenkins-bot: Upstream release v8.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/966859 (owner: 10Volans) [13:07:34] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [13:07:38] So nobody able to deploy? :-/ [13:07:59] (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:08:20] hnowlan: around? [13:09:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [13:09:59] (PuppetFailure) firing: (3) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:10:35] I deployed and reverted cxserver patch, but reverting change is not affecting. What can be reason? Patch reverted was: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965022 [13:10:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [13:11:32] Anyone around to deploy? [13:12:59] (PuppetFailure) resolved: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:13:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:14:07] (03PS2) 10Cathal Mooney: Add homer automation for management router bgp [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) [13:14:23] !log uploaded spicerack_8.0.2 to apt.wikimedia.org bullseye-wikimedia [13:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:34] (03CR) 10Cathal Mooney: Add homer automation for management router bgp (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [13:14:52] (03CR) 10Ayounsi: [C: 03+2] Turnilo: wmf_netflow: change forwarded 1/0 to yes/no [puppet] - 10https://gerrit.wikimedia.org/r/966800 (https://phabricator.wikimedia.org/T331707) (owner: 10Ayounsi) [13:15:42] (03CR) 10Cathal Mooney: [C: 03+2] Add homer automation for management router bgp [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [13:16:04] (03CR) 10Fabfur: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/966573 (https://phabricator.wikimedia.org/T348335) (owner: 10Ssingh) [13:16:16] (03PS1) 10Majavah: kubeadm: drop default [puppet] - 10https://gerrit.wikimedia.org/r/966863 [13:16:18] (03PS1) 10Majavah: kubeadm: drop version upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/966864 (https://phabricator.wikimedia.org/T343869) [13:16:20] (03PS1) 10Majavah: aptrepo: drop k8s 1.22 components [puppet] - 10https://gerrit.wikimedia.org/r/966865 (https://phabricator.wikimedia.org/T298005) [13:17:25] (03PS1) 10KartikMistry: cxserver: Pin chart to 0.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966866 [13:17:26] Looks like I have to pin chart to 0.2.2 [13:17:35] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/36/console" [puppet] - 10https://gerrit.wikimedia.org/r/966863 (owner: 10Majavah) [13:17:43] (03PS1) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) [13:18:03] kart_: hey, here [13:18:13] (03CR) 10CI reject: [V: 04-1] cxserver: Pin chart to 0.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966866 (owner: 10KartikMistry) [13:19:27] kart_: you could also bump to 0.2.4, might be easier [13:19:33] (03PS2) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) [13:19:35] if it's an emergency I can do a rollback in helm [13:19:47] (03PS2) 10KartikMistry: cxserver: Pin chart to 0.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966866 [13:20:11] (03PS3) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) [13:20:17] (03PS4) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) [13:20:26] hnowlan: can you please do emergency revert? [13:20:35] (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/966821 (owner: 10Slyngshede) [13:20:39] (03CR) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [13:21:37] kart_: ack, will do [13:21:58] hnowlan: I just broke cxserver again :/ [13:22:44] (03CR) 10Herron: [C: 03+1] thanos: allow thanos-rule to serve /rule [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [13:23:14] !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet [13:23:35] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet [13:23:48] kart_: there was a deploy done at 12:44 and 13:05 today - do you want me to roll back to 12:44 or to the previous one on Wed Oct 11 11:57:42 2023 which uses the 0.2.2 chart? [13:24:52] hnowlan: one with 0.2.2 chart [13:25:11] (03CR) 10Effie Mouzeli: [C: 03+2] osm: remove imposm-deploy-import [puppet] - 10https://gerrit.wikimedia.org/r/862281 (owner: 10Effie Mouzeli) [13:25:32] (03CR) 10Herron: [C: 03+1] thanos: reverse-proxy /rule to rule-hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [13:25:57] kart_: rollback done [13:27:08] (03PS1) 10Jforrester: wikifunctions: Update evaluators to WASM images [deployment-charts] - 10https://gerrit.wikimedia.org/r/966868 (https://phabricator.wikimedia.org/T343829) [13:27:17] hnowlan: Thanks a lot! [13:27:33] hnowlan: I should have rollback access, right? [13:28:43] kart_: for emergency rollback you'd need sudo on the deploy hosts [13:28:51] (03Merged) 10jenkins-bot: Add homer automation for management router bgp [homer/public] - 10https://gerrit.wikimedia.org/r/966581 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [13:28:59] hnowlan: Can you revisit patch, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965022 -- what's wrong with it? Also, do you know if any other services also done similar work? [13:29:05] hnowlan: noted. [13:29:11] generally we'd encourage using helmfile for rollbacks where it isn't a critical situation [13:29:21] I actually didn't look at the service, what failed? [13:30:02] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10fgiunchedi) [13:30:06] 10Puppet, 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10fgiunchedi) 05Open→03Declined [13:30:08] hnowlan: Page loading from RestBase (ie example, https://cxserver.wikimedia.org/v2/page/es/it/Mariana_BO) So, cxserver won't load page and thus fails. [13:31:13] Configuration seems wrong for sure. [13:33:17] hard to know without more logging in the service really. is there debug logging we could turn on in staging? [13:33:46] (03CR) 10Jbond: [C: 04-1] "see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [13:34:47] I'll come up with some ideas later but I need to step out for dinner now. Let's talk on the patch once I submit new patch if that's fine? [13:34:52] generally I would test a change like this in staging before rolling to eqiad/codfw by doing something along the lines of `curl -vk https://staging.svc.eqiad.wmnet:4002/v2/page/es/it/Mariana_BO` [13:35:04] (03CR) 10Xcollazo: [C: 03+1] Set the class for each of the spark shuffle services [puppet] - 10https://gerrit.wikimedia.org/r/966853 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:35:19] assuming that path/port combo are correct and `Page es:Mariana_BO could not be found.` is the error you were seeing [13:36:09] Yes [13:36:24] (03CR) 10Jbond: [C: 03+1] debug_presentation: script to render HTML templates (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [13:38:25] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10dcausse) [13:38:33] (03PS1) 10Majavah: openstack: encapi: don't try to close the connection [puppet] - 10https://gerrit.wikimedia.org/r/966871 (https://phabricator.wikimedia.org/T349195) [13:38:53] kart_: the URI path you're using to access the mwapi rest.php is incorrect - you need to specify the host header in your request and remove it from the URI (for example - `curl -H "Host: es.wikipedia.org" localhost:6500/w/rest.php/v1/page/Mariana_BO`) [13:38:56] (03CR) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [13:39:13] (03PS1) 10Cathal Mooney: Remove superflous brackets from bgp templates [homer/public] - 10https://gerrit.wikimedia.org/r/966872 [13:39:16] that's probably what breaks. it'd be nice to have some kind of logging to make that more visible within the application though [13:39:58] Noted. [13:40:00] (03CR) 10Cathal Mooney: [C: 03+2] Remove superflous brackets from bgp templates [homer/public] - 10https://gerrit.wikimedia.org/r/966872 (owner: 10Cathal Mooney) [13:40:38] (03Merged) 10jenkins-bot: Remove superflous brackets from bgp templates [homer/public] - 10https://gerrit.wikimedia.org/r/966872 (owner: 10Cathal Mooney) [13:41:48] hnowlan: Can you please also comment on the patch? IRC logs is easier to lost :) [13:41:58] ack [13:42:15] (03PS7) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [13:43:08] (03CR) 10Hnowlan: Update cxserver to 2023-10-12-080927-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [13:44:17] Thanks! [13:44:53] 10SRE-swift-storage, 10serviceops, 10Patch-For-Review: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 (10fgiunchedi) [13:45:27] (03CR) 10Vgutierrez: [C: 04-1] haproxy: enable healthcheck-dedicated backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [13:50:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/966865 (https://phabricator.wikimedia.org/T298005) (owner: 10Majavah) [13:50:51] (03CR) 10Btullis: [C: 03+1] install_server: create aqs reuse partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [13:51:07] (03PS4) 10Eevans: install_server: create aqs partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) [13:51:39] (03PS5) 10Eevans: install_server: create aqs partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) [13:52:57] (03PS5) 10Jbond: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 [13:52:59] (03PS5) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 [13:53:19] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [13:53:36] (03CR) 10Eevans: [C: 03+2] install_server: create aqs partition reuse recipe (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [13:54:30] (03PS1) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) [13:56:52] (03CR) 10CI reject: [V: 04-1] spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [13:57:46] (03CR) 10CI reject: [V: 04-1] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [13:58:25] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [13:59:05] (03Abandoned) 10KartikMistry: cxserver: Pin chart to 0.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/966866 (owner: 10KartikMistry) [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1400) [14:01:18] (03CR) 10Vgutierrez: [C: 03+1] svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 (owner: 10Volans) [14:02:15] (03PS2) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) [14:03:11] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye [14:03:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye [14:03:27] PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:03:39] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:03:48] (03PS6) 10Jbond: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 [14:03:56] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [14:04:13] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:04:21] PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:04:37] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:07:04] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:14] (03PS1) 10Slavina Stefanova: harbor: upgrade from 2.5 to 2.9 [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) [14:08:27] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:13] (03CR) 10Fabfur: [C: 03+1] "headers and content seems similar between the two endpoints" [puppet] - 10https://gerrit.wikimedia.org/r/966851 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [14:10:22] 10SRE, 10Fundraising-Backlog, 10SRE Observability: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10lmata) Will radar for now; please let us know if you'd like us to engage somehow. [14:12:11] (03PS2) 10Effie Mouzeli: P:memcached::memkeys: move templates under profile/ [puppet] - 10https://gerrit.wikimedia.org/r/955701 (owner: 10Majavah) [14:12:58] (03Abandoned) 10Effie Mouzeli: Revert "tegola-vector-tiles: use tegola image with debug enabled on codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951850 (owner: 10Effie Mouzeli) [14:13:38] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [14:18:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1114'] [14:20:07] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:20:25] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage [14:21:25] (03PS1) 10Jbond: sre.puppet.renew-cert: don't disable puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966877 [14:22:48] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [14:23:29] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/966877 (owner: 10Jbond) [14:23:31] (03CR) 10David Caro: "LGTM, to test this in toolsbeta, you have to ssh to the puppetmaster there:" [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) (owner: 10Slavina Stefanova) [14:23:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage [14:25:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1114'] [14:25:26] (03PS3) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) [14:25:42] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route editor-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/966851 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [14:25:44] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:25:50] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro so i submitted the logs and here is Dells Response. The only errors showing in the System Event Log (SEL) ar... [14:26:12] (03PS6) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 [14:27:43] (03CR) 10Jbond: [C: 03+1] "lgtm, nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:27:48] (03CR) 10Volans: [C: 03+2] svc records: add missing comments for reserved IPs (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/965119 (owner: 10Volans) [14:27:50] (03PS1) 10Hashar: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 [14:27:52] (03PS1) 10Hashar: Add style to HTML output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 [14:28:19] (03CR) 10Jbond: [C: 03+2] sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [14:28:21] (03CR) 10Jbond: [C: 03+2] sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [14:28:23] (03CR) 10Jbond: [C: 03+2] sre.puppet.renew-cert: don't disable puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966877 (owner: 10Jbond) [14:28:40] (03CR) 10Hashar: "Screenshot: https://phabricator.wikimedia.org/F38608194" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar) [14:29:07] (03CR) 10Hashar: "That is not really needed, but I felt we could avoid repetition :)" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 (owner: 10Hashar) [14:31:46] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [14:31:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [14:32:24] (03PS4) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) [14:32:38] (03Merged) 10jenkins-bot: sre.puppet: move get_puppet_version to sre.puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966823 (owner: 10Jbond) [14:32:50] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:32:54] (03Merged) 10jenkins-bot: sre.puppet.renew-cert: update to work with puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/966824 (owner: 10Jbond) [14:32:56] (03Merged) 10jenkins-bot: sre.puppet.renew-cert: don't disable puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/966877 (owner: 10Jbond) [14:33:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:34:17] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:34:23] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) just in case you need it: > What OS are you running on the server? Debian Bullseye (11): 5.10.0-19-amd64 #1 SMP Debian 5.10.1... [14:34:49] (03CR) 10CI reject: [V: 04-1] spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:36:20] 10SRE, 10Fundraising-Backlog, 10SRE Observability, 10fundraising-tech-ops: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10Jgreen) [14:36:22] RECOVERY - ensure kvm processes are running on cloudvirt1051 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:38:36] (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:11] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:41:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 (owner: 10Hashar) [14:42:09] (03CR) 10Jbond: [C: 03+1] "LGTM, can you also add a changelog entry for all theses changes either as a new CR or can be included in this one" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar) [14:44:28] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1010.eqiad.wmnet with OS bullseye [14:46:22] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [14:49:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [14:51:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:51:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1108.eqiad.wmnet with OS bullseye [14:52:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye completed: - cp1108 (**PASS**) - Removed from Puppet... [14:53:37] (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:12] (03PS1) 10Herron: graphite-web: switch logrotate to copytruncate [puppet] - 10https://gerrit.wikimedia.org/r/966881 [14:54:21] (03PS2) 10Herron: graphite-web: switch logrotate to copytruncate [puppet] - 10https://gerrit.wikimedia.org/r/966881 [14:56:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [14:56:12] (03PS1) 10Elukey: install_server: fix reuse-parts-test.cfg [puppet] - 10https://gerrit.wikimedia.org/r/966882 [14:56:24] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110'] [14:56:32] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [14:56:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1111'] [14:57:01] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [14:57:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1111'] [14:57:09] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:57:12] (03CR) 10Eevans: [C: 03+1] install_server: fix reuse-parts-test.cfg [puppet] - 10https://gerrit.wikimedia.org/r/966882 (owner: 10Elukey) [14:57:20] !log volans@cumin2002 START - Cookbook sre.hosts.dhcp for host sretest1001.eqiad.wmnet [14:57:33] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1001.eqiad.wmnet [14:57:50] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1111 [14:58:06] !log powercycle titan1001 (no mgmt console / tty available, no host metrics, no ssh) [14:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:20] (03CR) 10Elukey: [C: 03+2] install_server: fix reuse-parts-test.cfg [puppet] - 10https://gerrit.wikimedia.org/r/966882 (owner: 10Elukey) [14:58:55] (03CR) 10Herron: "please lmk what if any tasks should be attached" [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron) [14:58:58] (03PS2) 10Hashar: tox: remove envdir optimizations [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434) [14:59:00] (03PS2) 10Hashar: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822 [14:59:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1111 [14:59:02] (03PS4) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 [14:59:05] (03PS2) 10Hashar: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 [14:59:07] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [14:59:07] (03PS2) 10Hashar: Add style to HTML output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 [14:59:10] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1111'] [14:59:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [14:59:31] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1010.eqiad.wmnet with OS bullseye [14:59:38] (03CR) 10Hashar: "Rebased in order to add a CHANGELOG entry and avoid conflicting with another series of patches. They are now all in a single series." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966847 (https://phabricator.wikimedia.org/T348434) (owner: 10Hashar) [14:59:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1111'] [15:00:03] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:00:49] (03CR) 10Hashar: "I have amended the whole series to have each change add an entry in CHANGELOG. Based on this last change:" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 (owner: 10Hashar) [15:01:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:33] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [15:01:35] RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Fri 03 Nov 2023 08:51:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:01:36] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1111'] [15:01:49] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:01:53] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:02:07] RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:02:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [15:02:17] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1010.eqiad.wmnet with OS bullseye [15:02:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1111'] [15:03:09] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye [15:03:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye [15:03:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:03:37] (JobUnavailable) firing: (9) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110'] [15:04:55] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [15:05:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye [15:06:02] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:06:32] (03CR) 10Effie Mouzeli: "This will not be needed as we have defined proxies in values.yaml for eqiad and codfw. Cronjobs should inherit those vars. I will get back" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [15:07:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:07:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1114.eqiad.wmnet with OS bullseye [15:07:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye completed: - cp1114 (**PASS**) - Removed from Puppet... [15:08:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [15:08:09] (03PS1) 10Ebernhardson: Switch articletopic over to outlink topic prediction [deployment-charts] - 10https://gerrit.wikimedia.org/r/966883 [15:09:01] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1107'] [15:09:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1107'] [15:10:05] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [15:10:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [15:12:46] !log dancy@deploy2002 Started deploy [releng/jenkins-deploy@2cf7af2] (releasing): (no justification provided) [15:13:03] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:31] !log dancy@deploy2002 Finished deploy [releng/jenkins-deploy@2cf7af2] (releasing): (no justification provided) (duration: 00m 44s) [15:14:03] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:57] (03PS2) 10Ebernhardson: cirrus updater: Switch articletopic over to outlink topic prediction [deployment-charts] - 10https://gerrit.wikimedia.org/r/966883 [15:14:59] (03PS1) 10Ebernhardson: cirrus updater: disable jemalloc and increase task manager mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/966884 [15:15:38] (03CR) 10DCausse: [C: 03+1] cirrus updater: disable jemalloc and increase task manager mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/966884 (owner: 10Ebernhardson) [15:15:43] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:43] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:18:54] (03PS1) 10Hnowlan: trafficserver: route all requests for /api/rest_v1/metrics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/966885 (https://phabricator.wikimedia.org/T336385) [15:19:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1106'] [15:20:18] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1111.eqiad.wmnet with reason: host reimage [15:21:12] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Switch articletopic over to outlink topic prediction [deployment-charts] - 10https://gerrit.wikimedia.org/r/966883 (owner: 10Ebernhardson) [15:21:37] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: disable jemalloc and increase task manager mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/966884 (owner: 10Ebernhardson) [15:21:43] (03PS1) 10Volans: documentation: expand distributed locking docs [software/spicerack] - 10https://gerrit.wikimedia.org/r/966886 (https://phabricator.wikimedia.org/T341973) [15:22:06] (03Merged) 10jenkins-bot: cirrus updater: Switch articletopic over to outlink topic prediction [deployment-charts] - 10https://gerrit.wikimedia.org/r/966883 (owner: 10Ebernhardson) [15:22:28] (03Merged) 10jenkins-bot: cirrus updater: disable jemalloc and increase task manager mem limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/966884 (owner: 10Ebernhardson) [15:22:43] (03CR) 10BBlack: [C: 03+1] slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [15:23:08] (03CR) 10Hashar: [C: 04-1] debug_presentation: script to render HTML templates (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [15:23:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1111.eqiad.wmnet with reason: host reimage [15:23:35] (03PS5) 10BCornwall: slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) [15:25:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1106'] [15:26:16] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye [15:26:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye [15:27:49] (03CR) 10BCornwall: [V: 03+1 C: 03+2] slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [15:27:51] (03CR) 10BCornwall: [V: 03+2 C: 03+2] slo_definitions: Switch to using varnish_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/965842 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [15:28:17] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage [15:28:21] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:28:34] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:28:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1105'] [15:29:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1105'] [15:29:42] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye [15:29:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye [15:32:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:32:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage [15:33:05] (03PS1) 10Jdlrobson: Fix Typo in OS Dark Mode field [extensions/WikimediaEvents] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966615 (https://phabricator.wikimedia.org/T346106) [15:33:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10nskaggs) Can someone provide an update on what's happening with these machines? Where they indeed sent back? Do we have replacement hardware? [15:36:39] (03PS1) 10Vgutierrez: ssl: Add digicert-2023 unified public certificates [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119) [15:40:13] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:40:53] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1106.eqiad.wmnet with reason: host reimage [15:41:52] brennen: Just left a comment on https://phabricator.wikimedia.org/T348354 that this patch https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/966615 needs to go out with the next train to avoid a spike in EventGate schema validation errors [15:43:09] (03PS2) 10Hnowlan: Add script for automating joining a single node to the cluster [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/829807 (https://phabricator.wikimedia.org/T309619) [15:43:34] !log bking@deploy2002 destroy dse-k8s-services instance of rdf-streaming-updater T349095 [15:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:38] T349095: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 [15:44:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1106.eqiad.wmnet with reason: host reimage [15:44:22] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:44:49] (03Abandoned) 10Hnowlan: service::deploy::gitclone: don't append deploy to repo [puppet] - 10https://gerrit.wikimedia.org/r/677620 (owner: 10Hnowlan) [15:45:28] (03PS5) 10Hnowlan: wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) [15:45:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10Jclark-ctr) Unsure if port is turned off or if fs dell optics are not compatible. I put loopback on optic in dell switch and link did not come up [15:46:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:46:06] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) @jclark-ctr these need a single NIC connected to the `cloud-hosts` as the primary VLAN, and `cloud-instances` and `cloud-private` VLANs trunked (we... [15:46:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1111.eqiad.wmnet with OS bullseye [15:46:16] (03PS1) 10Btullis: Partial fix for multiple spark shufflers [puppet] - 10https://gerrit.wikimedia.org/r/966889 (https://phabricator.wikimedia.org/T344910) [15:46:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye completed: - cp1111 (**PASS**) - Removed from Puppet... [15:46:48] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage [15:47:25] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1010.eqiad.wmnet with OS bullseye [15:47:49] (03CR) 10Slavina Stefanova: harbor: upgrade from 2.5 to 2.9 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966874 (https://phabricator.wikimedia.org/T346241) (owner: 10Slavina Stefanova) [15:48:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [15:49:30] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1104'] [15:49:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage [15:49:52] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:49:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1104'] [15:50:07] (03PS2) 10Vgutierrez: base,ssl: Add digicert-2023 unified public certs and RSA intermediate [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119) [15:50:15] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1010.eqiad.wmnet with OS bullseye [15:50:27] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [15:50:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [15:50:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:50:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1107.eqiad.wmnet with OS bullseye [15:51:00] kimberly_sarabia: having a look [15:51:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye completed: - cp1107 (**WARN**) - Downtimed on Icinga/... [15:51:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [15:51:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [15:51:45] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/966889 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [15:51:48] (03PS1) 10Gmodena: mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 [15:51:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102'] [15:52:05] (03CR) 10Kosta Harlan: [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [15:52:13] (03CR) 10Kosta Harlan: [C: 04-2] [WIP] ipoid: Set PROXY_HOST and PROXY_PORT [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [15:52:54] (03CR) 10Vgutierrez: base,ssl: Add digicert-2023 unified public certs and RSA intermediate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [15:52:59] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:53:09] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:53:22] (03PS2) 10Gmodena: mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 (https://phabricator.wikimedia.org/T345805) [15:53:59] kimberly_sarabia: right on, i'll do a backport before train moves forward. [15:54:11] brennen: thank you! [15:54:14] (03CR) 10BBlack: [C: 03+1] base,ssl: Add digicert-2023 unified public certs and RSA intermediate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [15:54:20] sure thing. [15:55:06] (03CR) 10CI reject: [V: 04-1] wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [15:55:16] (03CR) 10Vgutierrez: [C: 03+2] base,ssl: Add digicert-2023 unified public certs and RSA intermediate [puppet] - 10https://gerrit.wikimedia.org/r/966887 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [15:55:54] (03CR) 10TChin: [C: 03+1] mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 (https://phabricator.wikimedia.org/T345805) (owner: 10Gmodena) [15:57:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1110.eqiad.wmnet with OS bullseye [15:57:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**) - Removed f... [15:57:55] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy nllb in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) [15:58:07] (03PS1) 10Btullis: Change the first spark shuffler service to use the default port [puppet] - 10https://gerrit.wikimedia.org/r/966892 (https://phabricator.wikimedia.org/T344910) [15:59:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:59:55] (03PS1) 10Vgutierrez: hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) [16:00:32] (03PS5) 10Volans: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) [16:00:58] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:01:32] (03PS2) 10Vgutierrez: hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) [16:02:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:02:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1106.eqiad.wmnet with OS bullseye [16:02:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye completed: - cp1106 (**PASS**) - Removed from Puppet... [16:02:39] (03CR) 10Elukey: ml-services: deploy nllb in llm namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [16:02:58] (03CR) 10CI reject: [V: 04-1] spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [16:04:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:04:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [16:05:09] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/966892 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:05:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1102'] [16:05:17] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [16:05:41] (03PS1) 10Vgutierrez: ssl: Add dummy digicert-2023 unified keys [labs/private] - 10https://gerrit.wikimedia.org/r/966894 (https://phabricator.wikimedia.org/T341119) [16:06:25] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1102 [16:06:26] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:06:28] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] ssl: Add dummy digicert-2023 unified keys [labs/private] - 10https://gerrit.wikimedia.org/r/966894 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [16:07:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [16:07:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:07:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1105.eqiad.wmnet with OS bullseye [16:07:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1102 [16:07:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1105.eqiad.wmnet with OS bullseye completed: - cp1105 (**PASS**) - Removed from Puppet... [16:07:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [16:07:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102'] [16:08:03] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1010.eqiad.wmnet with reason: host reimage [16:08:06] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/43/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [16:08:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [16:08:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [16:08:33] (03CR) 10Kosta Harlan: [C: 04-2] [WIP] ipoid: Set PROXY_HOST and PROXY_PORT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966867 (https://phabricator.wikimedia.org/T349171) (owner: 10Kosta Harlan) [16:08:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110'] [16:09:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:10:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110'] [16:10:16] (03CR) 10Xcollazo: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/966889 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:11:02] (03PS2) 10Kosta Harlan: labs: Enable ReportIncident on all beta wikis except loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965100 (https://phabricator.wikimedia.org/T346018) [16:11:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1010.eqiad.wmnet with reason: host reimage [16:11:25] jouncebot: nowandnext [16:11:25] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [16:11:25] In 0 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1700) [16:11:28] (03CR) 10Xcollazo: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/966892 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:11:38] (03CR) 10Btullis: [V: 03+1 C: 03+2] Partial fix for multiple spark shufflers [puppet] - 10https://gerrit.wikimedia.org/r/966889 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:11:48] (03CR) 10Btullis: [V: 03+1 C: 03+2] Change the first spark shuffler service to use the default port [puppet] - 10https://gerrit.wikimedia.org/r/966892 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:11:53] (03PS2) 10Jforrester: wikifunctions: Update evaluators to WASM images [deployment-charts] - 10https://gerrit.wikimedia.org/r/966868 (https://phabricator.wikimedia.org/T343829) [16:13:06] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1102 [16:13:21] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Update evaluators to WASM images [deployment-charts] - 10https://gerrit.wikimedia.org/r/966868 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester) [16:14:09] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators to WASM images [deployment-charts] - 10https://gerrit.wikimedia.org/r/966868 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester) [16:14:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1102 [16:14:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110'] [16:14:34] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp1110'] [16:14:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [16:14:45] (03PS6) 10Jbond: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [16:15:03] (03CR) 10Jbond: "hopefully that fixes it" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [16:15:29] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [16:15:47] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:16:37] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:17:01] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:17:22] (03CR) 10CI reject: [V: 04-1] spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [16:17:34] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cp1102 - jclark@cumin1001" [16:17:51] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:17:53] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:18:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cp1102 - jclark@cumin1001" [16:18:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:32] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [16:18:42] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:18:45] (03PS3) 10Vgutierrez: hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) [16:18:54] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110'] [16:19:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110'] [16:19:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [16:19:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye [16:20:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1101'] [16:20:23] (03PS3) 10Hashar: debug_host: rename mangecode to managecode (typo) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966822 [16:20:25] (03PS5) 10Hashar: debug_presentation: script to render HTML templates [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 [16:20:27] (03PS3) 10Hashar: Use macros for links to Gerrit and Jenkins [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966878 [16:20:29] (03PS3) 10Hashar: Add style to HTML output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966879 [16:20:31] (03PS1) 10Hashar: tox: add commands to allowlist_externals [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966895 [16:20:33] (03CR) 10BBlack: [C: 03+1] hieradata: Deploy digicert-2023 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/966893 (https://phabricator.wikimedia.org/T341119) (owner: 10Vgutierrez) [16:20:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1101'] [16:20:49] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100'] [16:21:51] (03CR) 10Hashar: debug_presentation: script to render HTML templates (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/966846 (owner: 10Hashar) [16:21:59] (03CR) 10TChin: [C: 03+2] mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 (https://phabricator.wikimedia.org/T345805) (owner: 10Gmodena) [16:22:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1100'] [16:22:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye [16:22:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1100.eqiad.wmnet with OS bullseye [16:23:18] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye [16:23:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye [16:23:28] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/966890 (https://phabricator.wikimedia.org/T345805) (owner: 10Gmodena) [16:24:26] (03PS7) 10Jbond: spicerack: enable distributed locking in production [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [16:24:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102'] [16:24:50] (03PS1) 10Jforrester: Revert "wikifunctions: Update evaluators to WASM images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966616 [16:24:54] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:25:00] (03CR) 10Jforrester: [C: 03+2] Revert "wikifunctions: Update evaluators to WASM images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966616 (owner: 10Jforrester) [16:25:27] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1102.eqiad.wmnet with OS bullseye [16:25:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye [16:25:49] (03Merged) 10jenkins-bot: Revert "wikifunctions: Update evaluators to WASM images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966616 (owner: 10Jforrester) [16:26:52] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:28:01] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:28:15] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:28:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:28:33] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1103.eqiad.wmnet with OS bullseye [16:28:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye completed: - cp1103 (**PASS**) - Removed from Puppet... [16:29:05] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:29:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [16:30:27] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:30:51] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:33:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1010.eqiad.wmnet with OS bullseye [16:34:47] (03CR) 10Ilias Sarantopoulos: ml-services: deploy nllb in llm namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966891 (https://phabricator.wikimedia.org/T349163) (owner: 10Ilias Sarantopoulos) [16:37:45] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage [16:39:54] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage [16:40:23] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage [16:40:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1102.eqiad.wmnet with reason: host reimage [16:43:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage [16:44:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:46:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage [16:49:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:51:30] (03PS1) 10Jforrester: wikifunctions: Temporarily add WASM JS service for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/966898 (https://phabricator.wikimedia.org/T343829) [16:53:24] (03CR) 10Ryan Kemper: [C: 03+2] airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:54:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:54:58] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:56:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:56:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1102.eqiad.wmnet with OS bullseye [16:56:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1102.eqiad.wmnet with OS bullseye completed: - cp1102 (**PASS**) - Removed from Puppet... [16:57:51] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Temporarily add WASM JS service for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/966898 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester) [16:59:22] (03Merged) 10jenkins-bot: wikifunctions: Temporarily add WASM JS service for testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/966898 (https://phabricator.wikimedia.org/T343829) (owner: 10Jforrester) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1700) [17:00:12] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [17:00:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [17:01:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [17:01:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1101.eqiad.wmnet with OS bullseye [17:01:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1101.eqiad.wmnet with OS bullseye completed: - cp1101 (**PASS**) - Removed from Puppet... [17:02:46] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:05] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [17:04:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [17:04:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1100.eqiad.wmnet with OS bullseye [17:04:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [17:04:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1100.eqiad.wmnet with OS bullseye completed: - cp1100 (**PASS**) - Removed from Puppet... [17:04:30] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1104'] [17:04:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1104'] [17:05:09] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS bullseye [17:05:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye [17:05:28] PROBLEM - Check systemd state on an-airflow1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@wmde.service,wmf_auto_restart_airflow-scheduler@wmde.service,wmf_auto_restart_airflow-webserver@wmde.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:52] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:05:58] PROBLEM - Checks that the airflow database for airflow wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:07:00] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [17:07:25] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [17:07:46] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:09:59] (PuppetFailure) firing: (3) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:10:46] (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:05] (03PS1) 10Bking: dse-k8s: remove rdf-streaming-updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/966902 (https://phabricator.wikimedia.org/T349095) [17:12:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1110.eqiad.wmnet with OS bullseye [17:12:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**) - Removed f... [17:13:30] !log restart turnilo to pickup UI change [17:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:47] (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:17:44] jouncebot nowandnext [17:17:45] For the next 0 hour(s) and 42 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1700) [17:17:45] In 0 hour(s) and 42 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1800) [17:17:45] In 0 hour(s) and 42 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1800) [17:19:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:22:19] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage [17:23:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110'] [17:24:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110'] [17:25:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:25:13] 10SRE, 10Infrastructure-Foundations, 10netops: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10cmooney) >>! In T349125#9260678, @ayounsi wrote: > Isn't OSPF required there to benefit from the end to end link cost calculations (... [17:25:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage [17:26:37] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [17:26:51] (03PS1) 10Cathal Mooney: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) [17:27:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:28:16] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1110 [17:29:33] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1110 [17:30:14] (03PS1) 10Btullis: Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) [17:30:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:56] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1110.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:52] (03PS2) 10Btullis: Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) [17:34:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1110.mgmt.eqiad.wmnet with reboot policy FORCED [17:34:39] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [17:34:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye [17:35:47] (03PS3) 10Btullis: Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) [17:38:21] (03CR) 10Xcollazo: [C: 03+1] Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [17:40:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:40:32] (03CR) 10Brennen Bearnes: [C: 03+2] Fix Typo in OS Dark Mode field [extensions/WikimediaEvents] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966615 (https://phabricator.wikimedia.org/T346106) (owner: 10Jdlrobson) [17:41:04] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 20 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [17:42:21] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [17:42:28] (03Merged) 10jenkins-bot: Fix Typo in OS Dark Mode field [extensions/WikimediaEvents] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/966615 (https://phabricator.wikimedia.org/T346106) (owner: 10Jdlrobson) [17:43:15] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update DNS records for Greenhouse [dns] - 10https://gerrit.wikimedia.org/r/966573 (https://phabricator.wikimedia.org/T348335) (owner: 10Ssingh) [17:43:18] (03PS2) 10Ssingh: wikimedia.org: update DNS records for Greenhouse [dns] - 10https://gerrit.wikimedia.org/r/966573 (https://phabricator.wikimedia.org/T348335) [17:43:26] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:43:36] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:44:45] !log running authdns-update for CR 966573 [17:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:53] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [17:47:11] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:50:13] (03PS1) 10Ryan Kemper: Revert "airflow-wmde: Place airflow1007 in airflow-wmde role" [puppet] - 10https://gerrit.wikimedia.org/r/966617 [17:50:16] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "airflow-wmde: Place airflow1007 in airflow-wmde role" [puppet] - 10https://gerrit.wikimedia.org/r/966617 (owner: 10Ryan Kemper) [17:50:44] (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable the multiple spark shufflers on the test cluster temporarily [puppet] - 10https://gerrit.wikimedia.org/r/966905 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [17:51:45] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage [17:52:14] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [17:52:20] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:52:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:54:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:55:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage [17:55:13] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:55:44] (03PS1) 10Herron: pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) [17:56:04] (03CR) 10CI reject: [V: 04-1] pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:56:21] (03PS1) 10Ryan Kemper: Revert "airflow-wmde: configure wmde airflow instance" [puppet] - 10https://gerrit.wikimedia.org/r/966618 [17:56:28] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "airflow-wmde: configure wmde airflow instance" [puppet] - 10https://gerrit.wikimedia.org/r/966618 (owner: 10Ryan Kemper) [17:56:52] (03PS2) 10Herron: pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) [17:58:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:59:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:00:07] brennen and hashar: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1800). [18:00:07] brennen and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T1800). [18:01:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 5.640 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:01:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:01:29] (03PS1) 10Ryan Kemper: Revert "airflow-wmde: Create scap deployment source for wmde" [puppet] - 10https://gerrit.wikimedia.org/r/966619 [18:01:36] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "airflow-wmde: Create scap deployment source for wmde" [puppet] - 10https://gerrit.wikimedia.org/r/966619 (owner: 10Ryan Kemper) [18:02:31] (03PS1) 10Ryan Kemper: Revert "airflow-wmde: Add wmde service user to the Yarn production queue" [puppet] - 10https://gerrit.wikimedia.org/r/966620 [18:02:41] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "airflow-wmde: Add wmde service user to the Yarn production queue" [puppet] - 10https://gerrit.wikimedia.org/r/966620 (owner: 10Ryan Kemper) [18:02:55] o/ [18:03:45] !log brennen@deploy2002 Started scap: Backport for [[gerrit:966615|Fix Typo in OS Dark Mode field (T346106)]] [18:03:56] T346106: Interface customization baseline instrumentation - https://phabricator.wikimedia.org/T346106 [18:03:58] kimberly_sarabia: ^ anything to test here? [18:05:08] !log brennen@deploy2002 brennen and jdlrobson: Backport for [[gerrit:966615|Fix Typo in OS Dark Mode field (T346106)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:06:02] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) 05Open→03Resolved We have updated the DNS records for Greenhouse, confirmed email delivery including 'reply-to' and checklist on the Greenhouse web interface. Marking th... [18:06:08] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) For posterity: we are now using `gh-mail.wikimedia.org` for the Greenhouse mails. [18:06:26] (03PS1) 10BCornwall: hiera: remove dns6001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/966907 (https://phabricator.wikimedia.org/T342154) [18:06:50] (03CR) 10Ssingh: [C: 03+1] hiera: remove dns6001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/966907 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:06:58] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns6001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/966907 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:09:59] (PuppetFailure) firing: (3) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:12:08] !log brennen@deploy2002 brennen and jdlrobson: Continuing with sync [18:12:09] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:12:20] (proceeding as this seems pretty low-risk.) [18:14:17] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/966828 [18:15:09] PROBLEM - Bird Internet Routing Daemon on dns6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:15:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [18:15:37] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:17:12] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns6001.wikimedia.org with OS bookworm [18:17:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns6001.wikimedia.org with OS bookworm [18:17:31] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:966615|Fix Typo in OS Dark Mode field (T346106)]] (duration: 13m 46s) [18:17:36] T346106: Interface customization baseline instrumentation - https://phabricator.wikimedia.org/T346106 [18:18:27] (03CR) 10Bartosz Dziewoński: [C: 03+1] [beta] Make temp user config SUL-friendly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965879 (https://phabricator.wikimedia.org/T342475) (owner: 10Gergő Tisza) [18:18:43] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:20:12] !log train 1.42.0-wmf.1 (T348354): logs clean and no blockers, rolling to group1 [18:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:17] T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354 [18:20:21] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/966909 [18:20:32] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966910 (https://phabricator.wikimedia.org/T348354) [18:20:36] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966910 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [18:21:41] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966910 (https://phabricator.wikimedia.org/T348354) (owner: 10TrainBranchBot) [18:22:20] BFD status alerts are the reimaging of DNS hosts [18:23:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) >>! In T324998#9262325, @nskaggs wrote: > Can someone provide an update on what's happening with these machines? Where they indeed sent back?... [18:24:25] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:27:47] PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:28:11] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.1 refs T348354 [18:28:13] brennen: sorry for the delay. everything LGTM on my end [18:28:16] T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354 [18:29:00] kimberly_sarabia: cool, thx. [18:29:59] (PuppetFailure) firing: (2) Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:32:59] (03PS1) 10Ebernhardson: cirrus updater: Read codfw events in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/966912 (https://phabricator.wikimedia.org/T347075) [18:33:52] !log brennen@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.1 refs T348354 (duration: 05m 40s) [18:34:03] T348354: 1.42.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T348354 [18:34:19] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Read codfw events in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/966912 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [18:35:07] (03Merged) 10jenkins-bot: cirrus updater: Read codfw events in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/966912 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [18:35:39] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:35:42] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:36:38] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:36:48] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:41:49] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns6001.wikimedia.org with reason: host reimage [18:45:10] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns6001.wikimedia.org with reason: host reimage [18:48:59] PROBLEM - Recursive DNS on 185.15.58.5 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:50:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [18:55:48] (03PS1) 10Eevans: cqlsh-instance (new) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/966913 [18:56:45] (03PS2) 10Eevans: cqlsh-instance (new) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/966913 [18:58:37] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:59:19] RECOVERY - Recursive DNS on 185.15.58.5 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:00:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [19:00:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1104.eqiad.wmnet with OS bullseye [19:00:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [19:00:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1110.eqiad.wmnet with OS bullseye [19:00:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1104.eqiad.wmnet with OS bullseye completed: - cp1104 (**PASS**) - Removed from Puppet... [19:00:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1110.eqiad.wmnet with OS bullseye completed: - cp1110 (**WARN**) - Removed from Puppet... [19:00:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [19:01:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) 05Open→03Resolved [19:01:09] (03CR) 10Bking: [C: 03+2] flink-app chart: Add zookeeper to egress_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/963130 (owner: 10Ebernhardson) [19:02:40] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [19:02:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye [19:03:37] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:03:53] PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:10:27] RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:15:13] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:15:55] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:16:35] PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:48] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns6001.wikimedia.org with OS bookworm [19:16:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns6001.wikimedia.org with OS bookworm completed: - dns6001 (**PASS**) - Downtimed on Icinga/Al... [19:17:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:54] (03PS1) 10Bartosz Dziewoński: Remove unused $wgIncludeLegacyJavaScript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 [19:20:25] (03PS1) 10BCornwall: Revert "hiera: remove dns6001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/966624 [19:22:17] (03PS1) 10Ebernhardson: cirrus-updater: Update deployed container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966916 [19:23:25] (03CR) 10Ebernhardson: [C: 03+2] cirrus-updater: Update deployed container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966916 (owner: 10Ebernhardson) [19:24:10] (03Merged) 10jenkins-bot: cirrus-updater: Update deployed container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966916 (owner: 10Ebernhardson) [19:25:01] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/966873 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [19:25:12] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:25:30] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:25:57] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: Last dump for es5 at codfw (es2025) taken on 2023-10-18 19:16:13 is 7 GiB, but the previous one was 4919 GiB, a change of -99.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:28:07] (03PS1) 10Herron: pyrra: add prometheus external url [puppet] - 10https://gerrit.wikimedia.org/r/966917 (https://phabricator.wikimedia.org/T302995) [19:28:17] (03PS2) 10Herron: pyrra: add prometheus external url [puppet] - 10https://gerrit.wikimedia.org/r/966917 (https://phabricator.wikimedia.org/T302995) [19:28:54] (03PS1) 10BCornwall: mtail: Record bad requests for HAProxy SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966918 (https://phabricator.wikimedia.org/T341606) [19:28:56] (03CR) 10Bartosz Dziewoński: "Wow, I had no idea this existed, and I hate it. It seems really difficult to review, other than just trusting that you know what you're do" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [19:29:21] (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns6001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/966624 (owner: 10BCornwall) [19:29:32] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns6001 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/966624 (owner: 10BCornwall) [19:30:05] RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:27] (03PS1) 10Bartosz Dziewoński: Remove $wgApiFrameOptions override for enwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966919 (https://phabricator.wikimedia.org/T131183) [19:30:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [19:30:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**) -... [19:31:24] (03CR) 10CI reject: [V: 04-1] mtail: Record bad requests for HAProxy SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966918 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [19:32:32] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/49/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966917 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:33:09] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [19:33:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye [19:33:23] (03PS2) 10BCornwall: mtail: Record bad requests for HAProxy SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966918 (https://phabricator.wikimedia.org/T341606) [19:33:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [19:33:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**) -... [19:33:46] (03CR) 10Herron: [V: 03+1 C: 03+2] pyrra: add prometheus external url [puppet] - 10https://gerrit.wikimedia.org/r/966917 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:34:11] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [19:35:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) @Papaul this is still failing [25/50, retrying in 75.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' r... [19:36:49] (03PS1) 10Herron: pyrra::filesystem: correct config permissions [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995) [19:37:49] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: Last dump for es5 at eqiad (es1025) taken on 2023-10-18 19:11:19 is 7 GiB, but the previous one was 4919 GiB, a change of -99.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:38:04] ^never an alert made me so happy! [19:38:23] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/50/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:38:28] :o [19:38:32] our backups are -99.9% faster [19:38:39] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: Last dump for es4 at codfw (es2022) taken on 2023-10-18 19:16:13 is 7 GiB, but the previous one was 4984 GiB, a change of -99.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:38:54] (03PS1) 10Bking: dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) [19:38:56] (03PS2) 10Herron: pyrra::filesystem: correct config permissions [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995) [19:39:06] well, 99.9% faster, I guess [19:39:14] or -99.9% slower [19:39:33] (03PS2) 10Bking: dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) [19:40:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [19:40:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye [19:41:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Volans) @Jclark-ctr: ` Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, No pup... [19:41:29] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/51/cons" [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:42:04] (03CR) 10Herron: [V: 03+1 C: 03+2] pyrra::filesystem: correct config permissions [puppet] - 10https://gerrit.wikimedia.org/r/966920 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:43:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:43:35] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: Last dump for es4 at eqiad (es1022) taken on 2023-10-18 19:11:19 is 7 GiB, but the previous one was 4984 GiB, a change of -99.9 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:45:09] (03CR) 10Jforrester: "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [19:48:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [19:48:54] ACKNOWLEDGEMENT - dump of es4 in codfw on backupmon1001 is CRITICAL: Last dump for es4 at codfw (es2022) taken on 2023-10-18 19:16:13 is 7 GiB, but the previous one was 4984 GiB, a change of -99.9 % Jcrespo expected after cluster split - The acknowledgement expires at: 2023-10-25 19:48:29. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:48:54] ACKNOWLEDGEMENT - dump of es4 in eqiad on backupmon1001 is CRITICAL: Last dump for es4 at eqiad (es1022) taken on 2023-10-18 19:11:19 is 7 GiB, but the previous one was 4984 GiB, a change of -99.9 % Jcrespo expected after cluster split - The acknowledgement expires at: 2023-10-25 19:48:29. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:48:54] ACKNOWLEDGEMENT - dump of es5 in codfw on backupmon1001 is CRITICAL: Last dump for es5 at codfw (es2025) taken on 2023-10-18 19:16:13 is 7 GiB, but the previous one was 4919 GiB, a change of -99.8 % Jcrespo expected after cluster split - The acknowledgement expires at: 2023-10-25 19:48:29. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:48:54] ACKNOWLEDGEMENT - dump of es5 in eqiad on backupmon1001 is CRITICAL: Last dump for es5 at eqiad (es1025) taken on 2023-10-18 19:11:19 is 7 GiB, but the previous one was 4919 GiB, a change of -99.9 % Jcrespo expected after cluster split - The acknowledgement expires at: 2023-10-25 19:48:29. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:56:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) thanks @Volans [19:56:52] (03PS1) 10Jclark-ctr: add db1229 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966923 (https://phabricator.wikimedia.org/T342176) [19:58:02] (03CR) 10Jclark-ctr: [C: 03+2] add db1229 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966923 (https://phabricator.wikimedia.org/T342176) (owner: 10Jclark-ctr) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:03:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:06:31] (03PS3) 10Bking: dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) [20:12:12] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:22:12] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:23:42] (03PS1) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 [20:24:24] (03PS2) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 [20:27:12] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:31:30] (03PS3) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 [20:32:20] (03PS4) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 [20:33:42] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [20:34:14] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [20:36:13] (03PS3) 10Herron: pyrra::filesystem::config: add pyrra filesystem operator config manager [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) [20:37:03] (03CR) 10Herron: "please see PCC on the related patch above this one" [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [20:37:16] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [20:42:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [20:42:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**) -... [20:43:04] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [20:43:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye [20:43:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [20:43:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**) -... [20:44:13] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [20:44:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye [20:46:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [20:46:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**) -... [20:46:28] (03PS1) 10Cathal Mooney: Include two new temp codfw sretest hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966928 (https://phabricator.wikimedia.org/T345803) [20:46:30] (03PS5) 10Ebernhardson: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 [20:46:40] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [20:46:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye [20:51:31] (03CR) 10Bking: [C: 03+2] Include two new temp codfw sretest hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966928 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney) [20:51:43] (03PS1) 10BCornwall: mtail: Record bad requests for ATS SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966930 (https://phabricator.wikimedia.org/T341606) [20:51:52] (03CR) 10Bking: [C: 03+1] Include two new temp codfw sretest hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966928 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney) [20:52:45] (03CR) 10Cathal Mooney: [C: 03+2] Include two new temp codfw sretest hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/966928 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney) [20:53:29] (03PS2) 10BCornwall: mtail: Record bad requests for ATS SLI metrics [puppet] - 10https://gerrit.wikimedia.org/r/966930 (https://phabricator.wikimedia.org/T341606) [20:54:35] (03CR) 10Bking: [C: 03+1] flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 (owner: 10Ebernhardson) [20:55:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:57:43] (03PS1) 10Ebernhardson: cirrus updater: Limit staging consumer to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/966931 [20:59:01] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage [20:59:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231018T2100) [21:00:51] (03PS1) 10Cathal Mooney: Specify partman receipe for sretest2003 & sretest2004 [puppet] - 10https://gerrit.wikimedia.org/r/966932 (https://phabricator.wikimedia.org/T345803) [21:02:06] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Limit staging consumer to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/966931 (owner: 10Ebernhardson) [21:02:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1229.eqiad.wmnet with reason: host reimage [21:03:26] (03Merged) 10jenkins-bot: cirrus updater: Limit staging consumer to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/966931 (owner: 10Ebernhardson) [21:04:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:04:21] (03CR) 10Ebernhardson: [C: 03+2] flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 (owner: 10Ebernhardson) [21:05:11] (03Merged) 10jenkins-bot: flink-app: Configure internal metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/966926 (owner: 10Ebernhardson) [21:08:31] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:08:35] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966909 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [21:08:54] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:08:54] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966906 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [21:09:34] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966881 (owner: 10Herron) [21:10:46] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966819 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [21:11:05] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/966818 (https://phabricator.wikimedia.org/T349102) (owner: 10Filippo Giunchedi) [21:15:38] (03CR) 10Bking: [C: 03+1] Specify partman receipe for sretest2003 & sretest2004 [puppet] - 10https://gerrit.wikimedia.org/r/966932 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney) [21:16:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:16:38] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:19:16] (03CR) 10Cathal Mooney: [C: 03+2] Specify partman receipe for sretest2003 & sretest2004 [puppet] - 10https://gerrit.wikimedia.org/r/966932 (https://phabricator.wikimedia.org/T345803) (owner: 10Cathal Mooney) [21:21:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:23:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:23:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1229.eqiad.wmnet with OS bullseye [21:23:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye completed: - db1229 (**WARN**) - Downtimed o... [21:23:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) [21:23:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) a:03Jclark-ctr [21:23:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) 05Open→03Resolved [21:35:09] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED [21:44:44] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED [21:52:07] (03Abandoned) 10Krinkle: [BETA HACK] Allow external access from anywhere to parsoid port 80 for CI purposes [puppet] - 10https://gerrit.wikimedia.org/r/941477 (owner: 10Krinkle) [21:54:31] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED [21:54:47] (03CR) 10Krinkle: [BETA HACK] Attempt to secure Puppet DB better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [21:56:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [21:56:37] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:58:14] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED [22:01:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [22:02:12] (03CR) 10Gergő Tisza: [C: 03+1] Remove $wgApiFrameOptions override for enwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966919 (https://phabricator.wikimedia.org/T131183) (owner: 10Bartosz Dziewoński) [22:08:47] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 41 probes of 779 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:14:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [22:14:11] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 4 probes of 779 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:19:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [22:24:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [22:30:14] (PuppetFailure) firing: Puppet has failed on host - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:50:13] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS13030/IPv6: Connect - Init7, AS13030/IPv4: Connect - Init7, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:56:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [23:01:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [23:03:37] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:06:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [23:38:10] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [23:43:10] (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [23:48:45] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status