[00:02:11] Okay firstly, I should state we (web team) appear to have messed up here and didn't realize a process existed (I'll talk to my team about what went wrong here later). For the remaining work I'll open a ticket and cc you to go through this process. You mentioned an approval and allow list but I don't see this mentioned in https://www.mediawiki.org/wiki/Beta_Features#Creating_your_own ? [00:02:22] Awesome. [00:02:23] Also who owns this process? You mentioned Greg but the page only mentions you. [00:02:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T348183)', diff saved to https://phabricator.wikimedia.org/P54220 and previous config saved to /var/cache/conftool/dbconfig/20231206-000236-arnaudb.json [00:02:51] Do you think it is feasible that we can address this during the course of this week? I need to report back to Olga. [00:02:51] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [00:02:55] (03PS3) 10Alex Paskulin: rest-gateway: fix device analytics routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980471 (https://phabricator.wikimedia.org/T343268) [00:03:06] Oh, I guess I own it entirely then? Meh. I've been trying to get this owned by an actual product owner for years. Maybe I'll try again? Totally able to get this done this week, sure. [00:03:32] Okay thanks James - Please expect a Phabricator ping by the end of the day :) [00:03:45] Jdlrobson: Awesome. :-) Process is https://www.mediawiki.org/wiki/Beta_Features/Package#Release_requirements [00:07:24] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:58] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:08:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:14:48] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:26] (03PS1) 10Jforrester: Beta Features: Move ULS Compact Links to only the wikis it's enabled on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980512 [00:15:38] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:16:12] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:16:46] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:17:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P54221 and previous config saved to /var/cache/conftool/dbconfig/20231206-001742-arnaudb.json [00:28:55] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [00:32:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P54222 and previous config saved to /var/cache/conftool/dbconfig/20231206-003249-arnaudb.json [00:38:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979965 [00:38:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979965 (owner: 10TrainBranchBot) [00:47:30] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T348183)', diff saved to https://phabricator.wikimedia.org/P54223 and previous config saved to /var/cache/conftool/dbconfig/20231206-004756-arnaudb.json [00:47:59] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [00:48:01] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [00:48:14] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [00:48:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T348183)', diff saved to https://phabricator.wikimedia.org/P54224 and previous config saved to /var/cache/conftool/dbconfig/20231206-004820-arnaudb.json [00:57:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979965 (owner: 10TrainBranchBot) [00:59:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T348183)', diff saved to https://phabricator.wikimedia.org/P54225 and previous config saved to /var/cache/conftool/dbconfig/20231206-005923-arnaudb.json [00:59:28] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [00:59:58] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:28:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2001.codfw.wmnet with OS bullseye [01:28:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye [01:28:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2003.codfw.wmnet with OS bullseye [01:28:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2001.codfw.wmnet with OS bullseye [01:28:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [01:28:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2003.codfw.wmnet with OS bullseye [01:29:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P54227 and previous config saved to /var/cache/conftool/dbconfig/20231206-012936-arnaudb.json [01:29:40] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:31:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:32:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ceph2001.mgmt.codfw.wmnet with reboot policy FORCED [01:32:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED [01:32:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ceph2003.mgmt.codfw.wmnet with reboot policy FORCED [01:34:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ceph2001.mgmt.codfw.wmnet with reboot policy FORCED [01:34:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED [01:34:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ceph2003.mgmt.codfw.wmnet with reboot policy FORCED [01:40:37] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:42:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating ceph to cephosd to codfw - jhancock@cumin2002" [01:43:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating ceph to cephosd to codfw - jhancock@cumin2002" [01:43:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:44:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T348183)', diff saved to https://phabricator.wikimedia.org/P54228 and previous config saved to /var/cache/conftool/dbconfig/20231206-014443-arnaudb.json [01:44:45] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [01:44:47] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [01:45:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [01:45:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T348183)', diff saved to https://phabricator.wikimedia.org/P54229 and previous config saved to /var/cache/conftool/dbconfig/20231206-014506-arnaudb.json [01:51:04] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:52:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:55:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T348183)', diff saved to https://phabricator.wikimedia.org/P54230 and previous config saved to /var/cache/conftool/dbconfig/20231206-015519-arnaudb.json [01:55:23] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [01:58:16] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:59:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:00:14] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:01:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 48.48% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:01:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:06:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 45.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:10:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 48.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:10:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P54231 and previous config saved to /var/cache/conftool/dbconfig/20231206-021031-arnaudb.json [02:15:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 48.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:23:52] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [02:25:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P54232 and previous config saved to /var/cache/conftool/dbconfig/20231206-022538-arnaudb.json [02:39:05] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T348183)', diff saved to https://phabricator.wikimedia.org/P54233 and previous config saved to /var/cache/conftool/dbconfig/20231206-024045-arnaudb.json [02:40:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [02:40:50] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [02:41:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [02:41:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T348183)', diff saved to https://phabricator.wikimedia.org/P54234 and previous config saved to /var/cache/conftool/dbconfig/20231206-024108-arnaudb.json [02:53:59] (PuppetFailure) firing: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:55:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T348183)', diff saved to https://phabricator.wikimedia.org/P54235 and previous config saved to /var/cache/conftool/dbconfig/20231206-025503-arnaudb.json [02:55:12] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [03:00:22] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:03:59] (PuppetFailure) resolved: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:07:16] RECOVERY - cassandra-c CQL 10.192.16.239:9042 on restbase2028 is OK: TCP OK - 0.032 second response time on 10.192.16.239 port 9042 https://phabricator.wikimedia.org/T93886 [03:09:05] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P54236 and previous config saved to /var/cache/conftool/dbconfig/20231206-031009-arnaudb.json [03:25:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P54237 and previous config saved to /var/cache/conftool/dbconfig/20231206-032516-arnaudb.json [03:40:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T348183)', diff saved to https://phabricator.wikimedia.org/P54238 and previous config saved to /var/cache/conftool/dbconfig/20231206-034022-arnaudb.json [03:40:25] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [03:40:28] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [03:40:39] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [03:40:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T348183)', diff saved to https://phabricator.wikimedia.org/P54239 and previous config saved to /var/cache/conftool/dbconfig/20231206-034045-arnaudb.json [03:51:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T348183)', diff saved to https://phabricator.wikimedia.org/P54240 and previous config saved to /var/cache/conftool/dbconfig/20231206-035119-arnaudb.json [03:51:23] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [03:53:51] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [04:06:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P54241 and previous config saved to /var/cache/conftool/dbconfig/20231206-040625-arnaudb.json [04:21:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P54242 and previous config saved to /var/cache/conftool/dbconfig/20231206-042132-arnaudb.json [04:33:41] (LVSHighRX) firing: Excessive RX traffic on lvs6001:9100 (enp175s0f0np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [04:36:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T348183)', diff saved to https://phabricator.wikimedia.org/P54243 and previous config saved to /var/cache/conftool/dbconfig/20231206-043638-arnaudb.json [04:36:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [04:36:43] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [04:36:57] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [04:36:58] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:37:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:37:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T348183)', diff saved to https://phabricator.wikimedia.org/P54244 and previous config saved to /var/cache/conftool/dbconfig/20231206-043718-arnaudb.json [04:37:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [04:38:41] (LVSHighRX) resolved: Excessive RX traffic on lvs6001:9100 (enp175s0f0np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [04:47:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T348183)', diff saved to https://phabricator.wikimedia.org/P54245 and previous config saved to /var/cache/conftool/dbconfig/20231206-044737-arnaudb.json [04:47:41] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [04:58:44] PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [250000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [05:02:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P54246 and previous config saved to /var/cache/conftool/dbconfig/20231206-050243-arnaudb.json [05:07:36] RECOVERY - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is OK: OK: Less than 1.00% above the threshold [100000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [05:17:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P54247 and previous config saved to /var/cache/conftool/dbconfig/20231206-051750-arnaudb.json [05:19:04] !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade T351616 [05:19:13] !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade T351616 (duration: 00m 09s) [05:23:50] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-check-services.service,librenms-discovery-new.service,librenms-poll-billing.service,librenms-poller-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:18] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T348183)', diff saved to https://phabricator.wikimedia.org/P54248 and previous config saved to /var/cache/conftool/dbconfig/20231206-053256-arnaudb.json [05:33:00] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [05:33:02] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [05:33:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [05:33:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T348183)', diff saved to https://phabricator.wikimedia.org/P54249 and previous config saved to /var/cache/conftool/dbconfig/20231206-053321-arnaudb.json [05:43:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T348183)', diff saved to https://phabricator.wikimedia.org/P54250 and previous config saved to /var/cache/conftool/dbconfig/20231206-054339-arnaudb.json [05:43:44] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [05:58:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P54251 and previous config saved to /var/cache/conftool/dbconfig/20231206-055846-arnaudb.json [06:13:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P54252 and previous config saved to /var/cache/conftool/dbconfig/20231206-061352-arnaudb.json [06:21:06] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:29:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T348183)', diff saved to https://phabricator.wikimedia.org/P54254 and previous config saved to /var/cache/conftool/dbconfig/20231206-062859-arnaudb.json [06:29:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [06:29:04] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [06:29:16] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [06:29:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T348183)', diff saved to https://phabricator.wikimedia.org/P54255 and previous config saved to /var/cache/conftool/dbconfig/20231206-062922-arnaudb.json [06:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:56:52] PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [250000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T0700) [07:00:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:05:44] RECOVERY - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is OK: OK: Less than 1.00% above the threshold [100000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [07:07:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T348183)', diff saved to https://phabricator.wikimedia.org/P54256 and previous config saved to /var/cache/conftool/dbconfig/20231206-070749-arnaudb.json [07:07:54] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [07:09:06] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:17:55] (03PS9) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [07:19:32] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group for ArthurTaylor - https://phabricator.wikimedia.org/T352653 (10ArthurTaylor) Signed from my side. [07:22:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P54257 and previous config saved to /var/cache/conftool/dbconfig/20231206-072256-arnaudb.json [07:38:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P54258 and previous config saved to /var/cache/conftool/dbconfig/20231206-073803-arnaudb.json [07:42:02] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:52:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:53:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T348183)', diff saved to https://phabricator.wikimedia.org/P54259 and previous config saved to /var/cache/conftool/dbconfig/20231206-075309-arnaudb.json [07:53:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [07:53:14] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [07:53:27] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [07:53:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54260 and previous config saved to /var/cache/conftool/dbconfig/20231206-075333-arnaudb.json [07:54:39] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4047.ulsfo.wmnet [07:58:48] (03PS1) 10Muehlenhoff: Switch cp4047 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980776 (https://phabricator.wikimedia.org/T349619) [07:59:50] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [08:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:04:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54261 and previous config saved to /var/cache/conftool/dbconfig/20231206-080409-arnaudb.json [08:04:16] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [08:06:58] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:08:52] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4047 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980776 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:15:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4047.ulsfo.wmnet [08:19:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P54262 and previous config saved to /var/cache/conftool/dbconfig/20231206-081915-arnaudb.json [08:20:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969128 (owner: 10Muehlenhoff) [08:27:28] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:27:55] (03CR) 10Muehlenhoff: [C: 03+2] piwik::dataase: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969128 (owner: 10Muehlenhoff) [08:34:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P54263 and previous config saved to /var/cache/conftool/dbconfig/20231206-083422-arnaudb.json [08:37:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:39:09] (03PS3) 10Ayounsi: Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) [08:39:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [08:45:15] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: update receiver config for version 8.2302 [puppet] - 10https://gerrit.wikimedia.org/r/980434 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [08:49:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54264 and previous config saved to /var/cache/conftool/dbconfig/20231206-084928-arnaudb.json [08:49:32] !log test rsyslog version from bullseye-backports on centrallog - T351710 [08:49:33] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [08:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:36] T351710: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 [08:59:14] PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [250000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [09:05:08] RECOVERY - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is OK: OK: Less than 1.00% above the threshold [100000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [09:08:57] (03PS1) 10Ayounsi: Netbox: remove SECURE_PROXY_SSL_HEADER [puppet] - 10https://gerrit.wikimedia.org/r/980815 (https://phabricator.wikimedia.org/T336275) [09:20:42] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/980815 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:22:47] (03PS1) 10Muehlenhoff: Revert "Remove ganeti RAPI dummy certs" [labs/private] - 10https://gerrit.wikimedia.org/r/980816 [09:24:08] 10SRE, 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 1): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Gehel) [09:25:22] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert "Remove ganeti RAPI dummy certs" [labs/private] - 10https://gerrit.wikimedia.org/r/980816 (owner: 10Muehlenhoff) [09:29:28] (03PS3) 10Muehlenhoff: Switch idp_test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969138 [09:31:00] (03CR) 10Muehlenhoff: [C: 03+2] Switch idp_test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969138 (owner: 10Muehlenhoff) [09:31:34] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:35] (03CR) 10Muehlenhoff: [C: 03+2] Switch netboxdb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff) [09:32:42] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:10] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:38:38] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:58] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:39:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:40:47] (ConfdResourceFailed) firing: confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:44:28] (03PS1) 10Filippo Giunchedi: docker_pkg: install convenience symlink [puppet] - 10https://gerrit.wikimedia.org/r/980817 (https://phabricator.wikimedia.org/T320555) [09:47:14] (03PS1) 10Filippo Giunchedi: New image: oauth2-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980818 (https://phabricator.wikimedia.org/T320555) [09:47:26] (03PS1) 10AikoChou: ml-services: fix PYTHONPATH issue in revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/980819 (https://phabricator.wikimedia.org/T352181) [09:55:47] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:00:17] (03CR) 10Brouberol: [C: 03+2] Add discovery records for the k8s-ingress-dse LVS service [dns] - 10https://gerrit.wikimedia.org/r/980404 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [10:02:34] (03PS1) 10Brouberol: Revert "Add discovery records for the k8s-ingress-dse LVS service" [dns] - 10https://gerrit.wikimedia.org/r/980473 [10:03:35] (03PS1) 10Effie Mouzeli: update README to include wikitech documentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/980823 [10:04:09] (03PS1) 10Ayounsi: Add ApereoSocialPipeline for now CAS auth [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/980824 (https://phabricator.wikimedia.org/T308002) [10:08:01] (03CR) 10Clément Goubert: [C: 03+1] Revert "Add discovery records for the k8s-ingress-dse LVS service" [dns] - 10https://gerrit.wikimedia.org/r/980473 (owner: 10Brouberol) [10:08:13] (03CR) 10Brouberol: [C: 03+2] Revert "Add discovery records for the k8s-ingress-dse LVS service" [dns] - 10https://gerrit.wikimedia.org/r/980473 (owner: 10Brouberol) [10:12:19] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:12:55] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:05] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:16:09] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:16:15] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:29] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:09] (03CR) 10Hnowlan: [C: 03+1] "Happy to deploy this if you'd prefer!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980471 (https://phabricator.wikimedia.org/T343268) (owner: 10Alex Paskulin) [10:19:05] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:43] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:20:24] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Remove non-PKI code for RAPI access [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [10:22:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:15] !log installing gtk+3.0 bug fix updates from Bookworm point release [10:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:42] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [10:30:15] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/980824 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi) [10:35:03] (03PS1) 10Muehlenhoff: Revert "Revert "Remove ganeti RAPI dummy certs"" [labs/private] - 10https://gerrit.wikimedia.org/r/980825 [10:37:40] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert "Revert "Remove ganeti RAPI dummy certs"" [labs/private] - 10https://gerrit.wikimedia.org/r/980825 (owner: 10Muehlenhoff) [10:38:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4050.ulsfo.wmnet [10:40:28] (03PS1) 10Muehlenhoff: Switch cp4050 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980846 (https://phabricator.wikimedia.org/T349619) [10:43:43] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4050 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980846 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:43:45] (03CR) 10JMeybohm: mcrouter: add chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:45:16] (03CR) 10JMeybohm: [C: 03+1] update README to include wikitech documentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/980823 (owner: 10Effie Mouzeli) [10:47:02] (03CR) 10JMeybohm: [C: 04-1] deployment_server: add mcrouter service 1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:50:05] (03PS1) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [10:50:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4050.ulsfo.wmnet [11:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T1100) [11:00:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:05:08] (03PS4) 10Effie Mouzeli: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:07:03] (03PS24) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:10:25] (03PS25) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:11:10] (03CR) 10CI reject: [V: 04-1] mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:11:35] (03PS2) 10Effie Mouzeli: deployment_server: add mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) [11:11:47] (03CR) 10Effie Mouzeli: deployment_server: add mcrouter service 1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:13:17] (03PS26) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:16:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4044.ulsfo.wmnet [11:17:14] (03PS1) 10Muehlenhoff: Switch cp4044 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980848 (https://phabricator.wikimedia.org/T349619) [11:17:38] (03CR) 10Effie Mouzeli: mcrouter: add chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:20:25] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4044 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980848 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:21:04] (03Abandoned) 10Effie Mouzeli: Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947802 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [11:21:24] (03CR) 10Effie Mouzeli: [C: 03+2] update README to include wikitech documentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/980823 (owner: 10Effie Mouzeli) [11:21:45] (03CR) 10Sg912: [C: 03+1] geo-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/978529 (owner: 10Hnowlan) [11:21:56] (03Merged) 10jenkins-bot: update README to include wikitech documentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/980823 (owner: 10Effie Mouzeli) [11:23:18] (03PS1) 10Hnowlan: changeprop-jobqueue: migrate one large and a few small jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980849 (https://phabricator.wikimedia.org/T349796) [11:26:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4044.ulsfo.wmnet [11:33:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: druid::analytics::worker [11:34:09] (03CR) 10Effie Mouzeli: [C: 03+1] changeprop-jobqueue: migrate one large and a few small jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980849 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:35:30] (03PS1) 10Muehlenhoff: Switch druid::analytics::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980850 (https://phabricator.wikimedia.org/T349619) [11:37:36] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [11:37:53] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: migrate one large and a few small jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980849 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:38:40] (03Merged) 10jenkins-bot: changeprop-jobqueue: migrate one large and a few small jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980849 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:39:24] (03CR) 10Muehlenhoff: [C: 03+2] Switch druid::analytics::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980850 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:39:47] (03CR) 10Volans: "Looks almost ready now! I've left a question and a couple of nits. It's just missing the tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [11:40:26] (03PS2) 10AikoChou: ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980819 (https://phabricator.wikimedia.org/T352181) [11:40:47] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:41:13] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:42:29] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:43:02] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:44:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: druid::analytics::worker [11:46:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:48:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:48:17] !log rollback changeprop-jobqueue [11:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:49:07] (03PS2) 10Effie Mouzeli: modules/app: update to job 1.1.0 (vanila) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980847 [11:49:09] (03PS1) 10Effie Mouzeli: modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [11:50:19] (03PS1) 10Hnowlan: changeprop-jobqueue: restore cirrussearchlinksupdate to metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/980853 (https://phabricator.wikimedia.org/T349796) [11:51:33] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: restore cirrussearchlinksupdate to metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/980853 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:51:39] (03PS2) 10Effie Mouzeli: modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [11:53:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:53:30] (03CR) 10Effie Mouzeli: [C: 03+1] changeprop-jobqueue: restore cirrussearchlinksupdate to metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/980853 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:54:11] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980819 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou) [11:54:56] (03PS3) 10Effie Mouzeli: modules/app: update to job 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 [11:57:37] (03CR) 10Muehlenhoff: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [12:00:19] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: restore cirrussearchlinksupdate to metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/980853 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:01:06] (03Merged) 10jenkins-bot: changeprop-jobqueue: restore cirrussearchlinksupdate to metal [deployment-charts] - 10https://gerrit.wikimedia.org/r/980853 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:15:53] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1001.eqiad.wmnet with OS bookworm [12:16:01] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1001.eqiad.wmnet with OS bookworm [12:29:27] (03PS1) 10Hnowlan: changeprop-jobqueue: migrate all remaining small jobs, also cdnPurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/980855 (https://phabricator.wikimedia.org/T349796) [12:30:49] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1001.eqiad.wmnet with reason: host reimage [12:32:11] (03PS1) 10Hnowlan: changeprop-jobqueue: migrate all low-traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980856 (https://phabricator.wikimedia.org/T349796) [12:33:02] (03PS3) 10Awight: Remove wgPopupsReferencePreviews now that it defaults to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978035 (https://phabricator.wikimedia.org/T351708) [12:33:35] !log installing pam bugfix updates from Bookworm point release [12:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:01] (03PS5) 10JMeybohm: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) [12:34:03] (03PS2) 10JMeybohm: function-orchestrator: Update to latest mesh and ingress module [deployment-charts] - 10https://gerrit.wikimedia.org/r/980425 (https://phabricator.wikimedia.org/T300033) [12:34:15] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1001.eqiad.wmnet with reason: host reimage [12:35:46] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: migrate all remaining small jobs, also cdnPurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/980855 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:40:02] (03CR) 10Effie Mouzeli: [C: 03+1] Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:40:08] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [12:41:56] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4041.ulsfo.wmnet [12:43:26] (03PS1) 10Muehlenhoff: Switch cp4041 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980858 (https://phabricator.wikimedia.org/T349619) [12:45:16] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4041 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980858 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:50:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4041.ulsfo.wmnet [12:52:03] !log mvernon@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1001" [12:52:09] (03CR) 10JMeybohm: [C: 04-1] modules/app: update to job 1.1.0 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli) [12:52:35] (03CR) 10JMeybohm: [C: 04-1] "Please also add an entry to CHANGELOG.md" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980852 (owner: 10Effie Mouzeli) [12:57:10] (03CR) 10JMeybohm: [C: 03+2] Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:57:12] (03CR) 10JMeybohm: [C: 03+2] Add new mesh module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/979094 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:58:02] (03Merged) 10jenkins-bot: Add new mesh module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/979094 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:58:04] (03Merged) 10jenkins-bot: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:04:01] (03CR) 10Effie Mouzeli: [C: 03+1] function-orchestrator: Update to latest mesh and ingress module [deployment-charts] - 10https://gerrit.wikimedia.org/r/980425 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:06:50] (03CR) 10JMeybohm: [C: 03+2] function-orchestrator: Update to latest mesh and ingress module [deployment-charts] - 10https://gerrit.wikimedia.org/r/980425 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:07:44] (03Merged) 10jenkins-bot: function-orchestrator: Update to latest mesh and ingress module [deployment-charts] - 10https://gerrit.wikimedia.org/r/980425 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:08:05] !log installing systemd bugfix updates from Bookworm point release [13:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:09] (03CR) 10JMeybohm: mcrouter: add chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:08:20] (03PS1) 10KartikMistry: Provide python3-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) [13:08:59] (03PS4) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [13:09:22] (03PS1) 10Volans: doc: mention inclusion into Debian upstream [software/cumin] - 10https://gerrit.wikimedia.org/r/980861 [13:11:56] (03CR) 10Volans: [C: 03+1] "I love the removal of all the hardcoded stuff! I'll leave to netops to ensure it's a noop and the logic works." [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [13:12:18] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@bfd944e]: Add metrics configuration TEST [airflow-dags@bfd944e4] [13:12:29] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@bfd944e]: Add metrics configuration TEST [airflow-dags@bfd944e4] (duration: 00m 11s) [13:16:05] (03CR) 10CI reject: [V: 04-1] doc: mention inclusion into Debian upstream [software/cumin] - 10https://gerrit.wikimedia.org/r/980861 (owner: 10Volans) [13:20:51] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4052.ulsfo.wmnet [13:21:54] (03PS1) 10Muehlenhoff: Switch cp4052 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980863 (https://phabricator.wikimedia.org/T349619) [13:23:38] (03PS1) 10Jelto: add optional install_recommends to apt_install [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/980864 (https://phabricator.wikimedia.org/T352003) [13:25:54] (03CR) 10Volans: [V: 03+2 C: 03+2] "docs only self-merging. CI failing is the lack of a recent pyparsing release upstream with the merged fix for type hinting." [software/cumin] - 10https://gerrit.wikimedia.org/r/980861 (owner: 10Volans) [13:26:34] (03PS1) 10Klausman: api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) [13:33:53] (03CR) 10Elukey: "We should probably try to figure out what is the best URI path that we want to use. I think that having api.wikimedia.org/etc../api is not" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [13:36:44] (03PS10) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [13:37:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1001" [13:37:40] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1001.eqiad.wmnet with OS bookworm [13:37:43] (03PS11) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [13:37:47] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1001.eqiad.wmnet with OS bookworm completed: - moss-be1001 (**PASS**) - Removed from Puppet and Pup... [13:38:46] (03PS1) 10Volans: doc: update .readthedocs.yml configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/980866 [13:38:51] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [13:41:55] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4052 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980863 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:43:14] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: migrate all remaining small jobs, also cdnPurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/980855 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [13:43:49] (03PS27) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [13:44:04] (03Merged) 10jenkins-bot: changeprop-jobqueue: migrate all remaining small jobs, also cdnPurge [deployment-charts] - 10https://gerrit.wikimedia.org/r/980855 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [13:45:21] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:45:45] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:45:51] (03CR) 10CI reject: [V: 04-1] doc: update .readthedocs.yml configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/980866 (owner: 10Volans) [13:46:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4052.ulsfo.wmnet [13:48:17] (03CR) 10JMeybohm: [C: 03+1] deployment_server: add mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:48:27] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:48:28] (03CR) 10JMeybohm: [C: 03+1] Add namespace for mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:48:30] (03CR) 10Volans: [V: 03+2 C: 03+2] "self-merging, CI failure is due to missing release upstream of pyparsing" [software/cumin] - 10https://gerrit.wikimedia.org/r/980866 (owner: 10Volans) [13:48:48] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:50:29] (03CR) 10Klausman: api-gateway: Add entry for recommendation-api-ng on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [13:52:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:53:52] (03CR) 10Elukey: api-gateway: Add entry for recommendation-api-ng on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [13:54:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:56:02] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:56:31] (03CR) 10Effie Mouzeli: mcrouter: add chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T1400). [14:00:04] xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] (03PS1) 10Muehlenhoff: Enable requestctl-based block list for nftables on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) [14:01:44] I can deploy [14:01:50] xSavitar: available? [14:01:50] ok! [14:04:06] xSavitar: ping me when you're available for your deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/976252/ :-) I'm around for the hour [14:07:21] (03PS1) 10Muehlenhoff: Extend MOU for aarora [puppet] - 10https://gerrit.wikimedia.org/r/980870 [14:07:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:08:13] (03CR) 10Volans: "Much better! There is still an open question on the API and some safety net to add, see inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [14:09:00] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for aarora [puppet] - 10https://gerrit.wikimedia.org/r/980870 (owner: 10Muehlenhoff) [14:09:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [14:10:09] (03PS2) 10Hnowlan: changeprop-jobqueue: migrate all low-traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980856 (https://phabricator.wikimedia.org/T349796) [14:11:13] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [14:12:38] PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [250000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [14:16:01] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10brouberol) > While we ought to consider an upgrade for all 4 clusters, from what I understand Jumbo can be upgraded independently. Are there any concerns... [14:17:35] (03PS1) 10Ssingh: traffic: remove paging for LVSHighRX alert [alerts] - 10https://gerrit.wikimedia.org/r/980871 [14:18:24] (03CR) 10CDanis: [C: 03+1] traffic: remove paging for LVSHighRX alert [alerts] - 10https://gerrit.wikimedia.org/r/980871 (owner: 10Ssingh) [14:19:56] (03CR) 10Ssingh: [C: 03+2] traffic: remove paging for LVSHighRX alert [alerts] - 10https://gerrit.wikimedia.org/r/980871 (owner: 10Ssingh) [14:20:00] (03CR) 10CDanis: [C: 03+1] "looks good!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980818 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [14:21:03] !log repooling cp4052 after reimage (bookworm -> bullseye) due to possible impacting T352744 [14:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:07] T352744: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 [14:23:30] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] New image: oauth2-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980818 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [14:23:50] (03CR) 10Brouberol: "Looks great, with a minor suggestion. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [14:23:56] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dnsbox [14:26:03] (03PS1) 10Muehlenhoff: Switch dnsbox to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980872 (https://phabricator.wikimedia.org/T349619) [14:27:32] (03CR) 10Ssingh: [C: 03+1] Switch dnsbox to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980872 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:28:01] (03CR) 10Muehlenhoff: [C: 03+2] Switch dnsbox to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980872 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:30:34] (03CR) 10Ilias Sarantopoulos: api-gateway: Add entry for recommendation-api-ng on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:31:38] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10elukey) >>! In T300102#9386753, @brouberol wrote: > >> Specifically are there clients that publish to Kafka Jumbo directly or do all Kafka topics get mi... [14:32:02] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:33:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:35:23] (03CR) 10Brouberol: [C: 03+1] "I checked that the URL can be reached and that we do get a response" [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [14:37:52] (03CR) 10Bking: [C: 03+2] trafficserver: revert to using hostname for wdqs ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [14:38:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dnsbox [14:39:06] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:36] 10SRE, 10Data-Engineering, 10Data-Platform-SRE (23/24 Q2 Milestone 1): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) Thank you for looking into this @brouberol. Yes I think you are right about the anonymous ACLs. I thi... [14:41:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:43:21] !log installing debian-archive-keyring updates from Bookworm point release [14:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:44] (03CR) 10Hnowlan: [C: 03+1] "lgtm config-wise!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:44:12] (03CR) 10Xcollazo: "Hmm.. I wonder if we should deploy to test cluster only to make sure I didn't break the world." [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [14:45:41] (03PS12) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [14:45:46] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10bking) FYI, Elasticsearch doesn't use envoy either. The Flink pipelines @Joe mentioned are all in k8s, and wdqs (consumer of flink) uses envoy, so I think we're OK there? [14:48:22] (03CR) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [14:48:32] xSavitar: last call for your deployment :) [14:48:58] (03CR) 10AikoChou: [C: 03+2] ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980819 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou) [14:49:47] (03Merged) 10jenkins-bot: ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980819 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou) [14:50:43] (03CR) 10Kevin Bazira: api-gateway: Add entry for recommendation-api-ng on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:54:06] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:32] (03PS1) 10Majavah: P:toolforge: bastion: make CPU quota more reasonable [puppet] - 10https://gerrit.wikimedia.org/r/980877 (https://phabricator.wikimedia.org/T352832) [14:56:06] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/980877 (https://phabricator.wikimedia.org/T352832) (owner: 10Majavah) [14:57:47] (03CR) 10Majavah: [C: 03+2] P:toolforge: bastion: make CPU quota more reasonable [puppet] - 10https://gerrit.wikimedia.org/r/980877 (https://phabricator.wikimedia.org/T352832) (owner: 10Majavah) [14:59:21] (03PS2) 10Muehlenhoff: Enable requestctl-based block list for nftables on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) [15:00:04] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T1500) [15:00:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:02:09] !log installing mariadb bugfix updates from Bookworm point release (as packaged in Debian, unrelated to wmf-mariadb packages) [15:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:21] 10Puppet: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870 (10TheresNoTime) [15:04:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: restbase::production [15:06:22] (03PS2) 10Samtar: redirects: Add funnel for fox.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/980879 (https://phabricator.wikimedia.org/T352870) [15:06:30] (03Abandoned) 10Eevans: install_server: configure for initial install of restbase20[28-35] [puppet] - 10https://gerrit.wikimedia.org/r/968732 (https://phabricator.wikimedia.org/T348474) (owner: 10Eevans) [15:06:39] (03PS1) 10Muehlenhoff: Switch restbase::production to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980881 (https://phabricator.wikimedia.org/T349619) [15:08:02] (03CR) 10Muehlenhoff: [C: 03+2] Switch restbase::production to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980881 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:08:26] (03PS2) 10Klausman: api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) [15:10:32] 10Puppet, 10SRE, 10Patch-For-Review: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870 (10TheresNoTime) [15:13:06] (03PS2) 10Jforrester: Beta Features: Move ULS Compact Links to only the wikis it's enabled on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980512 [15:14:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['testhost2001'] [15:15:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [15:16:19] (03PS2) 10Jforrester: Beta Features: Allow Vector 2022 typography feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980517 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [15:16:21] (03CR) 10Jforrester: Beta Features: Allow Vector 2022 typography feature (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980517 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [15:16:23] (03PS1) 10Jforrester: Beta Features: Drop Popups, deployed everywhere for ages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980883 [15:17:11] (03PS1) 10Eevans: restbase: set production role and add config for restbase2029 [puppet] - 10https://gerrit.wikimedia.org/r/980884 (https://phabricator.wikimedia.org/T352468) [15:17:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980512 (owner: 10Jforrester) [15:17:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980883 (owner: 10Jforrester) [15:18:27] (03Merged) 10jenkins-bot: Beta Features: Move ULS Compact Links to only the wikis it's enabled on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980512 (owner: 10Jforrester) [15:18:29] (03Merged) 10jenkins-bot: Beta Features: Drop Popups, deployed everywhere for ages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980883 (owner: 10Jforrester) [15:19:05] Meh, always fun to be the first scap of the day. [15:19:07] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:980512|Beta Features: Move ULS Compact Links to only the wikis it's enabled on]], [[gerrit:980883|Beta Features: Drop Popups, deployed everywhere for ages]] [15:19:32] (03CR) 10Jforrester: [C: 03+1] "Good to land whenever the team is happy to deploy from my POV." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980517 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [15:19:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['testhost2001'] [15:19:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cephosd2001.mgmt.codfw.wmnet with reboot policy FORCED [15:20:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cephosd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:20:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cephosd2003.mgmt.codfw.wmnet with reboot policy FORCED [15:21:06] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:980512|Beta Features: Move ULS Compact Links to only the wikis it's enabled on]], [[gerrit:980883|Beta Features: Drop Popups, deployed everywhere for ages]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:21:26] (03PS1) 10Jgiannelos: tegola: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980885 [15:23:38] !log jforrester@deploy2002 jforrester: Continuing with sync [15:23:42] !log depool cp4037 for reimage testing: T350179 [15:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:45] T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 [15:24:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: restbase::production [15:27:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cephosd2001.mgmt.codfw.wmnet with reboot policy FORCED [15:27:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cephosd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:27:54] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ssingh) [15:27:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cephosd2003.mgmt.codfw.wmnet with reboot policy FORCED [15:28:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cephosd2001.mgmt.codfw.wmnet with reboot policy FORCED [15:28:02] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:28:06] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [15:28:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cephosd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:28:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cephosd2003.mgmt.codfw.wmnet with reboot policy FORCED [15:28:31] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:28:45] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [15:28:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd2001.mgmt.codfw.wmnet with reboot policy FORCED [15:29:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:29:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd2003.mgmt.codfw.wmnet with reboot policy FORCED [15:30:00] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:30:41] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:980512|Beta Features: Move ULS Compact Links to only the wikis it's enabled on]], [[gerrit:980883|Beta Features: Drop Popups, deployed everywhere for ages]] (duration: 11m 33s) [15:31:51] (03PS3) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 [15:31:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:31:58] (03PS3) 10Andrew Bogott: Keystone: turn on credential_key managementin eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486 [15:32:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2001.codfw.wmnet with OS bullseye [15:32:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2001.codfw.wmnet with OS bullseye [15:33:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2001.codfw.wmnet with OS bullseye [15:33:12] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: sessionstore [15:33:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2001.codfw.wmnet with OS bullseye executed with errors: - ce... [15:33:20] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [15:33:36] (03PS4) 10Andrew Bogott: Keystone: turn on credential_key management in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486 [15:33:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980486 (owner: 10Andrew Bogott) [15:34:28] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:34:44] (03PS1) 10Muehlenhoff: Switch sessionstore to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980887 (https://phabricator.wikimedia.org/T349619) [15:35:19] (03PS2) 10Kamila Součková: mobileapps: 60% to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/976222 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:35:39] (03PS1) 10Kamila Součková: mw-api-int: increase replicas by 33% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980888 (https://phabricator.wikimedia.org/T350846) [15:35:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:03] (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980887 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:37:02] (03CR) 10Effie Mouzeli: [C: 03+1] changeprop-jobqueue: migrate all low-traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980856 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:37:23] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:38:27] (03CR) 10Hnowlan: [C: 03+1] mw-api-int: increase replicas by 33% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980888 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [15:38:37] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:38:46] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:38:52] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [15:39:18] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [15:39:43] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:41:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: sessionstore [15:42:26] (03PS4) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 [15:42:28] (03PS5) 10Andrew Bogott: Keystone: turn on credential_key management in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486 [15:43:06] (03PS5) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) [15:43:26] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: migrate all low-traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980856 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:43:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [15:43:53] (03CR) 10JHathaway: apt_repo: move hiera data into module, to allow for validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [15:44:19] (03CR) 10JHathaway: apt_repo: move hiera data into module, to allow for validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [15:44:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:44:36] (03Merged) 10jenkins-bot: changeprop-jobqueue: migrate all low-traffic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/980856 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:44:51] (03PS1) 10Peter Fischer: Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980890 [15:45:22] (03CR) 10Peter Fischer: [C: 03+2] Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980890 (owner: 10Peter Fischer) [15:46:13] (03Merged) 10jenkins-bot: Bump services/cirrus-streaming-updater version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980890 (owner: 10Peter Fischer) [15:46:17] !log restarting Cassandra on aqs2001-{a,b,c} (testing puppet 7 migration) [15:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:47] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:47:17] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:47:59] (03PS1) 10JMeybohm: kubernetes: Remove cergen certs from kubernetes secrets [labs/private] - 10https://gerrit.wikimedia.org/r/980891 (https://phabricator.wikimedia.org/T300033) [15:48:23] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:48:44] (03PS1) 10Muehlenhoff: Add library hint for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/980892 [15:48:50] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:49:13] (03CR) 10CI reject: [V: 04-1] Add library hint for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/980892 (owner: 10Muehlenhoff) [15:50:48] (03PS1) 10Andrew Bogott: keystone fernet_keys: remove old absent section [puppet] - 10https://gerrit.wikimedia.org/r/980893 [15:51:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [15:51:57] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:52:11] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:52:27] (03PS2) 10Muehlenhoff: Add library hint for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/980892 [15:55:11] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/980892 (owner: 10Muehlenhoff) [15:56:26] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: sync [15:56:29] (03CR) 10Andrew Bogott: [C: 04-2] "I think this is totally wrong and that we can just have one set of creds that are long-lived." [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [15:56:37] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: sync [15:59:00] (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2029 [puppet] - 10https://gerrit.wikimedia.org/r/980884 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [16:01:01] (03PS1) 10DLynch: Enable Edit Check on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980895 (https://phabricator.wikimedia.org/T352355) [16:04:18] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [16:05:37] !log milimetric@deploy2002 Started deploy [airflow-dags/platform_eng@db1cb48]: in order to run the querypage job [16:07:05] !log milimetric@deploy2002 Finished deploy [airflow-dags/platform_eng@db1cb48]: in order to run the querypage job (duration: 01m 28s) [16:13:57] (03PS1) 10Jdlrobson: Correct links to beta feature [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980477 (https://phabricator.wikimedia.org/T352826) [16:14:47] (03PS3) 10Klausman: api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) [16:15:39] (03CR) 10Klausman: api-gateway: Add entry for recommendation-api-ng on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [16:18:12] (03CR) 10AOkoth: [C: 03+2] vrts: add v6.4.5 required packages [puppet] - 10https://gerrit.wikimedia.org/r/975906 (https://phabricator.wikimedia.org/T349349) (owner: 10AOkoth) [16:20:57] 10Puppet, 10Instrument-ClientError, 10Patch-For-Review: Google Translate and other translate services triggering client error alert - https://phabricator.wikimedia.org/T351738 (10Jdlrobson) @colewhite could you please help me with this? [16:21:09] (03PS4) 10Jdlrobson: Filter errors originating in external tools [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) [16:21:40] (03PS1) 10Andrew Bogott: Keystone: add fake cred keys [labs/private] - 10https://gerrit.wikimedia.org/r/980900 [16:23:12] (03CR) 10Jdlrobson: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980883 (owner: 10Jforrester) [16:24:43] (03PS1) 10Jdlrobson: References previews is no longer a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980901 (https://phabricator.wikimedia.org/T282999) [16:24:55] (03CR) 10Joal: Update the refinery version used by the refine jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [16:27:42] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:44] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10hnowlan) [16:29:00] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [16:29:11] !log bootstrapping Cassandra/restbase2020-a — T352468 [16:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:16] T352468: Provision new RESTBase cluster nodes: restbase20[28-35] - https://phabricator.wikimedia.org/T352468 [16:29:18] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10hnowlan) 05Open→03In progress a:03hnowlan [16:29:41] (03PS2) 10DLynch: DiscussionTools visual enhancements on pages with __NEWSECTIONLINK__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971465 (https://phabricator.wikimedia.org/T331635) [16:31:18] PROBLEM - Check systemd state on restbase2029 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:20] (03PS2) 10Ssingh: P:dns::auth: add support for depooling recdns via confd [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [16:32:18] RECOVERY - Check systemd state on restbase2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:33:18] 10Puppet, 10Wikidata, 10wmde-wikidata-tech, 10Technical-Debt, 10Wikidata Analytics (Kanban): Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072 (10Manuel) [16:34:10] (03PS5) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 [16:34:12] (03PS2) 10Andrew Bogott: keystone fernet_keys: remove old absent section [puppet] - 10https://gerrit.wikimedia.org/r/980893 [16:34:40] (03Abandoned) 10DLynch: Enable Edit Check on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980895 (https://phabricator.wikimedia.org/T352355) (owner: 10DLynch) [16:35:04] (03Abandoned) 10Andrew Bogott: Keystone: turn on credential_key management in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486 (owner: 10Andrew Bogott) [16:35:11] (03CR) 10CI reject: [V: 04-1] Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [16:35:15] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Keystone: add fake cred keys [labs/private] - 10https://gerrit.wikimedia.org/r/980900 (owner: 10Andrew Bogott) [16:35:19] (03CR) 10CI reject: [V: 04-1] keystone fernet_keys: remove old absent section [puppet] - 10https://gerrit.wikimedia.org/r/980893 (owner: 10Andrew Bogott) [16:37:02] (03CR) 10Hnowlan: [C: 03+2] geo-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/978529 (owner: 10Hnowlan) [16:37:22] (03PS11) 10Andrea Denisse: ircecho: Migrate the ircecho script from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) [16:37:30] (03PS6) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 [16:37:34] (03PS3) 10Andrew Bogott: keystone fernet_keys: remove old absent section [puppet] - 10https://gerrit.wikimedia.org/r/980893 [16:37:42] (03Abandoned) 10Hnowlan: mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/978500 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:37:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Test IP-renumbering on a kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) [16:37:59] (03Merged) 10jenkins-bot: geo-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/978529 (owner: 10Hnowlan) [16:38:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) [16:38:27] (03CR) 10Andrea Denisse: ircecho: Migrate the ircecho script from Python 2 to Python 3 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [16:38:41] (03CR) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [16:39:21] (03CR) 10Andrea Denisse: ircecho: Migrate the ircecho script from Python 2 to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [16:39:38] (03PS7) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 [16:39:40] (03PS4) 10Andrew Bogott: keystone fernet_keys: remove old absent section [puppet] - 10https://gerrit.wikimedia.org/r/980893 [16:40:09] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [16:40:35] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [16:40:55] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [16:41:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) p:05Triage→03Medium a:03Clement_Goubert [16:41:23] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [16:41:50] (03Abandoned) 10DLynch: DiscussionTools visual enhancements on pages with __NEWSECTIONLINK__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971465 (https://phabricator.wikimedia.org/T331635) (owner: 10DLynch) [16:42:01] (03PS12) 10Andrea Denisse: ircecho: Migrate the ircecho script from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) [16:42:16] PROBLEM - cassandra-a CQL 10.192.16.240:9042 on restbase2029 is CRITICAL: connect to address 10.192.16.240 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:42:31] (03PS2) 10Hnowlan: kubernetes::worker: add mw-jobrunner to pools [puppet] - 10https://gerrit.wikimedia.org/r/973824 (https://phabricator.wikimedia.org/T349796) [16:42:46] (03Abandoned) 10Hnowlan: kubernetes::worker: add mw-jobrunner to pools [puppet] - 10https://gerrit.wikimedia.org/r/973824 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:43:13] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [16:44:28] (03PS3) 10Ssingh: P:dns::auth: add support for depooling authdns via confd [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [16:44:36] (03PS1) 10Elukey: python-webapp: update mesh and base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/980904 [16:46:00] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:46:36] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:48:12] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [16:48:22] (03CR) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [16:49:36] PROBLEM - cassandra-b CQL 10.192.16.241:9042 on restbase2029 is CRITICAL: connect to address 10.192.16.241 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:50:12] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "This is also part of I54f8dfb. Merging either one works." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980901 (https://phabricator.wikimedia.org/T282999) (owner: 10Jdlrobson) [16:51:34] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:52:02] PROBLEM - cassandra-b SSL 10.192.16.241:7000 on restbase2029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:20:18] (03CR) 10BBlack: P:dns::auth: add support for depooling authdns via confd (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:29:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10VRiley-WMF) a:03VRiley-WMF [17:30:33] (03PS1) 10Elukey: profile::cache::kafka::webrequest: add the Sec-Profile req header [puppet] - 10https://gerrit.wikimedia.org/r/980912 (https://phabricator.wikimedia.org/T346463) [17:31:02] (03CR) 10CI reject: [V: 04-1] profile::cache::kafka::webrequest: add the Sec-Profile req header [puppet] - 10https://gerrit.wikimedia.org/r/980912 (https://phabricator.wikimedia.org/T346463) (owner: 10Elukey) [17:33:41] (03PS13) 10Andrea Denisse: ircecho: Migrate the ircecho script from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) [17:34:05] (03PS2) 10Elukey: profile::cache::kafka::webrequest: change the JSON format [puppet] - 10https://gerrit.wikimedia.org/r/980912 (https://phabricator.wikimedia.org/T346463) [17:34:54] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [17:35:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/840/con" [puppet] - 10https://gerrit.wikimedia.org/r/980912 (https://phabricator.wikimedia.org/T346463) (owner: 10Elukey) [17:36:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [17:39:31] (03PS1) 10Ryan Kemper: wdqs: open firewall rules for graph_split [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) [17:40:35] (03CR) 10Ryan Kemper: "Not sure if the srange is set properly but wanted to get a patch up in advance of our meeting later today" [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) (owner: 10Ryan Kemper) [17:45:21] (03CR) 10Ssingh: [V: 03+1] P:dns::auth: add support for depooling authdns via confd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:47:00] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) (owner: 10Ryan Kemper) [17:51:01] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Papaul) @Volans @ssingh asked me to take a look at the issue to see what i can find. working on cp4037 Test1 when I start the reimage cookbook,... [17:56:03] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:57:10] (03CR) 10Alex Paskulin: rest-gateway: fix device analytics routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980471 (https://phabricator.wikimedia.org/T343268) (owner: 10Alex Paskulin) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T1800) [18:02:52] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS bullseye [18:09:17] (03PS1) 10Giuseppe Lavagetto: Add asyncio implementation [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/980918 (https://phabricator.wikimedia.org/T338297) [18:14:45] (03CR) 10CI reject: [V: 04-1] Add asyncio implementation [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/980918 (https://phabricator.wikimedia.org/T338297) (owner: 10Giuseppe Lavagetto) [18:18:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2001.codfw.wmnet with OS bullseye [18:18:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cephosd2001.codfw.wmnet with OS bullseye executed with errors: - ceph... [18:22:57] (03CR) 10Muehlenhoff: wdqs: open firewall rules for graph_split (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) (owner: 10Ryan Kemper) [18:23:44] 10Puppet, 10SRE, 10Patch-For-Review: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870 (10ClydeFranklin) I love fopkses! [18:27:02] (03PS1) 10Esanders: DiscussionTools: Rename config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980920 [18:30:39] (03PS1) 10Andrea Denisse: klaxon: Ensure the klaxon user has a home directory [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) [18:31:08] (03CR) 10CI reject: [V: 04-1] klaxon: Ensure the klaxon user has a home directory [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [18:33:24] (03CR) 10Brouberol: [C: 03+1] "Approved, with a small comment sugestion" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [18:33:34] (03PS4) 10Ssingh: P:dns::auth: add support for depooling authdns via confd [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [18:34:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:35:49] (03PS2) 10Andrea Denisse: klaxon: Ensure the klaxon user has a home directory [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) [18:44:53] 10SRE, 10Cloud-VPS: enable lists.wikimedia.org or wikimedia.org addresses to receive dmarc reports for *.wmflabs.org - https://phabricator.wikimedia.org/T352902 (10jsn.sherman) [18:46:42] (03PS2) 10Btullis: Update the refinery version used by the refine test jobs [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) [18:46:45] (03PS1) 10Btullis: Update the refinery version used by the refine production jobs [puppet] - 10https://gerrit.wikimedia.org/r/980923 (https://phabricator.wikimedia.org/T349121) [18:49:27] (03PS3) 10Btullis: Update the refinery version used by the refine test jobs [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) [18:49:59] (03PS2) 10Btullis: Update the refinery version used by the refine production jobs [puppet] - 10https://gerrit.wikimedia.org/r/980923 (https://phabricator.wikimedia.org/T349121) [18:51:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/842/con" [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [18:53:08] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/843/con" [puppet] - 10https://gerrit.wikimedia.org/r/980923 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [18:53:41] (03CR) 10Clément Goubert: [C: 04-1] "Thanks for the CR Kartik, as is the image does not build correctly on my system, but once we've clarified the situation around installing " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [18:53:50] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs1024.eqiad.wmnet with reason: T352878 [18:53:57] T352878: Troubleshoot recurring systemd unit failures for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 [18:54:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs1024.eqiad.wmnet with reason: T352878 [18:55:09] 10SRE, 10Cloud-VPS: enable lists.wikimedia.org or wikimedia.org email addresses to receive dmarc reports for *.wmflabs.org - https://phabricator.wikimedia.org/T352902 (10jsn.sherman) [18:55:21] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1024.eqiad.wmnet [18:56:42] (03PS1) 10Ayounsi: BGPPeers: add codfw racks A1 to B8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) [19:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T1900) [19:00:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:02:51] (03PS2) 10Ryan Kemper: wdqs: open firewall rules for graph_split [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) [19:05:40] (03CR) 10Muehlenhoff: [C: 03+1] wdqs: open firewall rules for graph_split [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) (owner: 10Ryan Kemper) [19:07:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host wdqs1024.eqiad.wmnet [19:19:10] (03PS1) 10Ayounsi: k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) [19:19:40] (03CR) 10CI reject: [V: 04-1] k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [19:19:44] (03PS2) 10Ayounsi: k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) [19:22:21] (03PS6) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) [19:22:33] (03CR) 10JHathaway: apt_repo: move hiera data into module, to allow for validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [19:23:39] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: open firewall rules for graph_split [puppet] - 10https://gerrit.wikimedia.org/r/980914 (https://phabricator.wikimedia.org/T350106) (owner: 10Ryan Kemper) [19:36:12] (03PS1) 10BBlack: dnsrecursor: forward_zones for wikimedia.org, too [puppet] - 10https://gerrit.wikimedia.org/r/980929 (https://phabricator.wikimedia.org/T347054) [19:46:52] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2012 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352909 (10cmooney) [19:47:18] (03PS1) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2012 and move row A vlans [puppet] - 10https://gerrit.wikimedia.org/r/980931 (https://phabricator.wikimedia.org/T352909) [19:50:51] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2012 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352909 (10cmooney) [19:51:50] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service,wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:56] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:53:16] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1024 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:54:36] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2012 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352909 (10cmooney) [19:59:54] (03CR) 10DLynch: [C: 03+1] DiscussionTools: Rename config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980920 (owner: 10Esanders) [20:03:34] (03PS1) 10Dwisehaupt: Add check_coworker nagios check to frdev and civi [puppet] - 10https://gerrit.wikimedia.org/r/980934 (https://phabricator.wikimedia.org/T324611) [20:11:15] (03CR) 10Jgreen: [C: 03+2] Add check_coworker nagios check to frdev and civi [puppet] - 10https://gerrit.wikimedia.org/r/980934 (https://phabricator.wikimedia.org/T324611) (owner: 10Dwisehaupt) [20:13:31] (03PS1) 10Samtar: wikimedia.org: add fox. [dns] - 10https://gerrit.wikimedia.org/r/980935 (https://phabricator.wikimedia.org/T352870) [20:15:20] (03PS1) 10Milimetric: maintain-views: add note on linktarget sanitization [puppet] - 10https://gerrit.wikimedia.org/r/980936 (https://phabricator.wikimedia.org/T352879) [20:17:21] (03CR) 10Kimberly Sarabia: [C: 03+1] Correct links to beta feature [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980477 (https://phabricator.wikimedia.org/T352826) (owner: 10Jdlrobson) [20:18:13] Is there an SRE around who can do a puppet-merge to deploy https://gerrit.wikimedia.org/r/980934? I think I am no longer able to do that step (as expected) although I seem to still have +2 for the puppet repo in gerrit. [20:19:18] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2011 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352912 (10cmooney) [20:21:00] (03PS1) 10Jgreen: Revert "Add check_coworker nagios check to frdev and civi" [puppet] - 10https://gerrit.wikimedia.org/r/980481 [20:21:45] (03CR) 10CI reject: [V: 04-1] Revert "Add check_coworker nagios check to frdev and civi" [puppet] - 10https://gerrit.wikimedia.org/r/980481 (owner: 10Jgreen) [20:22:26] (03CR) 10Jgreen: [C: 03+1] Revert "Add check_coworker nagios check to frdev and civi" [puppet] - 10https://gerrit.wikimedia.org/r/980481 (owner: 10Jgreen) [20:25:14] (03PS2) 10Jgreen: Revert "Add check_coworker nagios check to frdev and civi" [puppet] - 10https://gerrit.wikimedia.org/r/980481 [20:26:30] ^^^ n/m. I reverted it instead. [20:27:15] (03PS2) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2012 and move row A vlans [puppet] - 10https://gerrit.wikimedia.org/r/980931 (https://phabricator.wikimedia.org/T352909) [20:28:03] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] maintain-views: add note on linktarget sanitization [puppet] - 10https://gerrit.wikimedia.org/r/980936 (https://phabricator.wikimedia.org/T352879) (owner: 10Milimetric) [20:29:43] (03PS1) 10Milimetric: sqoop: move where we get the linktarget from [puppet] - 10https://gerrit.wikimedia.org/r/980939 (https://phabricator.wikimedia.org/T352879) [20:29:49] (03PS1) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2011 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) [20:30:24] (03CR) 10Ssingh: [C: 03+1] dnsrecursor: forward_zones for wikimedia.org, too [puppet] - 10https://gerrit.wikimedia.org/r/980929 (https://phabricator.wikimedia.org/T347054) (owner: 10BBlack) [20:30:57] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2011 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352912 (10cmooney) [20:31:42] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2012 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352909 (10cmooney) [20:32:06] (03CR) 10Jgreen: [C: 03+1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [20:45:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) a:03Jclark-ctr [20:49:32] (03PS2) 10Ladsgroup: sqoop: move where we get the linktarget from [puppet] - 10https://gerrit.wikimedia.org/r/980939 (https://phabricator.wikimedia.org/T352879) (owner: 10Milimetric) [20:49:38] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] sqoop: move where we get the linktarget from [puppet] - 10https://gerrit.wikimedia.org/r/980939 (https://phabricator.wikimedia.org/T352879) (owner: 10Milimetric) [20:50:06] (03PS4) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [20:50:35] (03CR) 10CI reject: [V: 04-1] wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [20:51:16] RECOVERY - cassandra-a CQL 10.192.16.240:9042 on restbase2029 is OK: TCP OK - 0.032 second response time on 10.192.16.240 port 9042 https://phabricator.wikimedia.org/T93886 [20:52:44] (03PS5) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [20:53:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @Jhancock.wm this is failing because site.pp and apt_repo.yaml still have the old host names [20:55:31] (03Abandoned) 10Dwisehaupt: Revert "Add check_coworker nagios check to frdev and civi" [puppet] - 10https://gerrit.wikimedia.org/r/980481 (owner: 10Jgreen) [20:56:33] 10SRE, 10Cloud-VPS: enable lists.wikimedia.org or wikimedia.org email addresses to receive dmarc reports for *.wmflabs.org - https://phabricator.wikimedia.org/T352902 (10herron) In cases where outbound mail delivery is important basic inbound mail handling should be configured for the (sub)domain and any from... [20:56:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T2100). nyaa~ [21:00:05] kemayo and kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] o/ [21:01:49] Hello [21:03:30] TheresNoTime: RoanKattouw urbanecm are either of you available to deploy? [21:03:32] RECOVERY - cassandra-b service on restbase2029 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:03:57] if no one else is around, i can [21:04:20] Thanks! [21:04:24] I just have two config patches, and one of them is a no-op that I don't particularly need tested. [21:04:52] (03CR) 10Urbanecm: [C: 03+2] Correct links to beta feature [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980477 (https://phabricator.wikimedia.org/T352826) (owner: 10Jdlrobson) [21:05:29] urbanecm: Could the Vector patch and the beta feature config patch go out together? [21:05:39] sure [21:05:52] (03PS2) 10Urbanecm: Enable DT visual enhancements on pages with __NEWSECTIONLINK__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978531 (https://phabricator.wikimedia.org/T352232) (owner: 10Esanders) [21:06:06] (03CR) 10Urbanecm: [C: 03+2] Enable DT visual enhancements on pages with __NEWSECTIONLINK__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978531 (https://phabricator.wikimedia.org/T352232) (owner: 10Esanders) [21:06:09] (03CR) 10Urbanecm: [C: 03+2] DiscussionTools: Rename config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980920 (owner: 10Esanders) [21:06:29] 👍 [21:07:06] (03Merged) 10jenkins-bot: Enable DT visual enhancements on pages with __NEWSECTIONLINK__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978531 (https://phabricator.wikimedia.org/T352232) (owner: 10Esanders) [21:07:08] (03CR) 10CI reject: [V: 04-1] DiscussionTools: Rename config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980920 (owner: 10Esanders) [21:08:00] (03PS6) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [21:08:25] Kemayo: there seems to be a rebase conflict. can you fix it please? [21:09:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [21:09:45] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:978531|Enable DT visual enhancements on pages with __NEWSECTIONLINK__ (T352232)]] [21:09:49] T352232: [Config] Phase 1 deployment of usability improvements on pages using _NEWSECTIONLINK_ - https://phabricator.wikimedia.org/T352232 [21:09:51] urbanecm: Sure, just a second [21:11:03] !log urbanecm@deploy2002 urbanecm and esanders: Backport for [[gerrit:978531|Enable DT visual enhancements on pages with __NEWSECTIONLINK__ (T352232)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:11:16] Kemayo: and please test ^^ at mwdebug [21:13:01] (03PS2) 10DLynch: DiscussionTools: Rename config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980920 (owner: 10Esanders) [21:13:13] urbanecm: Okay, rebased, now I'm testing [21:13:29] (03CR) 10Urbanecm: [C: 03+2] DiscussionTools: Rename config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980920 (owner: 10Esanders) [21:13:48] urbanecm: Testing looks good 👍🏻 [21:13:52] !log urbanecm@deploy2002 urbanecm and esanders: Continuing with sync [21:13:55] proceeding [21:13:56] (03CR) 10Dzahn: [V: 03+1 C: 03+2] planet: add ensure parameter allowing to disable update jobs [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [21:14:35] (03Merged) 10jenkins-bot: DiscussionTools: Rename config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980920 (owner: 10Esanders) [21:16:09] (03PS1) 10Jforrester: api: Only force backlink namespace index when there is one ns only [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980483 (https://phabricator.wikimedia.org/T351237) [21:20:21] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "works as intended. noop on *002 servers, removed timers on *003 servers" [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [21:20:28] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:978531|Enable DT visual enhancements on pages with __NEWSECTIONLINK__ (T352232)]] (duration: 10m 43s) [21:20:41] T352232: [Config] Phase 1 deployment of usability improvements on pages using _NEWSECTIONLINK_ - https://phabricator.wikimedia.org/T352232 [21:21:11] (03PS7) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [21:21:33] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:980920|DiscussionTools: Rename config]] [21:21:48] (03Merged) 10jenkins-bot: Correct links to beta feature [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980477 (https://phabricator.wikimedia.org/T352826) (owner: 10Jdlrobson) [21:22:50] !log urbanecm@deploy2002 esanders and urbanecm: Backport for [[gerrit:980920|DiscussionTools: Rename config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:22:51] (03PS3) 10Urbanecm: Beta Features: Allow Vector 2022 typography feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980517 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [21:23:34] (03PS8) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [21:23:41] Kemayo: second patch is at mwdebug, can you take a look? [21:24:51] urbanecm: Doesn't seem to have caused any problems [21:25:05] !log urbanecm@deploy2002 esanders and urbanecm: Continuing with sync [21:25:06] proceeding [21:25:29] (03CR) 10Urbanecm: [C: 03+2] Beta Features: Allow Vector 2022 typography feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980517 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [21:27:09] (03Merged) 10jenkins-bot: Beta Features: Allow Vector 2022 typography feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980517 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [21:28:44] RECOVERY - cassandra-b SSL 10.192.16.241:7000 on restbase2029 is OK: SSL OK - Certificate restbase2029-b valid until 2025-12-05 16:11:13 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:29:57] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:34] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:980920|DiscussionTools: Rename config]] (duration: 10m 01s) [21:32:16] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [21:32:27] (03CR) 10Xcollazo: [C: 03+1] "Thanks for the changes Btullis!" [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [21:32:57] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:980477|Correct links to beta feature (T352826)]], [[gerrit:980517|Beta Features: Allow Vector 2022 typography feature (T351339)]] [21:33:07] Jdlrobson: deploying the config and backport now [21:33:09] T352826: Replace links in the Vector 2022 beta feature description - https://phabricator.wikimedia.org/T352826 [21:33:09] T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339 [21:34:19] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be - jclark@cumin1001" [21:34:21] !log urbanecm@deploy2002 urbanecm and jdlrobson: Backport for [[gerrit:980477|Correct links to beta feature (T352826)]], [[gerrit:980517|Beta Features: Allow Vector 2022 typography feature (T351339)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:34:37] Jdlrobson: kimberly_sarabia: can you test your patches at mwdebug please? [21:34:45] yes one moment [21:35:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be - jclark@cumin1001" [21:35:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:35:40] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1082.mgmt.eqiad.wmnet with reboot policy FORCED [21:35:59] urbanecm: LGTM! [21:36:05] proceeding [21:36:07] !log urbanecm@deploy2002 urbanecm and jdlrobson: Continuing with sync [21:37:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:43:48] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:980477|Correct links to beta feature (T352826)]], [[gerrit:980517|Beta Features: Allow Vector 2022 typography feature (T351339)]] (duration: 10m 51s) [21:43:53] kimberly_sarabia: synced [21:43:53] T352826: Replace links in the Vector 2022 beta feature description - https://phabricator.wikimedia.org/T352826 [21:43:54] T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339 [21:43:55] anything else? [21:44:48] 10Puppet, 10SRE, 10Patch-For-Review: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870 (10Peachey88) Is there any particular reason we desire to enable additional tech-debt by having to maintain this for years to come? [21:45:22] PROBLEM - Check systemd state on mw1350 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:34] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 (10cmooney) [21:45:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1082.mgmt.eqiad.wmnet with reboot policy FORCED [21:46:11] (03PS1) 10Cathal Mooney: Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918) [21:47:18] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [21:47:18] (03CR) 10CI reject: [V: 04-1] Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918) (owner: 10Cathal Mooney) [21:50:10] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be - jclark@cumin1001" [21:50:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ms-be - jclark@cumin1001" [21:51:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:51:38] urbanecm: Thanks! [21:51:46] np [21:51:47] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1081.mgmt.eqiad.wmnet with reboot policy FORCED [21:52:18] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1080.mgmt.eqiad.wmnet with reboot policy FORCED [21:53:06] (03PS2) 10Cathal Mooney: Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918) [21:53:58] (03CR) 10CI reject: [V: 04-1] Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918) (owner: 10Cathal Mooney) [21:56:03] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:56:04] (03PS1) 10Dzahn: microsites/query_service: enable TLS when monitoring commons-query [puppet] - 10https://gerrit.wikimedia.org/r/980950 (https://phabricator.wikimedia.org/T333510) [21:56:30] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1082.mgmt.eqiad.wmnet with reboot policy FORCED [21:57:26] (03PS2) 10Dzahn: microsites/query_service: enable TLS when monitoring commons-query [puppet] - 10https://gerrit.wikimedia.org/r/980950 (https://phabricator.wikimedia.org/T333510) [21:57:38] (03PS3) 10Cathal Mooney: Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918) [21:57:56] (03PS1) 10Jdlrobson: Enable Vector beta feature for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) [21:58:23] (03CR) 10CI reject: [V: 04-1] Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918) (owner: 10Cathal Mooney) [22:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231206T2200) [22:05:40] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 (10cmooney) [22:05:48] (03CR) 10Bking: [C: 03+1] microsites/query_service: enable TLS when monitoring commons-query [puppet] - 10https://gerrit.wikimedia.org/r/980950 (https://phabricator.wikimedia.org/T333510) (owner: 10Dzahn) [22:06:51] (03CR) 10Dzahn: [C: 03+2] microsites/query_service: enable TLS when monitoring commons-query [puppet] - 10https://gerrit.wikimedia.org/r/980950 (https://phabricator.wikimedia.org/T333510) (owner: 10Dzahn) [22:10:10] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920 (10cmooney) [22:11:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1081.mgmt.eqiad.wmnet with reboot policy FORCED [22:11:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1080.mgmt.eqiad.wmnet with reboot policy FORCED [22:14:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1082.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:16:27] (03PS1) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) [22:17:18] (03CR) 10CI reject: [V: 04-1] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney) [22:18:59] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920 (10cmooney) [22:19:58] (03PS2) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) [22:20:47] (03CR) 10CI reject: [V: 04-1] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney) [22:24:33] (03PS9) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:27:04] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:07] (03PS10) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:27:30] 10SRE, 10Cloud-VPS: enable lists.wikimedia.org or wikimedia.org email addresses to receive dmarc reports for *.wmflabs.org - https://phabricator.wikimedia.org/T352902 (10jsn.sherman) For inbound mail delivery: What are our options that avoid exposing an unmaintained mail server to the Internet? Internal mail r... [22:29:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:29:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:42:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1080'] [22:42:17] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1081'] [22:42:25] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1082'] [22:42:45] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1080'] [22:43:11] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1081'] [22:43:19] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-be1081'] [22:43:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1080'] [22:49:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1081'] [22:49:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1082'] [22:49:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1080'] [22:50:29] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1080.eqiad.wmnet with OS bullseye [22:50:31] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1082.eqiad.wmnet with OS bullseye [22:50:33] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1081.eqiad.wmnet with OS bullseye [22:50:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1080.eqiad.wmnet with OS bullseye [22:50:39] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1082.eqiad.wmnet with OS bullseye [22:50:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1081.eqiad.wmnet with OS bullseye [22:51:34] (03PS11) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:52:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) [22:53:30] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) [22:54:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:56:42] (SystemdUnitFailed) firing: wdqs-categories.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:03:00] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 34 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T352878 [23:03:06] T352878: Troubleshoot recurring systemd unit failures for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 [23:03:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 34 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T352878 [23:07:36] (03CR) 10Bking: "PCC failed due to lack of support for IPv6: DNS lookup failed for Resolv::DNS::Resource::IN::AAAA (file: /srv/jenkins/puppet-compiler/277" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [23:14:24] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:19:46] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1080.eqiad.wmnet with reason: host reimage [23:20:08] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1081.eqiad.wmnet with reason: host reimage [23:21:26] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:21:30] 10SRE, 10Cloud-VPS, 10DNS, 10Traffic: DNS name resolution failure with www.spacecom.mil from Cloud VPS - https://phabricator.wikimedia.org/T346471 (10Dzahn) Can confirm this is still the case. From a random different cloud VPS instance: ` dig www.spacecom.mil @172.20.255.1` fails. (and 172.20.255.1 is... [23:23:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1080.eqiad.wmnet with reason: host reimage [23:24:00] 10SRE, 10Cloud-VPS, 10DNS, 10Traffic: DNS name resolution failure with www.spacecom.mil from Cloud VPS - https://phabricator.wikimedia.org/T346471 (10Dzahn) It's not ALL of .mil either. For example "dig cybercoe.army.mil @172.20.255.1" works and also points to an Akamai edge. [23:25:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1081.eqiad.wmnet with reason: host reimage [23:32:55] 10SRE, 10Cloud-VPS, 10DNS, 10Traffic: DNS name resolution failure with www.spacecom.mil from Cloud VPS - https://phabricator.wikimedia.org/T346471 (10Don-vip) Yes, my tool scans for free media at following .mil domains without problem: - www.afspc.af.mil - www.buckley.spaceforce.mil - www.jtf-spaced... [23:36:51] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:42:23] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:47:34] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:56:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1082.eqiad.wmnet with OS bullseye [23:56:57] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1082.eqiad.wmnet with OS bullseye executed with erro...