[00:00:18] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:03:35] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 8h 53m 33s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [00:03:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:26:09] (03PS4) 10Cwhite: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [00:27:26] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.wikimedia.org with OS bullseye [00:32:12] (03CR) 10CI reject: [V: 04-1] elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [00:38:09] (03CR) 10Cwhite: "Tests appear unhappy because aiohttp is missing?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [00:39:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974631 [00:39:06] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974631 (owner: 10TrainBranchBot) [00:44:39] (03CR) 10Cwhite: elasticsearch: move to opensearch client (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [00:58:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974631 (owner: 10TrainBranchBot) [01:06:10] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:09:27] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:19] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:38:56] (JobUnavailable) firing: (8) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:36] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:30] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:08:56] (JobUnavailable) firing: (8) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:16] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm [03:40:58] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [03:44:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [03:54:36] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:58:56] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:58:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:24:57] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.wikimedia.org with OS bullseye [04:50:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T348183)', diff saved to https://phabricator.wikimedia.org/P53495 and previous config saved to /var/cache/conftool/dbconfig/20231116-045035-arnaudb.json [04:50:40] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [04:57:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm [05:05:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P53496 and previous config saved to /var/cache/conftool/dbconfig/20231116-050542-arnaudb.json [05:09:51] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:10:26] (03PS1) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [05:12:17] (03PS2) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [05:13:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:21] (03PS3) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [05:18:18] (03PS4) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [05:20:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P53497 and previous config saved to /var/cache/conftool/dbconfig/20231116-052048-arnaudb.json [05:21:25] (03PS5) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [05:26:55] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:29:29] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T348183)', diff saved to https://phabricator.wikimedia.org/P53498 and previous config saved to /var/cache/conftool/dbconfig/20231116-053554-arnaudb.json [05:35:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [05:36:00] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [05:36:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [05:36:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T348183)', diff saved to https://phabricator.wikimedia.org/P53499 and previous config saved to /var/cache/conftool/dbconfig/20231116-053616-arnaudb.json [05:48:10] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.wikimedia.org with OS bullseye [05:54:09] (03CR) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [05:58:21] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:58:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:59:28] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:07:17] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2004.codfw.wmnet with OS bullseye [06:07:24] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [06:07:25] RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms [06:15:01] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:17:19] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:18:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:29:00] (03PS1) 10Marostegui: db1164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974721 (https://phabricator.wikimedia.org/T351176) [06:29:36] (03CR) 10Marostegui: [C: 03+2] db1164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974721 (https://phabricator.wikimedia.org/T351176) (owner: 10Marostegui) [06:29:40] (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [06:30:15] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:30:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1119,1164,1217].eqiad.wmnet with reason: Switch [06:30:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1119,1164,1217].eqiad.wmnet with reason: Switch [06:33:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:40] (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [06:37:11] PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:24] (03PS2) 10KartikMistry: testwiki: Enable the Unified Content Translation Dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973170 (https://phabricator.wikimedia.org/T337915) [06:44:05] RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:11] PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:33] (03PS1) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/974722 (https://phabricator.wikimedia.org/T351176) [06:55:05] RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:55:13] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:58:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:08:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission dbproxy1017.eqiad.wmnet - https://phabricator.wikimedia.org/T348956 (10Marostegui) This is ready for #dc-ops [07:16:08] (03PS1) 10Urbanecm: mediawiki: Add missing frequency param to the purge_temporary_accounts job [puppet] - 10https://gerrit.wikimedia.org/r/974726 (https://phabricator.wikimedia.org/T344695) [07:18:26] (03PS1) 10Marostegui: report_users: Remove dbproxy1011 address [software] - 10https://gerrit.wikimedia.org/r/974727 (https://phabricator.wikimedia.org/T202367) [07:19:04] (03CR) 10Marostegui: [C: 03+2] report_users: Remove dbproxy1011 address [software] - 10https://gerrit.wikimedia.org/r/974727 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:19:36] (03Merged) 10jenkins-bot: report_users: Remove dbproxy1011 address [software] - 10https://gerrit.wikimedia.org/r/974727 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:22:18] (03PS1) 10Urbanecm: IP Masking temp account expiry: Fix a typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974728 (https://phabricator.wikimedia.org/T344695) [07:22:27] jouncebot: nowandnext [07:22:27] For the next 0 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T0700) [07:22:28] For the next 0 hour(s) and 7 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T0700) [07:22:28] In 0 hour(s) and 37 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T0800) [07:26:29] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:28:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:30:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: prometheus::pop [07:31:44] (03PS1) 10Muehlenhoff: Switch prometheus::pop to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974828 (https://phabricator.wikimedia.org/T349619) [07:33:36] (03CR) 10Muehlenhoff: [C: 03+2] Switch prometheus::pop to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974828 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:37:52] PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:02] RECOVERY - Check systemd state on dbstore1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:19] (03CR) 10Arnaudb: [C: 03+1] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/974722 (https://phabricator.wikimedia.org/T351176) (owner: 10Marostegui) [07:42:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: prometheus::pop [07:43:24] PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [07:48:04] RECOVERY - Check systemd state on dbstore1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:49] (03PS1) 10Marostegui: pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/974868 (https://phabricator.wikimedia.org/T351285) [07:50:43] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/974868 (https://phabricator.wikimedia.org/T351285) (owner: 10Marostegui) [07:51:48] PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:04] RECOVERY - Check systemd state on dbstore1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ncredir4001.ulsfo.wmnet [07:55:41] (03PS1) 10Muehlenhoff: Switch ncredir4001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974870 (https://phabricator.wikimedia.org/T349619) [07:56:42] (03CR) 10Muehlenhoff: [C: 03+2] Switch ncredir4001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974870 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:56:58] PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:56] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:03:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ncredir4001.ulsfo.wmnet [08:07:49] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host prometheus2006.codfw.wmnet [08:09:07] !log installing python-git security updates [08:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:59] (03PS1) 10Muehlenhoff: Switch prometheus2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974908 (https://phabricator.wikimedia.org/T349619) [08:12:33] !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host cloudcumin2001.codfw.wmnet [08:12:54] (03PS1) 10Majavah: hieradata: migrate cloudcumin2001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974923 [08:13:32] (03CR) 10Majavah: [C: 03+2] hieradata: migrate cloudcumin2001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974923 (owner: 10Majavah) [08:14:31] (03CR) 10Muehlenhoff: [C: 03+2] Switch prometheus2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974908 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:17:01] !log installing elfutils security updates [08:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:09] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcumin2001.codfw.wmnet [08:18:26] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:18:46] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:19:01] !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host clouddumps1001.wikimedia.org [08:19:10] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:19:14] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:19:24] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:19:37] (03PS1) 10Majavah: hieradata: migrate clouddumps1001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974924 [08:20:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:20:24] (03PS1) 10Slyngshede: P:idm add fqdn for the host as an Apache server alias. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) [08:20:30] (03CR) 10Majavah: [C: 03+2] hieradata: migrate clouddumps1001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974924 (owner: 10Majavah) [08:20:50] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:12] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:21:36] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:38] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:42] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:21:52] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:57] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/504/con" [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [08:22:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host prometheus2006.codfw.wmnet [08:23:28] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:25:06] (03CR) 10Slyngshede: [C: 04-1] "That, not actually enough, you could still use the IP." [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [08:25:38] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host clouddumps1001.wikimedia.org [08:30:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: thanos::frontend [08:31:47] (03PS7) 10Hashar: gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) [08:31:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:59] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [08:32:26] (03PS1) 10Muehlenhoff: Switch thanos::frontend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974927 (https://phabricator.wikimedia.org/T349619) [08:34:14] !log installing ruby-rails-html-sanitizer security updates [08:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:22] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:34:42] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974532 (https://phabricator.wikimedia.org/T351300) (owner: 10Tchanders) [08:35:32] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974532 (https://phabricator.wikimedia.org/T351300) (owner: 10Tchanders) [08:36:05] (03CR) 10Muehlenhoff: [C: 03+2] Switch thanos::frontend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974927 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:36:29] (03PS1) 10Brouberol: Define a wmflib function to compute the last IP in a subnet [puppet] - 10https://gerrit.wikimedia.org/r/974928 (https://phabricator.wikimedia.org/T351059) [08:36:50] (03PS2) 10Slyngshede: P:idm add fqdn for the host as an Apache server alias. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) [08:37:13] (03CR) 10CI reject: [V: 04-1] Define a wmflib function to compute the last IP in a subnet [puppet] - 10https://gerrit.wikimedia.org/r/974928 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:37:29] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [08:37:54] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [08:38:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/505/con" [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [08:38:24] (03CR) 10Brouberol: [V: 03+1] Automatically generate autoinstall subnet DHCP config files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:38:57] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:39:34] (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [08:39:40] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:39:49] Hi, as we are still in train window time, can we deploy https://phabricator.wikimedia.org/T351048? [08:40:21] (03CR) 10Slyngshede: [V: 03+1] "While I'm not loving this, it does work. Obviously the IDM is differently than other Apache2 based application, so the correct permanent s" [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [08:41:43] (03CR) 10Majavah: P:idm add fqdn for the host as an Apache server alias. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [08:42:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: thanos::frontend [08:42:35] (03CR) 10Slyngshede: [V: 03+1] P:idm add fqdn for the host as an Apache server alias. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [08:42:54] (03CR) 10JMeybohm: [C: 04-1] "This should have a changelog entry" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [08:43:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:57] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:44:26] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:34] (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [08:48:57] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:49:52] (03CR) 10Arnaudb: mariadb: clone and upgrade mariadb (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:50:38] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:57:46] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:04] PROBLEM - Check systemd state on ml-cache2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kubernetes::worker [09:00:07] jeena and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T0900). [09:00:27] !log bounce prometheus instances on prometheus2006 to test p7 upgrade [09:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:45] (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [09:02:16] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:03:11] (03PS1) 10Muehlenhoff: Switch kubernetes::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974931 (https://phabricator.wikimedia.org/T349619) [09:03:57] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:05:02] (03PS1) 10Filippo Giunchedi: Revert "profile::pyrra::filesystem: improve/fix lift wing pilot" [puppet] - 10https://gerrit.wikimedia.org/r/974241 [09:05:26] (03CR) 10Muehlenhoff: [C: 03+2] Switch kubernetes::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974931 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:06:45] (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [09:07:56] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:08:28] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "profile::pyrra::filesystem: improve/fix lift wing pilot" [puppet] - 10https://gerrit.wikimedia.org/r/974241 (owner: 10Filippo Giunchedi) [09:08:46] (03PS3) 10Slyngshede: P:idm add fqdn for the host as an Apache server alias. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) [09:09:25] (03PS1) 10Abijeet Patro: TranslatablePageMarker: Add patrol status for translatable page [extensions/Translate] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974242 (https://phabricator.wikimedia.org/T351273) [09:09:28] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:09:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53500 and previous config saved to /var/cache/conftool/dbconfig/20231116-090955-arnaudb.json [09:10:00] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/506/con" [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [09:10:04] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:29] (03PS4) 10Slyngshede: P:idm Limit the domains Envoy will proxy. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) [09:13:57] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:14:10] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:22] (03PS1) 10Slyngshede: P:IDM Limit Envoy proxing for idm-test. [puppet] - 10https://gerrit.wikimedia.org/r/974933 (https://phabricator.wikimedia.org/T351343) [09:14:31] (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [09:14:40] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:25] (03PS1) 10Arnaudb: mariadb: re-enable notifications for db1238 [puppet] - 10https://gerrit.wikimedia.org/r/974632 (https://phabricator.wikimedia.org/T344036) [09:16:37] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/507/con" [puppet] - 10https://gerrit.wikimedia.org/r/974933 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [09:16:54] (03CR) 10Marostegui: [C: 03+1] mariadb: re-enable notifications for db1238 [puppet] - 10https://gerrit.wikimedia.org/r/974632 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:17:33] (03CR) 10Arnaudb: [C: 03+2] mariadb: re-enable notifications for db1238 [puppet] - 10https://gerrit.wikimedia.org/r/974632 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:18:57] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:19:02] (03CR) 10Slyngshede: "Perhaps best to split test and prod, and run test first." [puppet] - 10https://gerrit.wikimedia.org/r/974933 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [09:19:31] (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [09:25:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53501 and previous config saved to /var/cache/conftool/dbconfig/20231116-092500-arnaudb.json [09:28:03] (03Abandoned) 10Hashar: Initial checkin. [software/charon] - 10https://gerrit.wikimedia.org/r/838127 (owner: 10Slyngshede) [09:29:46] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/974935 [09:30:06] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/974935 (owner: 10Volans) [09:33:20] (03PS1) 10Arnaudb: mariadb: add db1241 and prepare db1141 retirement [puppet] - 10https://gerrit.wikimedia.org/r/974633 (https://phabricator.wikimedia.org/T344036) [09:34:26] (03PS2) 10Fabfur: haproxy: re-set varnish maxconn on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609) [09:35:26] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:35:41] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/974935 (owner: 10Volans) [09:37:23] (03CR) 10Vgutierrez: [C: 03+1] "ulsfo looking good per pybal logs.. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur) [09:38:07] (03PS1) 10Volans: Upstream release v8.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/974937 [09:38:35] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/974722 (https://phabricator.wikimedia.org/T351176) (owner: 10Marostegui) [09:38:49] (03CR) 10Volans: [C: 03+2] Upstream release v8.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/974937 (owner: 10Volans) [09:38:57] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:39:28] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:40:04] (03PS2) 10Hnowlan: envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) [09:40:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53502 and previous config saved to /var/cache/conftool/dbconfig/20231116-094005-arnaudb.json [09:40:32] (03CR) 10Hnowlan: envoy: use ENTRYPOINT instead of CMD (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [09:40:48] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:49] (03CR) 10Fabfur: [C: 03+2] haproxy: re-set varnish maxconn on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur) [09:45:39] (03PS2) 10D3r1ck01: wmf-config: Introduce setting for "mcrouter-primary-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) [09:45:45] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/974268 to all cp hosts everywhere (setting maxconn on varnish to 20k) T310609 [09:45:56] (03CR) 10Jbond: [C: 03+1] Upstream release v8.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/974937 (owner: 10Volans) [09:47:17] (03Merged) 10jenkins-bot: Upstream release v8.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/974937 (owner: 10Volans) [09:47:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [09:48:29] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:48:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [09:50:55] (03PS1) 10Kosta Harlan: ipoid: Add DATADIR environment variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500) [09:51:56] !log uploaded spicerack_8.0.3 to apt.wikimedia.org bullseye-wikimedia [09:51:59] moritzm: FYI ^^^ [09:52:11] I will deploy it shortly to the cumin hosts [09:53:24] excellent, thanks [09:54:24] RECOVERY - Check systemd state on ml-cache2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:26] (03PS1) 10MVernon: admin: add hghani, osefu to analytics-product-users [puppet] - 10https://gerrit.wikimedia.org/r/974940 (https://phabricator.wikimedia.org/T351130) [09:55:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53503 and previous config saved to /var/cache/conftool/dbconfig/20231116-095510-arnaudb.json [09:56:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kubernetes::worker [09:59:28] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:01:05] (03CR) 10Btullis: [C: 03+2] Update the contact info for the wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/973203 (https://phabricator.wikimedia.org/T345698) (owner: 10Btullis) [10:01:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: thanos::backend [10:01:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED [10:02:03] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:03:10] (03CR) 10Btullis: [C: 03+2] Add a prometheus_instance parameter to prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [10:03:19] !log bounce thanos components on titan1001 [10:03:19] (03PS1) 10Muehlenhoff: Switch thanos::backend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974941 (https://phabricator.wikimedia.org/T349619) [10:03:40] (03CR) 10Vgutierrez: [C: 03+1] swift: migrate one node to envoy for TLS termination (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [10:03:57] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:04:43] (03PS2) 10Muehlenhoff: Switch thanos::backend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974941 (https://phabricator.wikimedia.org/T349619) [10:05:27] !log stopping bacula on backup1001 [10:05:31] (03PS2) 10Hnowlan: service: move mw-jobrunner to prod [puppet] - 10https://gerrit.wikimedia.org/r/973827 (https://phabricator.wikimedia.org/T349796) [10:05:56] prometheus job runner for backup1001 will complain for a bit [10:06:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch thanos::backend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974941 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:08:20] (03PS1) 10Jbond: sre.hosts.reimage: add acmechief_host hiera data [cookbooks] - 10https://gerrit.wikimedia.org/r/974942 [10:08:30] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.wikimedia.org with OS bullseye [10:08:36] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:09:45] (03CR) 10Slyngshede: [C: 03+2] NTP: alert on ntp/time errors [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede) [10:10:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53504 and previous config saved to /var/cache/conftool/dbconfig/20231116-101015-arnaudb.json [10:11:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974940 (https://phabricator.wikimedia.org/T351130) (owner: 10MVernon) [10:11:33] (03Merged) 10jenkins-bot: NTP: alert on ntp/time errors [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede) [10:12:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: thanos::backend [10:12:46] (03CR) 10MVernon: [C: 03+2] admin: add hghani, osefu to analytics-product-users [puppet] - 10https://gerrit.wikimedia.org/r/974940 (https://phabricator.wikimedia.org/T351130) (owner: 10MVernon) [10:13:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:13:57] (JobUnavailable) firing: (7) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:05] ^that's me [10:17:19] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:17:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This is now done (but allow a little while for puppet to do its thing). [10:20:43] !log Failover m1 from db1119 to db1164 - T351176 [10:21:06] done [10:21:35] /clear [10:21:48] !log installer spicerack v8.0.3 on the cumin hosts [10:21:51] etherpad works fine [10:22:26] marostegui: there is one puppet thing I think I saw [10:22:35] jynus: which one? [10:22:36] shouldn't it say 10.6 on hiera? [10:22:43] maybe I am wrong [10:22:54] good point! [10:22:54] https://gerrit.wikimedia.org/r/c/operations/puppet/+/974722/2/hieradata/hosts/db1164.yaml [10:23:04] fixing [10:23:23] jynus: this can be merged too: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973804 [10:23:39] if you are asking, yes, please do when you can [10:24:03] (03CR) 10Marostegui: [C: 03+2] Revert "dbbackups: Switchover master from db1164 to db1119" [puppet] - 10https://gerrit.wikimedia.org/r/973804 (owner: 10Marostegui) [10:24:10] (03PS1) 10Marostegui: db1164: Add mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/974945 [10:24:23] let me know when finished to run puppet and restart bacula [10:24:28] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.wikimedia.org with reason: host reimage [10:24:36] jynus: it is merged now [10:24:51] (03CR) 10Marostegui: [C: 03+2] db1164: Add mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/974945 (owner: 10Marostegui) [10:24:54] !log reenabling puppet and starting bacula on backup1001 [10:25:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53505 and previous config saved to /var/cache/conftool/dbconfig/20231116-102520-arnaudb.json [10:27:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.wikimedia.org with reason: host reimage [10:28:57] (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:06] ^ everything looking good [10:29:13] no pki alerts today [10:32:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kubernetes::master [10:33:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: kubernetes::master [10:36:04] (03CR) 10Marostegui: [C: 03+1] mariadb: add db1241 and prepare db1141 retirement [puppet] - 10https://gerrit.wikimedia.org/r/974633 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:36:12] (03PS1) 10Hnowlan: cassandra: remove references to graphite [puppet] - 10https://gerrit.wikimedia.org/r/974946 (https://phabricator.wikimedia.org/T351193) [10:36:30] btullis: 3 druid hosts are out of space on /, known ? [10:37:59] (PuppetZeroResources) firing: Puppet has failed generate resources on kubernetes1038:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:39:08] 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10ehughes) [10:39:12] stevemunene: ^ re: druid out of space [10:39:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:40:00] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host kubemaster1002.eqiad.wmnet [10:40:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53506 and previous config saved to /var/cache/conftool/dbconfig/20231116-104025-arnaudb.json [10:40:50] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:40:59] (03PS1) 10Giuseppe Lavagetto: kubernetes::global_config: add listener for mw on k8s transition [puppet] - 10https://gerrit.wikimedia.org/r/974947 [10:41:57] (03PS1) 10Muehlenhoff: Switch kubemaster1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974948 (https://phabricator.wikimedia.org/T349619) [10:43:03] (03CR) 10Muehlenhoff: [C: 03+2] Switch kubemaster1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974948 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:44:54] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/974942 (owner: 10Jbond) [10:46:26] (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: add acmechief_host hiera data [cookbooks] - 10https://gerrit.wikimedia.org/r/974942 (owner: 10Jbond) [10:47:02] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet [10:47:08] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet [10:49:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host kubemaster1002.eqiad.wmnet [10:50:36] (03Merged) 10jenkins-bot: sre.hosts.reimage: add acmechief_host hiera data [cookbooks] - 10https://gerrit.wikimedia.org/r/974942 (owner: 10Jbond) [10:52:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.wikimedia.org with OS bullseye [10:52:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: syslog::centralserver [10:52:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on kubernetes1038:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:53:26] PROBLEM - Check systemd state on titan1001 is CRITICAL: CRITICAL - degraded: The following units failed: pyrra-generate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53507 and previous config saved to /var/cache/conftool/dbconfig/20231116-105530-arnaudb.json [10:56:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974928 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [10:58:09] (03PS1) 10Muehlenhoff: Switch syslog::centralserver to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974952 (https://phabricator.wikimedia.org/T349619) [10:58:47] (03CR) 10Effie Mouzeli: [C: 03+1] service: move mw-jobrunner to prod [puppet] - 10https://gerrit.wikimedia.org/r/973827 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:59:07] (03CR) 10Muehlenhoff: [C: 03+2] Switch syslog::centralserver to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974952 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:59:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T348183)', diff saved to https://phabricator.wikimedia.org/P53508 and previous config saved to /var/cache/conftool/dbconfig/20231116-105930-arnaudb.json [11:00:06] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1100). [11:00:07] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1100) [11:03:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: syslog::centralserver [11:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:07:04] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:07:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kubernetes::master [11:08:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:10:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53509 and previous config saved to /var/cache/conftool/dbconfig/20231116-111035-arnaudb.json [11:11:03] (03PS1) 10Muehlenhoff: Switch kubernetes::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974955 (https://phabricator.wikimedia.org/T349619) [11:12:23] (03CR) 10Muehlenhoff: [C: 03+2] Switch kubernetes::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974955 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:13:22] PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P53510 and previous config saved to /var/cache/conftool/dbconfig/20231116-111436-arnaudb.json [11:18:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kubernetes::master [11:19:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:20:56] RECOVERY - Check systemd state on titan1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:31] that was me ^ [11:22:34] (03PS1) 10Hnowlan: device-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 [11:22:59] (03CR) 10Sergio Gimeno: [C: 03+1] mediawiki: Add missing frequency param to the purge_temporary_accounts job [puppet] - 10https://gerrit.wikimedia.org/r/974726 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [11:23:47] (03CR) 10Sg912: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan) [11:24:41] (03CR) 10Santiago Faci: [C: 03+1] "looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan) [11:24:50] (03CR) 10Btullis: [C: 03+1] device-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan) [11:26:19] (03CR) 10Hnowlan: [C: 03+2] device-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan) [11:27:37] (03Merged) 10jenkins-bot: device-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan) [11:28:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [11:28:47] (03CR) 10Btullis: [C: 03+2] Update our kerberos scripts to remove oozie customisation [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [11:28:57] (03PS2) 10Btullis: Update our kerberos scripts to remove oozie customisation [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) [11:29:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P53512 and previous config saved to /var/cache/conftool/dbconfig/20231116-112942-arnaudb.json [11:29:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [11:33:22] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::serviceops [11:34:04] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [11:34:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1004.eqiad.wmnet [11:34:24] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [11:34:31] (03PS1) 10Muehlenhoff: Switch insetup::serviceops to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974957 (https://phabricator.wikimedia.org/T349619) [11:40:05] (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup::serviceops to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974957 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:41:02] (03PS3) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) [11:44:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T348183)', diff saved to https://phabricator.wikimedia.org/P53513 and previous config saved to /var/cache/conftool/dbconfig/20231116-114450-arnaudb.json [11:44:52] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:44:55] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:44:59] 10SRE, 10Infrastructure-Foundations, 10netops: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) 05Open→03Resolved Patches merged, all looking ok. For example on dns5004 this was situation before, server using TTL 2, CR using 193: ` 19:27:22.338917 IP (... [11:45:05] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:45:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53514 and previous config saved to /var/cache/conftool/dbconfig/20231116-114511-arnaudb.json [11:45:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::serviceops [11:49:35] !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host clouddb1021.eqiad.wmnet [11:49:53] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [11:50:19] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [11:50:28] (03PS1) 10Majavah: hieradata: upgrade clouddb1021 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974960 [11:50:37] (03PS1) 10Cathal Mooney: Remove reference to row E/F in reimage clear_dhcp_cache function [cookbooks] - 10https://gerrit.wikimedia.org/r/974961 (https://phabricator.wikimedia.org/T306421) [11:50:42] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [11:51:00] (03CR) 10Majavah: [C: 03+2] hieradata: upgrade clouddb1021 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974960 (owner: 10Majavah) [11:51:11] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [11:54:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:55:28] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host clouddb1021.eqiad.wmnet [11:55:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ms-fe1014.eqiad.wmnet [11:56:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10taavi) [11:57:06] (03PS1) 10Muehlenhoff: Switch ms-fe1014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974962 (https://phabricator.wikimedia.org/T349619) [11:57:38] 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10MatthewVernon) [11:58:10] (03PS4) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) [11:58:31] (03PS5) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) [11:58:33] (03PS2) 10Majavah: site: remove references to cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/973745 (https://phabricator.wikimedia.org/T351077) [11:58:57] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:59:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973745 (https://phabricator.wikimedia.org/T351077) (owner: 10Majavah) [12:00:15] (03CR) 10Muehlenhoff: [C: 03+2] Switch ms-fe1014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974962 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:00:17] (03CR) 10Majavah: [C: 03+2] site: remove references to cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/973745 (https://phabricator.wikimedia.org/T351077) (owner: 10Majavah) [12:00:29] 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10MatthewVernon) Approvals-wise, this needs manager approval from @spatton and analytics-privatedata-users approval from @odimi... [12:04:26] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/974961 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [12:07:09] (03PS2) 10Jcrespo: [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 [12:07:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ms-fe1014.eqiad.wmnet [12:07:52] (03CR) 10Slyngshede: [C: 03+2] P:IDM Limit Envoy proxing for idm-test. [puppet] - 10https://gerrit.wikimedia.org/r/974933 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [12:08:03] (03CR) 10CI reject: [V: 04-1] [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 (owner: 10Jcrespo) [12:14:46] (03CR) 10Majavah: [C: 03+1] P:idm Limit the domains Envoy will proxy. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [12:15:16] (03CR) 10Slyngshede: [C: 03+2] P:idm Limit the domains Envoy will proxy. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede) [12:16:29] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host db1124.eqiad.wmnet [12:16:44] (03CR) 10Cathal Mooney: [C: 03+2] Remove reference to row E/F in reimage clear_dhcp_cache function [cookbooks] - 10https://gerrit.wikimedia.org/r/974961 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [12:17:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [12:18:34] (03PS1) 10Jbond: db1124: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974965 (https://phabricator.wikimedia.org/T349619) [12:18:55] (03CR) 10Jbond: [C: 03+2] db1124: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974965 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:20:16] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974966 (owner: 10L10n-bot) [12:20:55] (03Merged) 10jenkins-bot: Remove reference to row E/F in reimage clear_dhcp_cache function [cookbooks] - 10https://gerrit.wikimedia.org/r/974961 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [12:21:07] (03CR) 10Cathal Mooney: [C: 03+2] Fail when setting int relations if PuppetDB parent not found in Netbox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney) [12:21:41] (03Merged) 10jenkins-bot: Fail when setting int relations if PuppetDB parent not found in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney) [12:23:09] (03PS4) 10Volans: sre.discovery.datacenter: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) [12:23:11] !log jmm@cumin1001 START - Cookbook sre.puppet.migrate-host for host cumin2002.codfw.wmnet [12:23:11] (03PS4) 10Volans: sre.discovery.service-route: customize lock args [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) [12:23:13] (03PS5) 10Volans: sre.ganeti.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) [12:23:16] (03PS1) 10Volans: sre.hardware.upgrade-firmware: prepare for fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 [12:23:17] (03PS1) 10Volans: sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 [12:23:19] (03PS1) 10Volans: sre.hardware.upgrade-firmware: add custom locking [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) [12:23:21] (03PS1) 10Volans: sre.hosts.decommission: acquire lock for each host [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973) [12:24:23] (03PS1) 10Muehlenhoff: Switch cumin2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974974 (https://phabricator.wikimedia.org/T349619) [12:26:34] (03CR) 10Muehlenhoff: [C: 03+2] Switch cumin2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974974 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:27:10] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:27:40] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1124.eqiad.wmnet [12:27:47] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: prepare for fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 (owner: 10Volans) [12:27:57] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 (owner: 10Volans) [12:29:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [12:29:47] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:29:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:33:06] !log Install Test MariaDB 10.6.16 (Bookworm) on pc2014 T351283 [12:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:12] T351283: Compile and package MariaDB 10.6.16 and 10.4.32 - https://phabricator.wikimedia.org/T351283 [12:33:29] !log jmm@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cumin2002.codfw.wmnet [12:34:31] (03CR) 10JMeybohm: [C: 03+1] envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [12:37:04] (03CR) 10Arnaudb: [C: 03+2] mariadb: add db1241 and prepare db1141 retirement [puppet] - 10https://gerrit.wikimedia.org/r/974633 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [12:38:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:46:21] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) I'm gonna close this one for now, if we see an issue again we should get a better error message which should point us to what PuppetDB data triggered i... [12:46:29] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) 05Open→03Resolved [12:47:04] 10SRE, 10Infrastructure-Foundations, 10netops: Use default BGP multihop TTL between CRs and servers - https://phabricator.wikimedia.org/T350488 (10cmooney) [12:53:32] PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:23] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - cmooney@cumin1001 [12:54:44] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: provisionning db1241.eqiad.wmnet - T344036 [12:54:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: provisionning db1241.eqiad.wmnet - T344036 [12:54:49] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [12:54:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: provisionning db1241.eqiad.wmnet - T344036 [12:55:04] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: provisionning db1241.eqiad.wmnet - T344036 [12:55:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'cloning db1141 - T350458', diff saved to https://phabricator.wikimedia.org/P53515 and previous config saved to /var/cache/conftool/dbconfig/20231116-125515-arnaudb.json [12:55:21] T350458: Decommission db11[26-49] - https://phabricator.wikimedia.org/T350458 [12:55:39] (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: Bump version [software] - 10https://gerrit.wikimedia.org/r/974978 (https://phabricator.wikimedia.org/T351283) [12:56:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - cmooney@cumin1001 [12:56:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'cloning db1141 - T350458', diff saved to https://phabricator.wikimedia.org/P53516 and previous config saved to /var/cache/conftool/dbconfig/20231116-125649-arnaudb.json [12:58:33] (03PS23) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1300) [13:00:26] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1141.eqiad.wmnet onto db1241.eqiad.wmnet [13:02:28] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host db1133.eqiad.wmnet [13:04:02] (03PS1) 10Jbond: db1133: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974980 (https://phabricator.wikimedia.org/T349619) [13:04:42] (03CR) 10Jbond: [C: 03+2] db1133: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974980 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:05:14] (03CR) 10Arnaudb: [C: 03+1] control-mariadb-10.6-bookworm: Bump version [software] - 10https://gerrit.wikimedia.org/r/974978 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui) [13:05:30] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bookworm: Bump version [software] - 10https://gerrit.wikimedia.org/r/974978 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui) [13:06:02] (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: Bump version [software] - 10https://gerrit.wikimedia.org/r/974978 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui) [13:09:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1014.eqiad.wmnet [13:09:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1133.eqiad.wmnet [13:10:12] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host backup1001.eqiad.wmnet [13:10:36] (03PS2) 10Volans: sre.hardware.upgrade-firmware: prepare for fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 [13:10:38] (03PS2) 10Volans: sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 [13:10:40] (03PS2) 10Volans: sre.hardware.upgrade-firmware: add custom locking [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) [13:10:42] (03PS2) 10Volans: sre.hosts.decommission: acquire lock for each host [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973) [13:14:00] (03CR) 10Volans: sre.ganeti.*: customize lock arguments (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [13:14:08] (03PS1) 10Jbond: backup1001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974981 (https://phabricator.wikimedia.org/T349619) [13:15:12] (03CR) 10Jbond: [C: 03+2] backup1001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974981 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:15:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:15:24] (03CR) 10Volans: sre.hardware.upgrade-firmware: prepare for fixes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 (owner: 10Volans) [13:17:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1014.eqiad.wmnet [13:17:54] (03CR) 10Volans: sre.hardware.upgrade-firmware: add custom locking (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [13:19:59] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host backup1001.eqiad.wmnet [13:21:46] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host dbprov2001.codfw.wmnet [13:22:17] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:17] (03PS1) 10Jbond: dbprov2001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974983 (https://phabricator.wikimedia.org/T349619) [13:24:34] (03CR) 10Brouberol: [C: 03+2] Define a wmflib function to compute the last IP in a subnet [puppet] - 10https://gerrit.wikimedia.org/r/974928 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:24:52] (03PS1) 10Majavah: P:ldap::client: updated outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/974985 [13:25:04] (03CR) 10Ladsgroup: [C: 03+1] "❤️" [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:26:14] (03CR) 10Jbond: [C: 03+2] dbprov2001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974983 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:26:29] (03PS3) 10Jcrespo: [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 [13:26:31] (03PS1) 10Jcrespo: Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 [13:26:36] (03CR) 10Marostegui: "Remember to update the doc (if needed)" [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:27:26] (03CR) 10CI reject: [V: 04-1] Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo) [13:27:30] (03CR) 10CI reject: [V: 04-1] [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 (owner: 10Jcrespo) [13:28:49] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ms-be2050.codfw.wmnet [13:28:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ms-be2050.codfw.wmnet [13:30:20] (03PS1) 10Muehlenhoff: Switch bs-be2050 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974987 (https://phabricator.wikimedia.org/T349619) [13:30:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host dbprov2001.codfw.wmnet [13:31:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch bs-be2050 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974987 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:33:05] (03CR) 10Arnaudb: [C: 03+2] mariadb: clone and upgrade mariadb [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:33:29] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host backup2001.codfw.wmnet [13:34:50] (03PS1) 10Jbond: backup2001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974988 (https://phabricator.wikimedia.org/T349619) [13:34:57] !log stat1008: Add `sowiki`, `stwiki`, `tgwiki` and `ugwiki` to `/srv/published/datasets/one-off/research-mwaddlink/wikis.txt` (T340944) [13:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:03] (03CR) 10Muehlenhoff: sre.ganeti.*: customize lock arguments (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [13:35:10] T340944: The published dataset's list of wikis misses a couple of wikis with existing data - https://phabricator.wikimedia.org/T340944 [13:35:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:35:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974985 (owner: 10Majavah) [13:35:44] (03CR) 10Jbond: [C: 03+2] backup2001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974988 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:36:36] (03CR) 10Btullis: "Adding Filippo to verify that the alertmanager config is correct." [puppet] - 10https://gerrit.wikimedia.org/r/974652 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [13:36:50] (03CR) 10Majavah: [C: 03+2] P:ldap::client: updated outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/974985 (owner: 10Majavah) [13:37:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ms-be2050.codfw.wmnet [13:39:46] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host backup2001.codfw.wmnet [13:40:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [13:42:48] (03PS1) 10Jforrester: Conditionally render the content of header-action instead of the slot [extensions/WikiLambda] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974244 (https://phabricator.wikimedia.org/T351121) [13:44:22] !log restart bacula at backup1001 [13:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:46] (03PS1) 10Giuseppe Lavagetto: modules/mesh: add new configuration and networkpolicy modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/974990 [13:44:48] (03PS1) 10Giuseppe Lavagetto: modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) [13:45:50] (03CR) 10CI reject: [V: 04-1] modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [13:47:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [13:49:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: prometheus [13:50:27] RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:55] PROBLEM - Check systemd state on an-worker1091 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:31] (03PS1) 10Muehlenhoff: Switch prometheus to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974992 (https://phabricator.wikimedia.org/T349619) [13:53:08] (03CR) 10Muehlenhoff: [C: 03+2] Switch prometheus to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974992 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:54:17] PROBLEM - Check systemd state on an-presto1013 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1400). Please do the needful. [14:00:07] abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:18] (03PS1) 10Arnaudb: decommission: db1136 [puppet] - 10https://gerrit.wikimedia.org/r/974634 (https://phabricator.wikimedia.org/T351065) [14:00:26] (unable to deploy today, sorry!) [14:00:40] (03PS1) 10Btullis: Add spark.sql.warehouse.dir to spark3 defaults [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523) [14:01:15] I can take care of backport. [14:01:21] (03CR) 10Marostegui: [C: 03+1] decommission: db1136 [puppet] - 10https://gerrit.wikimedia.org/r/974634 (https://phabricator.wikimedia.org/T351065) (owner: 10Arnaudb) [14:01:23] (03CR) 10Hnowlan: [C: 03+2] service: move mw-jobrunner to prod [puppet] - 10https://gerrit.wikimedia.org/r/973827 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [14:01:25] (03PS2) 10Btullis: Add spark.sql.warehouse.dir to spark3 defaults [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523) [14:02:08] abijeet: hi. I'll start the deployment. [14:02:17] kart_, ok, thanks! [14:02:20] and let you know once it is ready for testing on mwdebug. [14:02:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/Translate] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974242 (https://phabricator.wikimedia.org/T351273) (owner: 10Abijeet Patro) [14:03:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: prometheus [14:04:23] PROBLEM - Check systemd state on kubernetes2034 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:07] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/508/co" [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523) (owner: 10Btullis) [14:06:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:06:49] * Lucas_WMDE also around now if needed [14:07:12] (03CR) 10Eevans: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974946 (https://phabricator.wikimedia.org/T351193) (owner: 10Hnowlan) [14:07:19] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T349796) [14:07:23] T349796: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 [14:07:57] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kafka::monitoring_bullseye [14:08:40] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T349796) [14:08:55] (03CR) 10Ayounsi: [C: 03+1] Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney) [14:09:03] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T349796) [14:10:01] (03PS1) 10Jbond: WIP: update get_ca_server [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 [14:10:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T349796) [14:10:38] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/509/con" [puppet] - 10https://gerrit.wikimedia.org/r/974947 (owner: 10Giuseppe Lavagetto) [14:11:25] (03PS1) 10Muehlenhoff: Switch kafka::monitoring_bullseye to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974996 (https://phabricator.wikimedia.org/T349619) [14:12:54] (03PS2) 10Clément Goubert: mediawiki: Fix rsyslog errorlog ruleset [deployment-charts] - 10https://gerrit.wikimedia.org/r/974521 (https://phabricator.wikimedia.org/T350430) [14:13:10] (03PS2) 10Giuseppe Lavagetto: kubernetes::global_config: add listener for mw on k8s transition [puppet] - 10https://gerrit.wikimedia.org/r/974947 [14:13:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch kafka::monitoring_bullseye to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974996 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:15:31] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/510/con" [puppet] - 10https://gerrit.wikimedia.org/r/974947 (owner: 10Giuseppe Lavagetto) [14:15:42] !log stop puppet on puppet7 agents to debug puppet performance [14:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:52] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/511/console" [puppet] - 10https://gerrit.wikimedia.org/r/974946 (https://phabricator.wikimedia.org/T351193) (owner: 10Hnowlan) [14:16:45] (03CR) 10CI reject: [V: 04-1] WIP: update get_ca_server [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond) [14:19:00] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: remove references to graphite [puppet] - 10https://gerrit.wikimedia.org/r/974946 (https://phabricator.wikimedia.org/T351193) (owner: 10Hnowlan) [14:20:20] (03Merged) 10jenkins-bot: TranslatablePageMarker: Add patrol status for translatable page [extensions/Translate] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974242 (https://phabricator.wikimedia.org/T351273) (owner: 10Abijeet Patro) [14:20:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kafka::monitoring_bullseye [14:21:34] !log kartik@deploy2002 Started scap: Backport for [[gerrit:974242|TranslatablePageMarker: Add patrol status for translatable page (T351273)]] [14:21:35] (03CR) 10Ayounsi: [C: 03+1] "Had a look at https://puppet-compiler.wmflabs.org/output/974500/502/apt1001.wikimedia.org/fulldiff.html and lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:21:39] T351273: Revisions for unit markers addition are not longer autopatrolled - https://phabricator.wikimedia.org/T351273 [14:21:54] jouncebot: nowandnext [14:21:54] For the next 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1400) [14:21:54] In 2 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1700) [14:22:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:23:03] !log kartik@deploy2002 kartik and abi: Backport for [[gerrit:974242|TranslatablePageMarker: Add patrol status for translatable page (T351273)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:23:12] kart_: hi, can you ping me when done deploying please? [14:23:16] same :D [14:24:50] wanna go before or after me? :D [14:25:14] urbanecm: sure [14:25:19] claime: sure [14:25:28] (03CR) 10Jbond: [C: 03+2] "im going to merge this now. we have a few issues and im hoping this will help" [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [14:25:31] abijeet: can you test the patch on mwdebug servers? [14:25:34] (03CR) 10Jbond: [C: 03+2] puppetserver: change ssldir to a concat fragment [puppet] - 10https://gerrit.wikimedia.org/r/974282 (owner: 10JHathaway) [14:25:35] kart_, ok [14:25:39] urbanecm: I'm deploying my syslog rule update to mw-on-k8s, I think we can go at the same time depending on what you're deploying [14:27:53] (03CR) 10Brouberol: [V: 03+1 C: 03+2] Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:28:41] claime: i just want to run git pull on deployment host, it's a beta-specific change :)) [14:29:01] so unless you plan on stopping it for a while, i think we can do both changes at once. [14:29:06] yep [14:29:32] kart_, doesn't appear to break anything so I'd say we are good to go. will need to ask someone who has autopatrol permissions to check. [14:29:32] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974966 (owner: 10L10n-bot) [14:29:49] abijeet: which wiki? [14:29:53] abijeet: cool. [14:30:06] urbanecm, metawiki [14:30:17] urbanecm, specifically this issue: https://phabricator.wikimedia.org/T351273 [14:30:32] abijeet: what is your username? ill grant them to you [14:30:42] (03CR) 10Joal: [C: 03+1] Add spark.sql.warehouse.dir to spark3 defaults [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523) (owner: 10Btullis) [14:31:01] (03PS1) 10Muehlenhoff: Add crm1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/975000 (https://phabricator.wikimedia.org/T349402) [14:31:08] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/973756 (owner: 10L10n-bot) [14:31:13] urbanecm, APatro (WMF) - https://meta.wikimedia.org/wiki/User:APatro_(WMF) [14:31:29] 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10mpopov) Lovely! Thank you @MatthewVernon! @OSefu-WMF @Hghani: okay, both of you should now be able to run commands like the ones documented in T350750#9316104 [14:31:43] abijeet: you already seem to have autopatrol? [14:32:21] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:22] (it is bundled within "Translation administrators") [14:32:41] ah ok. [14:32:59] (03CR) 10Muehlenhoff: [C: 03+2] Add crm1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/975000 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [14:34:23] (03PS1) 10Muehlenhoff: Add crm* pattern to partman setup [puppet] - 10https://gerrit.wikimedia.org/r/975002 (https://phabricator.wikimedia.org/T349402) [14:34:28] abijeet: let me know when we're ready :) [14:34:34] urbanecm, thanks I see it now on https://meta.wikimedia.org/wiki/Special:ListGroupRights [14:36:28] (03PS2) 10Giuseppe Lavagetto: modules/mesh: add new configuration and networkpolicy modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/974990 [14:36:30] (03PS2) 10Giuseppe Lavagetto: modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) [14:36:51] kart_, we can go ahead. this change is not breaking any existing fucntionality. [14:37:23] awesome! [14:37:26] !log kartik@deploy2002 kartik and abi: Continuing with sync [14:38:39] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add spark.sql.warehouse.dir to spark3 defaults [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523) (owner: 10Btullis) [14:38:57] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:49] (03CR) 10Muehlenhoff: [C: 03+2] Add crm* pattern to partman setup [puppet] - 10https://gerrit.wikimedia.org/r/975002 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [14:42:30] (03CR) 10JHathaway: puppetserver: cache code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [14:43:16] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:974242|TranslatablePageMarker: Add patrol status for translatable page (T351273)]] (duration: 21m 41s) [14:43:20] T351273: Revisions for unit markers addition are not longer autopatrolled - https://phabricator.wikimedia.org/T351273 [14:43:56] !log re-enable puppet on puppet7 agents [14:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:03] abijeet: we are done. [14:44:19] urbanecm: claime I'm done with deployment.. [14:44:24] thanks [14:44:30] tyvm [14:44:36] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix rsyslog errorlog ruleset [deployment-charts] - 10https://gerrit.wikimedia.org/r/974521 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert) [14:44:45] (03CR) 10Urbanecm: [C: 03+2] IP Masking temp account expiry: Fix a typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974728 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:45:34] (03Merged) 10jenkins-bot: IP Masking temp account expiry: Fix a typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974728 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:45:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 141626 [14:45:55] (03Merged) 10jenkins-bot: mediawiki: Fix rsyslog errorlog ruleset [deployment-charts] - 10https://gerrit.wikimedia.org/r/974521 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert) [14:46:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 141626 [14:46:56] * urbanecm done [14:47:12] kart_, thanks [14:48:16] (03CR) 10Arnaudb: [C: 03+2] decommission: db1136 [puppet] - 10https://gerrit.wikimedia.org/r/974634 (https://phabricator.wikimedia.org/T351065) (owner: 10Arnaudb) [14:48:54] !log Redeploying mw-on-k8s for T350430 [14:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:58] T350430: php-fpm logs from Kubernetes lack 'message' and 'normalized_message' - https://phabricator.wikimedia.org/T350430 [14:49:00] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:49:13] RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1136.eqiad.wmnet [14:49:40] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:49:41] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:49:59] (PuppetFailure) firing: Puppet has failed on puppetdb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:50:11] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:50:12] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:50:46] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:50:47] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:51:21] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:51:23] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:51:33] RECOVERY - Check systemd state on an-presto1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:46] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:51:47] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:52:21] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:52:22] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:52:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:52:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on ganeti1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:53:23] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:53:24] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:53:53] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:53:57] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:03] (PuppetFailure) firing: (2) Puppet has failed on kubernetes2031:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:54:07] (PuppetFailure) firing: Puppet has failed on ganeti2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:54:14] (03CR) 10Filippo Giunchedi: [C: 03+1] Send recovery emails to data-engineering-alerts [puppet] - 10https://gerrit.wikimedia.org/r/974652 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [14:54:38] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [14:54:59] (PuppetFailure) firing: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:55:43] !log cgoubert@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [14:55:43] !log cgoubert@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [14:55:49] (03PS1) 10Btullis: Set a non-default mapreduce file committer algorithm for spark [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) [14:55:54] !log cgoubert@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:55:54] !log cgoubert@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:56:01] !log cgoubert@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [14:56:01] !log cgoubert@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [14:56:16] !log cgoubert@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:56:16] !log cgoubert@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:56:17] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [14:56:35] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [14:56:36] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [14:56:37] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1136.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [14:57:02] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [14:57:03] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [14:57:20] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [14:57:22] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [14:57:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1136.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [14:57:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:43] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1136.eqiad.wmnet [14:57:50] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [14:57:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'remove db1136', diff saved to https://phabricator.wikimedia.org/P53519 and previous config saved to /var/cache/conftool/dbconfig/20231116-145754-arnaudb.json [14:58:57] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1136.eqiad.wmnet - https://phabricator.wikimedia.org/T351065 (10ABran-WMF) 05In progress→03Open a:05ABran-WMF→03None [14:59:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/515/co" [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [14:59:59] (PuppetFailure) resolved: Puppet has failed on puppetdb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:01:07] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [15:01:13] RECOVERY - Check systemd state on kubernetes2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.wikimedia.org with OS bullseye [15:03:58] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.wikimedia.org with OS bullseye [15:03:59] (PuppetFailure) resolved: Puppet has failed on ganeti2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:04:03] (PuppetFailure) resolved: (2) Puppet has failed on kubernetes2031:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:04:59] (PuppetFailure) resolved: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:07:19] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:07:59] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:07:59] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on ganeti1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:11:39] (03PS1) 10Bking: Revert "elastic relforge: update logstash transport" [puppet] - 10https://gerrit.wikimedia.org/r/974245 [15:12:33] (03CR) 10Bking: [C: 03+2] Revert "elastic relforge: update logstash transport" [puppet] - 10https://gerrit.wikimedia.org/r/974245 (owner: 10Bking) [15:15:14] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet [15:16:58] (03CR) 10DCausse: query_service: add monitoring for ldf endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:17:08] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: revert logstash changes - bking@cumin2002 - T324335 [15:17:13] T324335: Remove logstash from the Search Elasticsearch servers - https://phabricator.wikimedia.org/T324335 [15:18:31] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1008.wikimedia.org with reason: host reimage [15:21:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1008.wikimedia.org with reason: host reimage [15:21:55] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: revert logstash changes - bking@cumin2002 - T324335 [15:22:03] (03PS1) 10Ssingh: conftool: introduce schema and host file for dnsboxes [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) [15:22:47] !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host an-druid1002.eqiad.wmnet with OS bullseye [15:23:32] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/517/console" [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:23:39] (03PS1) 10Cwhite: logstash: beta-logs to use current w3creportingapi template [puppet] - 10https://gerrit.wikimedia.org/r/974635 (https://phabricator.wikimedia.org/T350786) [15:25:32] (03PS6) 10Brouberol: Generate subnet DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) [15:26:21] (03CR) 10Cwhite: [C: 03+2] logstash: beta-logs to use current w3creportingapi template [puppet] - 10https://gerrit.wikimedia.org/r/974635 (https://phabricator.wikimedia.org/T350786) (owner: 10Cwhite) [15:26:22] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1141.eqiad.wmnet onto db1241.eqiad.wmnet [15:26:46] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/518/con" [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [15:28:12] (03CR) 10Ssingh: [V: 03+1] "I think we might need to iterate on this a bit but at least this is ready for an initial review. I was a bit unsure of the schema but I sp" [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:34:59] (PuppetFailure) firing: Puppet has failed on kubernetes2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:35:52] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1002.eqiad.wmnet with reason: host reimage [15:36:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1008.wikimedia.org with OS bullseye [15:37:48] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['aqs1012'] [15:38:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['aqs1012'] [15:38:17] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['aqs1012'] [15:38:47] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1002.eqiad.wmnet with reason: host reimage [15:44:33] (03CR) 10Hnowlan: [C: 03+1] services: bump cpu limits and Docker images for cp instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/974476 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [15:44:59] (PuppetFailure) resolved: Puppet has failed on kubernetes2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:51:59] (03PS3) 10D3r1ck01: wmf-config: Remove StatsCacheType (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) [15:54:33] (03CR) 10D3r1ck01: "This is pretty much ready and I can go ahead and deploy but I'll like a signal (+1) from either Krinkle or Effie 😊, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [15:54:39] (03PS1) 10FNegri: wmcs-cold-migrate: quick fixes [puppet] - 10https://gerrit.wikimedia.org/r/975012 [15:55:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kafka::logging [15:55:51] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1002.eqiad.wmnet with OS bullseye [15:56:42] (03PS2) 10FNegri: wmcs-cold-migrate: quick fixes [puppet] - 10https://gerrit.wikimedia.org/r/975012 [15:56:48] (03PS3) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) [15:57:10] (03PS1) 10Muehlenhoff: Switch kafka::logging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975015 (https://phabricator.wikimedia.org/T349619) [15:57:18] (03CR) 10CI reject: [V: 04-1] query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:57:39] (03CR) 10Bking: query_service: add monitoring for ldf endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:57:56] (03CR) 10Muehlenhoff: [C: 03+2] Switch kafka::logging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975015 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:58:57] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:59:37] (03CR) 10Brouberol: [V: 03+1] Generate subnet DHCP configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [16:00:18] (03PS4) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) [16:01:01] (03CR) 10CI reject: [V: 04-1] query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:02:06] (03PS7) 10Brouberol: Generate subnet DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) [16:03:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kafka::logging [16:03:33] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4037.ulsfo.wmnet [16:04:35] (03PS1) 10Muehlenhoff: Switch cp4037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975016 (https://phabricator.wikimedia.org/T349619) [16:05:18] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975016 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:08:32] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/519/con" [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [16:08:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['aqs1012'] [16:09:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4037.ulsfo.wmnet [16:11:16] (03CR) 10Brouberol: Generate subnet DHCP configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [16:14:04] (03PS1) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) [16:15:28] (03CR) 10CI reject: [V: 04-1] wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:17:04] !log depool cp4037 for reboot [post puppet 7 upgrade] [16:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:37] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host acmechief1001.eqiad.wmnet with OS bookworm [16:18:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4037.ulsfo.wmnet [16:18:35] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1108.eqiad.wmnet [16:18:36] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1108.eqiad.wmnet [16:19:14] (03PS2) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) [16:20:27] !log swapped cp1108 <-> cp1083 (T349244) [16:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:34] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [16:21:26] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1010.wikimedia.org with OS bullseye [16:21:47] (03CR) 10Effie Mouzeli: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [16:21:49] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1009.wikimedia.org with OS bullseye [16:23:03] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet [16:24:27] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1109.eqiad.wmnet [16:24:28] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1109.eqiad.wmnet [16:26:06] !log swapped cp1109 <-> cp1084 (T349244) [16:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:11] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [16:26:48] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief1001.eqiad.wmnet with reason: host reimage [16:26:50] (03PS4) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) [16:26:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:27:17] 10SRE, 10ops-eqiad: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Jclark-ctr) {F41511225} Reseated hard drives. update idrac and bios firmware [16:27:34] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4037.ulsfo.wmnet [16:30:07] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on 6 hosts with reason: Extending downtime for depooled cp hosts [16:30:23] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on 6 hosts with reason: Extending downtime for depooled cp hosts [16:30:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=154babc2-d86e-4f5b-baf5-fb36e9d129e4) set by fabfur@cumin1001 for 14 days, 0:00:00 on 6 host(s) and thei... [16:31:21] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief1001.eqiad.wmnet with reason: host reimage [16:33:08] PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:11] !log repool cp4037 [16:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:53] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [16:44:35] (03PS1) 10BCornwall: acme_chief: Remove acmechief1001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975024 [16:45:01] (03PS2) 10BCornwall: acme_chief: Remove acmechief1001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975024 [16:45:44] (03CR) 10Vgutierrez: [C: 03+1] "missing Bug: header on the commit msg, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/975024 (owner: 10BCornwall) [16:45:44] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10Sfaci) [16:46:10] (03PS3) 10BCornwall: acme_chief: Remove acmechief1001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975024 (https://phabricator.wikimedia.org/T342154) [16:47:21] (03CR) 10Elukey: [C: 03+2] services: bump cpu limits and Docker images for cp instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/974476 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [16:48:02] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10WDoranWMF) Approved as @Sfaci 's manager [16:48:34] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would assume you'd also need to:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:49:21] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host acmechief1001.eqiad.wmnet with OS bookworm [16:49:23] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/521/con" [puppet] - 10https://gerrit.wikimedia.org/r/975024 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:50:25] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [16:50:42] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [16:51:16] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [16:52:18] RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:40] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [16:53:37] 10SRE, 10ops-eqiad: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Eevans) >>! In T351320#9338030, @Jclark-ctr wrote: > {F41511225} Reseated hard drives. update idrac and bios firmware I confirmed this to be the case before proceeding, but after restarting via the reimage cook... [16:58:18] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.wikimedia.org with OS bullseye [17:00:05] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1700) [17:00:05] urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:22] !log Disabling puppet on all acme-chief clients for acme-chief bookworm upgrades - T342154 [17:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:29] Here! [17:00:42] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [17:01:10] urbanecm: hey! one sec [17:01:27] Sure [17:02:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53522 and previous config saved to /var/cache/conftool/dbconfig/20231116-170241-arnaudb.json [17:03:09] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:04:00] (03PS1) 10BCornwall: acme-chief: Set acmechief1001 as active [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) [17:04:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:04:55] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:05:29] urbanecm: lgtm, need a manual run? [17:05:59] (03CR) 10RLazarus: [C: 03+2] mediawiki: Add missing frequency param to the purge_temporary_accounts job [puppet] - 10https://gerrit.wikimedia.org/r/974726 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [17:06:06] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/522/con" [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:06:16] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [17:06:29] ^^ expected [17:06:56] rzl: thanks! Not needed, the related feature is currently only enabled on beta, and i can run it myself there :)). Thanks for the +2! [17:07:05] 👍 [17:07:22] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [17:07:31] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [17:08:02] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for acmechief1001.eqiad.wmnet [17:08:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief1001.eqiad.wmnet [17:08:33] (03CR) 10Btullis: [C: 03+2] Send recovery emails to data-engineering-alerts [puppet] - 10https://gerrit.wikimedia.org/r/974652 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [17:08:52] (03CR) 10BCornwall: [V: 03+1 C: 03+2] acme-chief: Set acmechief1001 as active [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:12:14] 10SRE, 10ops-esams, 10DC-Ops: Audit future knams power usage - https://phabricator.wikimedia.org/T331358 (10RobH) 05Open→03Invalid Never followed through on updated estiamtes since drmrs gave us a very clear indicator on what our esams racks woudl use (identical circuits and hardware) so this wasn't needed. [17:12:59] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.wikimedia.org with reason: host reimage [17:13:02] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudelastic1009.wikimedia.org with reason: host reimage [17:13:51] (03PS5) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) [17:14:24] (03PS1) 10BCornwall: acme-chief: Switch acmechief_host to acmechief1001 [puppet] - 10https://gerrit.wikimedia.org/r/975047 (https://phabricator.wikimedia.org/T342154) [17:14:48] (03CR) 10Vgutierrez: [C: 03+1] acme-chief: Switch acmechief_host to acmechief1001 [puppet] - 10https://gerrit.wikimedia.org/r/975047 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:16:50] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [17:17:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:17:40] (03CR) 10BCornwall: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:17:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P53523 and previous config saved to /var/cache/conftool/dbconfig/20231116-171748-arnaudb.json [17:18:16] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudelastic1010.wikimedia.org with OS bullseye [17:19:29] (03CR) 10BCornwall: [C: 03+2] acme-chief: Switch acmechief_host to acmechief1001 [puppet] - 10https://gerrit.wikimedia.org/r/975047 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:19:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [17:20:06] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudelastic1010.wikimedia.org with OS bullseye [17:21:19] (03CR) 10Bking: [C: 03+2] Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking) [17:21:39] (03CR) 10CI reject: [V: 04-1] Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking) [17:22:52] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2001 is OK: PROCS OK: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [17:23:57] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [17:25:28] PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: reload-acme-chief-backend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:04] that unit shouldn't be there anymore? :) [17:26:15] or maybe icinga host needs a puppet agent run :) [17:26:58] Loaded: not-found (Reason: Unit reload-acme-chief-backend.service not found.) [17:27:00] !log Re-enabling puppet on all acme-chief clients post-bookworm upgrade - T342154 [17:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:09] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [17:28:08] RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.wikimedia.org with OS bullseye [17:29:27] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1012.eqiad.wmnet with OS bullseye [17:30:09] (03PS2) 10Bking: Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 [17:30:41] (03CR) 10Bking: Revert "staging-eqiad: raise rdf-streaming-updater quota" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking) [17:32:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P53525 and previous config saved to /var/cache/conftool/dbconfig/20231116-173254-arnaudb.json [17:33:19] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [17:34:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [17:35:48] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:53] ^forgot to arm keyholder. Hopefully that fixes this [17:44:34] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:37] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [17:47:15] (03Abandoned) 10BCornwall: acme_chief: Remove acmechief1001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975024 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:48:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53526 and previous config saved to /var/cache/conftool/dbconfig/20231116-174800-arnaudb.json [17:48:03] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:48:06] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:48:16] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:50:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 (owner: 10Volans) [17:54:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 (owner: 10Volans) [17:55:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:55:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:59:03] (03CR) 10Cathal Mooney: [C: 03+2] Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney) [17:59:41] (03Merged) 10jenkins-bot: Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1800) [18:06:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:14:55] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [18:15:07] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [18:19:51] (03PS1) 10Btullis: Configure Matomo's TagManager to write to existing tmpdir [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) [18:20:21] (03CR) 10CI reject: [V: 04-1] Configure Matomo's TagManager to write to existing tmpdir [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [18:21:14] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/524/con" [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [18:31:15] (03PS1) 10DDesouza: Pre-deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975059 (https://phabricator.wikimedia.org/T351353) [18:31:54] (03CR) 10Majavah: [C: 03+1] wmcs-cold-migrate: quick fixes [puppet] - 10https://gerrit.wikimedia.org/r/975012 (owner: 10FNegri) [18:34:38] PROBLEM - Host ps1-e8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:35:51] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cold-migrate: quick fixes [puppet] - 10https://gerrit.wikimedia.org/r/975012 (owner: 10FNegri) [18:44:07] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1010.wikimedia.org with OS bullseye [18:46:15] (03PS1) 10Tchanders: ipoid: Disable the daily updates job and schedule an import [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) [18:48:05] (03PS2) 10Jbond: puppet: update gat_ca_server to also suport srv discovry [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 [18:49:03] (03CR) 10Jbond: "Let m know what you think of the general approach and if good ill update the tests etc." [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond) [18:50:21] (03CR) 10Bking: [C: 03+2] Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking) [18:51:24] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [18:53:06] (03Merged) 10jenkins-bot: Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking) [18:53:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) [18:54:20] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker11 - jclark@cumin1001" [18:55:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker11 - jclark@cumin1001" [18:55:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:56:01] (03PS1) 10Volans: sre.hosts.reimage: fix puppet version choice [cookbooks] - 10https://gerrit.wikimedia.org/r/975067 [18:56:19] (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also suport srv discovry [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond) [18:56:22] !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [18:56:31] !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [18:56:38] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [18:57:17] 10SRE, 10Infrastructure-Foundations, 10netops: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10cmooney) p:05Triage→03Low [18:59:34] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1157.mgmt.eqiad.wmnet with reboot policy FORCED [18:59:40] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1158.mgmt.eqiad.wmnet with reboot policy FORCED [18:59:54] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED [18:59:56] Emperor: I wanted to ask you about your thoughts on increasing the max upload size from 4GB to 5GB (or failing that, allowing users to request such uploads on a case by case basis). I'm told you're the person to talk to. For background context the previous limit was due to storing file size as a 32 bit integer, which has now been changed so is no longer a limting factor. I would appreciate [19:00:02] your thoughts on T191804. [19:00:04] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:00:04] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1160.mgmt.eqiad.wmnet with reboot policy FORCED [19:00:06] T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 [19:00:06] jeena and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1900). [19:00:13] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [19:00:26] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED [19:00:32] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED [19:00:49] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [19:01:54] !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:02:00] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:02:33] (03CR) 10Eevans: [C: 03+1] sre.hosts.reimage: fix puppet version choice [cookbooks] - 10https://gerrit.wikimedia.org/r/975067 (owner: 10Volans) [19:02:44] (03PS1) 10Cathal Mooney: Add BGP to the contributing protocols for aggregate routes on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456) [19:02:54] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975071 (https://phabricator.wikimedia.org/T350081) [19:02:56] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975071 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot) [19:03:18] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix puppet version choice [cookbooks] - 10https://gerrit.wikimedia.org/r/975067 (owner: 10Volans) [19:03:41] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975071 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot) [19:04:08] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [19:07:35] (03Merged) 10jenkins-bot: sre.hosts.reimage: fix puppet version choice [cookbooks] - 10https://gerrit.wikimedia.org/r/975067 (owner: 10Volans) [19:07:37] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [19:08:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:08:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:09:42] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1163 [19:09:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1163 [19:09:53] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1164 [19:09:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1164 [19:10:23] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [19:10:37] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.5 refs T350081 [19:10:45] T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081 [19:14:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [19:16:46] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [19:19:49] (03PS1) 10Andrew Bogott: puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073 [19:22:34] (03CR) 10CI reject: [V: 04-1] puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [19:22:55] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: host reimage [19:24:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED [19:24:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1157.mgmt.eqiad.wmnet with reboot policy FORCED [19:25:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED [19:25:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED [19:25:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1160.mgmt.eqiad.wmnet with reboot policy FORCED [19:25:51] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: host reimage [19:26:00] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [19:26:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [19:26:49] (03CR) 10Dzahn: puppetserver: create a necessary parent dirs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [19:27:25] (03PS2) 10Andrew Bogott: puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073 [19:28:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1158.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:12] (LVSHighRX) firing: Excessive RX traffic on lvs1019:9100 (eno1np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [19:31:09] hmm [19:31:25] (03CR) 10Andrew Bogott: puppetserver: create a necessary parent dirs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [19:32:19] (03PS3) 10Andrew Bogott: puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073 [19:40:12] (LVSHighRX) resolved: Excessive RX traffic on lvs1019:9100 (eno1np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [19:41:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/528/con" [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [19:44:09] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:44:47] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/output/974285/526/netmon1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [19:45:07] (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [19:46:24] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS for IPs in public1-b-codfw vlan - cmooney@cumin1001" [19:46:25] (03CR) 10Jbond: [V: 03+1] puppetserver: create a necessary parent dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [19:46:45] (03PS7) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 [19:47:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS for IPs in public1-b-codfw vlan - cmooney@cumin1001" [19:47:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:50:38] (03PS2) 10Jcrespo: Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 [19:51:26] (03CR) 10CI reject: [V: 04-1] Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo) [19:53:23] (03PS1) 10Andrew Bogott: puppetserver: only create '/var/lib/puppet/server' when needed [puppet] - 10https://gerrit.wikimedia.org/r/975075 [19:53:27] (03CR) 10Jcrespo: "I believe this works in my testing, but want a double check." [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo) [19:54:01] (03CR) 10CI reject: [V: 04-1] puppetserver: only create '/var/lib/puppet/server' when needed [puppet] - 10https://gerrit.wikimedia.org/r/975075 (owner: 10Andrew Bogott) [19:54:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1012.eqiad.wmnet with OS bullseye [19:54:36] (03PS2) 10Andrew Bogott: puppetserver: only create '/var/lib/puppet/server' when needed [puppet] - 10https://gerrit.wikimedia.org/r/975075 [19:58:42] (03PS4) 10Jcrespo: [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 [19:58:44] (03CR) 10Jbond: [C: 03+1] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/975075 (owner: 10Andrew Bogott) [19:58:58] (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: only create '/var/lib/puppet/server' when needed [puppet] - 10https://gerrit.wikimedia.org/r/975075 (owner: 10Andrew Bogott) [19:59:39] (03CR) 10CI reject: [V: 04-1] [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 (owner: 10Jcrespo) [19:59:51] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:00:24] (03PS8) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 [20:06:03] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/974285/530/" [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [20:14:05] PROBLEM - Check systemd state on kubernetes2053 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:08] (03CR) 10Jcrespo: "There is one thing missing, which is handling the new exceptions:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [20:18:17] (03PS4) 10Dzahn: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [20:18:48] (03CR) 10CI reject: [V: 04-1] planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [20:23:20] !log dr0ptp4kt@deploy2002 Started deploy [airflow-dags/search@b00c6ca]: Deploying Airflow search WDQS graph split HDFS job [20:23:47] !log dr0ptp4kt@deploy2002 Finished deploy [airflow-dags/search@b00c6ca]: Deploying Airflow search WDQS graph split HDFS job (duration: 00m 27s) [20:27:53] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974244 (https://phabricator.wikimedia.org/T351121) (owner: 10Jforrester) [20:41:03] (03PS1) 10JHathaway: puppetserver: remove log spam from user home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) [20:41:20] !log adding anycast GW for public1-b-codfw vlan to codfw spine switches (T347191) [20:41:22] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10kostajh) [20:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:25] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [20:41:42] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) (owner: 10JHathaway) [20:46:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:47:51] ^^ this is due to me, BGP reset but came back up [20:48:46] thanks! [20:49:57] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:50:58] ^^ this is doh2002, investigating [20:51:17] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:52:41] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:52:43] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:53:11] ^^ this required a manual reset on doh2002 for BFD I didn't expect [20:53:22] clear on the CR side didn't resolve [20:53:41] proceeding to next step [20:54:17] !log changing VRRP GW IP for public1-b-codfw on codfw CRs and disabling IPv6 RAs on the CRs (T347191) [20:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:21] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [20:56:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:58:57] RECOVERY - Disk space on druid1010 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1010&var-datasource=eqiad+prometheus/ops [20:59:19] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:59:27] PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:39] PROBLEM - Host ldap-rw2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:43] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:59:49] brennen: i might be interested to do the config deployment if that's okay with you [21:00:07] brennen and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T2100). [21:00:07] danisztls and James_F: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] RECOVERY - Host dns2004 is UP: PING WARNING - Packet loss = 90%, RTA = 33.21 ms [21:00:18] o/ [21:00:19] topranks: need me to check anything? [21:00:28] o/ [21:00:29] PROBLEM - Host 208.80.153.48 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:37] RECOVERY - Host ldap-rw2001 is UP: PING OK - Packet loss = 0%, RTA = 33.13 ms [21:00:39] o/ [21:01:15] RECOVERY - Host 208.80.153.48 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [21:01:15] (I'm not able to deploy this evening) [21:01:23] RECOVERY - Disk space on druid1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1011&var-datasource=eqiad+prometheus/ops [21:01:25] dr0ptp4kt: sure. :) [21:01:39] RECOVERY - Disk space on druid1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1009&var-datasource=eqiad+prometheus/ops [21:02:29] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:03:09] sukhe: thanks not right now [21:03:21] (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:03:39] things seemed to be ok, I attempted rollback but got worse so pushed forward and all seems ok [21:03:53] np! gl [21:04:07] PROBLEM - LDAP -read-only server- on ldap-replica2005 is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [21:04:47] (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:05:11] (03PS2) 10Dr0ptp4kt: Pre-deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975059 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:06:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dr0ptp4kt@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975059 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:07:02] (03Merged) 10jenkins-bot: Pre-deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975059 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:07:16] !log dr0ptp4kt@deploy2002 Started scap: Backport for [[gerrit:975059|Pre-deploy Annual Plan Core Metrics survey (T351353)]] [21:07:21] T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353 [21:08:36] !log dr0ptp4kt@deploy2002 dr0ptp4kt and dani: Backport for [[gerrit:975059|Pre-deploy Annual Plan Core Metrics survey (T351353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:22] danisztls: please check [21:10:06] 10ops-codfw, 10ops-esams, 10DC-Ops: ship MPC5E-40G10G-IRB from esams to codfw - https://phabricator.wikimedia.org/T351467 (10RobH) p:05Triage→03High [21:12:05] dr0ptp4kt: looks good, I'm not able to fully test right now as the messages weren't created yet but coverage is set to 0 [21:12:11] RECOVERY - Check systemd state on kubernetes2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:12:26] danisztls: okay, would you prefer we sync or rather abandon? [21:12:39] dr0ptp4kt: sync [21:12:42] on it [21:12:45] !log dr0ptp4kt@deploy2002 dr0ptp4kt and dani: Continuing with sync [21:13:37] @seen xqt [21:17:29] (03CR) 10Kosta Harlan: [C: 03+1] ipoid: Disable the daily updates job and schedule an import [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders) [21:18:23] (03PS1) 10Sohom Datta: Make the feed gracefully handle long snippets and urls [extensions/PageTriage] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975028 (https://phabricator.wikimedia.org/T347732) [21:18:29] !log dr0ptp4kt@deploy2002 Finished scap: Backport for [[gerrit:975059|Pre-deploy Annual Plan Core Metrics survey (T351353)]] (duration: 11m 12s) [21:18:33] T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353 [21:18:49] danisztls: sync'd [21:19:41] Um sorry if I'm late to the deployment window, can https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/975028 be backported (it fixes a regression in the newer PageTriage UI) [21:19:44] dr0ptp4kt: thanks! [21:21:02] (03PS1) 10RobH: update site.pp and partition info for new an-workers [puppet] - 10https://gerrit.wikimedia.org/r/975085 (https://phabricator.wikimedia.org/T349936) [21:21:02] It's fine if the answer is a no, asking since this is the last deploy window before the weekend :) [21:21:33] (03CR) 10RobH: [C: 03+2] update site.pp and partition info for new an-workers [puppet] - 10https://gerrit.wikimedia.org/r/975085 (https://phabricator.wikimedia.org/T349936) (owner: 10RobH) [21:21:39] Sohom_Datta: looking now [21:21:59] I came here to ask the same thing. Sohom is way ahead of me :) [21:23:24] (might be able to take a look in a little bit if someone else doesn't) [21:23:33] Looks okay Sohom_Datta would you please add it to https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_November_16 and let us know once added? [21:24:42] Sohom_Datta: NovemLinguae we've got one in the queue in front of you, we'll get it out after James_F 's patch :) [21:25:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED [21:25:13] ty :) [21:25:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED [21:26:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dr0ptp4kt@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974244 (https://phabricator.wikimedia.org/T351121) (owner: 10Jforrester) [21:26:34] thcipriani: <3 [21:26:43] PROBLEM - cassandra-b CQL 10.64.32.145:9042 on aqs1012 is CRITICAL: connect to address 10.64.32.145 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [21:26:49] PROBLEM - cassandra-b SSL 10.64.32.145:7000 on aqs1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:26:55] PROBLEM - cassandra-b service on aqs1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:27:09] dr0ptp4kt: Added :) [21:27:16] Thank you :) [21:27:23] Sohom_Datta: thanks, will take a check [21:29:14] (03PS5) 10Dzahn: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [21:30:21] RECOVERY - LDAP -read-only server- on ldap-replica2005 is OK: LDAP OK - 0.107 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [21:30:31] (03Merged) 10jenkins-bot: Conditionally render the content of header-action instead of the slot [extensions/WikiLambda] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974244 (https://phabricator.wikimedia.org/T351121) (owner: 10Jforrester) [21:30:43] !log dr0ptp4kt@deploy2002 Started scap: Backport for [[gerrit:974244|Conditionally render the content of header-action instead of the slot (T351121)]] [21:30:50] T351121: Button to run implementations and testers is gone - https://phabricator.wikimedia.org/T351121 [21:31:37] (03CR) 10Dr0ptp4kt: [C: 03+2] "Preparing for backport window" [extensions/PageTriage] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975028 (https://phabricator.wikimedia.org/T347732) (owner: 10Sohom Datta) [21:31:59] !log dr0ptp4kt@deploy2002 dr0ptp4kt and jforrester: Backport for [[gerrit:974244|Conditionally render the content of header-action instead of the slot (T351121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:13] James_F: would you please have a look and let when good to sync? [21:32:25] dr0ptp4kt: Yup, all looks good! [21:32:31] On it [21:32:34] !log dr0ptp4kt@deploy2002 dr0ptp4kt and jforrester: Continuing with sync [21:32:36] (Almost like I had the page ready in debug to test.) [21:32:39] Thank you! [21:33:45] in case you *weren't* already hovering over the refresh button :) [21:38:20] !log dr0ptp4kt@deploy2002 Finished scap: Backport for [[gerrit:974244|Conditionally render the content of header-action instead of the slot (T351121)]] (duration: 07m 36s) [21:38:27] Thank you again. [21:38:36] Thank you, as always. [21:38:36] T351121: Button to run implementations and testers is gone - https://phabricator.wikimedia.org/T351121 [21:40:47] Sohom_Datta: still going through gate and submit... [21:42:59] !log Removing VRRP config for for public1-b-codfw on codfw CRs (T347191) [21:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:04] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [21:44:17] PROBLEM - Host doh2002 is DOWN: PING CRITICAL - Packet loss = 100% [21:44:19] PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100% [21:44:47] PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100% [21:45:07] RECOVERY - Host doh2002 is UP: PING WARNING - Packet loss = 77%, RTA = 33.35 ms [21:46:09] RECOVERY - Host dns2004 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [21:46:19] (ProbeDown) firing: Service contint2002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint2002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:46:39] sukhe: You doing something fun with doh? [21:47:03] that was me sry, "cleaning up" after previous work seems I'd left teh VIP on the CRs [21:47:21] they'll clear shortly, reverted immediately [21:47:26] thanks [21:47:41] RECOVERY - Host serpens is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [21:48:39] (03Merged) 10jenkins-bot: Make the feed gracefully handle long snippets and urls [extensions/PageTriage] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975028 (https://phabricator.wikimedia.org/T347732) (owner: 10Sohom Datta) [21:49:02] merged \o/ [21:49:10] \o/ [21:50:03] PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100% [21:50:21] PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100% [21:50:23] sry... [21:50:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) [21:50:26] on it [21:50:37] !log dr0ptp4kt@deploy2002 Started scap: Backport for [[gerrit:975028|Make the feed gracefully handle long snippets and urls (T347732 T351463)]] [21:50:41] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:50:43] T347732: Mock up a 100% Codex front end for PageTriage - https://phabricator.wikimedia.org/T347732 [21:50:43] T351463: mwe-vue-pt-snippet is way too narrow - https://phabricator.wikimedia.org/T351463 [21:51:17] RECOVERY - Host dns2004 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [21:51:19] (ProbeDown) resolved: Service contint2002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint2002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:51:50] !log dr0ptp4kt@deploy2002 dr0ptp4kt and soda: Backport for [[gerrit:975028|Make the feed gracefully handle long snippets and urls (T347732 T351463)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:52:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1157'] [21:52:12] Sohom_Datta: would you please check and advise if okay to commence with sync? [21:52:24] On it [21:52:31] RECOVERY - Host serpens is UP: PING WARNING - Packet loss = 77%, RTA = 33.49 ms [21:52:48] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1157'] [21:53:03] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1157'] [21:53:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1157'] [21:53:21] (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:53:24] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1157'] [21:53:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:53:29] PROBLEM - Host doh2002 is DOWN: PING CRITICAL - Packet loss = 100% [21:53:53] PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100% [21:54:11] topranks: This you too? [21:54:22] Looks good to me :) [21:54:34] Thanks a lot for doing this on such a short notice :) [21:54:50] thx Sohom_Datta, will begin sync in a few secs [21:54:53] !log dr0ptp4kt@deploy2002 dr0ptp4kt and soda: Continuing with sync [21:54:55] brett: yeah certainly although looks up to me [21:55:00] (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:55:19] RECOVERY - Host doh2002 is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [21:55:51] PROBLEM - SSH on contint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:56:11] RECOVERY - Host dns2004 is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [21:57:01] RECOVERY - SSH on contint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:57:59] PROBLEM - Check systemd state on puppetserver2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:10] :O [21:59:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1157'] [22:00:27] !log dr0ptp4kt@deploy2002 Finished scap: Backport for [[gerrit:975028|Make the feed gracefully handle long snippets and urls (T347732 T351463)]] (duration: 09m 50s) [22:00:37] T347732: Mock up a 100% Codex front end for PageTriage - https://phabricator.wikimedia.org/T347732 [22:00:38] T351463: mwe-vue-pt-snippet is way too narrow - https://phabricator.wikimedia.org/T351463 [22:00:41] thx Sohom_Datta, sync done [22:02:04] brett: the puppetserver2002 alert is not related to anything I'm working on, only codfw row B public vlan is what I'm at [22:02:32] Can confirm that it works on my end after clearing the browser cache :) [22:03:14] ack [22:04:00] Hm, but puppet2002 doesn't have sync-puppet-volatile.service? [22:04:15] jouncebot: now [22:04:16] No deployments scheduled for the next 8 hour(s) and 55 minute(s) [22:05:41] brett: I need to try to flip this gw again, it may trigger BFD/BGP alerts on the dns/doh hosts, but right now things are inconsistent which we can't leave that way. [22:05:44] oh, puppetserver [22:05:50] cool, thanks for the heads up [22:06:05] those hosts are backed up in terms of function, so in terms of services we should be good [22:06:16] I'd rather not downtime as the alerts may be useful - sorry for the noise [22:07:09] sync-puppet-volatile.service is all good. Temporary dns resolution failure [22:07:41] RECOVERY - Check systemd state on puppetserver2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:22] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye [22:09:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [22:21:09] (03PS1) 10Andrew Bogott: puppetserver: '/srv/puppet_code/environments' owned by puppet/puppet [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468) [22:21:39] PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 1 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [22:22:22] Assuming this is related [22:26:06] brett: sorry, yeah it's giving out cos I removed the equivalent on cr2, I'm removing cr1 now so it should resolve shortly [22:26:14] No prob! [22:26:23] I owe you a beer I think :) [22:27:04] not at all! [22:27:11] RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [22:27:51] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:28:42] ^^ uncommited dns is probably me, I'll run the cookbook (fairly sure I don't have to re-add those IPs things look ok) [22:29:03] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:30:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [22:30:41] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old vlan 1117 entries - cmooney@cumin1001" [22:31:29] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-11-14 11:01:41 +0000 (expires in 1824 days) https://wikitech.wikimedia.org/wiki/Logs [22:31:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old vlan 1117 entries - cmooney@cumin1001" [22:31:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:34:08] (03CR) 10Dzahn: [V: 03+1 C: 03+2] wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [22:36:04] !log disabled puppet on miscweb*, netmon* and phab* hosts, deploying gerrit:974285, confirming noop [22:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:33] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:38:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [22:39:09] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [22:39:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53529 and previous config saved to /var/cache/conftool/dbconfig/20231116-223915-arnaudb.json [22:39:20] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:40:03] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "confirmed noop on all miscweb*, netmon* and phab* prod machines. additionally compiled on a cloud VPS using simplelamp2 role." [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [22:50:53] (03PS1) 10Dzahn: piwik: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975093 [22:51:21] (03CR) 10CI reject: [V: 04-1] piwik: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn) [22:52:58] (03PS1) 10Dzahn: simplelap: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975094 [22:54:52] (03PS2) 10Dzahn: piwik: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975093 [22:55:22] (03PS2) 10Dzahn: simplelap: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975094 [23:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:10:09] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6 with reason: Move public1-a-codfw vlan GW from codfw CR routers to ssw [23:10:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6 with reason: Move public1-a-codfw vlan GW from codfw CR routers to ssw [23:10:31] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c937612c-c0eb-4c9e-a245-9810a56c0a33) set by cmooney@cu... [23:12:29] (03PS1) 10Jdlrobson: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975097 (https://phabricator.wikimedia.org/T351362) [23:13:04] (03PS2) 10Jdlrobson: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975097 (https://phabricator.wikimedia.org/T351362) [23:13:38] jeena: could we backport the above patch to get the error rate back down to normal? [23:13:57] it disables the codepath that is erroring (which is broken anyway :-)) [23:21:13] Jdlrobson: need a deployer? [23:25:10] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [23:27:13] TheresNoTime: if you could! [23:27:14] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old vlan 2001 entries - cmooney@cumin1001" [23:27:19] ack! [23:27:22] Would be nice to go into the weekend without lots of email alerts :) [23:27:27] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/975097/ correct? [23:27:33] correct [23:27:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975097 (https://phabricator.wikimedia.org/T351362) (owner: 10Jdlrobson) [23:27:49] What's the process for this? Do I need to log it on https://wikitech.wikimedia.org/wiki/Deployments somewhere? [23:28:05] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old vlan 2001 entries - cmooney@cumin1001" [23:28:05] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:28:26] (03Merged) 10jenkins-bot: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975097 (https://phabricator.wikimedia.org/T351362) (owner: 10Jdlrobson) [23:28:39] Jdlrobson: I'll log it in the SAL [23:28:43] !log samtar@deploy2002 Started scap: Backport for [[gerrit:975097|Disable drawer temporarily while erroring (T351362)]] [23:28:50] T351362: Regression: AMC Outreach campaign is not showing when mobile users click desktop link - https://phabricator.wikimedia.org/T351362 [23:28:51] TheresNoTime: thx ! I can check this on stat1001 before you sync [23:29:02] debug2001 rather :) [23:29:08] in analytics mindset haha [23:29:16] (03PS8) 10Krinkle: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [23:29:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1157.eqiad.wmnet with OS bullseye [23:29:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [23:29:58] (03CR) 10Krinkle: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [23:29:59] !log samtar@deploy2002 jdlrobson and samtar: Backport for [[gerrit:975097|Disable drawer temporarily while erroring (T351362)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:30:08] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye [23:30:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [23:30:21] Jdlrobson: ready on mwdebug [23:30:26] !log Add gateway IP for public1-a-codfw Vlan to ssw in codfw T347191 [23:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:44] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [23:31:03] thanks looking [23:33:22] !log Change VRRP IP for public1-a-codfw vlan on codfw CRs T347191 [23:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:29] sorry Jdlrobson, I stepped away for a moment. Thanks TheresNoTime [23:33:37] np! [23:33:39] TheresNoTime: oh no.. it doesn't look like this fully solves the issue like I hoped. :( [23:33:46] So I guess there's no point in syncing it [23:33:57] Jdlrobson: ack :( [23:34:01] !log samtar@deploy2002 Sync cancelled. [23:34:31] (03PS1) 10Samtar: Revert "Disable drawer temporarily while erroring" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975029 [23:34:34] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/975096 should also fix it but it hasn't been reviewed yet [23:34:39] so I am not sure what protocol is for that. [23:34:51] it's pretty simple: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/975096/1/src/mobile.startup/mobile.startup.js what do you think TheresNoTime ? [23:35:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975029 (owner: 10Samtar) [23:35:23] Jdlrobson: I'll take a look [23:35:28] It looks pretty simple [23:35:40] I assume it's cheap to backport it, test it? [23:35:44] (03Merged) 10jenkins-bot: Revert "Disable drawer temporarily while erroring" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975029 (owner: 10Samtar) [23:35:49] and unbackport it if it doesn't work? [23:35:55] seems like it to me [23:35:59] !log samtar@deploy2002 Started scap: Backport for [[gerrit:975029|Revert "Disable drawer temporarily while erroring"]] [23:36:07] Ideally someone would +1 it first [23:36:27] PROBLEM - Host doh2001 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:03] TheresNoTime: i'll see if I can get someone in the team to vouch for it. It's near the end of the day though so am not sure who is still around (I'm the furthest west). [23:37:16] !log samtar@deploy2002 samtar: Backport for [[gerrit:975029|Revert "Disable drawer temporarily while erroring"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:37:16] I can get one for tomorrow if we're okay with a backport tomorrow? [23:37:33] !log samtar@deploy2002 samtar: Continuing with sync [23:37:35] RECOVERY - Host doh2001 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [23:37:35] let me check [23:38:24] We can do it tomorrow if needed [23:39:27] just syncing that revert (not entirely sure if I needed to, but *shrug*) [23:40:43] thanks TheresNoTime and sorry for the run around [23:40:55] not a problem at all! :) [23:43:31] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:975029|Revert "Disable drawer temporarily while erroring"]] (duration: 07m 31s) [23:43:53] Okay jeena i'll ping you tomorrow since I can't seem to find a review from my team [23:44:00] okay [23:44:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:46:26] hm [23:46:35] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158'] [23:49:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:51:49] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1158.eqiad.wmnet with OS bullseye [23:51:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye [23:52:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1158'] [23:58:31] PROBLEM - Host doh2001 is DOWN: PING CRITICAL - Packet loss = 100% [23:58:42] ^^ just doing a test with this one [23:59:37] RECOVERY - Host doh2001 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms