[00:01:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2034.codfw.wmnet with reason: host reimage [00:04:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2034.codfw.wmnet with reason: host reimage [00:07:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Eevans) 05Resolved→03Open >>! In T349758#9365653, @Papaul wrote: > @Eevans All your's Hi @Papaul, Did these get the additional 3 IPs per host (i.e. restbase2028-{a,... [00:09:04] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:09:59] (03CR) 10Dzahn: [C: 03+2] "thank you! planet1003 works but on 2003:" [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn) [00:22:42] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:23:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:23:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2033.codfw.wmnet with OS bullseye [00:23:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2033.codfw.wmnet with OS bullseye completed: - ganeti2033 (**PAS... [00:24:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Papaul) [00:24:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:24:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2034.codfw.wmnet with OS bullseye [00:24:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2034.codfw.wmnet with OS bullseye completed: - ganeti2034 (**PAS... [00:24:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Papaul) [00:24:58] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:26:08] 10SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wikimedia Bénin user group - https://phabricator.wikimedia.org/T352285 (10Ladsgroup) 05Open→03Resolved {{done}} ^_^ https://lists.wikimedia.org/postorius/lists/wikimedia-bj.lists.wikimedia.org [00:27:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff all your's [00:31:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) @Eevans i don't know since @Jhancock.wm did the provision and i just did the OS install, But I will check and let you know tomorrow. Thanks [00:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978649 [00:38:42] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978649 (owner: 10TrainBranchBot) [00:52:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) @Jhancock.wm I think you forget to setup the 3 additional IP's for those nodes (Networking Setup: Speed:1G - VLAN:Private(?)/Public/Other(Specify) : AAAA records:... [00:54:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) a:05Jhancock.wm→03Papaul [01:05:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978649 (owner: 10TrainBranchBot) [01:08:21] (03PS1) 10Papaul: Add new kubernetes node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978716 (https://phabricator.wikimedia.org/T349873) [01:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:18:07] (03CR) 10Papaul: [C: 03+2] Add new kubernetes node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978716 (https://phabricator.wikimedia.org/T349873) (owner: 10Papaul) [01:21:18] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:25:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2057.codfw.wmnet with OS bullseye [01:25:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2057.codfw.wmnet with OS bullseye [01:30:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2058.codfw.wmnet with OS bullseye [01:35:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2058.codfw.wmnet with OS bullseye [01:36:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Papaul) [01:43:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2057.codfw.wmnet with reason: host reimage [01:46:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2057.codfw.wmnet with reason: host reimage [01:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:52:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2058.codfw.wmnet with reason: host reimage [01:56:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2058.codfw.wmnet with reason: host reimage [01:56:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2059.codfw.wmnet with OS bullseye [01:56:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2059.codfw.wmnet with OS bullseye [01:57:35] 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T352027 (10Papaul) 05Open→03Resolved a:03Papaul fix [02:04:14] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:07:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:07:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2057.codfw.wmnet with OS bullseye [02:07:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2057.codfw.wmnet with OS bullseye completed: - kubernetes2057 (**PASS**)... [02:08:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Papaul) [02:09:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2060.codfw.wmnet with OS bullseye [02:09:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2060.codfw.wmnet with OS bullseye [02:14:40] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:14:58] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [02:18:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2059.codfw.wmnet with reason: host reimage [02:19:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:19:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2058.codfw.wmnet with OS bullseye [02:20:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2058.codfw.wmnet with OS bullseye completed: - kubernetes2058 (**PASS**)... [02:22:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2059.codfw.wmnet with reason: host reimage [02:22:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Eevans) >>! In T349758#9369257, @Papaul wrote: > [ ... ] > @Eevans if i add the other 3 IP's addresses manually you should be good or do we have to re image all the hosts... [02:26:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2060.codfw.wmnet with reason: host reimage [02:29:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2060.codfw.wmnet with reason: host reimage [02:38:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Eevans) >>! In T349758#9369347, @Eevans wrote: >>>! In T349758#9369257, @Papaul wrote: >> [ ... ] >> @Eevans if i add the other 3 IP's addresses manually you should be go... [02:39:00] (JobUnavailable) firing: (11) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:42:43] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:43:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:44:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2059.codfw.wmnet with OS bullseye [02:44:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2059.codfw.wmnet with OS bullseye completed: - kubernetes2059 (**PASS**)... [02:47:56] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:49:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:49:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2060.codfw.wmnet with OS bullseye [02:50:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2060.codfw.wmnet with OS bullseye completed: - kubernetes2060 (**PASS**)... [02:50:02] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T352354 (10phaultfinder) [02:50:53] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS13030/IPv4: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:57:17] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:01:41] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:07:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Papaul) [03:07:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Papaul) 05Open→03Resolved @Clement_Goubert @Joe all your's [03:09:00] (JobUnavailable) firing: (11) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:43:54] (03PS32) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [03:51:26] (03CR) 10Dwisehaupt: "Thanks for the feedback. I've fixed the nits and added a firewall::service stanza preemptively." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [03:53:59] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Active - Init7, AS13030/IPv4: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:54:31] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:05:03] (03CR) 10Andrew Bogott: puppetserver: '/srv/puppet_code/environments' owned by puppet/puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468) (owner: 10Andrew Bogott) [04:07:15] (03CR) 10Andrew Bogott: [C: 03+1] P:wmcs::instance: adjust syslog handling [puppet] - 10https://gerrit.wikimedia.org/r/850633 (owner: 10Majavah) [04:10:02] (03CR) 10Andrew Bogott: [C: 03+1] [openstack] Upgrade all remaining hosts to Antelope [puppet] - 10https://gerrit.wikimedia.org/r/978636 (https://phabricator.wikimedia.org/T348843) (owner: 10FNegri) [04:22:40] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T352357 (10phaultfinder) [04:24:33] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:24:58] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:25:11] PROBLEM - Check systemd state on an-worker1133 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:31] PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:34:01] RECOVERY - Check systemd state on an-worker1133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:05] PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:51] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:38:29] RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:25] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:45:03] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:45:09] PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:43] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:59:47] RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:19] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:35:39] (03PS3) 10KartikMistry: Update cxserver to 2023-11-28-064518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [05:35:48] (03CR) 10KartikMistry: Update cxserver to 2023-11-28-064518-production (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [05:39:04] I am going to put phabricator in RO for a few seconds to switch its database master [05:41:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149 [05:41:29] T352149: Switchover m3 master db1159 -> db1119 - https://phabricator.wikimedia.org/T352149 [05:41:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149 [05:43:25] (03PS1) 10Marostegui: mariadb: Promote db1119 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/978721 (https://phabricator.wikimedia.org/T352149) [05:45:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1119 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/978721 (https://phabricator.wikimedia.org/T352149) (owner: 10Marostegui) [05:47:24] !log Failover m3 from db1159 to db1119 - T352149 [05:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:30] T352149: Switchover m3 master db1159 -> db1119 - https://phabricator.wikimedia.org/T352149 [05:51:00] (03PS1) 10Marostegui: db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978722 (https://phabricator.wikimedia.org/T351990) [05:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:51:38] (03CR) 10Marostegui: [C: 03+2] db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978722 (https://phabricator.wikimedia.org/T351990) (owner: 10Marostegui) [05:52:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1159.eqiad.wmnet with OS bookworm [06:05:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1159.eqiad.wmnet with reason: host reimage [06:08:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1159.eqiad.wmnet with reason: host reimage [06:13:49] (03PS1) 10Marostegui: Revert "db1159: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978515 [06:13:56] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/978515 (owner: 10Marostegui) [06:14:58] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [06:23:31] (03PS2) 10KartikMistry: Update Apertium to 2023-11-30-061450-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978189 (https://phabricator.wikimedia.org/T270060) [06:24:28] marostegui: OK to deploy apertium service? [06:24:33] kart_: absolutely! [06:24:39] Thanks! [06:27:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1159.eqiad.wmnet with OS bookworm [06:32:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 T351283', diff saved to https://phabricator.wikimedia.org/P53951 and previous config saved to /var/cache/conftool/dbconfig/20231130-063258-root.json [06:33:04] T351283: Compile and package MariaDB 10.6.16 and 10.4.32 - https://phabricator.wikimedia.org/T351283 [06:33:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1210 T351283', diff saved to https://phabricator.wikimedia.org/P53952 and previous config saved to /var/cache/conftool/dbconfig/20231130-063317-root.json [06:34:39] (03PS1) 10Marostegui: db1210,db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978726 (https://phabricator.wikimedia.org/T351283) [06:35:24] (03CR) 10Marostegui: [C: 03+2] db1210,db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978726 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui) [06:35:55] (03CR) 10KartikMistry: [C: 03+2] Update Apertium to 2023-11-30-061450-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978189 (https://phabricator.wikimedia.org/T270060) (owner: 10KartikMistry) [06:36:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1210.eqiad.wmnet with OS bookworm [06:36:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1126.eqiad.wmnet with OS bookworm [06:36:45] (03Merged) 10jenkins-bot: Update Apertium to 2023-11-30-061450-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978189 (https://phabricator.wikimedia.org/T270060) (owner: 10KartikMistry) [06:37:18] (03CR) 10Marostegui: Revert "db1159: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978515 (owner: 10Marostegui) [06:37:30] (03CR) 10Marostegui: [C: 03+2] Revert "db1159: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978515 (owner: 10Marostegui) [06:39:52] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [06:40:17] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [06:41:45] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-October-December): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10santhosh) @elukey, What do you mean by 'reaching out to you by next time' ? Regarding the architecture of... [06:42:35] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply [06:43:13] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [06:44:10] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/apertium: apply [06:44:38] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [06:45:03] !log Updated Apertium to 2023-11-30-061450-production (T270060) [06:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:08] T270060: Package apertium-fra-frp (French-Arpitan) - https://phabricator.wikimedia.org/T270060 [06:46:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1126.eqiad.wmnet with reason: host reimage [06:49:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage [06:49:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1126.eqiad.wmnet with reason: host reimage [06:53:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage [06:57:58] (03PS1) 10Kevin Bazira: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978651 (https://phabricator.wikimedia.org/T343123) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0700) [07:00:05] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0700). [07:09:01] (JobUnavailable) firing: (10) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1126.eqiad.wmnet with OS bookworm [07:13:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1210.eqiad.wmnet with OS bookworm [07:20:42] (03PS1) 10KartikMistry: Update MinT to 2023-11-21-115852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978727 [07:31:01] (03PS1) 10Marostegui: Revert "db1210,db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978524 [07:31:56] (03CR) 10Marostegui: [C: 03+2] Revert "db1210,db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978524 (owner: 10Marostegui) [07:32:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53953 and previous config saved to /var/cache/conftool/dbconfig/20231130-073210-root.json [07:32:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53954 and previous config saved to /var/cache/conftool/dbconfig/20231130-073212-root.json [07:42:22] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:45:56] (03PS1) 10Marostegui: phabricator.my.cnf.erb: Increase innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/978850 (https://phabricator.wikimedia.org/T352360) [07:47:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53955 and previous config saved to /var/cache/conftool/dbconfig/20231130-074715-root.json [07:47:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53956 and previous config saved to /var/cache/conftool/dbconfig/20231130-074717-root.json [07:49:52] (03PS1) 10Clare Ming: Add stream config for *uiactionstracking via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) [07:52:14] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/output/978850/781/db1159.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/978850 (https://phabricator.wikimedia.org/T352360) (owner: 10Marostegui) [07:54:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:54:49] (03CR) 10Marostegui: [C: 03+2] phabricator.my.cnf.erb: Increase innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/978850 (https://phabricator.wikimedia.org/T352360) (owner: 10Marostegui) [08:00:06] Amir1, apergos, and jnuche: gettimeofday() says it's time for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0800) [08:00:06] aanzx: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:36] uh, there is no patch listed on the Deployment calendar actually [08:00:53] no trainees signed up either [08:01:08] I removed mine [08:01:13] okey dokey [08:01:34] in that case, have a quiet day everyone and see you all next time! [08:02:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53957 and previous config saved to /var/cache/conftool/dbconfig/20231130-080220-root.json [08:02:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53958 and previous config saved to /var/cache/conftool/dbconfig/20231130-080222-root.json [08:09:43] (03PS1) 10Marostegui: pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/978963 (https://phabricator.wikimedia.org/T352244) [08:10:32] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/978963 (https://phabricator.wikimedia.org/T352244) (owner: 10Marostegui) [08:12:34] 10SRE, 10Data-Platform-SRE: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10brouberol) Thanks @MoritzMuehlenhoff ! [08:14:22] (03CR) 10Brouberol: [C: 03+1] "Nice! This will definitely be useful." [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis) [08:17:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [08:17:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53959 and previous config saved to /var/cache/conftool/dbconfig/20231130-081726-root.json [08:17:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53960 and previous config saved to /var/cache/conftool/dbconfig/20231130-081727-root.json [08:19:09] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:20:26] (03PS3) 10Brouberol: Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) [08:21:17] (03PS1) 10Muehlenhoff: ganeti: Switch drmrs clusters to PKI [puppet] - 10https://gerrit.wikimedia.org/r/978965 (https://phabricator.wikimedia.org/T350686) [08:23:21] (03PS2) 10Clare Ming: Add stream config for *uiactionstracking via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) [08:24:58] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:28:05] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1010.eqiad.wmnet with OS bullseye [08:30:11] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:30:43] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:31:07] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53961 and previous config saved to /var/cache/conftool/dbconfig/20231130-083231-root.json [08:32:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53962 and previous config saved to /var/cache/conftool/dbconfig/20231130-083232-root.json [08:32:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:55] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:10] (03CR) 10Elukey: [C: 03+1] ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978651 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [08:34:47] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:35:07] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:37:31] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:05] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:39:15] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978651 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [08:40:10] (03Merged) 10jenkins-bot: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978651 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [08:40:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 T352362', diff saved to https://phabricator.wikimedia.org/P53963 and previous config saved to /var/cache/conftool/dbconfig/20231130-084015-root.json [08:40:21] T352362: decommission db1126.eqiad.wmnet - https://phabricator.wikimedia.org/T352362 [08:40:47] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch drmrs clusters to PKI [puppet] - 10https://gerrit.wikimedia.org/r/978965 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [08:41:03] (03PS1) 10Marostegui: db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979018 (https://phabricator.wikimedia.org/T352362) [08:41:42] (03CR) 10Marostegui: [C: 03+2] db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979018 (https://phabricator.wikimedia.org/T352362) (owner: 10Marostegui) [08:42:29] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:42:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:47] (03PS1) 10Marostegui: dbproxy1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979021 (https://phabricator.wikimedia.org/T351864) [08:44:06] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:44:50] (03CR) 10Marostegui: [C: 03+2] dbproxy1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979021 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [08:44:56] (03PS1) 10Marostegui: instances.yaml: Remove db1126 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/979022 (https://phabricator.wikimedia.org/T352362) [08:45:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1025.eqiad.wmnet with OS bookworm [08:45:44] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1126 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/979022 (https://phabricator.wikimedia.org/T352362) (owner: 10Marostegui) [08:46:53] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:46:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1126 from dbctl T352362', diff saved to https://phabricator.wikimedia.org/P53964 and previous config saved to /var/cache/conftool/dbconfig/20231130-084655-marostegui.json [08:47:01] T352362: decommission db1126.eqiad.wmnet - https://phabricator.wikimedia.org/T352362 [08:47:16] (03PS1) 10Marostegui: Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978525 [08:47:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install6002.wikimedia.org [08:47:22] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/978525 (owner: 10Marostegui) [08:47:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53965 and previous config saved to /var/cache/conftool/dbconfig/20231130-084737-root.json [08:48:53] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:07] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:49:27] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:35] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:52:48] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1010.eqiad.wmnet with reason: host reimage [08:53:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install6002.wikimedia.org [08:54:20] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [08:54:27] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:39] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow6001.drmrs.wmnet [08:55:17] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:55:43] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1010.eqiad.wmnet with reason: host reimage [08:56:18] (03CR) 10JMeybohm: [C: 03+1] istio: upgrade Docker images to 1.15.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978637 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [08:57:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage [08:58:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow6001.drmrs.wmnet [09:00:05] hashar and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0900). [09:01:58] (03PS1) 10Marostegui: pc1014: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/979023 [09:02:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53966 and previous config saved to /var/cache/conftool/dbconfig/20231130-090242-root.json [09:02:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage [09:04:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:02] (03CR) 10Marostegui: [C: 03+2] pc1014: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/979023 (owner: 10Marostegui) [09:06:12] I am going to run the MediaWiki train [09:06:59] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979024 (https://phabricator.wikimedia.org/T350083) [09:07:02] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979024 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [09:07:44] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979024 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot) [09:07:45] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:07:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:10] (03CR) 10Slyngshede: [C: 03+2] RAID - Add instance name to MD RAID alert summary [alerts] - 10https://gerrit.wikimedia.org/r/978485 (owner: 10Slyngshede) [09:10:05] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [09:10:55] (03Merged) 10jenkins-bot: RAID - Add instance name to MD RAID alert summary [alerts] - 10https://gerrit.wikimedia.org/r/978485 (owner: 10Slyngshede) [09:13:06] (03PS1) 10Muehlenhoff: ganeti: Switch esams to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979025 (https://phabricator.wikimedia.org/T350686) [09:13:25] (03CR) 10Muehlenhoff: [C: 04-2] "Not yet ready" [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [09:13:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Clement_Goubert) Thanks! [09:15:06] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.7 refs T350083 [09:15:13] T350083: 1.42.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T350083 [09:18:02] 10SRE, 10serviceops: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10Clement_Goubert) [09:18:51] 10SRE, 10serviceops: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10Clement_Goubert) p:05Triage→03Medium [09:19:00] (03PS2) 10Muehlenhoff: ganeti: Switch esams to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979025 (https://phabricator.wikimedia.org/T350686) [09:19:11] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1010.eqiad.wmnet with OS bullseye [09:21:19] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:49] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch esams to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979025 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:24:38] (03PS2) 10Marostegui: Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978525 [09:24:43] (03CR) 10Marostegui: Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978525 (owner: 10Marostegui) [09:25:05] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:25:09] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978525 (owner: 10Marostegui) [09:25:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1025.eqiad.wmnet with OS bookworm [09:26:45] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:29:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow3003.esams.wmnet [09:31:14] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ncredir: Enable IPIP encapsulation on codfw [puppet] - 10https://gerrit.wikimedia.org/r/978624 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:31:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:33:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot [09:33:55] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot [09:34:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow3003.esams.wmnet [09:35:08] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot [09:35:12] !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot [09:36:06] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot [09:36:09] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot [09:37:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'change es2 master to es2026 as es2033 is rebooting', diff saved to https://phabricator.wikimedia.org/P53967 and previous config saved to /var/cache/conftool/dbconfig/20231130-093740-arnaudb.json [09:38:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 is depooled', diff saved to https://phabricator.wikimedia.org/P53968 and previous config saved to /var/cache/conftool/dbconfig/20231130-093814-arnaudb.json [09:39:01] (JobUnavailable) firing: (10) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:39:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2048:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2048 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:44:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir3003.esams.wmnet [09:44:16] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2048:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2048 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:46:23] (03PS1) 10Clément Goubert: kubernetes20[57-60]: Add to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/979029 (https://phabricator.wikimedia.org/T352369) [09:46:32] (03PS1) 10Clément Goubert: wikikube: put kubernetes20[57-60] in production [puppet] - 10https://gerrit.wikimedia.org/r/979030 (https://phabricator.wikimedia.org/T352369) [09:46:36] (03PS1) 10Clément Goubert: wikikube: add kubernetes20[57-60] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/979031 (https://phabricator.wikimedia.org/T352369) [09:48:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir3003.esams.wmnet [09:51:02] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP on codfw text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/978625 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:51:20] (03PS2) 10Vgutierrez: hiera: Enable IPIP on codfw text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/978625 (https://phabricator.wikimedia.org/T351069) [09:53:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53969 and previous config saved to /var/cache/conftool/dbconfig/20231130-095325-arnaudb.json [09:59:01] (JobUnavailable) firing: (9) Reduced availability for job lvs_realserver in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:19] !log rolling restart of pybal on lvs2011 and lvs2014, effectively enabling IPIP encapsulation on ncredir@codfw - T351069 [09:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:24] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [09:59:32] volans, Amir1 ^^ blame me if anything pages [10:00:31] vgutierrez: ack [10:01:56] all good apparently :) [10:03:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:03:26] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:03:27] nice! [10:03:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/978056 (https://phabricator.wikimedia.org/T351139) (owner: 10Slyngshede) [10:05:13] (03CR) 10Cathal Mooney: "Looks good to me. Some of the finer points of the logic to build the server lists I don't grok fully, but overall I'm happy with the appr" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [10:06:34] (03CR) 10Muehlenhoff: [C: 03+2] gerrit : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [10:07:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2041:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2041 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:08:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 20%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53970 and previous config saved to /var/cache/conftool/dbconfig/20231130-100830-arnaudb.json [10:09:03] (03PS1) 10Alexandros Kosiaris: rdb101[34]: Set them up as redis misc replicas [puppet] - 10https://gerrit.wikimedia.org/r/979034 (https://phabricator.wikimedia.org/T326171) [10:09:07] (03PS1) 10Alexandros Kosiaris: Update references to rdb1010 to point to rdb1014 [puppet] - 10https://gerrit.wikimedia.org/r/979035 (https://phabricator.wikimedia.org/T326171) [10:09:11] (03PS1) 10Alexandros Kosiaris: Switch rdb1009 replicas to replicate from rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979036 (https://phabricator.wikimedia.org/T326171) [10:09:15] (03PS1) 10Alexandros Kosiaris: Promote rdb1013 to master, drop rdb1009, rdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/979037 (https://phabricator.wikimedia.org/T326171) [10:12:16] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2041:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2041 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:12:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:13:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:14:58] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [10:22:17] (03CR) 10Clément Goubert: [C: 03+1] mediawiki::php: Set php-common version dependent on OS [puppet] - 10https://gerrit.wikimedia.org/r/978540 (owner: 10Muehlenhoff) [10:22:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:22:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:22:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T348183)', diff saved to https://phabricator.wikimedia.org/P53971 and previous config saved to /var/cache/conftool/dbconfig/20231130-102255-arnaudb.json [10:23:00] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:23:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 30%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53972 and previous config saved to /var/cache/conftool/dbconfig/20231130-102336-arnaudb.json [10:24:56] (03PS1) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [10:26:54] (03CR) 10Hnowlan: [C: 03+1] wikikube: put kubernetes20[57-60] in production [puppet] - 10https://gerrit.wikimedia.org/r/979030 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert) [10:27:16] (03CR) 10Hnowlan: [C: 03+1] wikikube: add kubernetes20[57-60] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/979031 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert) [10:28:18] (03PS2) 10Muehlenhoff: ceph::server: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945785 [10:28:52] (03CR) 10Hnowlan: [C: 03+1] kubernetes20[57-60]: Add to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/979029 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert) [10:30:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T348183)', diff saved to https://phabricator.wikimedia.org/P53973 and previous config saved to /var/cache/conftool/dbconfig/20231130-103004-arnaudb.json [10:30:13] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:30:45] (03PS1) 10Muehlenhoff: Configure rdb1013/rdb1014 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979041 (https://phabricator.wikimedia.org/T349619) [10:32:57] (03CR) 10CI reject: [V: 04-1] Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [10:34:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945785 (owner: 10Muehlenhoff) [10:37:10] (03PS1) 10Arnaudb: mariadb: toggle notifications [puppet] - 10https://gerrit.wikimedia.org/r/978652 (https://phabricator.wikimedia.org/T343674) [10:37:19] (03CR) 10Volans: [C: 04-1] "The idea looks ok, some comments inline. Also missing tests ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [10:37:27] (03CR) 10Muehlenhoff: [C: 03+2] Configure rdb1013/rdb1014 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979041 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:37:30] 10SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wikimedia Bénin user group - https://phabricator.wikimedia.org/T352285 (10Mh-3110) Many thanks [10:38:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 40%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53974 and previous config saved to /var/cache/conftool/dbconfig/20231130-103841-arnaudb.json [10:39:59] (03PS5) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [10:40:38] (03PS5) 10Arnaudb: debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656) [10:40:43] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:42:25] (03CR) 10Clément Goubert: [C: 03+2] wikikube: put kubernetes20[57-60] in production [puppet] - 10https://gerrit.wikimedia.org/r/979030 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert) [10:42:33] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications [puppet] - 10https://gerrit.wikimedia.org/r/978652 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:43:02] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications [puppet] - 10https://gerrit.wikimedia.org/r/978652 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [10:43:12] (03CR) 10Clément Goubert: [C: 03+2] kubernetes20[57-60]: Add to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/979029 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert) [10:43:46] (03Merged) 10jenkins-bot: kubernetes20[57-60]: Add to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/979029 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert) [10:43:48] (03CR) 10Muehlenhoff: [C: 03+2] ceph::server: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945785 (owner: 10Muehlenhoff) [10:45:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P53975 and previous config saved to /var/cache/conftool/dbconfig/20231130-104510-arnaudb.json [10:48:20] (03PS2) 10Klausman: ml-services/article-description: set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123) [10:48:40] (03CR) 10Elukey: [C: 03+1] ml-services/article-description: set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [10:50:22] (03PS3) 10Klausman: ml-services: article-description set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123) [10:50:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2057.codfw.wmnet with OS bullseye [10:50:47] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes2057.codfw.wmnet with OS bullseye [10:52:37] !log installing python-git security updates [10:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53976 and previous config saved to /var/cache/conftool/dbconfig/20231130-105346-arnaudb.json [10:53:55] (03PS1) 10Majavah: cloudlb: wikireplicas: fix timeouts [puppet] - 10https://gerrit.wikimedia.org/r/979045 (https://phabricator.wikimedia.org/T346947) [10:56:15] (03CR) 10Elukey: [C: 03+2] istio: upgrade Docker images to 1.15.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978637 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [10:56:45] (03PS2) 10Elukey: cert-manager: bump appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) [10:58:58] (03CR) 10Majavah: [C: 03+2] "Self-merging given this is relatively straightforward (just copied from the old wiki replica proxies) and will fix an user-facing issue. P" [puppet] - 10https://gerrit.wikimedia.org/r/979045 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [10:59:17] (03CR) 10Klausman: [C: 03+2] ml-services: article-description set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [10:59:47] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2058.codfw.wmnet with OS bullseye [10:59:57] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes2058.codfw.wmnet with OS bullseye [11:00:05] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1100). [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1100) [11:00:06] (03Merged) 10jenkins-bot: ml-services: article-description set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman) [11:00:08] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2059.codfw.wmnet with OS bullseye [11:00:17] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes2059.codfw.wmnet with OS bullseye [11:00:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P53977 and previous config saved to /var/cache/conftool/dbconfig/20231130-110017-arnaudb.json [11:00:27] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2060.codfw.wmnet with OS bullseye [11:00:36] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes2060.codfw.wmnet with OS bullseye [11:01:38] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:06:35] (03PS5) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) [11:08:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 60%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53978 and previous config saved to /var/cache/conftool/dbconfig/20231130-110851-arnaudb.json [11:11:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2057.codfw.wmnet with reason: host reimage [11:11:49] (03CR) 10FNegri: [C: 03+2] [openstack] Upgrade all remaining hosts to Antelope [puppet] - 10https://gerrit.wikimedia.org/r/978636 (https://phabricator.wikimedia.org/T348843) (owner: 10FNegri) [11:11:52] (03PS6) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) [11:12:51] (03CR) 10Awight: "cold review: I think the application config is missing?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli) [11:14:20] (03PS7) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) [11:14:35] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2057.codfw.wmnet with reason: host reimage [11:15:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T348183)', diff saved to https://phabricator.wikimedia.org/P53979 and previous config saved to /var/cache/conftool/dbconfig/20231130-111524-arnaudb.json [11:15:26] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:15:35] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:15:40] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:15:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T348183)', diff saved to https://phabricator.wikimedia.org/P53980 and previous config saved to /var/cache/conftool/dbconfig/20231130-111546-arnaudb.json [11:19:11] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2058.codfw.wmnet with reason: host reimage [11:20:13] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2060.codfw.wmnet with reason: host reimage [11:22:07] (03CR) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli) [11:22:31] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2058.codfw.wmnet with reason: host reimage [11:22:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T348183)', diff saved to https://phabricator.wikimedia.org/P53981 and previous config saved to /var/cache/conftool/dbconfig/20231130-112258-arnaudb.json [11:23:03] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:23:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 70%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53982 and previous config saved to /var/cache/conftool/dbconfig/20231130-112356-arnaudb.json [11:25:22] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2060.codfw.wmnet with reason: host reimage [11:25:30] (03CR) 10Jbond: [C: 03+1] "lgtm but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [11:25:34] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2059.codfw.wmnet with reason: host reimage [11:26:47] (03PS2) 10Alexandros Kosiaris: rdb101[34]: Set them up as redis misc replicas [puppet] - 10https://gerrit.wikimedia.org/r/979034 (https://phabricator.wikimedia.org/T326171) [11:26:49] (03PS2) 10Alexandros Kosiaris: Update references to rdb1010 to point to rdb1014 [puppet] - 10https://gerrit.wikimedia.org/r/979035 (https://phabricator.wikimedia.org/T326171) [11:26:51] (03PS2) 10Alexandros Kosiaris: Switch rdb1009 replicas to replicate from rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979036 (https://phabricator.wikimedia.org/T326171) [11:26:53] (03PS2) 10Alexandros Kosiaris: Promote rdb1013 to master, drop rdb1009, rdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/979037 (https://phabricator.wikimedia.org/T326171) [11:27:33] (03PS1) 10Muehlenhoff: ganeti: Switch codfw to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979050 (https://phabricator.wikimedia.org/T350686) [11:27:35] (03PS6) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:28:37] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:28:48] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2059.codfw.wmnet with reason: host reimage [11:28:55] (03PS1) 10Hnowlan: changeprop-jobqueue: migrate more lower-impact jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/979051 (https://phabricator.wikimedia.org/T349796) [11:31:59] (03PS7) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:32:39] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:33:51] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2057.codfw.wmnet with OS bullseye [11:34:05] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes2057.codfw.wmnet with OS bullseye completed: - kubernetes2057 (**PASS**) - Down... [11:34:31] (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:35:22] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: migrate more lower-impact jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/979051 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:36:01] (03PS8) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) [11:36:16] (03CR) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [11:38:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P53983 and previous config saved to /var/cache/conftool/dbconfig/20231130-113804-arnaudb.json [11:38:31] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Clement_Goubert) [11:39:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 80%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53984 and previous config saved to /var/cache/conftool/dbconfig/20231130-113901-arnaudb.json [11:40:44] (03PS8) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:43:06] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2058.codfw.wmnet with OS bullseye [11:43:16] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes2058.codfw.wmnet with OS bullseye completed: - kubernetes2058 (**PASS**) - Down... [11:45:14] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2060.codfw.wmnet with OS bullseye [11:45:22] (03CR) 10Cathal Mooney: [C: 03+2] Reverse logic to select correct virtual console serial mode [cookbooks] - 10https://gerrit.wikimedia.org/r/978127 (owner: 10Cathal Mooney) [11:45:24] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes2060.codfw.wmnet with OS bullseye completed: - kubernetes2060 (**PASS**) - Down... [11:50:58] (03Merged) 10jenkins-bot: Reverse logic to select correct virtual console serial mode [cookbooks] - 10https://gerrit.wikimedia.org/r/978127 (owner: 10Cathal Mooney) [11:53:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P53985 and previous config saved to /var/cache/conftool/dbconfig/20231130-115312-arnaudb.json [11:54:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 90%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53986 and previous config saved to /var/cache/conftool/dbconfig/20231130-115406-arnaudb.json [11:54:11] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: migrate more lower-impact jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/979051 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:55:02] (03Merged) 10jenkins-bot: changeprop-jobqueue: migrate more lower-impact jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/979051 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:02:04] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2059.codfw.wmnet with OS bullseye [12:02:13] 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes2059.codfw.wmnet with OS bullseye completed: - kubernetes2059 (**WARN**) - Down... [12:02:14] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:02:30] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:03:32] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:03:55] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:06:07] !log Running homer 'cr*codfw*' commit T352369 [12:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:12] T352369: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 [12:06:16] (03PS1) 10Majavah: openstack: spreadcheck: remove in favour of server groups [puppet] - 10https://gerrit.wikimedia.org/r/979056 (https://phabricator.wikimedia.org/T247213) [12:08:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T348183)', diff saved to https://phabricator.wikimedia.org/P53987 and previous config saved to /var/cache/conftool/dbconfig/20231130-120819-arnaudb.json [12:08:21] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:08:22] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:08:25] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:08:26] (03PS9) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [12:08:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:08:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T348183)', diff saved to https://phabricator.wikimedia.org/P53988 and previous config saved to /var/cache/conftool/dbconfig/20231130-120841-arnaudb.json [12:08:45] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:08:58] (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [12:09:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53989 and previous config saved to /var/cache/conftool/dbconfig/20231130-120911-arnaudb.json [12:10:39] (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [12:11:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/782/console" [puppet] - 10https://gerrit.wikimedia.org/r/979056 (https://phabricator.wikimedia.org/T247213) (owner: 10Majavah) [12:12:32] (03CR) 10Clément Goubert: [C: 03+2] wikikube: add kubernetes20[57-60] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/979031 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert) [12:13:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/783/con" [puppet] - 10https://gerrit.wikimedia.org/r/979056 (https://phabricator.wikimedia.org/T247213) (owner: 10Majavah) [12:15:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T348183)', diff saved to https://phabricator.wikimedia.org/P53990 and previous config saved to /var/cache/conftool/dbconfig/20231130-121554-arnaudb.json [12:16:00] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:18:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:21:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] rdb101[34]: Set them up as redis misc replicas [puppet] - 10https://gerrit.wikimedia.org/r/979034 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [12:22:34] !log Pooling kubernetes20(5[4789]|60).codfw.wmnet - T352369 [12:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:39] T352369: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 [12:24:25] !log Uncordoning kubernetes20(5[4789]|60).codfw.wmnet - T352369 [12:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:36] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978654 [12:26:05] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2059.codfw.wmnet [12:26:05] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2059.codfw.wmnet [12:27:12] (03PS6) 10MdsShakil: Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) [12:27:22] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2059 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:27:35] ^that's me, on it [12:28:07] (03PS1) 10Btullis: Bring an-coord1003 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) [12:28:40] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2059 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:29:01] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:29:36] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/784/con" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [12:30:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Clement_Goubert) [12:31:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P53991 and previous config saved to /var/cache/conftool/dbconfig/20231130-123100-arnaudb.json [12:31:01] 10SRE, 10serviceops: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10Clement_Goubert) 05Open→03Resolved Hosts are in production, resolving. [12:31:41] (03PS2) 10Btullis: Bring an-coord1003 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) [12:32:55] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:35:03] (03CR) 10Awight: (WIP) kartotherian: add kartotherian chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli) [12:36:40] (03PS1) 10Muehlenhoff: Add explicit Hiera records to mark the new coordinator nodes as running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979087 (https://phabricator.wikimedia.org/T336045) [12:37:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'change es2 master to es2033 after reboot', diff saved to https://phabricator.wikimedia.org/P53992 and previous config saved to /var/cache/conftool/dbconfig/20231130-123752-arnaudb.json [12:39:37] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: reboot [12:39:51] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: reboot [12:40:23] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'change es3 master to es2029 as es2034 will reboot', diff saved to https://phabricator.wikimedia.org/P53993 and previous config saved to /var/cache/conftool/dbconfig/20231130-124050-arnaudb.json [12:41:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 is depooled', diff saved to https://phabricator.wikimedia.org/P53994 and previous config saved to /var/cache/conftool/dbconfig/20231130-124110-arnaudb.json [12:42:22] (03CR) 10Muehlenhoff: [C: 03+2] Add explicit Hiera records to mark the new coordinator nodes as running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979087 (https://phabricator.wikimedia.org/T336045) (owner: 10Muehlenhoff) [12:42:59] (03PS1) 10Btullis: Add dummy keytabs for new hadoop coordinators [labs/private] - 10https://gerrit.wikimedia.org/r/979088 (https://phabricator.wikimedia.org/T336045) [12:43:01] (03CR) 10Awight: (WIP) kartotherian: add kartotherian chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli) [12:44:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:44:38] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy keytabs for new hadoop coordinators [labs/private] - 10https://gerrit.wikimedia.org/r/979088 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [12:46:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P53995 and previous config saved to /var/cache/conftool/dbconfig/20231130-124607-arnaudb.json [12:46:24] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/786/con" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [12:48:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53996 and previous config saved to /var/cache/conftool/dbconfig/20231130-124849-arnaudb.json [12:53:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host schema2004.codfw.wmnet [12:53:30] (03PS5) 10Btullis: Update default airflow_version and remove overrides [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621) [12:54:16] (03PS1) 10Muehlenhoff: Switch schema2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979090 (https://phabricator.wikimedia.org/T349619) [12:56:03] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:56:50] (03CR) 10Muehlenhoff: [C: 03+2] Switch schema2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979090 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:57:53] (03PS3) 10Btullis: Bring an-coord1003 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) [12:59:42] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1300) [13:00:09] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/787/con" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [13:00:18] (03PS1) 10JMeybohm: Add new mesh module versions: certificate, configuration, deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/979094 (https://phabricator.wikimedia.org/T300033) [13:00:20] (03PS1) 10JMeybohm: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) [13:00:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:01:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T348183)', diff saved to https://phabricator.wikimedia.org/P53997 and previous config saved to /var/cache/conftool/dbconfig/20231130-130113-arnaudb.json [13:01:15] (03CR) 10CI reject: [V: 04-1] Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:01:16] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:01:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:01:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T348183)', diff saved to https://phabricator.wikimedia.org/P53998 and previous config saved to /var/cache/conftool/dbconfig/20231130-130136-arnaudb.json [13:01:40] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [13:01:41] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:02:56] (03PS1) 10Marostegui: mariadb: Remove db1126 [puppet] - 10https://gerrit.wikimedia.org/r/979096 (https://phabricator.wikimedia.org/T352362) [13:03:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1126.eqiad.wmnet [13:03:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 20%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53999 and previous config saved to /var/cache/conftool/dbconfig/20231130-130354-arnaudb.json [13:04:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host schema2004.codfw.wmnet [13:05:20] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host elastic2108.mgmt.codfw.wmnet with reboot policy FORCED [13:06:03] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2108.mgmt.codfw.wmnet with reboot policy FORCED [13:06:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1126 [puppet] - 10https://gerrit.wikimedia.org/r/979096 (https://phabricator.wikimedia.org/T352362) (owner: 10Marostegui) [13:08:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T348183)', diff saved to https://phabricator.wikimedia.org/P54000 and previous config saved to /var/cache/conftool/dbconfig/20231130-130851-arnaudb.json [13:09:07] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [13:09:12] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:10:40] (03CR) 10Brouberol: [C: 03+2] Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:11:05] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1126.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [13:12:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1126.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [13:12:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:12:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1126.eqiad.wmnet [13:13:19] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1126.eqiad.wmnet - https://phabricator.wikimedia.org/T352362 (10Marostegui) a:03Jclark-ctr This is ready for #dc-ops [13:14:25] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1126.eqiad.wmnet - https://phabricator.wikimedia.org/T352362 (10Marostegui) [13:15:02] (03CR) 10Brouberol: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [13:15:04] (03PS12) 10Marostegui: mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:19:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 30%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54001 and previous config saved to /var/cache/conftool/dbconfig/20231130-131859-arnaudb.json [13:21:12] (03PS1) 10Volans: CI: test apt_repo failures [puppet] - 10https://gerrit.wikimedia.org/r/979098 (https://phabricator.wikimedia.org/T351059) [13:21:19] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:41] (03PS2) 10JMeybohm: Add new mesh module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/979094 (https://phabricator.wikimedia.org/T300033) [13:23:43] (03PS2) 10JMeybohm: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) [13:23:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P54002 and previous config saved to /var/cache/conftool/dbconfig/20231130-132357-arnaudb.json [13:24:28] (03CR) 10Stevemunene: "Should we consider including the other instances of Presto discovery_uri settings https://github.com/wikimedia/operations-puppet/blob/prod" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [13:27:01] (03CR) 10JMeybohm: wikifunctions: Reduce drain time from 600s default to 60s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester) [13:30:20] (03CR) 10JMeybohm: [C: 04-1] cert-manager: bump appVersion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [13:32:24] (03PS2) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [13:32:40] (03CR) 10JMeybohm: Remove cergen certificate support from mesh module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:33:09] (03CR) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:33:12] 10SRE, 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T352357 (10phaultfinder) [13:34:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 40%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54003 and previous config saved to /var/cache/conftool/dbconfig/20231130-133404-arnaudb.json [13:35:46] (03CR) 10Stevemunene: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [13:37:48] (03PS1) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO request use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) [13:38:57] (03CR) 10CI reject: [V: 04-1] Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:39:00] (03PS9) 10Effie Mouzeli: (WIP)mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [13:39:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P54004 and previous config saved to /var/cache/conftool/dbconfig/20231130-133904-arnaudb.json [13:41:55] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch codfw to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979050 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:43:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] Update references to rdb1010 to point to rdb1014 [puppet] - 10https://gerrit.wikimedia.org/r/979035 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [13:44:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/788/con" [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [13:44:12] (03PS3) 10Alexandros Kosiaris: Switch rdb1009 replicas to replicate from rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979036 (https://phabricator.wikimedia.org/T326171) [13:44:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch rdb1009 replicas to replicate from rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979036 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris) [13:45:28] (03PS2) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO latency use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) [13:46:27] (03CR) 10Btullis: "I think we still have to mention the change in each changelog if we want it to build." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [13:47:49] (03PS1) 10Alexandros Kosiaris: redis_lock: Switch from rdb1009 to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979101 (https://phabricator.wikimedia.org/T326171) [13:48:50] (03CR) 10Brouberol: Mention the fact that 'history' is a valid argument in the error message (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [13:49:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54005 and previous config saved to /var/cache/conftool/dbconfig/20231130-134909-arnaudb.json [13:54:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T348183)', diff saved to https://phabricator.wikimedia.org/P54006 and previous config saved to /var/cache/conftool/dbconfig/20231130-135410-arnaudb.json [13:54:14] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:54:16] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:54:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:54:31] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:54:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:54:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T348183)', diff saved to https://phabricator.wikimedia.org/P54007 and previous config saved to /var/cache/conftool/dbconfig/20231130-135453-arnaudb.json [13:59:01] (JobUnavailable) firing: (8) Reduced availability for job lvs_realserver in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:03:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T348183)', diff saved to https://phabricator.wikimedia.org/P54008 and previous config saved to /var/cache/conftool/dbconfig/20231130-140308-arnaudb.json [14:03:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM stewards2001.codfw.wmnet [14:03:25] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:04:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 60%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54009 and previous config saved to /var/cache/conftool/dbconfig/20231130-140414-arnaudb.json [14:05:54] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10VRiley-WMF) a:03VRiley-WMF [14:06:53] nothing to deploy indeed [14:07:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM stewards2001.codfw.wmnet [14:07:24] (03PS1) 10Alexandros Kosiaris: mediawiki: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979102 (https://phabricator.wikimedia.org/T326171) [14:07:28] (03PS1) 10Alexandros Kosiaris: changeprop: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979103 (https://phabricator.wikimedia.org/T326171) [14:07:32] (03PS1) 10Alexandros Kosiaris: api-gateway: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979104 (https://phabricator.wikimedia.org/T326171) [14:07:36] (03PS1) 10Alexandros Kosiaris: cp-jobqueue: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979105 (https://phabricator.wikimedia.org/T326171) [14:07:40] (03PS1) 10Alexandros Kosiaris: Remove rdb1009 unused references from repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/979106 (https://phabricator.wikimedia.org/T326171) [14:11:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2093.mgmt.codfw.wmnet with reboot policy FORCED [14:11:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2095.mgmt.codfw.wmnet with reboot policy FORCED [14:12:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Papaul) a:03Jhancock.wm [14:12:56] (03PS1) 10Effie Mouzeli: (WIP) mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 [14:13:41] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (owner: 10Effie Mouzeli) [14:13:43] (03PS10) 10Effie Mouzeli: (WIP)mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [14:14:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2095.mgmt.codfw.wmnet with reboot policy FORCED [14:14:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2093.mgmt.codfw.wmnet with reboot policy FORCED [14:14:14] (03CR) 10Btullis: [C: 03+1] Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:14:59] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [14:15:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host planet2003.codfw.wmnet [14:15:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host planet2003.codfw.wmnet [14:17:18] (03CR) 10Brouberol: [C: 03+2] Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:17:21] (03CR) 10Brouberol: [V: 03+2 C: 03+2] Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [14:18:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P54010 and previous config saved to /var/cache/conftool/dbconfig/20231130-141815-arnaudb.json [14:19:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 70%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54011 and previous config saved to /var/cache/conftool/dbconfig/20231130-141919-arnaudb.json [14:43:42] (03PS1) 10Matthias Mullie: No custom UW licensing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979113 [14:43:51] (03CR) 10Matthias Mullie: [C: 04-1] No custom UW licensing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979113 (owner: 10Matthias Mullie) [14:44:39] (03PS1) 10Ilias Sarantopoulos: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) [14:45:00] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: lower space-based retention to 2800GB [puppet] - 10https://gerrit.wikimedia.org/r/979110 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [14:45:21] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1106.mgmt.eqiad.wmnet with reboot policy FORCED [14:45:48] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1105.mgmt.eqiad.wmnet with reboot policy FORCED [14:46:26] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED [14:48:13] 10SRE, 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T352357 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated lightning. alert cleared. if reoccurs, replace eth cable [14:48:22] !log roll-restart prometheus/ops in eqiad/codfw to apply new size-based retention - T351179 [14:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T348183)', diff saved to https://phabricator.wikimedia.org/P54015 and previous config saved to /var/cache/conftool/dbconfig/20231130-144831-arnaudb.json [14:48:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [14:48:38] T351179: LVM vg0 close to getting full on prometheus eqiad - https://phabricator.wikimedia.org/T351179 [14:48:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [14:48:52] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:48:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T348183)', diff saved to https://phabricator.wikimedia.org/P54016 and previous config saved to /var/cache/conftool/dbconfig/20231130-144854-arnaudb.json [14:48:59] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105'] [14:49:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 90%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54017 and previous config saved to /var/cache/conftool/dbconfig/20231130-144929-arnaudb.json [14:50:01] (03PS2) 10Jforrester: wikifunctions: Reduce helm deploy timeout from 600s default to 120s [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 [14:50:03] (03CR) 10Jforrester: wikifunctions: Reduce helm deploy timeout from 600s default to 120s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester) [14:50:11] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic1105'] [14:50:31] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105'] [14:53:13] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1105'] [14:53:49] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1106'] [14:53:58] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:01] (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [14:54:01] (PuppetFailure) resolved: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:54:11] (03PS1) 10Majavah: P:cache::haproxy: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/979115 [14:54:32] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic1106'] [14:54:59] (JobUnavailable) firing: (9) Reduced availability for job lvs_realserver in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:58] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/792/console" [puppet] - 10https://gerrit.wikimedia.org/r/979115 (owner: 10Majavah) [14:56:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:57:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T348183)', diff saved to https://phabricator.wikimedia.org/P54018 and previous config saved to /var/cache/conftool/dbconfig/20231130-145707-arnaudb.json [14:57:15] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:57:59] (03CR) 10Vgutierrez: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979115 (owner: 10Majavah) [14:58:11] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:cache::haproxy: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/979115 (owner: 10Majavah) [14:59:01] (JobUnavailable) firing: (9) Reduced availability for job lvs_realserver in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:00:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:01:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:44] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:01] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54019 and previous config saved to /var/cache/conftool/dbconfig/20231130-150434-arnaudb.json [15:06:20] (03PS1) 10Brouberol: Mention topic/cluster in the kafka replication factor alert message [alerts] - 10https://gerrit.wikimedia.org/r/979116 [15:07:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'change es3 master back to es2034', diff saved to https://phabricator.wikimedia.org/P54020 and previous config saved to /var/cache/conftool/dbconfig/20231130-150712-arnaudb.json [15:07:26] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [15:08:22] matthiasmullie: way too late, but nothing going on AFAIK, yes [15:08:43] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105'] [15:09:34] (03CR) 10Btullis: [C: 03+1] "Nice." [alerts] - 10https://gerrit.wikimedia.org/r/979116 (owner: 10Brouberol) [15:11:18] (03CR) 10Herron: "Nice one, great to see a latency SLO coming onboard! Please see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:11:22] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/979117 [15:12:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P54021 and previous config saved to /var/cache/conftool/dbconfig/20231130-151214-arnaudb.json [15:13:12] (03CR) 10Brouberol: [C: 03+2] Mention topic/cluster in the kafka replication factor alert message [alerts] - 10https://gerrit.wikimedia.org/r/979116 (owner: 10Brouberol) [15:14:26] (03Merged) 10jenkins-bot: Mention topic/cluster in the kafka replication factor alert message [alerts] - 10https://gerrit.wikimedia.org/r/979116 (owner: 10Brouberol) [15:15:34] (03PS3) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO latency use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) [15:16:14] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/979117 (owner: 10Muehlenhoff) [15:17:39] (03CR) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO latency use case (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:18:55] (03PS4) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO latency use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) [15:19:30] hello [15:20:05] (03CR) 10Herron: profile::pyrra::filesystem: add Lift Wing SLO latency use case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:21:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) [15:21:25] !log installing libbsd bugfix updates from Bullseye point release [15:21:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) 05Open→03Resolved [15:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P54022 and previous config saved to /var/cache/conftool/dbconfig/20231130-152721-arnaudb.json [15:29:39] (03CR) 10Herron: [C: 03+1] profile::pyrra::filesystem: add Lift Wing SLO latency use case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:30:01] (03CR) 10Elukey: "Had a chat with Keith on IRC, we are good to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:30:31] (03CR) 10Herron: [C: 03+1] profile::pyrra::filesystem: add Lift Wing SLO latency use case (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:30:54] (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: add Lift Wing SLO latency use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [15:31:20] !log installing dbus security updates on buster [15:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:24] moritzm: ok to merge? [15:31:59] yes nothing dangerous afaics :) [15:32:33] oh sorry, yes please go ahwad [15:33:24] (03PS2) 10Ilias Sarantopoulos: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) [15:33:39] !log clean-up /etc/hosts on A:dns-rec to remove entries populated by host_core: T347054 [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:47] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [15:36:34] !log installing minizip security updates [15:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:56] (03PS1) 10Aqu: Rewrite metrics sent by Airflow [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) [15:42:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T348183)', diff saved to https://phabricator.wikimedia.org/P54023 and previous config saved to /var/cache/conftool/dbconfig/20231130-154227-arnaudb.json [15:42:31] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:42:46] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:42:46] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:50:57] (03CR) 10Muehlenhoff: [C: 03+2] mediawiki::php: Set php-common version dependent on OS [puppet] - 10https://gerrit.wikimedia.org/r/978540 (owner: 10Muehlenhoff) [15:52:07] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [15:52:17] (03PS1) 10Ayounsi: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) [15:52:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [15:52:45] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [15:52:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T348183)', diff saved to https://phabricator.wikimedia.org/P54024 and previous config saved to /var/cache/conftool/dbconfig/20231130-155251-arnaudb.json [15:52:57] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:53:14] (03CR) 10Ayounsi: Netbox: add generic function to execute a Netbox script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:54:33] !log installing stunnel4 bugfix updates from bookworm point release [15:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:00] (03PS3) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [15:59:12] (03CR) 10CI reject: [V: 04-1] Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [16:00:09] (03PS1) 10Hnowlan: changeprop-jobqueue: toggle another job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979122 (https://phabricator.wikimedia.org/T349796) [16:02:02] (03PS3) 10Jcrespo: Prepare for 0.2.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978643 (https://phabricator.wikimedia.org/T327157) [16:03:24] (03PS1) 10Ladsgroup: Revert "PoolCounterConnectionManager: Add support for ipv6" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) [16:03:31] (03CR) 10Ladsgroup: [C: 03+2] Revert "PoolCounterConnectionManager: Add support for ipv6" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) (owner: 10Ladsgroup) [16:04:36] (03CR) 10Hashar: [C: 03+1] Revert "PoolCounterConnectionManager: Add support for ipv6" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) (owner: 10Ladsgroup) [16:08:13] UBN being deployed [16:11:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) (owner: 10Ladsgroup) [16:11:49] (03CR) 10Bartosz Dziewoński: Update CentralAuth login failures metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978679 (https://phabricator.wikimedia.org/T351948) (owner: 10Bartosz Dziewoński) [16:12:06] (03CR) 10Ejegg: CentralNotice: Add wmflabs to banner preview CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864327 (https://phabricator.wikimedia.org/T199055) (owner: 10AndyRussG) [16:21:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T348183)', diff saved to https://phabricator.wikimedia.org/P54025 and previous config saved to /var/cache/conftool/dbconfig/20231130-162131-arnaudb.json [16:21:38] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:23:25] (03Merged) 10jenkins-bot: Revert "PoolCounterConnectionManager: Add support for ipv6" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) (owner: 10Ladsgroup) [16:23:39] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]] [16:23:50] T352444: CirrusSearch generates a massive amount of "poolcounter-connection-error" messages - https://phabricator.wikimedia.org/T352444 [16:24:00] (03PS1) 10Elukey: profile::pyrra::filesystem: use histogram count for LW latency pilot [puppet] - 10https://gerrit.wikimedia.org/r/979126 (https://phabricator.wikimedia.org/T351390) [16:24:34] (03PS1) 10Andrew Bogott: rabbitmq: don't include openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/979127 [16:24:52] (03CR) 10Herron: [C: 03+1] profile::pyrra::filesystem: use histogram count for LW latency pilot [puppet] - 10https://gerrit.wikimedia.org/r/979126 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [16:25:03] (03PS4) 10Brouberol: Explicit the link between apt_repo.yaml and running modules/profile specs [puppet] - 10https://gerrit.wikimedia.org/r/979119 [16:25:08] (03CR) 10Brouberol: "After our discussion in #wikimedia-dcops, I tried to explicitly link hieradata/role/common/apt_repo.yaml with the profile rspecs. This way" [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol) [16:26:31] (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: use histogram count for LW latency pilot [puppet] - 10https://gerrit.wikimedia.org/r/979126 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [16:26:55] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:27:09] (03PS2) 10Andrew Bogott: rabbitmq: don't include openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/979127 [16:27:22] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:28:13] Amir1: looks like poolcounter connections are resuming :) [16:28:23] \o/ [16:33:24] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]] (duration: 09m 45s) [16:33:30] T352444: CirrusSearch generates a massive amount of "poolcounter-connection-error" messages - https://phabricator.wikimedia.org/T352444 [16:36:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P54026 and previous config saved to /var/cache/conftool/dbconfig/20231130-163637-arnaudb.json [16:40:05] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979127 (owner: 10Andrew Bogott) [16:42:43] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979127 (owner: 10Andrew Bogott) [16:51:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P54027 and previous config saved to /var/cache/conftool/dbconfig/20231130-165144-arnaudb.json [16:58:13] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:00:06] jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1700). [17:00:06] No Gerrit patches in the queue for this window AFAICS. [17:00:35] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ips to restbase servers in codfw - jhancock@cumin2002" [17:01:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ips to restbase servers in codfw - jhancock@cumin2002" [17:01:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Jhancock.wm) 05Open→03Resolved @Eevans Hey my bad. newbie mistake. Papaul taught me how to fix this and you should be good now. [17:04:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop-jobqueue: toggle another job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979122 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:06:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T348183)', diff saved to https://phabricator.wikimedia.org/P54028 and previous config saved to /var/cache/conftool/dbconfig/20231130-170650-arnaudb.json [17:06:53] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:07:07] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:07:10] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:07:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T348183)', diff saved to https://phabricator.wikimedia.org/P54029 and previous config saved to /var/cache/conftool/dbconfig/20231130-170713-arnaudb.json [17:08:47] (03PS3) 10Ilias Sarantopoulos: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) [17:11:30] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [17:12:18] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [17:13:08] (03Merged) 10jenkins-bot: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [17:14:44] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [17:18:34] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: toggle another job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979122 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:19:26] (03Merged) 10jenkins-bot: changeprop-jobqueue: toggle another job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979122 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:23:29] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [17:23:47] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [17:24:15] (03PS1) 10Ebernhardson: cirrus: Disable event bus bridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) [17:24:21] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [17:24:47] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [17:24:57] (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: update article-desc image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979081 [17:25:48] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [17:26:11] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [17:26:44] (03CR) 10DCausse: "you might to disable canary events for this stream in ext-EventStreamConfig.php as well" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) (owner: 10Ebernhardson) [17:26:46] (03CR) 10Ilias Sarantopoulos: [C: 03+2] Revert "ml-services: update article-desc image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979081 (owner: 10Ilias Sarantopoulos) [17:27:15] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [17:33:38] (03PS2) 10Ebernhardson: cirrus: Disable event bus bridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) [17:33:40] (03CR) 10Ebernhardson: cirrus: Disable event bus bridge (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) (owner: 10Ebernhardson) [17:34:13] (03CR) 10DCausse: [C: 03+1] cirrus: Disable event bus bridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) (owner: 10Ebernhardson) [17:36:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T348183)', diff saved to https://phabricator.wikimedia.org/P54030 and previous config saved to /var/cache/conftool/dbconfig/20231130-173635-arnaudb.json [17:36:49] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:41:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:22] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P54031 and previous config saved to /var/cache/conftool/dbconfig/20231130-175141-arnaudb.json [18:00:05] bd808: Dear deployers, time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1800). [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1800) [18:02:09] !log planet2003 - revoking old puppet cert, following the "fix forward" steps from T349619 - puppet running again [18:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:15] T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 [18:06:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P54032 and previous config saved to /var/cache/conftool/dbconfig/20231130-180648-arnaudb.json [18:06:50] (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: don't include openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/979127 (owner: 10Andrew Bogott) [18:08:35] (03PS1) 10BryanDavis: toolhub: Bump container version to 2023-11-30-180312-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979142 (https://phabricator.wikimedia.org/T308938) [18:09:42] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bookworm [18:09:43] (03CR) 10SBassett: [C: 04-1] CentralNotice: Add wmflabs to banner preview CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864327 (https://phabricator.wikimedia.org/T199055) (owner: 10AndyRussG) [18:09:55] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2023-11-30-180312-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979142 (https://phabricator.wikimedia.org/T308938) (owner: 10BryanDavis) [18:11:08] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2023-11-30-180312-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979142 (https://phabricator.wikimedia.org/T308938) (owner: 10BryanDavis) [18:12:33] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [18:13:08] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [18:13:16] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [18:13:56] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [18:14:28] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [18:15:00] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:20] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [18:17:50] (03Abandoned) 10Ebernhardson: cirrus: Disable event bus bridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) (owner: 10Ebernhardson) [18:21:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T348183)', diff saved to https://phabricator.wikimedia.org/P54033 and previous config saved to /var/cache/conftool/dbconfig/20231130-182155-arnaudb.json [18:21:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [18:22:01] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:22:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [18:22:57] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudrabbit1003.wikimedia.org with OS bookworm [18:24:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bookworm [18:26:47] (03PS1) 10Ebernhardson: cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335) [18:27:29] (03CR) 10CI reject: [V: 04-1] cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [18:31:30] (03PS1) 10Ebernhardson: cirrus updater: Expand test deployment to prod+cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/979147 (https://phabricator.wikimedia.org/T352335) [18:36:47] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage [18:38:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Eevans) >>! In T349758#9372338, @Jhancock.wm wrote: > @Eevans Hey my bad. newbie mistake. Papaul taught me how to fix this and you should be good now. No worries; Thanks... [18:40:03] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage [18:44:24] (03PS2) 10Ebernhardson: cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335) [18:48:31] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10Dzahn) a:03thcipriani [18:48:40] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [18:48:54] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [18:49:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T348183)', diff saved to https://phabricator.wikimedia.org/P54034 and previous config saved to /var/cache/conftool/dbconfig/20231130-184900-arnaudb.json [18:49:22] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:49:31] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host elastic1104 [18:49:33] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1104 [18:50:14] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host elastic1104.mgmt.eqiad.wmnet with reboot policy FORCED [18:52:24] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: turn deploy script into template, support for php7.4-fpm [puppet] - 10https://gerrit.wikimedia.org/r/978710 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [18:56:26] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "no change in prod except a newline added to the script" [puppet] - 10https://gerrit.wikimedia.org/r/978710 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [18:56:37] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1003.wikimedia.org with OS bookworm [18:57:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.wikimedia.org with OS bookworm [19:00:00] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:05] hashar and jeena: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1900). [19:00:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [19:01:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:01] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:04:45] (03PS1) 10Bking: miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) [19:05:16] (03CR) 10CI reject: [V: 04-1] miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:06:20] (03PS2) 10Bking: miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) [19:08:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) a:03VRiley-WMF [19:08:18] (03CR) 10Dzahn: "ahh! so should this be moved to a profile applied on that (one) wdqs host?" [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:09:47] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1104.mgmt.eqiad.wmnet with reboot policy FORCED [19:10:01] (03PS3) 10Dzahn: miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:10:27] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1002.wikimedia.org with reason: host reimage [19:10:43] (03CR) 10Dzahn: [C: 03+1] miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:11:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1104'] [19:11:40] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105'] [19:11:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1106'] [19:11:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1107'] [19:12:05] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1103'] [19:12:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1104'] [19:12:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1103'] [19:12:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1107'] [19:12:37] (03CR) 10Bking: [C: 03+2] miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:13:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1002.wikimedia.org with reason: host reimage [19:13:39] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1104.mgmt.eqiad.wmnet with reboot policy FORCED [19:13:41] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED [19:13:42] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:14:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:14:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED [19:14:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1107'] [19:14:59] (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:15:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1107'] [19:15:22] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:15:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:15:49] re: etherpad alert. I checked and it was temporary [19:17:13] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:17:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:18:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T348183)', diff saved to https://phabricator.wikimedia.org/P54035 and previous config saved to /var/cache/conftool/dbconfig/20231130-191822-arnaudb.json [19:18:29] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:18:52] (03CR) 10Volans: "Thanks for finding a workaround. I'm not sure if this is the best place where to put it, adding Jesse for it, but if there aren't other al" [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol) [19:18:59] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:19:03] (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1106'] [19:19:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1105'] [19:19:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED [19:19:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED [19:20:05] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1103'] [19:20:08] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1103'] [19:20:31] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105'] [19:20:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED [19:21:13] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1105'] [19:21:46] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1106'] [19:22:12] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1107'] [19:22:15] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1106'] [19:22:36] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1107'] [19:24:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1104.mgmt.eqiad.wmnet with reboot policy FORCED [19:24:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1104'] [19:24:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1104'] [19:24:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1104'] [19:25:44] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2028.codfw.wmnet [19:27:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1103'] [19:27:28] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deploy for pfischer - https://phabricator.wikimedia.org/T352475 (10EBernhardson) [19:28:15] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1035'] [19:28:34] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti1035'] [19:29:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1107'] [19:29:02] (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:23] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1035'] [19:29:32] ^ they are using it at the "Data Modeling Days" [19:29:52] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deploy for pfischer - https://phabricator.wikimedia.org/T352475 (10EBernhardson) [19:30:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1002.wikimedia.org with OS bookworm [19:30:21] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for pfischer - https://phabricator.wikimedia.org/T352475 (10EBernhardson) [19:30:49] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1036'] [19:31:55] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1037'] [19:33:03] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1038'] [19:33:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P54036 and previous config saved to /var/cache/conftool/dbconfig/20231130-193329-arnaudb.json [19:33:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2028.codfw.wmnet [19:34:02] (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:34:59] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1035'] [19:36:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF) [19:37:24] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1036'] [19:37:32] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1037'] [19:40:55] (03PS1) 10Ryan Kemper: elastic: prepare new hosts elastic110[3-7] [puppet] - 10https://gerrit.wikimedia.org/r/979154 (https://phabricator.wikimedia.org/T349777) [19:41:26] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1038'] [19:41:39] (03PS3) 10Ebernhardson: cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335) [19:41:41] (03PS1) 10Ebernhardson: cirrus: Enable event bus bridge on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) [19:41:55] (03CR) 10Ryan Kemper: "Just tagging jclark for visibility" [puppet] - 10https://gerrit.wikimedia.org/r/979154 (https://phabricator.wikimedia.org/T349777) (owner: 10Ryan Kemper) [19:41:59] (03CR) 10Ebernhardson: [C: 04-2] "The necessary kafka topic changes have not been performed yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [19:42:06] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979154 (https://phabricator.wikimedia.org/T349777) (owner: 10Ryan Kemper) [19:48:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P54037 and previous config saved to /var/cache/conftool/dbconfig/20231130-194835-arnaudb.json [19:49:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1104'] [19:54:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) [19:54:15] (03CR) 10Jclark-ctr: [C: 03+2] elastic: prepare new hosts elastic110[3-7] [puppet] - 10https://gerrit.wikimedia.org/r/979154 (https://phabricator.wikimedia.org/T349777) (owner: 10Ryan Kemper) [19:57:37] (03PS1) 10Ssingh: hiera: dnsbox: remove anycast-hc dependency on pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/979159 (https://phabricator.wikimedia.org/T347054) [19:57:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:57:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:57:58] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1104.eqiad.wmnet with OS bookworm [19:58:00] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1103.eqiad.wmnet with OS bookworm [19:58:01] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1106.eqiad.wmnet with OS bookworm [19:58:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1105.eqiad.wmnet with OS bookworm [19:58:03] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm [19:58:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1104.eqiad.wmnet with OS bookworm [19:58:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1103.eqiad.wmnet with OS bookworm [19:58:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1106.eqiad.wmnet with OS bookworm [19:58:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [19:58:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm [19:59:00] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/979159 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [20:00:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:03:25] (03PS1) 10Jcrespo: add_recent_uploads: Be more solid resilient against errors [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979160 [20:03:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T348183)', diff saved to https://phabricator.wikimedia.org/P54039 and previous config saved to /var/cache/conftool/dbconfig/20231130-200342-arnaudb.json [20:03:45] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [20:03:48] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:03:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [20:03:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [20:04:03] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [20:04:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54040 and previous config saved to /var/cache/conftool/dbconfig/20231130-200409-arnaudb.json [20:04:27] (03PS2) 10Jcrespo: add_recent_uploads: Be more resilient against errors [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979160 [20:07:37] (03PS1) 10Eevans: restbase: set production role and add config for restbase2028 [puppet] - 10https://gerrit.wikimedia.org/r/979161 (https://phabricator.wikimedia.org/T352468) [20:11:43] (03PS1) 10Jdrewniak: Increase "large" font-size option for client-preferences [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979084 (https://phabricator.wikimedia.org/T351693) [20:12:06] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1104.eqiad.wmnet with reason: host reimage [20:12:36] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1103.eqiad.wmnet with reason: host reimage [20:14:58] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1106.eqiad.wmnet with reason: host reimage [20:15:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1106.eqiad.wmnet with reason: host reimage [20:15:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1104.eqiad.wmnet with reason: host reimage [20:16:10] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.4 [puppet] - 10https://gerrit.wikimedia.org/r/979162 (https://phabricator.wikimedia.org/T352480) [20:17:25] (03PS1) 10Herron: thanos-query: enable auto-downsampling [puppet] - 10https://gerrit.wikimedia.org/r/979163 [20:17:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1103.eqiad.wmnet with reason: host reimage [20:18:47] (03CR) 10Herron: "follow-up to irc conovo -- interested in your thoughts" [puppet] - 10https://gerrit.wikimedia.org/r/979163 (owner: 10Herron) [20:22:48] (03PS1) 10Papaul: Add new elastic nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/979164 (https://phabricator.wikimedia.org/T349780) [20:28:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54041 and previous config saved to /var/cache/conftool/dbconfig/20231130-202830-arnaudb.json [20:28:36] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:30:03] (03CR) 10Papaul: [C: 03+2] Add new elastic nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/979164 (https://phabricator.wikimedia.org/T349780) (owner: 10Papaul) [20:30:48] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:35:52] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:36:33] (03CR) 10Jdlrobson: [C: 03+1] Increase "large" font-size option for client-preferences [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979084 (https://phabricator.wikimedia.org/T351693) (owner: 10Jdrewniak) [20:37:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:37:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1103.eqiad.wmnet with OS bookworm [20:37:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:37:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1103.eqiad.wmnet with OS bookworm completed: - elastic1103 (**PASS**)... [20:37:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1106.eqiad.wmnet with OS bookworm [20:37:25] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:37:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1106.eqiad.wmnet with OS bookworm completed: - elastic1106 (**WARN**)... [20:38:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:38:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1104.eqiad.wmnet with OS bookworm [20:38:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1104.eqiad.wmnet with OS bookworm completed: - elastic1104 (**PASS**)... [20:43:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P54042 and previous config saved to /var/cache/conftool/dbconfig/20231130-204336-arnaudb.json [21:42:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T348183)', diff saved to https://phabricator.wikimedia.org/P54046 and previous config saved to /var/cache/conftool/dbconfig/20231130-214252-arnaudb.json [21:43:04] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:43:28] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:44:55] (03Merged) 10jenkins-bot: Increase "large" font-size option for client-preferences [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979084 (https://phabricator.wikimedia.org/T351693) (owner: 10Jdrewniak) [21:45:10] !log dancy@deploy2002 Started scap: Backport for [[gerrit:979084|Increase "large" font-size option for client-preferences (T351693)]] [21:45:20] T351693: Implement new default typography options - https://phabricator.wikimedia.org/T351693 [21:45:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:45:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2092.codfw.wmnet with OS bookworm [21:45:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2092.codfw.wmnet with OS bookworm completed: - elastic2092 (**PASS**)... [21:46:04] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:46:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bookworm [21:46:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm [21:46:21] (03PS1) 10Dzahn: planet: add ensure parameter allowing to disable update jobs [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) [21:46:26] !log dancy@deploy2002 jdrewniak and dancy: Backport for [[gerrit:979084|Increase "large" font-size option for client-preferences (T351693)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:46:53] kimberly_sarabia: Ready for testing [21:47:06] Thanks! One moment [21:47:54] (03PS2) 10Dzahn: planet: add ensure parameter allowing to disable update jobs [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) [21:48:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2093.codfw.wmnet with reason: host reimage [21:49:06] LGTM! Thanks [21:49:13] OK. Proceeding [21:49:16] !log dancy@deploy2002 jdrewniak and dancy: Continuing with sync [21:50:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) [21:51:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr) [21:52:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2093.codfw.wmnet with reason: host reimage [21:54:16] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1105.eqiad.wmnet with OS bookworm [21:54:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm [21:54:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm [21:54:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [21:55:12] !log dancy@deploy2002 Finished scap: Backport for [[gerrit:979084|Increase "large" font-size option for client-preferences (T351693)]] (duration: 10m 01s) [21:55:17] T351693: Implement new default typography options - https://phabricator.wikimedia.org/T351693 [21:55:43] kimberly_sarabia: Your change has been fully deployed [21:55:56] Thanks so much! [21:58:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P54047 and previous config saved to /var/cache/conftool/dbconfig/20231130-215759-arnaudb.json [22:00:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED [22:00:56] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1105.mgmt.eqiad.wmnet with reboot policy FORCED [22:02:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1105.mgmt.eqiad.wmnet with reboot policy FORCED [22:02:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED [22:08:47] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:13:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P54048 and previous config saved to /var/cache/conftool/dbconfig/20231130-221308-arnaudb.json [22:14:38] (03CR) 10Kimberly Sarabia: [C: 03+1] "This makes sense to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [22:20:54] (03CR) 10Krinkle: [C: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [22:21:20] (03PS2) 10Krinkle: noc: fix indentation in base.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 [22:22:19] dancy: all done with deployments? [22:23:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:23:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2093.codfw.wmnet with OS bookworm [22:24:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2093.codfw.wmnet with OS bookworm completed: - elastic2093 (**PASS**)... [22:24:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2095.codfw.wmnet with OS bookworm [22:24:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2095.codfw.wmnet with OS bookworm [22:28:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T348183)', diff saved to https://phabricator.wikimedia.org/P54050 and previous config saved to /var/cache/conftool/dbconfig/20231130-222814-arnaudb.json [22:28:17] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [22:28:20] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:28:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [22:28:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2190 (T348183)', diff saved to https://phabricator.wikimedia.org/P54051 and previous config saved to /var/cache/conftool/dbconfig/20231130-222836-arnaudb.json [22:42:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2095.codfw.wmnet with reason: host reimage [22:43:53] dancy: All done. [22:44:35] oops. Krinkle: All done. :-) [22:46:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2095.codfw.wmnet with reason: host reimage [22:58:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T348183)', diff saved to https://phabricator.wikimedia.org/P54053 and previous config saved to /var/cache/conftool/dbconfig/20231130-225802-arnaudb.json [22:58:09] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:00:14] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:00:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [23:02:49] (03PS1) 10Terasail: Add ability for sysop to manage functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) [23:03:30] (03CR) 10CI reject: [V: 04-1] Add ability for sysop to manage functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail) [23:03:49] !log removing 5 files for legal compliance [23:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:04:59] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:05:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [23:05:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:06:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2095.codfw.wmnet with OS bookworm [23:06:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2095.codfw.wmnet with OS bookworm completed: - elastic2095 (**PASS**)... [23:06:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bookworm [23:06:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm executed with errors: - elastic2094... [23:11:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2096.codfw.wmnet with OS bookworm [23:11:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2096.codfw.wmnet with OS bookworm [23:13:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P54054 and previous config saved to /var/cache/conftool/dbconfig/20231130-231309-arnaudb.json [23:16:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bookworm [23:16:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm [23:18:04] !log removing 1 file for legal compliance [23:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:14] (03PS2) 10Terasail: Bug: T352495 Change-Id: Ib6fcfb2df83204f148da9706dcb751b2f6050a63 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) [23:28:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P54055 and previous config saved to /var/cache/conftool/dbconfig/20231130-232815-arnaudb.json [23:31:12] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1105.eqiad.wmnet with OS bookworm [23:31:14] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm [23:31:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm [23:31:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm [23:31:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2096.codfw.wmnet with reason: host reimage [23:35:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2096.codfw.wmnet with reason: host reimage [23:35:45] !log removing 1 file for legal compliance [23:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:07] (03PS2) 10Krinkle: Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407) [23:36:15] (03PS2) 10Krinkle: Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407) [23:36:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2097.codfw.wmnet with OS bookworm [23:36:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2097.codfw.wmnet with OS bookworm [23:37:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul) [23:37:28] (03PS3) 10Jdlrobson: Filter errors originating in external tools [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935) [23:39:38] (03CR) 10Krinkle: [C: 03+2] noc: fix indentation in base.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 (owner: 10Krinkle) [23:39:53] (03CR) 10Krinkle: [C: 03+2] Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407) (owner: 10Krinkle) [23:40:21] (03Merged) 10jenkins-bot: noc: fix indentation in base.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 (owner: 10Krinkle) [23:40:33] (03Merged) 10jenkins-bot: Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407) (owner: 10Krinkle) [23:43:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T348183)', diff saved to https://phabricator.wikimedia.org/P54056 and previous config saved to /var/cache/conftool/dbconfig/20231130-234322-arnaudb.json [23:43:29] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:44:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2098.codfw.wmnet with OS bookworm [23:44:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2098.codfw.wmnet with OS bookworm [23:45:58] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1107.eqiad.wmnet with reason: host reimage [23:46:05] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1105.eqiad.wmnet with reason: host reimage [23:47:29] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1107.eqiad.wmnet with reason: host reimage [23:50:21] !log krinkle@deploy2002 Synchronized docroot/noc/: (no justification provided) (duration: 08m 28s) [23:52:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1105.eqiad.wmnet with reason: host reimage [23:52:13] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:54:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2097.codfw.wmnet with reason: host reimage [23:55:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:55:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2096.codfw.wmnet with OS bookworm [23:55:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2096.codfw.wmnet with OS bookworm completed: - elastic2096 (**PASS**)... [23:56:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2099.codfw.wmnet with OS bookworm [23:56:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2099.codfw.wmnet with OS bookworm [23:56:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2094.codfw.wmnet with reason: host reimage [23:56:40] (03PS3) 10Terasail: Add ability for sysop to manage functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) [23:57:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2097.codfw.wmnet with reason: host reimage