[00:01:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2034.codfw.wmnet with reason: host reimage
[00:04:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2034.codfw.wmnet with reason: host reimage
[00:07:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Eevans) 05Resolved→03Open >>! In T349758#9365653, @Papaul wrote: > @Eevans All your's  Hi @Papaul,  Did these get the additional 3 IPs per host (i.e. restbase2028-{a,...
[00:09:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:09:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thank you! planet1003 works but on 2003:" [puppet] - 10https://gerrit.wikimedia.org/r/978073 (owner: 10Dzahn)
[00:22:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:23:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:23:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2033.codfw.wmnet with OS bullseye
[00:23:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2033.codfw.wmnet with OS bullseye completed: - ganeti2033 (**PAS...
[00:24:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Papaul)
[00:24:30] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:24:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2034.codfw.wmnet with OS bullseye
[00:24:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2034.codfw.wmnet with OS bullseye completed: - ganeti2034 (**PAS...
[00:24:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Papaul)
[00:24:58] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:26:08] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wikimedia Bénin user group - https://phabricator.wikimedia.org/T352285 (10Ladsgroup) 05Open→03Resolved {{done}} ^_^ https://lists.wikimedia.org/postorius/lists/wikimedia-bj.lists.wikimedia.org
[00:27:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff  all your's
[00:31:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) @Eevans i don't know since @Jhancock.wm did the provision and i just did the OS install, But I will check and let you know tomorrow. Thanks
[00:38:38] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978649
[00:38:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978649 (owner: 10TrainBranchBot)
[00:52:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) @Jhancock.wm I think you forget to setup the 3 additional IP's for those nodes (Networking Setup: Speed:1G - VLAN:Private(?)/Public/Other(Specify) : AAAA records:...
[00:54:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) a:05Jhancock.wm→03Papaul
[01:05:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/978649 (owner: 10TrainBranchBot)
[01:08:21] <wikibugs>	 (03PS1) 10Papaul: Add new kubernetes node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978716 (https://phabricator.wikimedia.org/T349873)
[01:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[01:18:07] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add new kubernetes node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/978716 (https://phabricator.wikimedia.org/T349873) (owner: 10Papaul)
[01:21:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:25:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2057.codfw.wmnet with OS bullseye
[01:25:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2057.codfw.wmnet with OS bullseye
[01:30:51] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:35:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2058.codfw.wmnet with OS bullseye
[01:35:52] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2058.codfw.wmnet with OS bullseye
[01:36:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Papaul)
[01:43:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2057.codfw.wmnet with reason: host reimage
[01:46:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2057.codfw.wmnet with reason: host reimage
[01:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[01:52:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2058.codfw.wmnet with reason: host reimage
[01:56:11] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2058.codfw.wmnet with reason: host reimage
[01:56:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2059.codfw.wmnet with OS bullseye
[01:56:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2059.codfw.wmnet with OS bullseye
[01:57:35] <wikibugs>	 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T352027 (10Papaul) 05Open→03Resolved a:03Papaul fix
[02:04:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:07:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:07:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2057.codfw.wmnet with OS bullseye
[02:07:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2057.codfw.wmnet with OS bullseye completed: - kubernetes2057 (**PASS**)...
[02:08:21] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Papaul)
[02:09:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2060.codfw.wmnet with OS bullseye
[02:09:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2060.codfw.wmnet with OS bullseye
[02:14:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:14:58] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[02:18:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2059.codfw.wmnet with reason: host reimage
[02:19:51] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:19:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2058.codfw.wmnet with OS bullseye
[02:20:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2058.codfw.wmnet with OS bullseye completed: - kubernetes2058 (**PASS**)...
[02:22:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2059.codfw.wmnet with reason: host reimage
[02:22:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Eevans) >>! In T349758#9369257, @Papaul wrote: > [ ... ] > @Eevans if i add the other 3 IP's addresses manually you should be good or do we have to re image all the hosts...
[02:26:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2060.codfw.wmnet with reason: host reimage
[02:29:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2060.codfw.wmnet with reason: host reimage
[02:38:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Eevans) >>! In T349758#9369347, @Eevans wrote: >>>! In T349758#9369257, @Papaul wrote: >> [ ... ] >> @Eevans if i add the other 3 IP's addresses manually you should be go...
[02:39:00] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:42:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:43:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:44:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2059.codfw.wmnet with OS bullseye
[02:44:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2059.codfw.wmnet with OS bullseye completed: - kubernetes2059 (**PASS**)...
[02:47:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:49:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:49:54] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2060.codfw.wmnet with OS bullseye
[02:50:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2060.codfw.wmnet with OS bullseye completed: - kubernetes2060 (**PASS**)...
[02:50:02] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T352354 (10phaultfinder)
[02:50:53] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS13030/IPv4: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:57:17] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:01:41] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:07:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Papaul)
[03:07:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Papaul) 05Open→03Resolved @Clement_Goubert @Joe all your's
[03:09:00] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:43:54] <wikibugs>	 (03PS32) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[03:51:26] <wikibugs>	 (03CR) 10Dwisehaupt: "Thanks for the feedback. I've fixed the nits and added a firewall::service stanza preemptively." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[03:53:59] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Active - Init7, AS13030/IPv4: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:54:31] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[04:05:03] <wikibugs>	 (03CR) 10Andrew Bogott: puppetserver: '/srv/puppet_code/environments' owned by puppet/puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468) (owner: 10Andrew Bogott)
[04:07:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] P:wmcs::instance: adjust syslog handling [puppet] - 10https://gerrit.wikimedia.org/r/850633 (owner: 10Majavah)
[04:10:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] [openstack] Upgrade all remaining hosts to Antelope [puppet] - 10https://gerrit.wikimedia.org/r/978636 (https://phabricator.wikimedia.org/T348843) (owner: 10FNegri)
[04:22:40] <wikibugs>	 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T352357 (10phaultfinder)
[04:24:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:24:58] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:25:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1133 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:33:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:34:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:34:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:34:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:38:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1145 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:39:25] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:45:03] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:45:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:59:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:59:47] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:21:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:35:39] <wikibugs>	 (03PS3) 10KartikMistry: Update cxserver to 2023-11-28-064518-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982)
[05:35:48] <wikibugs>	 (03CR) 10KartikMistry: Update cxserver to 2023-11-28-064518-production (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry)
[05:39:04] <marostegui>	 I am going to put phabricator in RO for a few seconds to switch its database master
[05:41:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149
[05:41:29] <stashbot>	 T352149: Switchover m3 master db1159 -> db1119 - https://phabricator.wikimedia.org/T352149
[05:41:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149
[05:43:25] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1119 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/978721 (https://phabricator.wikimedia.org/T352149)
[05:45:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1119 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/978721 (https://phabricator.wikimedia.org/T352149) (owner: 10Marostegui)
[05:47:24] <marostegui>	 !log Failover m3 from db1159 to db1119 - T352149
[05:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:30] <stashbot>	 T352149: Switchover m3 master db1159 -> db1119 - https://phabricator.wikimedia.org/T352149
[05:51:00] <wikibugs>	 (03PS1) 10Marostegui: db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978722 (https://phabricator.wikimedia.org/T351990)
[05:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:51:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978722 (https://phabricator.wikimedia.org/T351990) (owner: 10Marostegui)
[05:52:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1159.eqiad.wmnet with OS bookworm
[06:05:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1159.eqiad.wmnet with reason: host reimage
[06:08:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1159.eqiad.wmnet with reason: host reimage
[06:13:49] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1159: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978515
[06:13:56] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/978515 (owner: 10Marostegui)
[06:14:58] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[06:23:31] <wikibugs>	 (03PS2) 10KartikMistry: Update Apertium to 2023-11-30-061450-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978189 (https://phabricator.wikimedia.org/T270060)
[06:24:28] <kart_>	 marostegui: OK to deploy apertium service?
[06:24:33] <marostegui>	 kart_: absolutely!
[06:24:39] <kart_>	 Thanks!
[06:27:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1159.eqiad.wmnet with OS bookworm
[06:32:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 T351283', diff saved to https://phabricator.wikimedia.org/P53951 and previous config saved to /var/cache/conftool/dbconfig/20231130-063258-root.json
[06:33:04] <stashbot>	 T351283: Compile and package MariaDB 10.6.16 and 10.4.32 - https://phabricator.wikimedia.org/T351283
[06:33:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1210 T351283', diff saved to https://phabricator.wikimedia.org/P53952 and previous config saved to /var/cache/conftool/dbconfig/20231130-063317-root.json
[06:34:39] <wikibugs>	 (03PS1) 10Marostegui: db1210,db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978726 (https://phabricator.wikimedia.org/T351283)
[06:35:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1210,db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/978726 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui)
[06:35:55] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update Apertium to 2023-11-30-061450-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978189 (https://phabricator.wikimedia.org/T270060) (owner: 10KartikMistry)
[06:36:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1210.eqiad.wmnet with OS bookworm
[06:36:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1126.eqiad.wmnet with OS bookworm
[06:36:45] <wikibugs>	 (03Merged) 10jenkins-bot: Update Apertium to 2023-11-30-061450-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978189 (https://phabricator.wikimedia.org/T270060) (owner: 10KartikMistry)
[06:37:18] <wikibugs>	 (03CR) 10Marostegui: Revert "db1159: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978515 (owner: 10Marostegui)
[06:37:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1159: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978515 (owner: 10Marostegui)
[06:39:52] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply
[06:40:17] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply
[06:41:45] <wikibugs>	 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-October-December): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10santhosh) @elukey, What do you mean by 'reaching out to you by next time' ?  Regarding the architecture of...
[06:42:35] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/apertium: apply
[06:43:13] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply
[06:44:10] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/apertium: apply
[06:44:38] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/apertium: apply
[06:45:03] <kart_>	 !log Updated Apertium to 2023-11-30-061450-production (T270060)
[06:45:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:08] <stashbot>	 T270060: Package apertium-fra-frp (French-Arpitan) - https://phabricator.wikimedia.org/T270060
[06:46:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1126.eqiad.wmnet with reason: host reimage
[06:49:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage
[06:49:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1126.eqiad.wmnet with reason: host reimage
[06:53:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1210.eqiad.wmnet with reason: host reimage
[06:57:58] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978651 (https://phabricator.wikimedia.org/T343123)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0700).
[07:09:01] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:09:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1126.eqiad.wmnet with OS bookworm
[07:13:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1210.eqiad.wmnet with OS bookworm
[07:20:42] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-11-21-115852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978727
[07:31:01] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1210,db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978524
[07:31:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1210,db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978524 (owner: 10Marostegui)
[07:32:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53953 and previous config saved to /var/cache/conftool/dbconfig/20231130-073210-root.json
[07:32:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53954 and previous config saved to /var/cache/conftool/dbconfig/20231130-073212-root.json
[07:42:22] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:45:56] <wikibugs>	 (03PS1) 10Marostegui: phabricator.my.cnf.erb: Increase innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/978850 (https://phabricator.wikimedia.org/T352360)
[07:47:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53955 and previous config saved to /var/cache/conftool/dbconfig/20231130-074715-root.json
[07:47:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53956 and previous config saved to /var/cache/conftool/dbconfig/20231130-074717-root.json
[07:49:52] <wikibugs>	 (03PS1) 10Clare Ming: Add stream config for *uiactionstracking via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298)
[07:52:14] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/output/978850/781/db1159.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/978850 (https://phabricator.wikimedia.org/T352360) (owner: 10Marostegui)
[07:54:45] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[07:54:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] phabricator.my.cnf.erb: Increase innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/978850 (https://phabricator.wikimedia.org/T352360) (owner: 10Marostegui)
[08:00:06] <jouncebot>	 Amir1, apergos, and jnuche: gettimeofday() says it's time for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0800)
[08:00:06] <jouncebot>	 aanzx: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:36] <apergos>	 uh, there is no patch listed on the Deployment calendar actually
[08:00:53] <apergos>	 no trainees signed up either
[08:01:08] <anzx>	 I removed mine
[08:01:13] <apergos>	 okey dokey
[08:01:34] <apergos>	 in that case, have a quiet day everyone and see you all next time!
[08:02:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53957 and previous config saved to /var/cache/conftool/dbconfig/20231130-080220-root.json
[08:02:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53958 and previous config saved to /var/cache/conftool/dbconfig/20231130-080222-root.json
[08:09:43] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/978963 (https://phabricator.wikimedia.org/T352244)
[08:10:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/978963 (https://phabricator.wikimedia.org/T352244) (owner: 10Marostegui)
[08:12:34] <wikibugs>	 10SRE, 10Data-Platform-SRE: Harden the netboot configuration against typos - https://phabricator.wikimedia.org/T351059 (10brouberol) Thanks @MoritzMuehlenhoff !
[08:14:22] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Nice! This will definitely be useful." [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis)
[08:17:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[08:17:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53959 and previous config saved to /var/cache/conftool/dbconfig/20231130-081726-root.json
[08:17:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53960 and previous config saved to /var/cache/conftool/dbconfig/20231130-081727-root.json
[08:19:09] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:20:26] <wikibugs>	 (03PS3) 10Brouberol: Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722)
[08:21:17] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Switch drmrs clusters to PKI [puppet] - 10https://gerrit.wikimedia.org/r/978965 (https://phabricator.wikimedia.org/T350686)
[08:23:21] <wikibugs>	 (03PS2) 10Clare Ming: Add stream config for *uiactionstracking via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298)
[08:24:58] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:28:05] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1010.eqiad.wmnet with OS bullseye
[08:30:11] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:30:43] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:31:07] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53961 and previous config saved to /var/cache/conftool/dbconfig/20231130-083231-root.json
[08:32:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53962 and previous config saved to /var/cache/conftool/dbconfig/20231130-083232-root.json
[08:32:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:32:55] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:33:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978651 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira)
[08:34:47] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:35:07] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:37:31] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:38:05] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:39:15] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/978651 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira)
[08:40:10] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/978651 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira)
[08:40:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 T352362', diff saved to https://phabricator.wikimedia.org/P53963 and previous config saved to /var/cache/conftool/dbconfig/20231130-084015-root.json
[08:40:21] <stashbot>	 T352362: decommission db1126.eqiad.wmnet - https://phabricator.wikimedia.org/T352362
[08:40:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch drmrs clusters to PKI [puppet] - 10https://gerrit.wikimedia.org/r/978965 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[08:41:03] <wikibugs>	 (03PS1) 10Marostegui: db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979018 (https://phabricator.wikimedia.org/T352362)
[08:41:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979018 (https://phabricator.wikimedia.org/T352362) (owner: 10Marostegui)
[08:42:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:42:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:43:47] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979021 (https://phabricator.wikimedia.org/T351864)
[08:44:06] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[08:44:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979021 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[08:44:56] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1126 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/979022 (https://phabricator.wikimedia.org/T352362)
[08:45:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1025.eqiad.wmnet with OS bookworm
[08:45:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1126 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/979022 (https://phabricator.wikimedia.org/T352362) (owner: 10Marostegui)
[08:46:53] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:46:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1126 from dbctl T352362', diff saved to https://phabricator.wikimedia.org/P53964 and previous config saved to /var/cache/conftool/dbconfig/20231130-084655-marostegui.json
[08:47:01] <stashbot>	 T352362: decommission db1126.eqiad.wmnet - https://phabricator.wikimedia.org/T352362
[08:47:16] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978525
[08:47:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install6002.wikimedia.org
[08:47:22] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/978525 (owner: 10Marostegui)
[08:47:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53965 and previous config saved to /var/cache/conftool/dbconfig/20231130-084737-root.json
[08:48:53] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:49:07] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:49:27] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:49:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:52:48] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1010.eqiad.wmnet with reason: host reimage
[08:53:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install6002.wikimedia.org
[08:54:20] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[08:54:27] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:54:39] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:54:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow6001.drmrs.wmnet
[08:55:17] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:55:43] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1010.eqiad.wmnet with reason: host reimage
[08:56:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] istio: upgrade Docker images to 1.15.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978637 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey)
[08:57:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage
[08:58:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow6001.drmrs.wmnet
[09:00:05] <jouncebot>	 hashar and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T0900).
[09:01:58] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/979023
[09:02:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P53966 and previous config saved to /var/cache/conftool/dbconfig/20231130-090242-root.json
[09:02:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage
[09:04:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:05:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/979023 (owner: 10Marostegui)
[09:06:12] <hashar>	 I am going to run the MediaWiki train
[09:06:59] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979024 (https://phabricator.wikimedia.org/T350083)
[09:07:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979024 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot)
[09:07:44] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979024 (https://phabricator.wikimedia.org/T350083) (owner: 10TrainBranchBot)
[09:07:45] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:07:45] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:09:10] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] RAID - Add instance name to MD RAID alert summary [alerts] - 10https://gerrit.wikimedia.org/r/978485 (owner: 10Slyngshede)
[09:10:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff)
[09:10:55] <wikibugs>	 (03Merged) 10jenkins-bot: RAID - Add instance name to MD RAID alert summary [alerts] - 10https://gerrit.wikimedia.org/r/978485 (owner: 10Slyngshede)
[09:13:06] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Switch esams to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979025 (https://phabricator.wikimedia.org/T350686)
[09:13:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-2] "Not yet ready" [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff)
[09:13:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Clement_Goubert) Thanks!
[09:15:06] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.7  refs T350083
[09:15:13] <stashbot>	 T350083: 1.42.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T350083
[09:18:02] <wikibugs>	 10SRE, 10serviceops: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10Clement_Goubert)
[09:18:51] <wikibugs>	 10SRE, 10serviceops: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10Clement_Goubert) p:05Triage→03Medium
[09:19:00] <wikibugs>	 (03PS2) 10Muehlenhoff: ganeti: Switch esams to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979025 (https://phabricator.wikimedia.org/T350686)
[09:19:11] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1010.eqiad.wmnet with OS bullseye
[09:21:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:22:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch esams to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979025 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[09:24:38] <wikibugs>	 (03PS2) 10Marostegui: Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978525
[09:24:43] <wikibugs>	 (03CR) 10Marostegui: Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978525 (owner: 10Marostegui)
[09:25:05] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:25:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/978525 (owner: 10Marostegui)
[09:25:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1025.eqiad.wmnet with OS bookworm
[09:26:45] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:29:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow3003.esams.wmnet
[09:31:14] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ncredir: Enable IPIP encapsulation on codfw [puppet] - 10https://gerrit.wikimedia.org/r/978624 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[09:31:20] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[09:33:39] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot
[09:33:55] <logmsgbot>	 !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot
[09:34:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow3003.esams.wmnet
[09:35:08] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot
[09:35:12] <logmsgbot>	 !log arnaudb@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot
[09:36:06] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot
[09:36:09] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: reboot
[09:37:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'change es2 master to es2026 as es2033 is rebooting', diff saved to https://phabricator.wikimedia.org/P53967 and previous config saved to /var/cache/conftool/dbconfig/20231130-093740-arnaudb.json
[09:38:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 is depooled', diff saved to https://phabricator.wikimedia.org/P53968 and previous config saved to /var/cache/conftool/dbconfig/20231130-093814-arnaudb.json
[09:39:01] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job lvs_realserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:39:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2048:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2048 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:44:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir3003.esams.wmnet
[09:44:16] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2048:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2048 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:46:23] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes20[57-60]: Add to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/979029 (https://phabricator.wikimedia.org/T352369)
[09:46:32] <wikibugs>	 (03PS1) 10Clément Goubert: wikikube: put kubernetes20[57-60] in production [puppet] - 10https://gerrit.wikimedia.org/r/979030 (https://phabricator.wikimedia.org/T352369)
[09:46:36] <wikibugs>	 (03PS1) 10Clément Goubert: wikikube: add kubernetes20[57-60] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/979031 (https://phabricator.wikimedia.org/T352369)
[09:48:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir3003.esams.wmnet
[09:51:02] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP on codfw text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/978625 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[09:51:20] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP on codfw text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/978625 (https://phabricator.wikimedia.org/T351069)
[09:53:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53969 and previous config saved to /var/cache/conftool/dbconfig/20231130-095325-arnaudb.json
[09:59:01] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job lvs_realserver in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:59:19] <vgutierrez>	 !log rolling restart of pybal on lvs2011 and lvs2014, effectively enabling IPIP encapsulation on ncredir@codfw - T351069
[09:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:24] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[09:59:32] <vgutierrez>	 volans, Amir1 ^^ blame me if anything pages
[10:00:31] <volans>	 vgutierrez: ack
[10:01:56] <vgutierrez>	 all good apparently :)
[10:03:11] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[10:03:26] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[10:03:27] <volans>	 nice!
[10:03:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/978056 (https://phabricator.wikimedia.org/T351139) (owner: 10Slyngshede)
[10:05:13] <wikibugs>	 (03CR) 10Cathal Mooney: "Looks good to me.  Some of the finer points of the logic to build the server lists I don't grok fully, but overall I'm happy with the appr" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[10:06:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] gerrit : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff)
[10:07:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2041:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2041 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:08:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 20%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53970 and previous config saved to /var/cache/conftool/dbconfig/20231130-100830-arnaudb.json
[10:09:03] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: rdb101[34]: Set them up as redis misc replicas [puppet] - 10https://gerrit.wikimedia.org/r/979034 (https://phabricator.wikimedia.org/T326171)
[10:09:07] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Update references to rdb1010 to point to rdb1014 [puppet] - 10https://gerrit.wikimedia.org/r/979035 (https://phabricator.wikimedia.org/T326171)
[10:09:11] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Switch rdb1009 replicas to replicate from rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979036 (https://phabricator.wikimedia.org/T326171)
[10:09:15] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Promote rdb1013 to master, drop rdb1009, rdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/979037 (https://phabricator.wikimedia.org/T326171)
[10:12:16] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2041:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2041 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:12:49] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[10:13:02] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[10:14:58] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[10:22:17] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mediawiki::php: Set php-common version dependent on OS [puppet] - 10https://gerrit.wikimedia.org/r/978540 (owner: 10Muehlenhoff)
[10:22:22] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:22:49] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:22:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T348183)', diff saved to https://phabricator.wikimedia.org/P53971 and previous config saved to /var/cache/conftool/dbconfig/20231130-102255-arnaudb.json
[10:23:00] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[10:23:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 30%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53972 and previous config saved to /var/cache/conftool/dbconfig/20231130-102336-arnaudb.json
[10:24:56] <wikibugs>	 (03PS1) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152)
[10:26:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] wikikube: put kubernetes20[57-60] in production [puppet] - 10https://gerrit.wikimedia.org/r/979030 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert)
[10:27:16] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] wikikube: add kubernetes20[57-60] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/979031 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert)
[10:28:18] <wikibugs>	 (03PS2) 10Muehlenhoff: ceph::server: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945785
[10:28:52] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] kubernetes20[57-60]: Add to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/979029 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert)
[10:30:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T348183)', diff saved to https://phabricator.wikimedia.org/P53973 and previous config saved to /var/cache/conftool/dbconfig/20231130-103004-arnaudb.json
[10:30:13] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[10:30:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure rdb1013/rdb1014 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979041 (https://phabricator.wikimedia.org/T349619)
[10:32:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[10:34:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945785 (owner: 10Muehlenhoff)
[10:37:10] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: toggle notifications [puppet] - 10https://gerrit.wikimedia.org/r/978652 (https://phabricator.wikimedia.org/T343674)
[10:37:19] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "The idea looks ok, some comments inline. Also missing tests ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[10:37:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Configure rdb1013/rdb1014 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979041 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:37:30] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wikimedia Bénin user group - https://phabricator.wikimedia.org/T352285 (10Mh-3110) Many thanks
[10:38:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 40%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53974 and previous config saved to /var/cache/conftool/dbconfig/20231130-103841-arnaudb.json
[10:39:59] <wikibugs>	 (03PS5) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[10:40:38] <wikibugs>	 (03PS5) 10Arnaudb: debug: printing results when return object count > 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/971437 (https://phabricator.wikimedia.org/T350656)
[10:40:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[10:42:25] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wikikube: put kubernetes20[57-60] in production [puppet] - 10https://gerrit.wikimedia.org/r/979030 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert)
[10:42:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications [puppet] - 10https://gerrit.wikimedia.org/r/978652 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[10:43:02] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications [puppet] - 10https://gerrit.wikimedia.org/r/978652 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[10:43:12] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] kubernetes20[57-60]: Add to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/979029 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert)
[10:43:46] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes20[57-60]: Add to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/979029 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert)
[10:43:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ceph::server: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945785 (owner: 10Muehlenhoff)
[10:45:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P53975 and previous config saved to /var/cache/conftool/dbconfig/20231130-104510-arnaudb.json
[10:48:20] <wikibugs>	 (03PS2) 10Klausman: ml-services/article-description: set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123)
[10:48:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services/article-description: set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman)
[10:50:22] <wikibugs>	 (03PS3) 10Klausman: ml-services: article-description set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123)
[10:50:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2057.codfw.wmnet with OS bullseye
[10:50:47] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes2057.codfw.wmnet with OS bullseye
[10:52:37] <moritzm>	 !log installing python-git security updates
[10:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53976 and previous config saved to /var/cache/conftool/dbconfig/20231130-105346-arnaudb.json
[10:53:55] <wikibugs>	 (03PS1) 10Majavah: cloudlb: wikireplicas: fix timeouts [puppet] - 10https://gerrit.wikimedia.org/r/979045 (https://phabricator.wikimedia.org/T346947)
[10:56:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] istio: upgrade Docker images to 1.15.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/978637 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey)
[10:56:45] <wikibugs>	 (03PS2) 10Elukey: cert-manager: bump appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933)
[10:58:58] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "Self-merging given this is relatively straightforward (just copied from the old wiki replica proxies) and will fix an user-facing issue. P" [puppet] - 10https://gerrit.wikimedia.org/r/979045 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah)
[10:59:17] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: article-description set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman)
[10:59:47] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2058.codfw.wmnet with OS bullseye
[10:59:57] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes2058.codfw.wmnet with OS bullseye
[11:00:05] <jouncebot>	 mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1100).
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1100)
[11:00:06] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: article-description set OMP_NUM_THREADS=1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979042 (https://phabricator.wikimedia.org/T343123) (owner: 10Klausman)
[11:00:08] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2059.codfw.wmnet with OS bullseye
[11:00:17] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes2059.codfw.wmnet with OS bullseye
[11:00:18] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P53977 and previous config saved to /var/cache/conftool/dbconfig/20231130-110017-arnaudb.json
[11:00:27] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2060.codfw.wmnet with OS bullseye
[11:00:36] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes2060.codfw.wmnet with OS bullseye
[11:01:38] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:06:35] <wikibugs>	 (03PS5) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059)
[11:08:52] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 60%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53978 and previous config saved to /var/cache/conftool/dbconfig/20231130-110851-arnaudb.json
[11:11:23] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2057.codfw.wmnet with reason: host reimage
[11:11:49] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] [openstack] Upgrade all remaining hosts to Antelope [puppet] - 10https://gerrit.wikimedia.org/r/978636 (https://phabricator.wikimedia.org/T348843) (owner: 10FNegri)
[11:11:52] <wikibugs>	 (03PS6) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059)
[11:12:51] <wikibugs>	 (03CR) 10Awight: "cold review:  I think the application config is missing?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli)
[11:14:20] <wikibugs>	 (03PS7) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059)
[11:14:35] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2057.codfw.wmnet with reason: host reimage
[11:15:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T348183)', diff saved to https://phabricator.wikimedia.org/P53979 and previous config saved to /var/cache/conftool/dbconfig/20231130-111524-arnaudb.json
[11:15:26] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[11:15:35] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:15:40] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[11:15:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T348183)', diff saved to https://phabricator.wikimedia.org/P53980 and previous config saved to /var/cache/conftool/dbconfig/20231130-111546-arnaudb.json
[11:19:11] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2058.codfw.wmnet with reason: host reimage
[11:20:13] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2060.codfw.wmnet with reason: host reimage
[11:22:07] <wikibugs>	 (03CR) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli)
[11:22:31] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2058.codfw.wmnet with reason: host reimage
[11:22:58] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T348183)', diff saved to https://phabricator.wikimedia.org/P53981 and previous config saved to /var/cache/conftool/dbconfig/20231130-112258-arnaudb.json
[11:23:03] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:23:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 70%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53982 and previous config saved to /var/cache/conftool/dbconfig/20231130-112356-arnaudb.json
[11:25:22] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2060.codfw.wmnet with reason: host reimage
[11:25:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[11:25:34] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2059.codfw.wmnet with reason: host reimage
[11:26:47] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: rdb101[34]: Set them up as redis misc replicas [puppet] - 10https://gerrit.wikimedia.org/r/979034 (https://phabricator.wikimedia.org/T326171)
[11:26:49] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Update references to rdb1010 to point to rdb1014 [puppet] - 10https://gerrit.wikimedia.org/r/979035 (https://phabricator.wikimedia.org/T326171)
[11:26:51] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Switch rdb1009 replicas to replicate from rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979036 (https://phabricator.wikimedia.org/T326171)
[11:26:53] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Promote rdb1013 to master, drop rdb1009, rdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/979037 (https://phabricator.wikimedia.org/T326171)
[11:27:33] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Switch codfw to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979050 (https://phabricator.wikimedia.org/T350686)
[11:27:35] <wikibugs>	 (03PS6) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[11:28:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:28:48] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2059.codfw.wmnet with reason: host reimage
[11:28:55] <wikibugs>	 (03PS1) 10Hnowlan: changeprop-jobqueue: migrate more lower-impact jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/979051 (https://phabricator.wikimedia.org/T349796)
[11:31:59] <wikibugs>	 (03PS7) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[11:32:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:33:51] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2057.codfw.wmnet with OS bullseye
[11:34:05] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes2057.codfw.wmnet with OS bullseye completed: - kubernetes2057 (**PASS**)   - Down...
[11:34:31] <jinxer-wm>	 (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[11:35:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: migrate more lower-impact jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/979051 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:36:01] <wikibugs>	 (03PS8) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059)
[11:36:16] <wikibugs>	 (03CR) 10Brouberol: Add a unit-test that validates the structure of the preseed hieradata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[11:38:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P53983 and previous config saved to /var/cache/conftool/dbconfig/20231130-113804-arnaudb.json
[11:38:31] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Clement_Goubert)
[11:39:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 80%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53984 and previous config saved to /var/cache/conftool/dbconfig/20231130-113901-arnaudb.json
[11:40:44] <wikibugs>	 (03PS8) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[11:43:06] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2058.codfw.wmnet with OS bullseye
[11:43:16] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes2058.codfw.wmnet with OS bullseye completed: - kubernetes2058 (**PASS**)   - Down...
[11:45:14] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2060.codfw.wmnet with OS bullseye
[11:45:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Reverse logic to select correct virtual console serial mode [cookbooks] - 10https://gerrit.wikimedia.org/r/978127 (owner: 10Cathal Mooney)
[11:45:24] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes2060.codfw.wmnet with OS bullseye completed: - kubernetes2060 (**PASS**)   - Down...
[11:50:58] <wikibugs>	 (03Merged) 10jenkins-bot: Reverse logic to select correct virtual console serial mode [cookbooks] - 10https://gerrit.wikimedia.org/r/978127 (owner: 10Cathal Mooney)
[11:53:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P53985 and previous config saved to /var/cache/conftool/dbconfig/20231130-115312-arnaudb.json
[11:54:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 90%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53986 and previous config saved to /var/cache/conftool/dbconfig/20231130-115406-arnaudb.json
[11:54:11] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: migrate more lower-impact jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/979051 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:55:02] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: migrate more lower-impact jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/979051 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[12:02:04] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2059.codfw.wmnet with OS bullseye
[12:02:13] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes2059.codfw.wmnet with OS bullseye completed: - kubernetes2059 (**WARN**)   - Down...
[12:02:14] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[12:02:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[12:03:32] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[12:03:55] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[12:06:07] <claime>	 !log Running homer 'cr*codfw*' commit T352369
[12:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:12] <stashbot>	 T352369: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369
[12:06:16] <wikibugs>	 (03PS1) 10Majavah: openstack: spreadcheck: remove in favour of server groups [puppet] - 10https://gerrit.wikimedia.org/r/979056 (https://phabricator.wikimedia.org/T247213)
[12:08:19] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T348183)', diff saved to https://phabricator.wikimedia.org/P53987 and previous config saved to /var/cache/conftool/dbconfig/20231130-120819-arnaudb.json
[12:08:21] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[12:08:22] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[12:08:25] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:08:26] <wikibugs>	 (03PS9) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[12:08:35] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[12:08:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T348183)', diff saved to https://phabricator.wikimedia.org/P53988 and previous config saved to /var/cache/conftool/dbconfig/20231130-120841-arnaudb.json
[12:08:45] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[12:08:58] <wikibugs>	 (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[12:09:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53989 and previous config saved to /var/cache/conftool/dbconfig/20231130-120911-arnaudb.json
[12:10:39] <wikibugs>	 (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[12:11:02] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/782/console" [puppet] - 10https://gerrit.wikimedia.org/r/979056 (https://phabricator.wikimedia.org/T247213) (owner: 10Majavah)
[12:12:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] wikikube: add kubernetes20[57-60] to LVS [puppet] - 10https://gerrit.wikimedia.org/r/979031 (https://phabricator.wikimedia.org/T352369) (owner: 10Clément Goubert)
[12:13:02] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/783/con" [puppet] - 10https://gerrit.wikimedia.org/r/979056 (https://phabricator.wikimedia.org/T247213) (owner: 10Majavah)
[12:15:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T348183)', diff saved to https://phabricator.wikimedia.org/P53990 and previous config saved to /var/cache/conftool/dbconfig/20231130-121554-arnaudb.json
[12:16:00] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:18:00] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:19:18] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:21:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] rdb101[34]: Set them up as redis misc replicas [puppet] - 10https://gerrit.wikimedia.org/r/979034 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris)
[12:22:34] <claime>	 !log Pooling kubernetes20(5[4789]|60).codfw.wmnet - T352369
[12:22:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:39] <stashbot>	 T352369: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369
[12:24:25] <claime>	 !log Uncordoning kubernetes20(5[4789]|60).codfw.wmnet - T352369
[12:24:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:36] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978654
[12:26:05] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2059.codfw.wmnet
[12:26:05] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2059.codfw.wmnet
[12:27:12] <wikibugs>	 (03PS6) 10MdsShakil: Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903)
[12:27:22] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2059 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:27:35] <claime>	 ^that's me, on it
[12:28:07] <wikibugs>	 (03PS1) 10Btullis: Bring an-coord1003 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045)
[12:28:40] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2059 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:29:01] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:29:36] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/784/con" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[12:30:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Clement_Goubert)
[12:31:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P53991 and previous config saved to /var/cache/conftool/dbconfig/20231130-123100-arnaudb.json
[12:31:01] <wikibugs>	 10SRE, 10serviceops: setup/install kubernetes20[57-60] - https://phabricator.wikimedia.org/T352369 (10Clement_Goubert) 05Open→03Resolved Hosts are in production, resolving.
[12:31:41] <wikibugs>	 (03PS2) 10Btullis: Bring an-coord1003 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045)
[12:32:55] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:59] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:35:03] <wikibugs>	 (03CR) 10Awight: (WIP) kartotherian: add kartotherian chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli)
[12:36:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Add explicit Hiera records to mark the new coordinator nodes as running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979087 (https://phabricator.wikimedia.org/T336045)
[12:37:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'change es2 master to es2033 after reboot', diff saved to https://phabricator.wikimedia.org/P53992 and previous config saved to /var/cache/conftool/dbconfig/20231130-123752-arnaudb.json
[12:39:37] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: reboot
[12:39:51] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: reboot
[12:40:23] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'change es3 master to es2029 as es2034 will reboot', diff saved to https://phabricator.wikimedia.org/P53993 and previous config saved to /var/cache/conftool/dbconfig/20231130-124050-arnaudb.json
[12:41:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 is depooled', diff saved to https://phabricator.wikimedia.org/P53994 and previous config saved to /var/cache/conftool/dbconfig/20231130-124110-arnaudb.json
[12:42:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add explicit Hiera records to mark the new coordinator nodes as running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979087 (https://phabricator.wikimedia.org/T336045) (owner: 10Muehlenhoff)
[12:42:59] <wikibugs>	 (03PS1) 10Btullis: Add dummy keytabs for new hadoop coordinators [labs/private] - 10https://gerrit.wikimedia.org/r/979088 (https://phabricator.wikimedia.org/T336045)
[12:43:01] <wikibugs>	 (03CR) 10Awight: (WIP) kartotherian: add kartotherian chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006) (owner: 10Effie Mouzeli)
[12:44:11] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:44:38] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy keytabs for new hadoop coordinators [labs/private] - 10https://gerrit.wikimedia.org/r/979088 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[12:46:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P53995 and previous config saved to /var/cache/conftool/dbconfig/20231130-124607-arnaudb.json
[12:46:24] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/786/con" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[12:48:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53996 and previous config saved to /var/cache/conftool/dbconfig/20231130-124849-arnaudb.json
[12:53:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host schema2004.codfw.wmnet
[12:53:30] <wikibugs>	 (03PS5) 10Btullis: Update default airflow_version and remove overrides [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621)
[12:54:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch schema2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979090 (https://phabricator.wikimedia.org/T349619)
[12:56:03] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[12:56:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch schema2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979090 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:57:53] <wikibugs>	 (03PS3) 10Btullis: Bring an-coord1003 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045)
[12:59:42] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1300)
[13:00:09] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/787/con" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[13:00:18] <wikibugs>	 (03PS1) 10JMeybohm: Add new mesh module versions: certificate, configuration, deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/979094 (https://phabricator.wikimedia.org/T300033)
[13:00:20] <wikibugs>	 (03PS1) 10JMeybohm: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033)
[13:00:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[13:01:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T348183)', diff saved to https://phabricator.wikimedia.org/P53997 and previous config saved to /var/cache/conftool/dbconfig/20231130-130113-arnaudb.json
[13:01:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:01:16] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:01:30] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:01:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T348183)', diff saved to https://phabricator.wikimedia.org/P53998 and previous config saved to /var/cache/conftool/dbconfig/20231130-130136-arnaudb.json
[13:01:40] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[13:01:41] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:02:56] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db1126 [puppet] - 10https://gerrit.wikimedia.org/r/979096 (https://phabricator.wikimedia.org/T352362)
[13:03:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1126.eqiad.wmnet
[13:03:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 20%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P53999 and previous config saved to /var/cache/conftool/dbconfig/20231130-130354-arnaudb.json
[13:04:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host schema2004.codfw.wmnet
[13:05:20] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host elastic2108.mgmt.codfw.wmnet with reboot policy FORCED
[13:06:03] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2108.mgmt.codfw.wmnet with reboot policy FORCED
[13:06:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1126 [puppet] - 10https://gerrit.wikimedia.org/r/979096 (https://phabricator.wikimedia.org/T352362) (owner: 10Marostegui)
[13:08:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T348183)', diff saved to https://phabricator.wikimedia.org/P54000 and previous config saved to /var/cache/conftool/dbconfig/20231130-130851-arnaudb.json
[13:09:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[13:09:12] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:10:40] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Add a unit-test that validates the structure of the preseed hieradata [puppet] - 10https://gerrit.wikimedia.org/r/979039 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[13:11:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1126.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[13:12:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1126.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[13:12:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:12:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1126.eqiad.wmnet
[13:13:19] <wikibugs>	 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1126.eqiad.wmnet - https://phabricator.wikimedia.org/T352362 (10Marostegui) a:03Jclark-ctr This is ready for #dc-ops
[13:14:25] <wikibugs>	 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1126.eqiad.wmnet - https://phabricator.wikimedia.org/T352362 (10Marostegui)
[13:15:02] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[13:15:04] <wikibugs>	 (03PS12) 10Marostegui: mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:19:00] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 30%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54001 and previous config saved to /var/cache/conftool/dbconfig/20231130-131859-arnaudb.json
[13:21:12] <wikibugs>	 (03PS1) 10Volans: CI: test apt_repo failures [puppet] - 10https://gerrit.wikimedia.org/r/979098 (https://phabricator.wikimedia.org/T351059)
[13:21:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:23:41] <wikibugs>	 (03PS2) 10JMeybohm: Add new mesh module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/979094 (https://phabricator.wikimedia.org/T300033)
[13:23:43] <wikibugs>	 (03PS2) 10JMeybohm: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033)
[13:23:58] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P54002 and previous config saved to /var/cache/conftool/dbconfig/20231130-132357-arnaudb.json
[13:24:28] <wikibugs>	 (03CR) 10Stevemunene: "Should we consider including the other instances of Presto discovery_uri settings https://github.com/wikimedia/operations-puppet/blob/prod" [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[13:27:01] <wikibugs>	 (03CR) 10JMeybohm: wikifunctions: Reduce drain time from 600s default to 60s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester)
[13:30:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] cert-manager: bump appVersion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey)
[13:32:24] <wikibugs>	 (03PS2) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152)
[13:32:40] <wikibugs>	 (03CR) 10JMeybohm: Remove cergen certificate support from mesh module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:33:09] <wikibugs>	 (03CR) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[13:33:12] <wikibugs>	 10SRE, 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T352357 (10phaultfinder)
[13:34:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 40%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54003 and previous config saved to /var/cache/conftool/dbconfig/20231130-133404-arnaudb.json
[13:35:46] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol)
[13:37:48] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO request use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390)
[13:38:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[13:39:00] <wikibugs>	 (03PS9) 10Effie Mouzeli: (WIP)mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[13:39:04] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P54004 and previous config saved to /var/cache/conftool/dbconfig/20231130-133904-arnaudb.json
[13:41:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch codfw to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979050 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[13:43:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Update references to rdb1010 to point to rdb1014 [puppet] - 10https://gerrit.wikimedia.org/r/979035 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris)
[13:44:06] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/788/con" [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[13:44:12] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Switch rdb1009 replicas to replicate from rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979036 (https://phabricator.wikimedia.org/T326171)
[13:44:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch rdb1009 replicas to replicate from rdb1013 [puppet] - 10https://gerrit.wikimedia.org/r/979036 (https://phabricator.wikimedia.org/T326171) (owner: 10Alexandros Kosiaris)
[13:45:28] <wikibugs>	 (03PS2) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO latency use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390)
[13:46:27] <wikibugs>	 (03CR) 10Btullis: "I think we still have to mention the change in each changelog if we want it to build." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol)
[13:47:49] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: redis_lock: Switch from rdb1009 to rdb1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979101 (https://phabricator.wikimedia.org/T326171)
[13:48:50] <wikibugs>	 (03CR) 10Brouberol: Mention the fact that 'history' is a valid argument in the error message (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol)
[13:49:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54005 and previous config saved to /var/cache/conftool/dbconfig/20231130-134909-arnaudb.json
[13:54:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T348183)', diff saved to https://phabricator.wikimedia.org/P54006 and previous config saved to /var/cache/conftool/dbconfig/20231130-135410-arnaudb.json
[13:54:14] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[13:54:16] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:54:29] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[13:54:31] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:54:47] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:54:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T348183)', diff saved to https://phabricator.wikimedia.org/P54007 and previous config saved to /var/cache/conftool/dbconfig/20231130-135453-arnaudb.json
[13:59:01] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job lvs_realserver in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:03:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T348183)', diff saved to https://phabricator.wikimedia.org/P54008 and previous config saved to /var/cache/conftool/dbconfig/20231130-140308-arnaudb.json
[14:03:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM stewards2001.codfw.wmnet
[14:03:25] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:04:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 60%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54009 and previous config saved to /var/cache/conftool/dbconfig/20231130-140414-arnaudb.json
[14:05:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10VRiley-WMF) a:03VRiley-WMF
[14:06:53] <Lucas_WMDE>	 nothing to deploy indeed
[14:07:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM stewards2001.codfw.wmnet
[14:07:24] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mediawiki: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979102 (https://phabricator.wikimedia.org/T326171)
[14:07:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: changeprop: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979103 (https://phabricator.wikimedia.org/T326171)
[14:07:32] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: api-gateway: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979104 (https://phabricator.wikimedia.org/T326171)
[14:07:36] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cp-jobqueue: Add rdb1013, rdb1014, mark rdb1009 as deprecated [deployment-charts] - 10https://gerrit.wikimedia.org/r/979105 (https://phabricator.wikimedia.org/T326171)
[14:07:40] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove rdb1009 unused references from repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/979106 (https://phabricator.wikimedia.org/T326171)
[14:11:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2093.mgmt.codfw.wmnet with reboot policy FORCED
[14:11:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2095.mgmt.codfw.wmnet with reboot policy FORCED
[14:12:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Papaul) a:03Jhancock.wm
[14:12:56] <wikibugs>	 (03PS1) 10Effie Mouzeli: (WIP) mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107
[14:13:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (owner: 10Effie Mouzeli)
[14:13:43] <wikibugs>	 (03PS10) 10Effie Mouzeli: (WIP)mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[14:14:07] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2095.mgmt.codfw.wmnet with reboot policy FORCED
[14:14:12] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2093.mgmt.codfw.wmnet with reboot policy FORCED
[14:14:14] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol)
[14:14:59] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[14:15:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host planet2003.codfw.wmnet
[14:15:44] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host planet2003.codfw.wmnet
[14:17:18] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol)
[14:17:21] <wikibugs>	 (03CR) 10Brouberol: [V: 03+2 C: 03+2] Mention the fact that 'history' is a valid argument in the error message [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977994 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol)
[14:18:19] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P54010 and previous config saved to /var/cache/conftool/dbconfig/20231130-141815-arnaudb.json
[14:19:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 70%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54011 and previous config saved to /var/cache/conftool/dbconfig/20231130-141919-arnaudb.json
[14:43:42] <wikibugs>	 (03PS1) 10Matthias Mullie: No custom UW licensing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979113
[14:43:51] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 04-1] No custom UW licensing config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979113 (owner: 10Matthias Mullie)
[14:44:39] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940)
[14:45:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: lower space-based retention to 2800GB [puppet] - 10https://gerrit.wikimedia.org/r/979110 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[14:45:21] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1106.mgmt.eqiad.wmnet with reboot policy FORCED
[14:45:48] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1105.mgmt.eqiad.wmnet with reboot policy FORCED
[14:46:26] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED
[14:48:13] <wikibugs>	 10SRE, 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T352357 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated lightning. alert cleared. if reoccurs, replace eth cable
[14:48:22] <godog>	 !log roll-restart prometheus/ops in eqiad/codfw to apply new size-based retention - T351179
[14:48:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T348183)', diff saved to https://phabricator.wikimedia.org/P54015 and previous config saved to /var/cache/conftool/dbconfig/20231130-144831-arnaudb.json
[14:48:34] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[14:48:38] <stashbot>	 T351179: LVM vg0 close to getting full on prometheus eqiad - https://phabricator.wikimedia.org/T351179
[14:48:48] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[14:48:52] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:48:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T348183)', diff saved to https://phabricator.wikimedia.org/P54016 and previous config saved to /var/cache/conftool/dbconfig/20231130-144854-arnaudb.json
[14:48:59] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105']
[14:49:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 90%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54017 and previous config saved to /var/cache/conftool/dbconfig/20231130-144929-arnaudb.json
[14:50:01] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Reduce helm deploy timeout from 600s default to 120s [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873
[14:50:03] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Reduce helm deploy timeout from 600s default to 120s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975873 (owner: 10Jforrester)
[14:50:11] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic1105']
[14:50:31] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105']
[14:53:13] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1105']
[14:53:49] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1106']
[14:53:58] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:01] <jinxer-wm>	 (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[14:54:01] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:54:11] <wikibugs>	 (03PS1) 10Majavah: P:cache::haproxy: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/979115
[14:54:32] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic1106']
[14:54:59] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job lvs_realserver in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:58] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/792/console" [puppet] - 10https://gerrit.wikimedia.org/r/979115 (owner: 10Majavah)
[14:56:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:57:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T348183)', diff saved to https://phabricator.wikimedia.org/P54018 and previous config saved to /var/cache/conftool/dbconfig/20231130-145707-arnaudb.json
[14:57:15] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:57:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979115 (owner: 10Majavah)
[14:58:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:cache::haproxy: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/979115 (owner: 10Majavah)
[14:59:01] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job lvs_realserver in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:00:22] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[15:01:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:02:44] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:01] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:04:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: Post reboot repooling', diff saved to https://phabricator.wikimedia.org/P54019 and previous config saved to /var/cache/conftool/dbconfig/20231130-150434-arnaudb.json
[15:06:20] <wikibugs>	 (03PS1) 10Brouberol: Mention topic/cluster in the kafka replication factor alert message [alerts] - 10https://gerrit.wikimedia.org/r/979116
[15:07:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'change es3 master back to es2034', diff saved to https://phabricator.wikimedia.org/P54020 and previous config saved to /var/cache/conftool/dbconfig/20231130-150712-arnaudb.json
[15:07:26] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[15:08:22] <Lucas_WMDE>	 matthiasmullie: way too late, but nothing going on AFAIK, yes
[15:08:43] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105']
[15:09:34] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Nice." [alerts] - 10https://gerrit.wikimedia.org/r/979116 (owner: 10Brouberol)
[15:11:18] <wikibugs>	 (03CR) 10Herron: "Nice one, great to see a latency SLO coming onboard!  Please see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[15:11:22] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/979117
[15:12:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P54021 and previous config saved to /var/cache/conftool/dbconfig/20231130-151214-arnaudb.json
[15:13:12] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Mention topic/cluster in the kafka replication factor alert message [alerts] - 10https://gerrit.wikimedia.org/r/979116 (owner: 10Brouberol)
[15:14:26] <wikibugs>	 (03Merged) 10jenkins-bot: Mention topic/cluster in the kafka replication factor alert message [alerts] - 10https://gerrit.wikimedia.org/r/979116 (owner: 10Brouberol)
[15:15:34] <wikibugs>	 (03PS3) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO latency use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390)
[15:16:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/979117 (owner: 10Muehlenhoff)
[15:17:39] <wikibugs>	 (03CR) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO latency use case (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[15:18:55] <wikibugs>	 (03PS4) 10Elukey: profile::pyrra::filesystem: add Lift Wing SLO latency use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390)
[15:19:30] <Amir1>	 hello
[15:20:05] <wikibugs>	 (03CR) 10Herron: profile::pyrra::filesystem: add Lift Wing SLO latency use case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[15:21:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr)
[15:21:25] <moritzm>	 !log installing libbsd bugfix updates from Bullseye point release
[15:21:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) 05Open→03Resolved
[15:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P54022 and previous config saved to /var/cache/conftool/dbconfig/20231130-152721-arnaudb.json
[15:29:39] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile::pyrra::filesystem: add Lift Wing SLO latency use case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[15:30:01] <wikibugs>	 (03CR) 10Elukey: "Had a chat with Keith on IRC, we are good to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[15:30:31] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile::pyrra::filesystem: add Lift Wing SLO latency use case (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[15:30:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: add Lift Wing SLO latency use case [puppet] - 10https://gerrit.wikimedia.org/r/979099 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[15:31:20] <moritzm>	 !log installing dbus security updates on buster
[15:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:24] <elukey>	 moritzm: ok to merge?
[15:31:59] <elukey>	 yes nothing dangerous afaics :)
[15:32:33] <moritzm>	 oh sorry, yes please go ahwad
[15:33:24] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940)
[15:33:39] <sukhe>	 !log clean-up /etc/hosts on A:dns-rec to remove entries populated by host_core: T347054
[15:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:47] <stashbot>	 T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054
[15:36:34] <moritzm>	 !log installing minizip security updates
[15:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:56] <wikibugs>	 (03PS1) 10Aqu: Rewrite metrics sent by Airflow [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532)
[15:42:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T348183)', diff saved to https://phabricator.wikimedia.org/P54023 and previous config saved to /var/cache/conftool/dbconfig/20231130-154227-arnaudb.json
[15:42:31] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[15:42:46] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:42:46] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[15:50:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mediawiki::php: Set php-common version dependent on OS [puppet] - 10https://gerrit.wikimedia.org/r/978540 (owner: 10Muehlenhoff)
[15:52:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff)
[15:52:17] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152)
[15:52:20] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[15:52:45] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[15:52:52] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T348183)', diff saved to https://phabricator.wikimedia.org/P54024 and previous config saved to /var/cache/conftool/dbconfig/20231130-155251-arnaudb.json
[15:52:57] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:53:14] <wikibugs>	 (03CR) 10Ayounsi: Netbox: add generic function to execute a Netbox script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[15:54:33] <moritzm>	 !log installing stunnel4 bugfix updates from bookworm point release
[15:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:00] <wikibugs>	 (03PS3) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152)
[15:59:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[16:00:09] <wikibugs>	 (03PS1) 10Hnowlan: changeprop-jobqueue: toggle another job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979122 (https://phabricator.wikimedia.org/T349796)
[16:02:02] <wikibugs>	 (03PS3) 10Jcrespo: Prepare for 0.2.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/978643 (https://phabricator.wikimedia.org/T327157)
[16:03:24] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "PoolCounterConnectionManager: Add support for ipv6" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444)
[16:03:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "PoolCounterConnectionManager: Add support for ipv6" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) (owner: 10Ladsgroup)
[16:04:36] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Revert "PoolCounterConnectionManager: Add support for ipv6" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) (owner: 10Ladsgroup)
[16:08:13] <Amir1>	 UBN being deployed
[16:11:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) (owner: 10Ladsgroup)
[16:11:49] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Update CentralAuth login failures metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/978679 (https://phabricator.wikimedia.org/T351948) (owner: 10Bartosz Dziewoński)
[16:12:06] <wikibugs>	 (03CR) 10Ejegg: CentralNotice: Add wmflabs to banner preview CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864327 (https://phabricator.wikimedia.org/T199055) (owner: 10AndyRussG)
[16:21:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T348183)', diff saved to https://phabricator.wikimedia.org/P54025 and previous config saved to /var/cache/conftool/dbconfig/20231130-162131-arnaudb.json
[16:21:38] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[16:23:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "PoolCounterConnectionManager: Add support for ipv6" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979080 (https://phabricator.wikimedia.org/T352444) (owner: 10Ladsgroup)
[16:23:39] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]]
[16:23:50] <stashbot>	 T352444: CirrusSearch generates a massive amount of "poolcounter-connection-error" messages - https://phabricator.wikimedia.org/T352444
[16:24:00] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra::filesystem: use histogram count for LW latency pilot [puppet] - 10https://gerrit.wikimedia.org/r/979126 (https://phabricator.wikimedia.org/T351390)
[16:24:34] <wikibugs>	 (03PS1) 10Andrew Bogott: rabbitmq: don't include openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/979127
[16:24:52] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile::pyrra::filesystem: use histogram count for LW latency pilot [puppet] - 10https://gerrit.wikimedia.org/r/979126 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[16:25:03] <wikibugs>	 (03PS4) 10Brouberol: Explicit the link between apt_repo.yaml and running modules/profile specs [puppet] - 10https://gerrit.wikimedia.org/r/979119
[16:25:08] <wikibugs>	 (03CR) 10Brouberol: "After our discussion in #wikimedia-dcops, I tried to explicitly link hieradata/role/common/apt_repo.yaml with the profile rspecs. This way" [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol)
[16:26:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: use histogram count for LW latency pilot [puppet] - 10https://gerrit.wikimedia.org/r/979126 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[16:26:55] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:27:09] <wikibugs>	 (03PS2) 10Andrew Bogott: rabbitmq: don't include openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/979127
[16:27:22] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[16:28:13] <hashar>	 Amir1: looks like poolcounter connections are resuming :)
[16:28:23] <Amir1>	 \o/
[16:33:24] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]] (duration: 09m 45s)
[16:33:30] <stashbot>	 T352444: CirrusSearch generates a massive amount of "poolcounter-connection-error" messages - https://phabricator.wikimedia.org/T352444
[16:36:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P54026 and previous config saved to /var/cache/conftool/dbconfig/20231130-163637-arnaudb.json
[16:40:05] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979127 (owner: 10Andrew Bogott)
[16:42:43] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979127 (owner: 10Andrew Bogott)
[16:51:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P54027 and previous config saved to /var/cache/conftool/dbconfig/20231130-165144-arnaudb.json
[16:58:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[17:00:06] <jouncebot>	 jhathaway and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1700).
[17:00:06] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ips to restbase servers in codfw - jhancock@cumin2002"
[17:01:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ips to restbase servers in codfw - jhancock@cumin2002"
[17:01:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:04:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Jhancock.wm) 05Open→03Resolved @Eevans Hey my bad. newbie mistake. Papaul taught me how to fix this and you should be good now.
[17:04:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop-jobqueue: toggle another job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979122 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[17:06:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T348183)', diff saved to https://phabricator.wikimedia.org/P54028 and previous config saved to /var/cache/conftool/dbconfig/20231130-170650-arnaudb.json
[17:06:53] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:07:07] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:07:10] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:07:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T348183)', diff saved to https://phabricator.wikimedia.org/P54029 and previous config saved to /var/cache/conftool/dbconfig/20231130-170713-arnaudb.json
[17:08:47] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940)
[17:11:30] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[17:12:18] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[17:13:08] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update article-desc image [deployment-charts] - 10https://gerrit.wikimedia.org/r/979114 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[17:14:44] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[17:18:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: toggle another job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979122 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[17:19:26] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: toggle another job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979122 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[17:23:29] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[17:23:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[17:24:15] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Disable event bus bridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503)
[17:24:21] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[17:24:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[17:24:57] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: update article-desc image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979081
[17:25:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[17:26:11] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[17:26:44] <wikibugs>	 (03CR) 10DCausse: "you might to disable canary events for this stream in ext-EventStreamConfig.php as well" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) (owner: 10Ebernhardson)
[17:26:46] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] Revert "ml-services: update article-desc image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979081 (owner: 10Ilias Sarantopoulos)
[17:27:15] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[17:33:38] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Disable event bus bridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503)
[17:33:40] <wikibugs>	 (03CR) 10Ebernhardson: cirrus: Disable event bus bridge (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) (owner: 10Ebernhardson)
[17:34:13] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cirrus: Disable event bus bridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) (owner: 10Ebernhardson)
[17:36:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T348183)', diff saved to https://phabricator.wikimedia.org/P54030 and previous config saved to /var/cache/conftool/dbconfig/20231130-173635-arnaudb.json
[17:36:49] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:41:44] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:22] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P54031 and previous config saved to /var/cache/conftool/dbconfig/20231130-175141-arnaudb.json
[18:00:05] <jouncebot>	 bd808: Dear deployers, time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1800)
[18:02:09] <mutante>	 !log planet2003 - revoking old puppet cert, following the "fix forward" steps from T349619 - puppet running again 
[18:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:15] <stashbot>	 T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619
[18:06:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P54032 and previous config saved to /var/cache/conftool/dbconfig/20231130-180648-arnaudb.json
[18:06:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] rabbitmq: don't include openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/979127 (owner: 10Andrew Bogott)
[18:08:35] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: Bump container version to 2023-11-30-180312-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979142 (https://phabricator.wikimedia.org/T308938)
[18:09:42] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bookworm
[18:09:43] <wikibugs>	 (03CR) 10SBassett: [C: 04-1] CentralNotice: Add wmflabs to banner preview CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864327 (https://phabricator.wikimedia.org/T199055) (owner: 10AndyRussG)
[18:09:55] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2023-11-30-180312-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979142 (https://phabricator.wikimedia.org/T308938) (owner: 10BryanDavis)
[18:11:08] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Bump container version to 2023-11-30-180312-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979142 (https://phabricator.wikimedia.org/T308938) (owner: 10BryanDavis)
[18:12:33] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply
[18:13:08] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[18:13:16] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply
[18:13:56] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[18:14:28] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[18:15:00] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:15:20] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[18:17:50] <wikibugs>	 (03Abandoned) 10Ebernhardson: cirrus: Disable event bus bridge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979133 (https://phabricator.wikimedia.org/T351503) (owner: 10Ebernhardson)
[18:21:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T348183)', diff saved to https://phabricator.wikimedia.org/P54033 and previous config saved to /var/cache/conftool/dbconfig/20231130-182155-arnaudb.json
[18:21:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[18:22:01] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:22:10] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[18:22:57] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudrabbit1003.wikimedia.org with OS bookworm
[18:24:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bookworm
[18:26:47] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335)
[18:27:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson)
[18:31:30] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Expand test deployment to prod+cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/979147 (https://phabricator.wikimedia.org/T352335)
[18:36:47] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage
[18:38:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Eevans) >>! In T349758#9372338, @Jhancock.wm wrote: > @Eevans Hey my bad. newbie mistake. Papaul taught me how to fix this and you should be good now.  No worries; Thanks...
[18:40:03] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1003.wikimedia.org with reason: host reimage
[18:44:24] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335)
[18:48:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10Dzahn) a:03thcipriani
[18:48:40] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[18:48:54] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[18:49:00] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T348183)', diff saved to https://phabricator.wikimedia.org/P54034 and previous config saved to /var/cache/conftool/dbconfig/20231130-184900-arnaudb.json
[18:49:22] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:49:31] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host elastic1104
[18:49:33] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1104
[18:50:14] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host elastic1104.mgmt.eqiad.wmnet with reboot policy FORCED
[18:52:24] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: turn deploy script into template, support for php7.4-fpm [puppet] - 10https://gerrit.wikimedia.org/r/978710 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn)
[18:56:26] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "no change in prod except a newline added to the script" [puppet] - 10https://gerrit.wikimedia.org/r/978710 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn)
[18:56:37] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1003.wikimedia.org with OS bookworm
[18:57:10] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.wikimedia.org with OS bookworm
[19:00:00] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:00:05] <jouncebot>	 hashar and jeena: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231130T1900).
[19:00:22] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[19:01:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:04:01] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:04:45] <wikibugs>	 (03PS1) 10Bking: miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355)
[19:05:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:06:20] <wikibugs>	 (03PS2) 10Bking: miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355)
[19:08:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) a:03VRiley-WMF
[19:08:18] <wikibugs>	 (03CR) 10Dzahn: "ahh! so should this be moved to a profile applied on that (one) wdqs host?" [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:09:47] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1104.mgmt.eqiad.wmnet with reboot policy FORCED
[19:10:01] <wikibugs>	 (03PS3) 10Dzahn: miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:10:27] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1002.wikimedia.org with reason: host reimage
[19:10:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:11:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1104']
[19:11:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105']
[19:11:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1106']
[19:11:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1107']
[19:12:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1103']
[19:12:15] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1104']
[19:12:23] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1103']
[19:12:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1107']
[19:12:37] <wikibugs>	 (03CR) 10Bking: [C: 03+2] miscweb: remove wdqs ldf endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/979149 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:13:21] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1002.wikimedia.org with reason: host reimage
[19:13:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1104.mgmt.eqiad.wmnet with reboot policy FORCED
[19:13:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED
[19:13:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:14:04] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:14:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED
[19:14:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1107']
[19:14:59] <jinxer-wm>	 (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:15:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1107']
[19:15:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:15:35] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:15:49] <mutante>	 re: etherpad alert. I checked and it was temporary
[19:17:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:17:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:18:23] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T348183)', diff saved to https://phabricator.wikimedia.org/P54035 and previous config saved to /var/cache/conftool/dbconfig/20231130-191822-arnaudb.json
[19:18:29] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[19:18:52] <wikibugs>	 (03CR) 10Volans: "Thanks for finding a workaround. I'm not sure if this is the best place where to put it, adding Jesse for it, but if there aren't other al" [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol)
[19:18:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:19:03] <jinxer-wm>	 (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:19:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1106']
[19:19:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1105']
[19:19:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED
[19:19:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1103.mgmt.eqiad.wmnet with reboot policy FORCED
[19:20:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1103']
[19:20:08] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1103']
[19:20:31] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1105']
[19:20:34] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED
[19:21:13] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1105']
[19:21:46] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1106']
[19:22:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1107']
[19:22:15] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1106']
[19:22:36] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1107']
[19:24:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1104.mgmt.eqiad.wmnet with reboot policy FORCED
[19:24:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1104']
[19:24:35] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1104']
[19:24:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1104']
[19:25:44] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2028.codfw.wmnet
[19:27:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1103']
[19:27:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deploy for pfischer - https://phabricator.wikimedia.org/T352475 (10EBernhardson)
[19:28:15] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1035']
[19:28:34] <logmsgbot>	 !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti1035']
[19:29:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1107']
[19:29:02] <jinxer-wm>	 (ProbeDown) firing: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:29:23] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1035']
[19:29:32] <mutante>	 ^ they are using it at the "Data Modeling Days"
[19:29:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deploy for pfischer - https://phabricator.wikimedia.org/T352475 (10EBernhardson)
[19:30:18] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1002.wikimedia.org with OS bookworm
[19:30:21] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for pfischer - https://phabricator.wikimedia.org/T352475 (10EBernhardson)
[19:30:49] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1036']
[19:31:55] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1037']
[19:33:03] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1038']
[19:33:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P54036 and previous config saved to /var/cache/conftool/dbconfig/20231130-193329-arnaudb.json
[19:33:34] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2028.codfw.wmnet
[19:34:02] <jinxer-wm>	 (ProbeDown) resolved: Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#etherpad1003:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:34:59] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1035']
[19:36:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF)
[19:37:24] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1036']
[19:37:32] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1037']
[19:40:55] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: prepare new hosts elastic110[3-7] [puppet] - 10https://gerrit.wikimedia.org/r/979154 (https://phabricator.wikimedia.org/T349777)
[19:41:26] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1038']
[19:41:39] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus: Disable cloudelastic writes on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979146 (https://phabricator.wikimedia.org/T352335)
[19:41:41] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Enable event bus bridge on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335)
[19:41:55] <wikibugs>	 (03CR) 10Ryan Kemper: "Just tagging jclark for visibility" [puppet] - 10https://gerrit.wikimedia.org/r/979154 (https://phabricator.wikimedia.org/T349777) (owner: 10Ryan Kemper)
[19:41:59] <wikibugs>	 (03CR) 10Ebernhardson: [C: 04-2] "The necessary kafka topic changes have not been performed yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson)
[19:42:06] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979154 (https://phabricator.wikimedia.org/T349777) (owner: 10Ryan Kemper)
[19:48:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P54037 and previous config saved to /var/cache/conftool/dbconfig/20231130-194835-arnaudb.json
[19:49:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic1104']
[19:54:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr)
[19:54:15] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] elastic: prepare new hosts elastic110[3-7] [puppet] - 10https://gerrit.wikimedia.org/r/979154 (https://phabricator.wikimedia.org/T349777) (owner: 10Ryan Kemper)
[19:57:37] <wikibugs>	 (03PS1) 10Ssingh: hiera: dnsbox: remove anycast-hc dependency on pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/979159 (https://phabricator.wikimedia.org/T347054)
[19:57:37] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:57:43] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:57:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1104.eqiad.wmnet with OS bookworm
[19:58:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1103.eqiad.wmnet with OS bookworm
[19:58:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1106.eqiad.wmnet with OS bookworm
[19:58:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1105.eqiad.wmnet with OS bookworm
[19:58:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm
[19:58:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1104.eqiad.wmnet with OS bookworm
[19:58:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1103.eqiad.wmnet with OS bookworm
[19:58:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1106.eqiad.wmnet with OS bookworm
[19:58:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm
[19:58:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm
[19:59:00] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/979159 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[20:00:23] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:00:27] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:03:25] <wikibugs>	 (03PS1) 10Jcrespo: add_recent_uploads: Be more solid resilient against errors [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979160
[20:03:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T348183)', diff saved to https://phabricator.wikimedia.org/P54039 and previous config saved to /var/cache/conftool/dbconfig/20231130-200342-arnaudb.json
[20:03:45] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[20:03:48] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:03:48] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[20:03:50] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[20:04:03] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[20:04:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54040 and previous config saved to /var/cache/conftool/dbconfig/20231130-200409-arnaudb.json
[20:04:27] <wikibugs>	 (03PS2) 10Jcrespo: add_recent_uploads: Be more resilient against errors [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979160
[20:07:37] <wikibugs>	 (03PS1) 10Eevans: restbase: set production role and add config for restbase2028 [puppet] - 10https://gerrit.wikimedia.org/r/979161 (https://phabricator.wikimedia.org/T352468)
[20:11:43] <wikibugs>	 (03PS1) 10Jdrewniak: Increase "large" font-size option for client-preferences [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979084 (https://phabricator.wikimedia.org/T351693)
[20:12:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1104.eqiad.wmnet with reason: host reimage
[20:12:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1103.eqiad.wmnet with reason: host reimage
[20:14:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1106.eqiad.wmnet with reason: host reimage
[20:15:00] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1106.eqiad.wmnet with reason: host reimage
[20:15:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1104.eqiad.wmnet with reason: host reimage
[20:16:10] <wikibugs>	 (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.4 [puppet] - 10https://gerrit.wikimedia.org/r/979162 (https://phabricator.wikimedia.org/T352480)
[20:17:25] <wikibugs>	 (03PS1) 10Herron: thanos-query: enable auto-downsampling [puppet] - 10https://gerrit.wikimedia.org/r/979163
[20:17:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1103.eqiad.wmnet with reason: host reimage
[20:18:47] <wikibugs>	 (03CR) 10Herron: "follow-up to irc conovo -- interested in your thoughts" [puppet] - 10https://gerrit.wikimedia.org/r/979163 (owner: 10Herron)
[20:22:48] <wikibugs>	 (03PS1) 10Papaul: Add new elastic nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/979164 (https://phabricator.wikimedia.org/T349780)
[20:28:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54041 and previous config saved to /var/cache/conftool/dbconfig/20231130-202830-arnaudb.json
[20:28:36] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:30:03] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add new elastic nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/979164 (https://phabricator.wikimedia.org/T349780) (owner: 10Papaul)
[20:30:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:35:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:36:33] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Increase "large" font-size option for client-preferences [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979084 (https://phabricator.wikimedia.org/T351693) (owner: 10Jdrewniak)
[20:37:15] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:37:17] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1103.eqiad.wmnet with OS bookworm
[20:37:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:37:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1103.eqiad.wmnet with OS bookworm completed: - elastic1103 (**PASS**)...
[20:37:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1106.eqiad.wmnet with OS bookworm
[20:37:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:37:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1106.eqiad.wmnet with OS bookworm completed: - elastic1106 (**WARN**)...
[20:38:15] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:38:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1104.eqiad.wmnet with OS bookworm
[20:38:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host elastic1104.eqiad.wmnet with OS bookworm completed: - elastic1104 (**PASS**)...
[20:43:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P54042 and previous config saved to /var/cache/conftool/dbconfig/20231130-204336-arnaudb.json
[21:42:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T348183)', diff saved to https://phabricator.wikimedia.org/P54046 and previous config saved to /var/cache/conftool/dbconfig/20231130-214252-arnaudb.json
[21:43:04] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[21:43:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[21:44:55] <wikibugs>	 (03Merged) 10jenkins-bot: Increase "large" font-size option for client-preferences [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979084 (https://phabricator.wikimedia.org/T351693) (owner: 10Jdrewniak)
[21:45:10] <logmsgbot>	 !log dancy@deploy2002 Started scap: Backport for [[gerrit:979084|Increase "large" font-size option for client-preferences (T351693)]]
[21:45:20] <stashbot>	 T351693: Implement new default typography options - https://phabricator.wikimedia.org/T351693
[21:45:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[21:45:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2092.codfw.wmnet with OS bookworm
[21:45:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2092.codfw.wmnet with OS bookworm completed: - elastic2092 (**PASS**)...
[21:46:04] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:46:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bookworm
[21:46:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm
[21:46:21] <wikibugs>	 (03PS1) 10Dzahn: planet: add ensure parameter allowing to disable update jobs [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392)
[21:46:26] <logmsgbot>	 !log dancy@deploy2002 jdrewniak and dancy: Backport for [[gerrit:979084|Increase "large" font-size option for client-preferences (T351693)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:46:53] <dancy>	 kimberly_sarabia: Ready for testing 
[21:47:06] <kimberly_sarabia>	 Thanks! One moment
[21:47:54] <wikibugs>	 (03PS2) 10Dzahn: planet: add ensure parameter allowing to disable update jobs [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392)
[21:48:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2093.codfw.wmnet with reason: host reimage
[21:49:06] <kimberly_sarabia>	 LGTM! Thanks
[21:49:13] <dancy>	 OK.  Proceeding
[21:49:16] <logmsgbot>	 !log dancy@deploy2002 jdrewniak and dancy: Continuing with sync
[21:50:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr)
[21:51:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10Jclark-ctr)
[21:52:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2093.codfw.wmnet with reason: host reimage
[21:54:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1105.eqiad.wmnet with OS bookworm
[21:54:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm
[21:54:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm
[21:54:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm
[21:55:12] <logmsgbot>	 !log dancy@deploy2002 Finished scap: Backport for [[gerrit:979084|Increase "large" font-size option for client-preferences (T351693)]] (duration: 10m 01s)
[21:55:17] <stashbot>	 T351693: Implement new default typography options - https://phabricator.wikimedia.org/T351693
[21:55:43] <dancy>	 kimberly_sarabia: Your change has been fully deployed
[21:55:56] <kimberly_sarabia>	 Thanks so much! 
[21:58:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P54047 and previous config saved to /var/cache/conftool/dbconfig/20231130-215759-arnaudb.json
[22:00:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED
[22:00:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host elastic1105.mgmt.eqiad.wmnet with reboot policy FORCED
[22:02:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1105.mgmt.eqiad.wmnet with reboot policy FORCED
[22:02:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1107.mgmt.eqiad.wmnet with reboot policy FORCED
[22:08:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[22:13:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P54048 and previous config saved to /var/cache/conftool/dbconfig/20231130-221308-arnaudb.json
[22:14:38] <wikibugs>	 (03CR) 10Kimberly Sarabia: [C: 03+1] "This makes sense to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming)
[22:20:54] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[22:21:20] <wikibugs>	 (03PS2) 10Krinkle: noc: fix indentation in base.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857
[22:22:19] <Krinkle>	 dancy: all done with deployments?
[22:23:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[22:23:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2093.codfw.wmnet with OS bookworm
[22:24:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2093.codfw.wmnet with OS bookworm completed: - elastic2093 (**PASS**)...
[22:24:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2095.codfw.wmnet with OS bookworm
[22:24:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2095.codfw.wmnet with OS bookworm
[22:28:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T348183)', diff saved to https://phabricator.wikimedia.org/P54050 and previous config saved to /var/cache/conftool/dbconfig/20231130-222814-arnaudb.json
[22:28:17] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance
[22:28:20] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:28:30] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance
[22:28:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2190 (T348183)', diff saved to https://phabricator.wikimedia.org/P54051 and previous config saved to /var/cache/conftool/dbconfig/20231130-222836-arnaudb.json
[22:42:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2095.codfw.wmnet with reason: host reimage
[22:43:53] <dancy>	 dancy: All done.
[22:44:35] <dancy>	 oops.  Krinkle: All done. :-)
[22:46:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2095.codfw.wmnet with reason: host reimage
[22:58:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T348183)', diff saved to https://phabricator.wikimedia.org/P54053 and previous config saved to /var/cache/conftool/dbconfig/20231130-225802-arnaudb.json
[22:58:09] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[23:00:14] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:00:22] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[23:02:49] <wikibugs>	 (03PS1) 10Terasail: Add ability for sysop to manage functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495)
[23:03:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add ability for sysop to manage functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail)
[23:03:49] <foks>	 !log removing 5 files for legal compliance 
[23:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:04:59] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:05:44] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul)
[23:05:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:06:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2095.codfw.wmnet with OS bookworm
[23:06:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2095.codfw.wmnet with OS bookworm completed: - elastic2095 (**PASS**)...
[23:06:33] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bookworm
[23:06:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm executed with errors: - elastic2094...
[23:11:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2096.codfw.wmnet with OS bookworm
[23:11:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2096.codfw.wmnet with OS bookworm
[23:13:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P54054 and previous config saved to /var/cache/conftool/dbconfig/20231130-231309-arnaudb.json
[23:16:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bookworm
[23:16:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2094.codfw.wmnet with OS bookworm
[23:18:04] <foks>	 !log removing 1 file for legal compliance 
[23:18:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:14] <wikibugs>	 (03PS2) 10Terasail: Bug: T352495 Change-Id: Ib6fcfb2df83204f148da9706dcb751b2f6050a63 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495)
[23:28:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P54055 and previous config saved to /var/cache/conftool/dbconfig/20231130-232815-arnaudb.json
[23:31:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1105.eqiad.wmnet with OS bookworm
[23:31:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bookworm
[23:31:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1105.eqiad.wmnet with OS bookworm
[23:31:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic110[3-7] - https://phabricator.wikimedia.org/T349777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host elastic1107.eqiad.wmnet with OS bookworm
[23:31:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2096.codfw.wmnet with reason: host reimage
[23:35:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2096.codfw.wmnet with reason: host reimage
[23:35:45] <foks>	 !log removing 1 file for legal compliance 
[23:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:36:07] <wikibugs>	 (03PS2) 10Krinkle: Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407)
[23:36:15] <wikibugs>	 (03PS2) 10Krinkle: Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407)
[23:36:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2097.codfw.wmnet with OS bookworm
[23:36:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2097.codfw.wmnet with OS bookworm
[23:37:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Papaul)
[23:37:28] <wikibugs>	 (03PS3) 10Jdlrobson: Filter errors originating in external tools [puppet] - 10https://gerrit.wikimedia.org/r/975836 (https://phabricator.wikimedia.org/T349935)
[23:39:38] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] noc: fix indentation in base.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 (owner: 10Krinkle)
[23:39:53] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407) (owner: 10Krinkle)
[23:40:21] <wikibugs>	 (03Merged) 10jenkins-bot: noc: fix indentation in base.css [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972857 (owner: 10Krinkle)
[23:40:33] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407) (owner: 10Krinkle)
[23:43:23] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T348183)', diff saved to https://phabricator.wikimedia.org/P54056 and previous config saved to /var/cache/conftool/dbconfig/20231130-234322-arnaudb.json
[23:43:29] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[23:44:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2098.codfw.wmnet with OS bookworm
[23:44:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2098.codfw.wmnet with OS bookworm
[23:45:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1107.eqiad.wmnet with reason: host reimage
[23:46:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1105.eqiad.wmnet with reason: host reimage
[23:47:29] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:49:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1107.eqiad.wmnet with reason: host reimage
[23:50:21] <logmsgbot>	 !log krinkle@deploy2002 Synchronized docroot/noc/: (no justification provided) (duration: 08m 28s)
[23:52:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1105.eqiad.wmnet with reason: host reimage
[23:52:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:54:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2097.codfw.wmnet with reason: host reimage
[23:55:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:55:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2096.codfw.wmnet with OS bookworm
[23:55:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2096.codfw.wmnet with OS bookworm completed: - elastic2096 (**PASS**)...
[23:56:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2099.codfw.wmnet with OS bookworm
[23:56:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2099.codfw.wmnet with OS bookworm
[23:56:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2094.codfw.wmnet with reason: host reimage
[23:56:40] <wikibugs>	 (03PS3) 10Terasail: Add ability for sysop to manage functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495)
[23:57:31] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2097.codfw.wmnet with reason: host reimage