[00:17:16] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[00:29:26] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:29:34] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:29:58] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:33:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.539 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:33:56] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:34:18] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:38:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962238
[00:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962238 (owner: 10TrainBranchBot)
[00:55:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962238 (owner: 10TrainBranchBot)
[01:31:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:29:30] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:38:47] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:03:47] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:32:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:32:32] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:29] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-09-28-043003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961979 (https://phabricator.wikimedia.org/T343450) (owner: 10KartikMistry)
[03:46:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:47:16] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-09-28-043003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961979 (https://phabricator.wikimedia.org/T343450) (owner: 10KartikMistry)
[03:47:42] <icinga-wm>	 PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:48:27] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[03:48:51] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[03:51:17] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[03:51:52] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[03:52:12] <icinga-wm>	 RECOVERY - Check systemd state on analytics1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:52:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:53:26] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:54:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:55:46] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[03:56:20] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[03:56:51] <kart_>	 !log Updated cxserver to 2023-09-28-043003-production (T343450, T347389, T338689)
[03:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:56:58] <stashbot>	 T343450: Enable MinT for closely-related languages based on community input - https://phabricator.wikimedia.org/T343450
[03:56:58] <stashbot>	 T338689: error translating court cases out of english - https://phabricator.wikimedia.org/T338689
[03:56:59] <stashbot>	 T347389: Integrate improved sentence segmentation algorithm in CXServer - https://phabricator.wikimedia.org/T347389
[04:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:41:44] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:44:24] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:44:48] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:46:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:47:24] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:49:14] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:05:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:06:46] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:09:48] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:10:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:21:30] <wikibugs>	 (03PS6) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939)
[05:43:39] <logmsgbot>	 !log sg912@deploy2002 Started deploy [analytics/refinery@e954b12]: Regular analytics weekly train [analytics/refinery@e954b12a]
[05:44:35] <Surbhi_>	 Refinery deployment in progress
[05:49:41] <logmsgbot>	 !log sg912@deploy2002 Finished deploy [analytics/refinery@e954b12]: Regular analytics weekly train [analytics/refinery@e954b12a] (duration: 06m 02s)
[05:50:10] <logmsgbot>	 !log sg912@deploy2002 Started deploy [analytics/refinery@e954b12] (thin): Regular analytics weekly train THIN [analytics/refinery@e954b12a]
[05:50:17] <logmsgbot>	 !log sg912@deploy2002 Finished deploy [analytics/refinery@e954b12] (thin): Regular analytics weekly train THIN [analytics/refinery@e954b12a] (duration: 00m 06s)
[05:51:02] <logmsgbot>	 !log sg912@deploy2002 Started deploy [analytics/refinery@e954b12] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@e954b12a]
[05:54:03] <logmsgbot>	 !log sg912@deploy2002 Finished deploy [analytics/refinery@e954b12] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@e954b12a] (duration: 03m 00s)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T0600)
[06:19:43] <Surbhi_>	 !log Deployed refinery using scap, then deployed onto hdfs
[06:19:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ayounsi) Indeed, looks about right :)  For Puppet, if we can change the Hiera merge strategy to `hash`  for `profile::bird::adve...
[06:30:07] <moritzm>	 !log installing glibc security updates
[06:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Sfaci)
[06:58:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10Nahid)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T0700)
[07:00:05] <jouncebot>	 aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:03:48] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:05:53] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jelto)
[07:12:05] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/962636 (owner: 10EoghanGaffney)
[07:12:16] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro)
[07:19:53] <XioNoX>	 !log Remove static routes for anycast prefixes - T347494
[07:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:57] <stashbot>	 T347494: Remove static routes for anycast prefixes - https://phabricator.wikimedia.org/T347494
[07:22:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto)
[07:22:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you" [cookbooks] - 10https://gerrit.wikimedia.org/r/963189 (https://phabricator.wikimedia.org/T347954) (owner: 10Volans)
[07:24:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) p:05Triage→03Medium
[07:25:05] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) Thanks for opening the access request. There is a official [access request form](https://phabricator.wi...
[07:27:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Remove static routes for anycast prefixes - https://phabricator.wikimedia.org/T347494 (10ayounsi) 05Open→03Resolved All done.
[07:30:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto)
[07:32:42] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ci: manage cinder volume on Castor instance [puppet] - 10https://gerrit.wikimedia.org/r/961844 (https://phabricator.wikimedia.org/T304080) (owner: 10Hashar)
[07:33:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto) @Antoine_Quhen Can you confirm and add your wikitech username and email address in the task description...
[07:34:26] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2003.codfw.wmnet with OS bullseye
[07:34:31] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[07:36:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ayounsi) I was wondering what to do for all the appliances that have ntp.site.wikimedia.org configured. To me the best here is to...
[07:36:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix confctl repool command [cookbooks] - 10https://gerrit.wikimedia.org/r/963189 (https://phabricator.wikimedia.org/T347954) (owner: 10Volans)
[07:38:55] <wikibugs>	 (03CR) 10Hashar: gerrit: remove duplicate $gerrit_site definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[07:39:25] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: fix confctl repool command [cookbooks] - 10https://gerrit.wikimedia.org/r/963189 (https://phabricator.wikimedia.org/T347954) (owner: 10Volans)
[07:42:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:43:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for essexigyan [puppet] - 10https://gerrit.wikimedia.org/r/963256
[07:47:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:47:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10Aklapper) For context, https://meta.wikimedia.org/wiki/Special:CentralAuth?target=ZSoo%20(WMF)
[07:50:34] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Volans) IMHO I think we should stick to the agreed format in T284614#7214588 and T284614#7222919 and rename (and re-slug) the 3 non matching ones into the...
[07:53:33] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2003.codfw.wmnet with reason: host reimage
[07:55:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[07:56:05] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2003.codfw.wmnet with reason: host reimage
[07:59:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for essexigyan [puppet] - 10https://gerrit.wikimedia.org/r/963256 (owner: 10Muehlenhoff)
[08:00:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[08:00:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Eigyan out of all services on: 2176 hosts
[08:01:18] <wikibugs>	 (03PS5) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755)
[08:01:27] <wikibugs>	 (03CR) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert)
[08:01:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Eigyan out of all services on: 2176 hosts
[08:08:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for whatamidoing [puppet] - 10https://gerrit.wikimedia.org/r/963257
[08:11:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for whatamidoing [puppet] - 10https://gerrit.wikimedia.org/r/963257 (owner: 10Muehlenhoff)
[08:14:26] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Enable logging of remote IPs. [puppet] - 10https://gerrit.wikimedia.org/r/963258
[08:14:27] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2003.codfw.wmnet with OS bullseye
[08:14:32] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[08:15:59] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43858/console" [puppet] - 10https://gerrit.wikimedia.org/r/963258 (owner: 10Slyngshede)
[08:19:10] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye
[08:19:15] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[08:23:20] <wikibugs>	 (03CR) 10Volans: "Thanks for taking the time to migrate this cookbook to the newer class API." [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper)
[08:30:39] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963160 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney)
[08:31:33] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/963161 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney)
[08:37:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for erayfield [puppet] - 10https://gerrit.wikimedia.org/r/963259
[08:38:35] <wikibugs>	 (03PS1) 10David Caro: disable_tool: use the gitlab repository [puppet] - 10https://gerrit.wikimedia.org/r/963260 (https://phabricator.wikimedia.org/T327057)
[08:42:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for erayfield [puppet] - 10https://gerrit.wikimedia.org/r/963259 (owner: 10Muehlenhoff)
[08:43:21] <wikibugs>	 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team, 10User-aborrero: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) a:03Jclark-ctr Hi! Can we please have `cloudvirt-wdqs100[1-3]` moved to the WMCS racks, preferrably `E4` or `F4`? They will all need a s...
[08:43:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging EllenR out of all services on: 2175 hosts
[08:44:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging EllenR out of all services on: 2175 hosts
[08:52:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for agueyte [puppet] - 10https://gerrit.wikimedia.org/r/963262
[08:52:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Bump codfw to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962645 (owner: 10Jgiannelos)
[08:55:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for agueyte [puppet] - 10https://gerrit.wikimedia.org/r/963262 (owner: 10Muehlenhoff)
[08:57:03] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump codfw to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962645 (owner: 10Jgiannelos)
[08:57:07] <wikibugs>	 (03PS1) 10Jelto: admin: add scampos to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/963265 (https://phabricator.wikimedia.org/T348001)
[08:57:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T343198)', diff saved to https://phabricator.wikimedia.org/P52813 and previous config saved to /var/cache/conftool/dbconfig/20231004-085739-arnaudb.json
[08:57:44] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[08:58:28] <wikibugs>	 (03Merged) 10jenkins-bot: tegola: Bump codfw to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962645 (owner: 10Jgiannelos)
[08:59:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10Jelto) p:05Triage→03Medium
[09:01:58] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[09:02:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] rabbitmq: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945750 (owner: 10Muehlenhoff)
[09:02:43] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: rabbitmq: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945750 (owner: 10Muehlenhoff)
[09:02:48] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[09:04:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for kmorgan [puppet] - 10https://gerrit.wikimedia.org/r/963267
[09:06:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for kmorgan [puppet] - 10https://gerrit.wikimedia.org/r/963267 (owner: 10Muehlenhoff)
[09:08:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging KMorgan out of all services on: 2175 hosts
[09:08:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging KMorgan out of all services on: 2175 hosts
[09:12:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10jbond) Proposal looks good to me, minor nit would be to rename `ACAST_PS_ADVERTISE` to remove references to anycast to avoid con...
[09:12:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P52814 and previous config saved to /var/cache/conftool/dbconfig/20231004-091245-arnaudb.json
[09:13:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/963272
[09:14:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove access for tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/963272 (owner: 10Muehlenhoff)
[09:16:30] <wikibugs>	 (03PS1) 10Jgiannelos: tegola: Use latest image for all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963273
[09:19:43] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/963272
[09:20:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Jelto)
[09:21:58] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:22:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] admin: add scampos to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/963265 (https://phabricator.wikimedia.org/T348001) (owner: 10Jelto)
[09:25:01] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] admin: add scampos to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/963265 (https://phabricator.wikimedia.org/T348001) (owner: 10Jelto)
[09:25:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/963272 (owner: 10Muehlenhoff)
[09:25:50] <logmsgbot>	 !log sg912@deploy2002 Started deploy [airflow-dags/analytics@3b374a9]: (no justification provided)
[09:26:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: mw: allow egress to excimer [deployment-charts] - 10https://gerrit.wikimedia.org/r/963274 (https://phabricator.wikimedia.org/T347926)
[09:26:35] <logmsgbot>	 !log sg912@deploy2002 Finished deploy [airflow-dags/analytics@3b374a9]: (no justification provided) (duration: 00m 45s)
[09:27:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging TsepoThoabala out of all services on: 2175 hosts
[09:27:52] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P52815 and previous config saved to /var/cache/conftool/dbconfig/20231004-092752-arnaudb.json
[09:28:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging TsepoThoabala out of all services on: 2175 hosts
[09:28:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[09:33:06] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mw: allow egress to excimer [deployment-charts] - 10https://gerrit.wikimedia.org/r/963274 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi)
[09:33:55] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe2004.codfw.wmnet with OS bullseye
[09:34:01] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[09:34:06] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "As discussed there are some more improvements we can make here I think.  The /30 tricks are cool but probably better to use /29." [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[09:34:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] mw: allow egress to excimer [deployment-charts] - 10https://gerrit.wikimedia.org/r/963274 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi)
[09:35:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[09:35:24] <wikibugs>	 (03Merged) 10jenkins-bot: mw: allow egress to excimer [deployment-charts] - 10https://gerrit.wikimedia.org/r/963274 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi)
[09:35:50] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye
[09:35:55] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[09:37:07] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[09:37:23] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[09:37:24] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[09:37:31] <godog>	 hold steady for !log wall
[09:37:41] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[09:37:42] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[09:37:51] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[09:37:52] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[09:38:18] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[09:38:19] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[09:38:32] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[09:38:33] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[09:38:45] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[09:38:46] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[09:38:55] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[09:38:56] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[09:39:08] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[09:39:09] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[09:39:16] <logmsgbot>	 !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[09:39:17] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[09:39:31] <logmsgbot>	 !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[09:39:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10Jelto) 05Open→03Resolved a:03Jelto Thanks for the request. I can confirm the accounts are linked since 2023-10-03.  Sara Campos was added to wmf ldap group. I'm closing this task. Feel free t...
[09:42:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T343198)', diff saved to https://phabricator.wikimedia.org/P52816 and previous config saved to /var/cache/conftool/dbconfig/20231004-094258-arnaudb.json
[09:43:00] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[09:43:03] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[09:43:14] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[09:43:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T343198)', diff saved to https://phabricator.wikimedia.org/P52817 and previous config saved to /var/cache/conftool/dbconfig/20231004-094320-arnaudb.json
[09:50:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto) p:05Triage→03Medium
[09:50:25] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Use latest image for all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963273 (owner: 10Jgiannelos)
[09:50:30] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] tegola: Use latest image for all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963273 (owner: 10Jgiannelos)
[09:51:18] <wikibugs>	 (03Merged) 10jenkins-bot: tegola: Use latest image for all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963273 (owner: 10Jgiannelos)
[09:51:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto) Thanks for the access request.  I need approval from @WDoranWMF as the manager and @odimitrijevic  or @BTullis as the group owners for `analytics-admins`.
[09:52:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10BTullis) Approved.
[09:53:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto)
[09:56:02] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Limit global account linking to LDAP properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/961702 (owner: 10Slyngshede)
[09:58:35] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: set up cloudgw <-> cloudnet routes in the right interface [puppet] - 10https://gerrit.wikimedia.org/r/963279 (https://phabricator.wikimedia.org/T347687)
[09:59:19] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Lucas_Werkmeister_WMDE)
[09:59:47] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Sry that was my bad." [puppet] - 10https://gerrit.wikimedia.org/r/963279 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1000)
[10:00:36] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10Jelto) This task popped up in the clinic duty board because it's a #sre-access-requests. However [wikitech](https://wikitech.wikimedia.org/wi...
[10:01:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: interfaces: set up cloudgw <-> cloudnet routes in the right interface [puppet] - 10https://gerrit.wikimedia.org/r/963279 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[10:02:35] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe2004.codfw.wmnet with OS bullseye
[10:02:39] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[10:02:51] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye
[10:02:56] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[10:04:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for elitre [puppet] - 10https://gerrit.wikimedia.org/r/963280
[10:07:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for elitre [puppet] - 10https://gerrit.wikimedia.org/r/963280 (owner: 10Muehlenhoff)
[10:15:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10cmooney) > Otherwise, it should be fairly straightforward: we add the VIP the same way we do for the anycast IPs, making sure to...
[10:15:54] <wikibugs>	 (03PS1) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[10:18:31] <wikibugs>	 (03CR) 10Jbond: "see inline for comments questions" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) (owner: 10Cathal Mooney)
[10:18:50] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[10:20:08] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[10:20:11] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:20:16] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[10:20:18] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:20:22] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[10:20:35] <logmsgbot>	 !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-fe2004.codfw.wmnet with OS bullseye
[10:20:41] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[10:21:00] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10elukey)
[10:21:12] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:24:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687)
[10:25:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[10:25:30] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687)
[10:25:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[10:27:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:27:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[10:27:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] cloudgw: put cloud-realm routes back under keepalived control [puppet] - 10https://gerrit.wikimedia.org/r/963283 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[10:27:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10MoritzMuehlenhoff) >>! In T247045#9212891, @nshahquinn-wmf wrote: > https://os-reports.wikimedia.org/stretch.html now reports: >> A total of 0 hosts are runni...
[10:29:11] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2004']
[10:29:34] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['thanos-fe2004']
[10:30:44] <wikibugs>	 (03PS1) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284
[10:31:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[10:31:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43859/console" [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[10:32:51] <wikibugs>	 (03PS2) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284
[10:33:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Peachey88)
[10:33:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[10:33:18] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: thanos-fe2004 fail to boot into PXE - https://phabricator.wikimedia.org/T348119 (10fgiunchedi)
[10:33:48] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:33:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43860/console" [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[10:37:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:39:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/963052 (https://phabricator.wikimedia.org/T304491) (owner: 10MVernon)
[10:39:11] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:39:42] <wikibugs>	 (03PS3) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284
[10:40:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120 (10cmooney) p:05Triage→03Low
[10:40:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[10:40:20] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: update the ssl-ca value used by mariadb (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[10:40:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43861/console" [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[10:43:03] <wikibugs>	 (03PS4) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284
[10:43:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[10:44:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43862/console" [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[10:46:06] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475)
[10:47:36] <wikibugs>	 (03CR) 10Kevin Bazira: "elukey, in T347475#9224115 you had suggested the number 2 but for this test I've used 8, we can always scale them down once we've tested a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[10:49:06] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[10:50:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff >>! In T348103#9224094, @Jelto wrote: > @MoritzMuehlenhoff can I hand this over to you/your team?...
[10:51:25] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Clement_Goubert)
[10:52:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103)
[10:52:46] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[10:52:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) (owner: 10Muehlenhoff)
[10:54:06] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:54:16] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:54:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103)
[10:55:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) (owner: 10Muehlenhoff)
[10:57:43] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103)
[10:58:08] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2004.codfw.wmnet with OS bullseye
[10:58:14] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi...
[10:58:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:00:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for zxane [puppet] - 10https://gerrit.wikimedia.org/r/963285 (https://phabricator.wikimedia.org/T348103) (owner: 10Muehlenhoff)
[11:00:44] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Server moves in codfw to support switch numbering scheme - https://phabricator.wikimedia.org/T348125 (10cmooney) p:05Triage→03Medium
[11:00:59] <wikibugs>	 (03PS3) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475)
[11:02:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: increase recommendation-api-ng uwsgi workers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[11:02:54] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[11:02:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Zsoo out of all services on: 2175 hosts
[11:03:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:03:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Zsoo out of all services on: 2175 hosts
[11:03:58] <wikibugs>	 (03CR) 10Jbond: "ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:04:11] <wikibugs>	 (03Abandoned) 10Jbond: DO NOT MERGE: test wmflib::have_puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963284 (owner: 10Jbond)
[11:04:13] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[11:04:20] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: thanos-fe2004 fail to boot into PXE - https://phabricator.wikimedia.org/T348119 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm not sure how it happened but from the `ctrl-s` menu from broadcom...
[11:04:22] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10fgiunchedi)
[11:05:00] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira)
[11:05:07] <wikibugs>	 (03PS2) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[11:06:30] <wikibugs>	 (03PS2) 10Hnowlan: wmnet: add records for edit-, editor- and page-analytics [dns] - 10https://gerrit.wikimedia.org/r/963106 (https://phabricator.wikimedia.org/T336391)
[11:06:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Patch-For-Review: Remove zxane from restricted and analytics-privatedata-users - https://phabricator.wikimedia.org/T348103 (10MoritzMuehlenhoff) 05Open→03Resolved @Nahid: I have removed  Zxane's access to the "restricted" and "analytics-priv...
[11:07:20] <wikibugs>	 (03PS2) 10Jbond: mariadb: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741)
[11:07:25] <wikibugs>	 (03CR) 10Jbond: "updated thanks for the feedback" [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[11:08:25] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B server moves - port-block constraint / numbering - https://phabricator.wikimedia.org/T348125 (10cmooney)
[11:10:01] <wikibugs>	 (03PS9) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373)
[11:10:04] <wikibugs>	 (03CR) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:10:13] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] wmnet: add records for edit-, editor- and page-analytics [dns] - 10https://gerrit.wikimedia.org/r/963106 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan)
[11:10:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:10:17] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] service: add {edit,editor,page}-analytics services [puppet] - 10https://gerrit.wikimedia.org/r/962570 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan)
[11:11:03] <wikibugs>	 (03CR) 10Jbond: "fyi ill comeback to theses patches after the get_clusters patch is merged and working" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:11:13] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] aptrepo: install zip on aptrepo servers [puppet] - 10https://gerrit.wikimedia.org/r/963052 (https://phabricator.wikimedia.org/T304491) (owner: 10MVernon)
[11:12:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] gerrit: remove duplicate $gerrit_site definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[11:12:40] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] wmnet: add records for edit-, editor- and page-analytics [dns] - 10https://gerrit.wikimedia.org/r/963106 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan)
[11:12:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:dumps::distribution::ferm: pass array directly do ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/963062 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[11:13:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10WDoranWMF) Approved
[11:14:17] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[11:14:35] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2004.codfw.wmnet with reason: host reimage
[11:16:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto)
[11:17:47] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2004.codfw.wmnet with reason: host reimage
[11:18:32] <wikibugs>	 (03PS1) 10Jelto: admin: add sfaci to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/963288 (https://phabricator.wikimedia.org/T348101)
[11:20:43] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[11:23:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:25:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:25:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10cmooney) p:05Triage→03Medium
[11:26:59] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney)
[11:26:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:27:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:27:59] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:30:27] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[11:33:29] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2004.codfw.wmnet with OS bullseye
[11:33:34] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100...
[11:34:13] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] service: add {edit,editor,page}-analytics services [puppet] - 10https://gerrit.wikimedia.org/r/962570 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan)
[11:35:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963288 (https://phabricator.wikimedia.org/T348101) (owner: 10Jelto)
[11:38:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ayounsi) `ACAST_PS_ADVERTISE` is hardcoded in [[ https://github.com/unixsurfer/anycast_healthchecker | anycast_healthchecker ]]...
[11:41:29] <wikibugs>	 (03PS1) 10Jgiannelos: tegola: Enable structured logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/963289
[11:42:55] <wikibugs>	 (03PS2) 10Jgiannelos: tegola: Enable structured logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/963289 (https://phabricator.wikimedia.org/T347717)
[11:43:12] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[11:43:14] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete, `thanos-fe*` ho...
[11:43:28] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[11:43:33] <wikibugs>	 (03CR) 10Jgiannelos: "Now that we use latest tegola we can enable JSON logs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/963289 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos)
[11:44:17] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, we have split `titan*` hosts a...
[11:44:40] <wikibugs>	 (03CR) 10Btullis: "check automatic" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:45:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:45:34] <moritzm>	 !log installing exim4 security updates
[11:45:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:34] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[11:49:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:49:19] <icinga-wm>	 PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:04] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] admin: add sfaci to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/963288 (https://phabricator.wikimedia.org/T348101) (owner: 10Jelto)
[11:52:19] <wikibugs>	 (03PS3) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[11:52:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:53:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:34] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[12:02:04] <wikibugs>	 (03PS4) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[12:02:41] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[12:05:30] <wikibugs>	 (03PS5) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[12:06:53] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[12:08:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-admins for sfaci - https://phabricator.wikimedia.org/T348101 (10Jelto) 05Open→03Resolved a:03Jelto sfaci has access to `analytics-admins` now (in the next 30 minute). I'm closing the task. Feel free to reopen if you have p...
[12:11:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:11:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120 (10ayounsi) Nice rabbit hole! I found this: https://www.reddit.com/r/Juniper/comments/g12qxh/the_right_way_to_allow_traceroute_in_re_filter/ So it's possible...
[12:11:27] <icinga-wm>	 RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:07] <wikibugs>	 (03PS6) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[12:12:32] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[12:14:31] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:31] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:45:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:45:43] <icinga-wm>	 PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:46:40] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 90 days, 0:00:00 on 22 hosts with reason: Downtime for graceful shutdown and later decom
[12:46:58] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 90 days, 0:00:00 on 22 hosts with reason: Downtime for graceful shutdown and later decom
[12:47:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) 05Open→03Resolved I am going to close this task, the FPC issue was addressed through card replacement (although we decom'd router in the meantime).  Despite my best efforts i...
[12:51:51] <klausman>	 !log powering off ores100{2..9}.eqiad.wmnet (1001 is kept powered-on in case we need access to files from the old install). The machines have a 90d downtime already put in
[12:51:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1074 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:53:07] <icinga-wm>	 RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:14] <klausman>	 !log powering off ores200{2..9}.codfw.wmnet (2001 is kept powered-on in case we need access to files from the old install). The machines have a 90d downtime already put in.
[12:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:00] <klausman>	 !log powering off orespoolcounter{1004,2003,2004}.{eqiad,codfw}.wmnet (1003 is kept powered-on in case we need access to files from the old install). The machines have a 90d downtime already put in.
[12:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[12:56:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[12:57:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[12:57:58] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Aklapper) @DennisJJackson Hi and welcome to Phabricator! What //in this ticket// led you to asking for "retriage" (and what does that mean)?
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1300).
[13:00:05] <jouncebot>	 aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <logmsgbot>	 !log rook@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage
[13:00:12] <wikibugs>	 (03CR) 10Muehlenhoff: Implement Codex design, from design team. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) (owner: 10Slyngshede)
[13:00:19] <aanzx>	 o/
[13:03:15] <logmsgbot>	 !log rook@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage
[13:03:22] <taavi>	 o/ I can deploy in a moment
[13:03:44] <Lucas_WMDE>	 o/
[13:03:53] <taavi>	 (unless Lucas is faster)
[13:03:56] <Lucas_WMDE>	 I can also deploy
[13:04:27] <taavi>	 please do
[13:05:53] <Lucas_WMDE>	 ok
[13:06:57] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): fonwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:07:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:07:47] <wikibugs>	 (03Merged) 10jenkins-bot: fonwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:08:31] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) p:05Triage→03Low
[13:08:41] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687)
[13:08:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469)
[13:08:55] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney)
[13:08:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963066|fonwiki: add logos (T347939)]]
[13:09:05] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:09:08] <stashbot>	 T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939
[13:09:09] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469)
[13:09:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM but please fix the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/963147 (owner: 10Fabfur)
[13:10:27] <godog>	 nemo-yiannis: latest deployment of tegola is spamming logs and dumping its full request/response, please mitigate or revert, it is overwhelming logstash :(
[13:10:29] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Backport for [[gerrit:963066|fonwiki: add logos (T347939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:10:33] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469)
[13:10:35] <godog>	 filing a task now
[13:10:48] <Lucas_WMDE>	 aanzx: please test
[13:10:50] <aanzx>	 checking
[13:11:02] <Lucas_WMDE>	 godog: should I hold off deploying or is this sufficiently unrelated to mw?
[13:11:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:11:28] <nemo-yiannis>	 godog: on it
[13:11:41] <wikibugs>	 (03PS1) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165)
[13:11:42] <godog>	 Lucas_WMDE: some mw logs will be delayed
[13:11:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:11:45] <godog>	 nemo-yiannis: thank you
[13:11:45] <wikibugs>	 (03PS1) 10Jbond: realm: test monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/963300
[13:11:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[13:11:50] <Lucas_WMDE>	 ok
[13:11:58] <aanzx>	 Lucas_WMDE: look good
[13:12:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[13:12:59] <wikibugs>	 (03PS1) 10Jgiannelos: Revert "Enable aws-sdk (s3) debug logging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/963301
[13:13:52] <Lucas_WMDE>	 there’s an “expected but failed to find position index” error in mwdebug logstash but that seems to be unrelated
[13:13:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43863/console" [puppet] - 10https://gerrit.wikimedia.org/r/963300 (owner: 10Jbond)
[13:13:59] <Lucas_WMDE>	 there’s lots of instances of that on mediawiki-warnings
[13:13:59] <nemo-yiannis>	 godog: can you take a look at this patch? https://gerrit.wikimedia.org/r/c/operations/software/tegola/+/963301
[13:14:05] <wikibugs>	 (03PS2) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837)
[13:14:10] <Lucas_WMDE>	 let’s sync then
[13:14:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and anzx: Continuing with sync
[13:14:18] <wikibugs>	 (03CR) 10Fabfur: purged: use unix socket for varnish in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[13:14:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Enable aws-sdk (s3) debug logging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/963301 (owner: 10Jgiannelos)
[13:14:26] <wikibugs>	 (03CR) 10Cathal Mooney: "Overall looks good to me, one comment.  I'm not familiar with those interface:: classes but agree if it works how it looks it seems cleane" [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[13:14:28] <urandom>	 !log Cassandra bootstrap, restbase1030-a (`auto_bootstrap: false`) — T346803
[13:14:28] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687)
[13:14:29] <godog>	 nemo-yiannis: for sure! LGTM
[13:14:30] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469)
[13:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:32] <stashbot>	 T346803: Unable to bootstrap restbase1030-{a,b,c} - https://phabricator.wikimedia.org/T346803
[13:14:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[13:14:44] <godog>	 filed related task as https://phabricator.wikimedia.org/T348141
[13:14:58] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] Revert "Enable aws-sdk (s3) debug logging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/963301 (owner: 10Jgiannelos)
[13:15:12] <wikibugs>	 (03Abandoned) 10Jbond: realm: test monkey patch [puppet] - 10https://gerrit.wikimedia.org/r/963300 (owner: 10Jbond)
[13:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable aws-sdk (s3) debug logging" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/963301 (owner: 10Jgiannelos)
[13:15:48] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469)
[13:16:07] <Daimona>	 Lucas_WMDE: Hi, would you be willing to deploy a couple of last-minute patches?
[13:16:36] <Lucas_WMDE>	 Daimona: there’s still one regular config change in the queue
[13:16:38] <Lucas_WMDE>	 are they urgent?
[13:16:40] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.234 port 9042 https://phabricator.wikimedia.org/T93886
[13:16:47] <Daimona>	 ("What patches", you may ask. I still have to write them...)
[13:17:05] <Lucas_WMDE>	 (also kinda waiting for godog / nemo-yiannis to be done fixing logstash – I only realized after continuing the scap that delayed logstash makes the canaries less useful)
[13:17:18] <Daimona>	 Not urgent in the sense that something's broken. Just rolling out a new feature. It can wait if we're having infra issues
[13:17:44] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687)
[13:17:46] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469)
[13:17:56] <Daimona>	 (The feature rollout itself is overdue, but I can schedule it for another window today or tomorrow)
[13:18:20] <wikibugs>	 (03PS1) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963302 (https://phabricator.wikimedia.org/T347837)
[13:18:30] <Lucas_WMDE>	 I don’t think the infra issues will persist long
[13:18:34] <Lucas_WMDE>	 (a patch was already merged)
[13:18:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[13:18:40] <godog>	 Lucas_WMDE: thank you, yes you are correct that the lag might impact the canaries check
[13:18:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963302 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[13:19:22] <Lucas_WMDE>	 ok, the config changes are low risk I think but let’s still wait with the second one then
[13:19:26] <logmsgbot>	 !log rook@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet with OS bullseye
[13:19:29] <Lucas_WMDE>	 (first one is currently at 62% php-fpm-restart)
[13:19:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:19:59] <wikibugs>	 (03PS1) 10Jgiannelos: tegola: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963303
[13:20:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:20:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963066|fonwiki: add logos (T347939)]] (duration: 11m 43s)
[13:20:45] <wikibugs>	 (03PS2) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963302 (https://phabricator.wikimedia.org/T347837)
[13:20:46] <stashbot>	 T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939
[13:20:59] <Lucas_WMDE>	 Daimona: feel free to start uploading patches, at least ^^
[13:21:12] <Daimona>	 Yup, writing them now, sorry
[13:21:25] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963303 (owner: 10Jgiannelos)
[13:21:59] <Daimona>	 And it's actually 3 patches, but one is just beta
[13:22:17] <wikibugs>	 (03Merged) 10jenkins-bot: tegola: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963303 (owner: 10Jgiannelos)
[13:22:57] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[13:23:35] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[13:23:44] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10DennisJJackson) @Aklapper - It looks like this issue was originally raised several years ago and put in the icebox. I'm flagging that the situation around standardization and deploy...
[13:24:05] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[13:24:05] <wikibugs>	 (03PS1) 10Daimona Eaytoy: beta: Explicitly assign campaignevents-email-participants to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939)
[13:24:11] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[13:24:27] <wikibugs>	 (03Abandoned) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963302 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[13:24:49] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[13:25:00] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[13:25:05] <wikibugs>	 (03PS3) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837)
[13:25:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:25:40] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[13:25:48] <wikibugs>	 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Jhancock.wm)
[13:25:55] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[13:25:56] <wikibugs>	 (03PS5) 10Slyngshede: Implement Codex design, from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824)
[13:26:54] <wikibugs>	 (03CR) 10Slyngshede: "Tool tips are back 😊" [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) (owner: 10Slyngshede)
[13:27:08] <wikibugs>	 (03PS1) 10Daimona Eaytoy: metawiki: Restrict campaignevents-email-participants right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963306 (https://phabricator.wikimedia.org/T336939)
[13:27:30] <nemo-yiannis>	 godog: tegola log rate should be more reasonable now
[13:27:36] <fabfur>	 !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963147 (T347837). `purged` daemon will be restarted by puppet in eqiad in the next 30m
[13:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:40] <stashbot>	 T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837
[13:28:17] <godog>	 nemo-yiannis: indeed, I can confirm the kafka lag is going down, thank you for the quick action on this
[13:28:23] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[13:28:29] <wikibugs>	 (03PS2) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165)
[13:28:39] <Lucas_WMDE>	 alright, then I’ll continue now
[13:28:52] <wikibugs>	 (03PS7) 10Lucas Werkmeister (WMDE): fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:29:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[13:29:03] <godog>	 Lucas_WMDE: there's still some lag though it should clear in 10m or so FYI, should be safe to proceed
[13:29:08] <Lucas_WMDE>	 ah ok
[13:29:13] <Lucas_WMDE>	 yeah it’ll have to go out to mwdebug first and all
[13:29:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:29:28] <Lucas_WMDE>	 can I see the lag somewhere?
[13:29:45] <godog>	 yes sorry https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus%2Fops&orgId=1&from=now-30m&to=now&var-topic=All&var-consumer_group=All
[13:30:01] <godog>	 err, this one
[13:30:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43864/console" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[13:30:02] <Lucas_WMDE>	 cool, thank you!
[13:30:03] <godog>	 https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus%2Fops&orgId=1&from=now-30m&to=now&var-topic=All&var-consumer_group=logstash7-eqiad
[13:30:10] <wikibugs>	 (03PS1) 10Daimona Eaytoy: prod: Enable wgCampaignEventsEnableEmail in meta and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963307 (https://phabricator.wikimedia.org/T347065)
[13:30:14] <Lucas_WMDE>	 I’ll look at that before continuing the sync at the end then
[13:30:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:30:20] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963308
[13:30:45] <godog>	 Lucas_WMDE: ack
[13:31:17] <Lucas_WMDE>	 oops, scap failed actually
[13:31:19] * Lucas_WMDE looks
[13:31:35] <Lucas_WMDE>	 “mkdir: cannot create directory ‘log’: Permission denied” https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-test-docker/6591/console
[13:31:39] <Lucas_WMDE>	 that’s a transient one isn’t it
[13:31:43] <Lucas_WMDE>	 I’ll try again
[13:31:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:32:00] <godog>	 quite poetic
[13:32:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "trying again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:32:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:32:37] <Daimona>	 Lucas_WMDE: thanks for bearing with me, my patches are https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/963307 and its 2 dependencies
[13:32:40] <wikibugs>	 (03Merged) 10jenkins-bot: fonwiki: add wgSiteName, wgMetaNamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx)
[13:32:47] <Lucas_WMDE>	 yay now it went through
[13:32:48] * Lucas_WMDE looks
[13:33:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963036|fonwiki: add wgSiteName, wgMetaNamespace and timezone (T347939)]]
[13:33:10] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri)
[13:33:22] <stashbot>	 T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939
[13:33:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43866/console" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[13:33:47] <Lucas_WMDE>	 jouncebot: next
[13:33:47] <jouncebot>	 In 0 hour(s) and 26 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1400)
[13:34:08] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney)
[13:34:30] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Backport for [[gerrit:963036|fonwiki: add wgSiteName, wgMetaNamespace and timezone (T347939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:34:36] <Lucas_WMDE>	 aanzx: please test :)
[13:34:37] <aanzx>	 checking
[13:34:38] <Lucas_WMDE>	 ok
[13:36:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469)
[13:37:06] <wikibugs>	 (03PS3) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165)
[13:37:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:37:40] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963308 (owner: 10Volans)
[13:38:06] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10ssingh) Hi @DennisJJackson: Thanks for the question. We do plan to work on ECH and enable it for our sites and have had some discussions internally. There is no timeline yet as such...
[13:38:20] <Lucas_WMDE>	 seems to be working for me at least
[13:38:24] <aanzx>	 Lucas_WMDE: look good
[13:38:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43867/console" [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[13:38:30] <Lucas_WMDE>	 ok thanks!
[13:39:13] <Lucas_WMDE>	 hm, logstash hasn’t caught up yet
[13:39:20] <Lucas_WMDE>	 (though it’s definitely improving)
[13:39:45] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469)
[13:39:54] <Lucas_WMDE>	 I’ll start it, this should be low risk
[13:39:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 anzx and lucaswerkmeister-wmde: Continuing with sync
[13:40:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond)
[13:40:13] <wikibugs>	 (03CR) 10David Caro: "It seems that cloudbackups are being resolved as ip6 (from pcc):" [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah)
[13:40:23] <godog>	 Lucas_WMDE: +1
[13:40:28] <icinga-wm>	 PROBLEM - Check systemd state on releases2003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:40:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) @cmooney this should be a complication if we did have a mixed of 1G and 10G servers within the same rack which is not the case. In all exist...
[13:42:18] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963308 (owner: 10Volans)
[13:43:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) >>! In T347054#9223568, @ayounsi wrote: > I was wondering what to do for all the appliances that have ntp.site.wikimedia.o...
[13:44:12] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman)
[13:44:36] <wikibugs>	 (03PS1) 10Volans: Upstream release v7.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963314
[13:45:04] <aanzx>	 Lucas_WMDE: can you run namespaceDupes.php after sync
[13:45:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:45:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:45:37] <Lucas_WMDE>	 aanzx: so far it says there’s nothing to do
[13:45:48] <Lucas_WMDE>	 and I assume mwmaint has the updated code already
[13:45:51] <Lucas_WMDE>	 but I can check again after the scap is done
[13:46:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) (owner: 10Slyngshede)
[13:46:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963036|fonwiki: add wgSiteName, wgMetaNamespace and timezone (T347939)]] (duration: 13m 46s)
[13:46:57] <stashbot>	 T347939: Post-creation work for fonwiki - https://phabricator.wikimedia.org/T347939
[13:47:15] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Implement Codex design, from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/962003 (https://phabricator.wikimedia.org/T338824) (owner: 10Slyngshede)
[13:47:25] <wikibugs>	 (03PS4) 10Jbond: wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T346165)
[13:47:27] <Lucas_WMDE>	 !log mwscript namespaceDupes fonwiki --fix # T347939 – 0 pages to fix, 0 resolvable; 0 links to fix, 0 resolvable, 0 deleted
[13:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:35] <Lucas_WMDE>	 aanzx: ^
[13:47:44] <aanzx>	 thanks
[13:47:49] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): beta: Explicitly assign campaignevents-email-participants to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy)
[13:47:55] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): metawiki: Restrict campaignevents-email-participants right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963306 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy)
[13:48:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy)
[13:48:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963306 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy)
[13:48:27] <Lucas_WMDE>	 Daimona: deploying the first two changes for now
[13:48:31] <Lucas_WMDE>	 we’ll see if there’s time for the third one
[13:48:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] beta: Explicitly assign campaignevents-email-participants to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy)
[13:48:37] <Lucas_WMDE>	 (maybe after the wikifunctions window)
[13:48:42] <Lucas_WMDE>	 oops, CI reject :(
[13:48:57] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v7.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963314 (owner: 10Volans)
[13:48:59] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Explicitly assign campaignevents-email-participants to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963305 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy)
[13:49:00] <Lucas_WMDE>	 ugh, the same operation not permitted error again
[13:49:03] <wikibugs>	 (03Merged) 10jenkins-bot: metawiki: Restrict campaignevents-email-participants right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963306 (https://phabricator.wikimedia.org/T336939) (owner: 10Daimona Eaytoy)
[13:49:07] <Daimona>	 Yup, sounds good to me, ty!
[13:49:10] <Lucas_WMDE>	 ok, it was only the test, not the gate-and-submit 🤷
[13:49:29] <Lucas_WMDE>	 ugh, but now it got merged and the scap backport exited…
[13:49:30] <Daimona>	 The first 2 should also be no-ops, so...
[13:49:32] <wikibugs>	 (03CR) 10David Caro: P:cloudceph: cleanup firewall rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah)
[13:49:38] * Lucas_WMDE restarts it
[13:49:54] <wikibugs>	 (03PS1) 10Jclark-ctr: correct an-master1003,4 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/963315 (https://phabricator.wikimedia.org/T342291)
[13:49:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963305|beta: Explicitly assign campaignevents-email-participants to all users (T336939)]], [[gerrit:963306|metawiki: Restrict campaignevents-email-participants right (T336939)]]
[13:49:58] <stashbot>	 T336939: Add new user right to meta - https://phabricator.wikimedia.org/T336939
[13:50:21] <wikibugs>	 (03CR) 10Ladsgroup: "I don't know this well enough to confidently say it should be merged but generally speaking it looks okay." [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[13:50:22] <Lucas_WMDE>	 Daimona: shouldn’t the second one have an effect?
[13:50:28] <Lucas_WMDE>	 or do you mean it’s a no-op because the right doesn’t do anything yet?
[13:50:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (26) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:50:42] <Daimona>	 Yeah, it has an effect but there should be nothing using that right
[13:50:51] <Lucas_WMDE>	 ok, got it
[13:50:56] <Lucas_WMDE>	 but I can still test it with the siteinfo api ^^
[13:50:59] <Daimona>	 Still testable with UserGroupRights though I guess
[13:51:02] <Daimona>	 ^^^
[13:51:12] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] correct an-master1003,4 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/963315 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[13:51:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 daimona and lucaswerkmeister-wmde: Backport for [[gerrit:963305|beta: Explicitly assign campaignevents-email-participants to all users (T336939)]], [[gerrit:963306|metawiki: Restrict campaignevents-email-participants right (T336939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:52:21] <wikibugs>	 (03PS7) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[13:52:26] <Lucas_WMDE>	 okay, API output diff looks good to me
[13:52:34] <Lucas_WMDE>	 Daimona: agree? ^^
[13:53:10] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v7.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963314 (owner: 10Volans)
[13:53:32] <Daimona>	 Lemme see
[13:53:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[13:54:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[13:54:11] <Daimona>	 Yup, LGTM
[13:54:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 daimona and lucaswerkmeister-wmde: Continuing with sync
[13:54:19] <Lucas_WMDE>	 \o/
[13:56:25] <Daimona>	 Lucas_WMDE: As for the third patch: I was waiting for someone from my team to volunteer for testing it, but nobody seems to be available, and I don't have those rights on meta
[13:56:34] <Lucas_WMDE>	 ah, hm
[13:56:40] <Daimona>	 So, given that we're also approaching the end of the window, I think it should be done another time
[13:56:55] <Lucas_WMDE>	 ok
[13:57:04] <wikibugs>	 (03CR) 10Ottomata: "I'd like to take a stab at doing the broader changes needed for automating this, but I probably won't have time very soon.  Don't want to " [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[13:57:10] <Daimona>	 Oh wait
[13:57:27] <Daimona>	 There's actually someone available. But still, it's late, so I'm also still fine with doing that another time
[13:57:49] <Lucas_WMDE>	 I think this deployment will already overrun a little bit into the wikifunctions window
[13:57:54] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2008.codfw.wmnet
[13:57:58] <Lucas_WMDE>	 it only just started php-fpm-restart
[13:58:07] <Lucas_WMDE>	 let’s see how much wikifunctions stuff there is to do, I guess
[13:58:30] <wikibugs>	 (03PS8) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[13:58:31] <Daimona>	 Ok, ty
[14:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1400)
[14:00:12] <Lucas_WMDE>	 I’m still deploying, please hold
[14:00:32] <wikibugs>	 (03PS1) 10Fabfur: purged: use unix socket for varnish in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837)
[14:00:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963305|beta: Explicitly assign campaignevents-email-participants to all users (T336939)]], [[gerrit:963306|metawiki: Restrict campaignevents-email-participants right (T336939)]] (duration: 10m 40s)
[14:00:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) Thanks everyone for the discussion and feedback above! So it seems like two main points have come up above:  1. We can c...
[14:00:40] <Lucas_WMDE>	 alright, I’m done for now
[14:00:45] <stashbot>	 T336939: Add new user right to meta - https://phabricator.wikimedia.org/T336939
[14:00:50] <Lucas_WMDE>	 is there anything to deploy from wikifunctions?
[14:00:59] <Lucas_WMDE>	 otherwise I have one more config change I’d like to do
[14:01:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10jbond) >ACAST_PS_ADVERTISE is hardcoded in anycast_healthchecker (the tool we use to monitor services). in that case agree its t...
[14:01:04] * Lucas_WMDE will wait a few minutes
[14:01:04] <jinxer-wm>	 (KubernetesAPILatency) firing: (22) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:01:38] <wikibugs>	 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Jhancock.wm)
[14:03:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) >>! In T348041#9224990, @jbond wrote: >>ACAST_PS_ADVERTISE is hardcoded in anycast_healthchecker (the tool we use to mon...
[14:03:41] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43868/console" [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[14:04:32] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:05:35] <wikibugs>	 (03PS2) 10Fabfur: purged: use unix socket for varnish in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837)
[14:05:47] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2007.codfw.wmnet
[14:06:05] <jinxer-wm>	 (KubernetesAPILatency) resolved: (25) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:07:44] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:08:34] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2006.codfw.wmnet
[14:08:47] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43869/console" [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[14:08:56] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2008.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:08:56] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:08:57] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ores2008.codfw.wmnet
[14:10:13] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2005.codfw.wmnet
[14:10:36] <Lucas_WMDE>	 doesn’t sound like there’s anything to do for wikifunctions today
[14:10:45] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:10:48] <Lucas_WMDE>	 Daimona: if you still have a tester available then I think we can go ahead
[14:10:56] <Daimona>	 Yup, I do, thank you
[14:11:00] <Lucas_WMDE>	 ok, let’s go
[14:11:13] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): prod: Enable wgCampaignEventsEnableEmail in meta and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963307 (https://phabricator.wikimedia.org/T347065) (owner: 10Daimona Eaytoy)
[14:11:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963307 (https://phabricator.wikimedia.org/T347065) (owner: 10Daimona Eaytoy)
[14:12:15] <wikibugs>	 (03Merged) 10jenkins-bot: prod: Enable wgCampaignEventsEnableEmail in meta and officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963307 (https://phabricator.wikimedia.org/T347065) (owner: 10Daimona Eaytoy)
[14:12:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:12:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963307|prod: Enable wgCampaignEventsEnableEmail in meta and officewiki (T347065)]]
[14:12:47] <stashbot>	 T347065: Release the email participants feature - https://phabricator.wikimedia.org/T347065
[14:12:52] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores2009.codfw.wmnet
[14:13:51] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2007.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:14:07] <wikibugs>	 (03PS9) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[14:14:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 daimona and lucaswerkmeister-wmde: Backport for [[gerrit:963307|prod: Enable wgCampaignEventsEnableEmail in meta and officewiki (T347065)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:14:44] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:15:03] <wikibugs>	 (03CR) 10Papaul: [V: 03+2] correct an-master1003,4 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/963315 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[14:15:50] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:16:30] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2007.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:16:31] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:16:31] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores2007.codfw.wmnet
[14:16:45] <urandom>	 !log starting Cassandra rebuild, restbase1030-a — T346803
[14:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:48] <stashbot>	 T346803: Unable to bootstrap restbase1030-{a,b,c} - https://phabricator.wikimedia.org/T346803
[14:17:04] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:17:09] <Lucas_WMDE>	 Daimona: can you test the change?
[14:17:13] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:17:13] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores2005.codfw.wmnet
[14:17:16] <Daimona>	 Yup, coordinating right now
[14:17:21] <Lucas_WMDE>	 ok thanks
[14:17:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:17:55] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:17:55] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:17:56] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores2006.codfw.wmnet
[14:17:57] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:18:01] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores[1002-1009].eqiad.wmnet
[14:19:01] <wikibugs>	 (03PS1) 10Andrew Bogott: Move radosgw/swift API to port 443, the standard swift port [puppet] - 10https://gerrit.wikimedia.org/r/963325 (https://phabricator.wikimedia.org/T341380)
[14:20:16] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:21:18] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores2009.codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:21:18] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:21:19] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores2009.codfw.wmnet
[14:21:46] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] purged: use unix socket for varnish in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[14:21:54] <wikibugs>	 (03PS10) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[14:22:03] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores[2001-2004].codfw.wmnet
[14:22:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye
[14:22:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:23:27] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: use unix socket for varnish in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/963321 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[14:23:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[14:24:01] <fabfur>	 !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963321 (T347837). `purged` daemon will be restarted by puppet in drmrs in the next 30m
[14:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:05] <stashbot>	 T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837
[14:25:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Move radosgw/swift API to port 443, the standard swift port [puppet] - 10https://gerrit.wikimedia.org/r/963325 (https://phabricator.wikimedia.org/T341380) (owner: 10Andrew Bogott)
[14:25:09] <Daimona>	 Lucas_WMDE: It's working!
[14:25:20] <Lucas_WMDE>	 \o/
[14:25:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 daimona and lucaswerkmeister-wmde: Continuing with sync
[14:25:37] <Lucas_WMDE>	 anakin_phantom_menace.gif
[14:25:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[14:26:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move radosgw/swift API to port 443, the standard swift port [puppet] - 10https://gerrit.wikimedia.org/r/963325 (https://phabricator.wikimedia.org/T341380) (owner: 10Andrew Bogott)
[14:26:43] * Daimona staring at my IRC client that does not display GIFs, I guess :(
[14:26:56] <Daimona>	 But google always has an answer for you :D
[14:27:21] <Lucas_WMDE>	 I just typed a fake file name and trusted your brain to fill it in :P
[14:27:31] <Lucas_WMDE>	 my client definitely doesn’t support gifs either
[14:28:02] <Daimona>	 Oooooooh :D I just by default assumed that gifs were too much for ye olde hexchat
[14:28:09] <Lucas_WMDE>	 :D
[14:29:03] <wikibugs>	 (03PS11) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[14:29:18] * Daimona is happy because, OTOH, his IRC client automatically replaces passwords with ********* when you type them :P
[14:29:44] <klausman>	 ah, good ole' hunter1
[14:29:52] <Daimona>	 After all, that's the must-have feature for all IRC clients
[14:30:04] <Lucas_WMDE>	 :D
[14:31:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963307|prod: Enable wgCampaignEventsEnableEmail in meta and officewiki (T347065)]] (duration: 18m 26s)
[14:31:14] <stashbot>	 T347065: Release the email participants feature - https://phabricator.wikimedia.org/T347065
[14:31:16] * Lucas_WMDE observes klausman has an older version of hunter
[14:31:48] <rzl>	 it's a shame about bash.org! end of an era
[14:32:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:33:08] <Lucas_WMDE>	 a german one (http://ibash.de/) grew a weird second life – last quote from ten months ago, but people are just chatting in the comments now, https://xkcd.com/1305/ -style
[14:33:22] <Daimona>	 Oh yeah, the new version is *******
[14:34:16] <wikibugs>	 (03PS1) 10Fabfur: purged: use unix socket for varnish in all DCs. [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837)
[14:34:28] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:34:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] purged: use unix socket for varnish in all DCs. [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[14:36:36] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores[2001-2004].codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:36:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ayounsi) Ah right! My bad.  Unrelated and maybe a scope creep, but we could also start by advertising a unicast v6 IP to validat...
[14:37:23] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[14:37:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:38:28] <Lucas_WMDE>	 !log spontaneously extended UTC afternoon backport+config window done now
[14:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:47] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:38:48] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:48] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ores[1002-1009].eqiad.wmnet
[14:39:07] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores[2001-2004].codfw.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[14:39:07] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:39:08] <logmsgbot>	 !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ores[2001-2004].codfw.wmnet
[14:39:43] <wikibugs>	 (03PS1) 10Btullis: Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587)
[14:40:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05Open→03In progress p:05Medium→03Low a:03bking
[14:41:29] <wikibugs>	 (03PS2) 10Btullis: Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587)
[14:41:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis)
[14:41:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Taking this back, as I was able to get the host to boot by changing the boot option for the 2nd NIC interfac...
[14:41:49] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert)
[14:42:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:42:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis)
[14:43:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:45:04] <wikibugs>	 (03PS3) 10Btullis: Bump the maximum number of HDFS files allwoed before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587)
[14:48:52] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:11] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:56] <wikibugs>	 (03PS1) 10Bking: partman: fix raid0-3dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963328 (https://phabricator.wikimedia.org/T342463)
[14:52:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] partman: fix raid0-3dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963328 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking)
[14:53:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:54:01] <wikibugs>	 (03CR) 10Bking: [C: 03+2] partman: fix raid0-3dev.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963328 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking)
[14:55:41] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye
[14:55:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[14:55:52] <wikibugs>	 (03PS1) 10Majavah: wikimediacloud: Add a dedicated CNAME for object storage [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380)
[14:56:11] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye
[14:56:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[14:56:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikimediacloud: Add a dedicated CNAME for object storage [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) (owner: 10Majavah)
[14:57:21] <wikibugs>	 (03PS2) 10Majavah: wikimediacloud: Add a dedicated CNAME for object storage [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380)
[14:59:19] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B server moves - port-block constraint / numbering - https://phabricator.wikimedia.org/T348125 (10cmooney) 05Open→03Resolved @papaul answered in T348129#9224878, seems like we're in a good place given previous rack assignment as '1...
[14:59:22] <taavi>	 !log revoke a bot password, https://phabricator.wikimedia.org/T348132
[14:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[14:59:40] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10procurement: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10wiki_willy) a:03RobH Adding the procurement project tag.  @RobH - can you move this to the S4 space as well?  Thanks, Willy
[15:00:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye
[15:00:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[15:01:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10cmooney) >>! In T348129#9224878, @Papaul wrote: > @cmooney this should be a complication if we did have a mixed of 1G and 10G servers within the sam...
[15:02:22] <wikibugs>	 (03PS1) 10Slyngshede: Handle mobile viewport correct. [software/bitu] - 10https://gerrit.wikimedia.org/r/963331
[15:03:02] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Handle mobile viewport correct. [software/bitu] - 10https://gerrit.wikimedia.org/r/963331 (owner: 10Slyngshede)
[15:05:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[15:07:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudvirt1062-67 - jclark@cumin1001"
[15:07:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Jclark-ctr)
[15:08:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudvirt1062-67 - jclark@cumin1001"
[15:08:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:08:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:12:22] <wikibugs>	 (03PS1) 10RLazarus: admin: Temporarily add a second ssh key for rzl [puppet] - 10https://gerrit.wikimedia.org/r/963333
[15:12:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10cmooney) >>! In T348041#9222035, @ssingh wrote: > We can and probably should have a backup static routes for each of `ns[01]` bu...
[15:12:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1062.mgmt.eqiad.wmnet with reboot policy FORCED
[15:12:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1063.mgmt.eqiad.wmnet with reboot policy FORCED
[15:12:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1064.mgmt.eqiad.wmnet with reboot policy FORCED
[15:12:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1065.mgmt.eqiad.wmnet with reboot policy FORCED
[15:12:55] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1066.mgmt.eqiad.wmnet with reboot policy FORCED
[15:12:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1067.mgmt.eqiad.wmnet with reboot policy FORCED
[15:13:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:13:47] <wikibugs>	 (03PS1) 10Bking: cloudelastic: include raid0.cfg in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963334 (https://phabricator.wikimedia.org/T342463)
[15:14:22] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/963335
[15:14:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] cloudelastic: include raid0.cfg in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963334 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking)
[15:15:07] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] admin: Temporarily add a second ssh key for rzl [puppet] - 10https://gerrit.wikimedia.org/r/963333 (owner: 10RLazarus)
[15:15:24] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: include raid0.cfg in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963334 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking)
[15:16:04] <wikibugs>	 (03PS1) 10Slyngshede: Add viewport meta tag [software/bitu] - 10https://gerrit.wikimedia.org/r/963336
[15:16:41] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add viewport meta tag [software/bitu] - 10https://gerrit.wikimedia.org/r/963336 (owner: 10Slyngshede)
[15:17:21] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye
[15:17:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[15:18:09] <wikibugs>	 (03PS4) 10Btullis: Bump the maximum number of HDFS files before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587)
[15:21:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye
[15:21:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[15:22:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/963335 (owner: 10Muehlenhoff)
[15:23:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10cmooney) >>! In T348041#9222035, @ssingh wrote: > We can and probably should have a backup static routes for each of `ns[01]` bu...
[15:24:05] <wikibugs>	 (03PS1) 10Slyngshede: Better wording for sign in text. [software/bitu] - 10https://gerrit.wikimedia.org/r/963339
[15:24:28] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Better wording for sign in text. [software/bitu] - 10https://gerrit.wikimedia.org/r/963339 (owner: 10Slyngshede)
[15:25:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ayounsi) Oops, I missed some of the comments.  * I'm in favor of ditching the statics * Changing the  Hiera merge strategy seems...
[15:26:05] <wikibugs>	 (03PS1) 10Jclark-ctr: corrected an-master100[3-4] in site.ppi [puppet] - 10https://gerrit.wikimedia.org/r/963340 (https://phabricator.wikimedia.org/T342291)
[15:26:12] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert)
[15:26:49] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] corrected an-master100[3-4] in site.ppi [puppet] - 10https://gerrit.wikimedia.org/r/963340 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[15:30:27] <wikibugs>	 (03PS1) 10Jbond: test_init: correctly mock spicerack.Dns [software/spicerack] - 10https://gerrit.wikimedia.org/r/963343
[15:32:21] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:41] <logmsgbot>	 !log hashar@deploy2002 Started deploy [integration/docroot@b3b712f]: (no justification provided)
[15:32:47] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [integration/docroot@b3b712f]: (no justification provided) (duration: 00m 06s)
[15:32:49] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Undelivered mail posted to wikimediacz-l - https://phabricator.wikimedia.org/T348158 (10Urbanecm)
[15:32:56] <wikibugs>	 (03PS2) 10Fabfur: purged: use unix socket for varnish in all DCs. [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837)
[15:33:48] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:35:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:35:56] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:36:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:37:29] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye
[15:37:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[15:38:09] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney) p:05Triage→03Medium
[15:38:29] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney)
[15:38:39] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney)
[15:39:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) >>! In T348041#9225321, @cmooney wrote: >>>! In T348041#9222035, @ssingh wrote: >> We can and probably should have a bac...
[15:39:54] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye
[15:40:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e...
[15:40:31] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:40:31] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:41:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) >>! In T348041#9225405, @ayounsi wrote: > Oops, I missed some of the comments. >  > * I'm in favor of ditching the stati...
[15:42:18] <wikibugs>	 (03PS1) 10Jclark-ctr: add cloudvirt10[62-67] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963366 (https://phabricator.wikimedia.org/T342537)
[15:43:48] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:44:41] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1063.mgmt.eqiad.wmnet with reboot policy FORCED
[15:44:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1064.mgmt.eqiad.wmnet with reboot policy FORCED
[15:44:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1067.mgmt.eqiad.wmnet with reboot policy FORCED
[15:44:50] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1066.mgmt.eqiad.wmnet with reboot policy FORCED
[15:44:52] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1065.mgmt.eqiad.wmnet with reboot policy FORCED
[15:44:54] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:45:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1062.mgmt.eqiad.wmnet with reboot policy FORCED
[15:45:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:45:35] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add cloudvirt10[62-67] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963366 (https://phabricator.wikimedia.org/T342537) (owner: 10Jclark-ctr)
[15:45:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10jbond) >>! In T348041#9225478, @ssingh wrote: >>>! In T348041#9225405, @ayounsi wrote: >> * Changing the  Hiera merge strategy s...
[15:46:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) For posterity:   - no static routes - merge strategy Arzhel mentioned above - I am going to rename `ACAST_PS_ADVERTISE`...
[15:47:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[15:47:19] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[15:47:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[15:47:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[15:49:06] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43870/console" [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[15:51:09] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:52:13] <wikibugs>	 (03PS1) 10Andrea Denisse: alertmanager: Add the "Auto-Submitted: auto-generated" header to AM emails [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850)
[15:55:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:56:04] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:56:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:58:17] <wikibugs>	 (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/output/908604/2460/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:58:34] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I guess this can be merged at any time." [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[15:59:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) I am thinking about something to consider when going servers refresh or new servers
[16:00:04] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST clusterissuers) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:05:04] <jinxer-wm>	 (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST clusterissuers) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:05:41] <wikibugs>	 (03PS14) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075)
[16:05:43] <wikibugs>	 (03CR) 10Ebernhardson: Draft: cirrus streaming updater service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[16:05:52] <wikibugs>	 (03Abandoned) 10Jforrester: mw-on-k8s: Serve 100% of wikifunctions.org traffic [puppet] - 10https://gerrit.wikimedia.org/r/961413 (https://phabricator.wikimedia.org/T347509) (owner: 10Jforrester)
[16:06:29] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/963368/43872/" [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) (owner: 10Andrea Denisse)
[16:06:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[16:07:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bullseye
[16:07:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye
[16:07:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye
[16:07:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye
[16:07:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS b...
[16:07:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS b...
[16:07:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS b...
[16:07:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS b...
[16:09:41] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) p:05Triage→03Medium
[16:11:11] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney)
[16:11:17] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney)
[16:15:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1062']
[16:15:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1063']
[16:15:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1064']
[16:15:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1065']
[16:21:17] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1062']
[16:21:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1064']
[16:21:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1065']
[16:21:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1063']
[16:21:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:21:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067']
[16:21:46] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066']
[16:21:48] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067']
[16:22:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:22:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066']
[16:22:28] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:22:32] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066']
[16:22:53] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10ayounsi) Yeah, that's perfect. We can revisit the day it dies and needs to be migrated to a VM.
[16:23:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bullseye
[16:23:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye
[16:23:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bullseye
[16:23:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye
[16:23:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:23:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067']
[16:23:54] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066']
[16:23:57] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067']
[16:24:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:24:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067']
[16:24:37] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066']
[16:24:39] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067']
[16:24:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:24:57] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067']
[16:25:03] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067']
[16:25:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:25:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:25:54] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066']
[16:25:55] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066']
[16:26:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:26:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067']
[16:26:58] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1066']
[16:27:00] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1067']
[16:28:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1067']
[16:28:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[16:31:50] <wikibugs>	 (03PS1) 10Cathal Mooney: Add ns0 and ns1 /32 routes to anycast_prefixes list [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041)
[16:34:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1066']
[16:34:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1067']
[16:34:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "change itself looks good, PCC should cover each DC though" [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[16:35:13] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[16:36:58] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10cmooney)
[16:39:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Hello DC Ops,   I've confirmed that our new partman recipe works in T342463 , but the reimage for `cloudelas...
[16:39:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) p:05Low→03Medium a:05bking→03None
[16:40:12] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack admin scripts: remove wmcs-ceph-migrate [puppet] - 10https://gerrit.wikimedia.org/r/963377
[16:40:33] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:40:33] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:41:22] <wikibugs>	 (03PS5) 10Brion VIBBER: Drop old VP8 video transcodes, enable HLS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152)
[16:42:19] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] openstack admin scripts: remove wmcs-ceph-migrate [puppet] - 10https://gerrit.wikimedia.org/r/963377 (owner: 10Andrew Bogott)
[16:48:01] <taavi>	 jouncebot: nowandnext
[16:48:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 11 minute(s)
[16:48:01] <jouncebot>	 In 0 hour(s) and 11 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1700)
[16:49:06] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[16:49:07] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye
[16:49:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[16:49:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[16:49:18] <taavi>	 !log taavi@mwmaint2002 ~ $ mwscript extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php metawiki | tee T242031-sul.log # T242031
[16:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:22] <stashbot>	 T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031
[16:53:43] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:53:43] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:54:56] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye
[16:55:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad...
[16:55:11] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:55:11] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:56:35] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] purged: use unix socket for varnish in all DCs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[16:57:23] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: upgrade init scripts for Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963378
[16:58:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack admin scripts: remove wmcs-ceph-migrate [puppet] - 10https://gerrit.wikimedia.org/r/963377 (owner: 10Andrew Bogott)
[16:59:10] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "one nit" [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) (owner: 10Cathal Mooney)
[16:59:12] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 10 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43873/console" [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[16:59:54] <fabfur>	 !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963326 (T347837). `purged` daemon will be restarted by puppet in esams in the next 30m
[16:59:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:58] <stashbot>	 T347837: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1700)
[17:00:06] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: use unix socket for varnish in all DCs. [puppet] - 10https://gerrit.wikimedia.org/r/963326 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur)
[17:00:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837 (10Fabfur)
[17:01:07] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837 (10Fabfur) 05Open→03Resolved
[17:03:17] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1062.eqiad.wmnet with OS bullseye
[17:03:25] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bullseye
[17:06:11] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[17:10:36] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: bump to 0.3.135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963383 (https://phabricator.wikimedia.org/T326914)
[17:16:04] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1066.eqiad.wmnet with OS bullseye
[17:19:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Are you imagining that we'd also move the openstack swift endpoint in the keystone catalog, or just keep this around as a fallback?  (Or, " [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) (owner: 10Majavah)
[17:20:22] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: bump to 0.3.135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963383 (https://phabricator.wikimedia.org/T326914) (owner: 10DCausse)
[17:21:12] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: bump to 0.3.135 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963383 (https://phabricator.wikimedia.org/T326914) (owner: 10DCausse)
[17:22:17] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:22:34] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:23:47] <wikibugs>	 10SRE, 10Cloud-VPS: cloudlb2001-dev and cloudlb2002-dev connected at different speeds - https://phabricator.wikimedia.org/T348173 (10cmooney) p:05Triage→03Low
[17:24:39] <wikibugs>	 10SRE, 10Traffic: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ssingh)
[17:24:51] <wikibugs>	 10SRE, 10Traffic: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ssingh)
[17:24:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh)
[17:26:23] <icinga-wm>	 RECOVERY - Disk space on testreduce1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops
[17:27:37] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye
[17:27:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye
[17:27:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro...
[17:27:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro...
[17:29:12] <urbanecm>	 jouncebot: nowandnext
[17:29:12] <jouncebot>	 For the next 0 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1700)
[17:29:12] <jouncebot>	 In 0 hour(s) and 30 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1800)
[17:29:12] <jouncebot>	 In 0 hour(s) and 30 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1800)
[17:29:58] <wikibugs>	 (03PS1) 10Ssingh: bird: rename ACAST_PS_ADVERTISE to BIRD_IP{4,6}_ADVERTISE [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174)
[17:30:16] <wikibugs>	 (03PS1) 10Urbanecm: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760)
[17:30:31] <wikibugs>	 (03PS1) 10Urbanecm: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760)
[17:31:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1062']
[17:32:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1063']
[17:32:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1064']
[17:32:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1065']
[17:32:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1066']
[17:33:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bullseye
[17:33:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye
[17:33:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye
[17:33:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye
[17:33:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bullseye
[17:33:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bullseye
[17:33:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye
[17:33:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye
[17:33:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye
[17:34:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bullseye
[17:37:22] <wikibugs>	 (03PS1) 10Majavah: Set READ_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963388 (https://phabricator.wikimedia.org/T242031)
[17:37:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designate pools.yaml: remove a domain-terminating '.' [puppet] - 10https://gerrit.wikimedia.org/r/961170 (owner: 10Andrew Bogott)
[17:41:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10ssingh)
[17:42:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10ssingh) p:05Triage→03Medium
[17:43:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testreduce1002.eqiad.wmnet
[17:43:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye
[17:43:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro...
[17:47:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testreduce1002.eqiad.wmnet
[17:47:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[17:52:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye
[17:52:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye
[18:00:07] <jouncebot>	 jeena and dduvall: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1800).
[18:00:07] <jouncebot>	 jeena and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T1800).
[18:00:51] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963391 (https://phabricator.wikimedia.org/T347080)
[18:00:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963391 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:00:58] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) p:05Triage→03Medium
[18:01:35] <wikibugs>	 (03PS1) 10Subramanya Sastry: parsoid-rt-client: Reduce worker pool to 24 clients [puppet] - 10https://gerrit.wikimedia.org/r/963392 (https://phabricator.wikimedia.org/T345220)
[18:01:47] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney)
[18:01:53] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney)
[18:02:03] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963391 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:08:28] <dancy>	 Lots of errors are being logged.
[18:08:40] <dancy>	 jeena: Roll back!
[18:09:00] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.29  refs T347080
[18:09:02] <jeena>	 rolling back
[18:09:04] <stashbot>	 T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080
[18:09:32] <jeena>	 dancy: can I just cancel this deploy if it's not done?
[18:09:45] <dduvall>	 that should work
[18:09:47] <dancy>	 yes
[18:10:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:10:41] <thcipriani>	 bah deprecations, looks like this one was filed already. Adding as a blocker.
[18:10:49] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963394 (https://phabricator.wikimedia.org/T347080)
[18:10:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963394 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:11:35] <brennen>	 fun
[18:11:41] <dancy>	 Hmm.. that commit message title is wrong.
[18:11:43] <jeena>	 I thought it was fine since they were deprecation warnings
[18:11:43] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963394 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:11:46] <dancy>	 Was that autogenerated?
[18:11:48] <jeena>	 but they did increase a lot
[18:11:50] <jeena>	 yeah
[18:12:05] <dduvall>	 https://phabricator.wikimedia.org/T348180
[18:12:20] <dduvall>	 thcipriani: oh, it was?
[18:12:38] <wikibugs>	 (03PS12) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[18:13:02] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[18:13:17] <dancy>	 jeena: Can you send me a transcript of what you ran to rollback?  I want to fix that.
[18:13:47] <jeena>	 okay
[18:14:00] <dduvall>	 looks like the deprecation errors are known and filtered out on the New Errors dash. there are others that spiked, however
[18:14:33] <jeena>	 Yeah I also see a cirrusSearchHandler error, but not as many as the deprecation warnings
[18:14:35] <thcipriani>	 dduvall: hrm, coming from the same place, but slightly different path https://phabricator.wikimedia.org/T348134 (not an api error)
[18:14:47] <dduvall>	 i see
[18:15:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[18:15:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:17:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/963392 (https://phabricator.wikimedia.org/T345220) (owner: 10Subramanya Sastry)
[18:18:15] <icinga-wm>	 PROBLEM - Disk space on testreduce1001 is CRITICAL: DISK CRITICAL - free space: /srv/data 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops
[18:18:35] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificaterequests) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:19:22] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1062.eqiad.wmnet with OS bullseye
[18:19:28] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1066.eqiad.wmnet with OS bullseye
[18:19:44] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.29  refs T347080
[18:19:48] <stashbot>	 T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080
[18:20:50] <dduvall>	 jeena, thcipriani i went ahead and filed https://phabricator.wikimedia.org/T348181 as well
[18:21:04] <jeena>	 thanks dduvall 
[18:21:16] <dduvall>	 np
[18:21:27] <thcipriani>	 dduvall: thanks – merged your other task with the existing one and noted the different stack trace, added as a blocker
[18:21:41] <dduvall>	 k
[18:23:35] <jinxer-wm>	 (KubernetesAPILatency) firing: (29) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:28:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (29) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:33:37] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[18:38:43] <icinga-wm>	 RECOVERY - Disk space on testreduce1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops
[18:45:21] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[18:53:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye
[18:54:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro...
[18:54:02] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye
[18:54:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro...
[19:04:27] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T343198)', diff saved to https://phabricator.wikimedia.org/P52822 and previous config saved to /var/cache/conftool/dbconfig/20231004-190427-arnaudb.json
[19:04:35] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[19:05:56] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) Speaking here only with respect to the data model:  TL;DR I think you need to change the schema like so...  `lang=diff ---...
[19:07:02] <wikibugs>	 10SRE, 10Data Products, 10Traffic: Data Quality - requestctl not getting set - https://phabricator.wikimedia.org/T342577 (10VirginiaPoundstone) @Milimetric What is the status on this task?
[19:12:33] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye
[19:12:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro...
[19:19:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[19:19:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P52823 and previous config saved to /var/cache/conftool/dbconfig/20231004-191933-arnaudb.json
[19:34:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P52824 and previous config saved to /var/cache/conftool/dbconfig/20231004-193439-arnaudb.json
[19:43:48] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:49:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T343198)', diff saved to https://phabricator.wikimedia.org/P52825 and previous config saved to /var/cache/conftool/dbconfig/20231004-194946-arnaudb.json
[19:49:49] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[19:49:51] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[19:50:03] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[19:50:04] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[19:50:18] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[19:50:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T343198)', diff saved to https://phabricator.wikimedia.org/P52826 and previous config saved to /var/cache/conftool/dbconfig/20231004-195023-arnaudb.json
[19:52:44] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) @Eevans Understood! I'll make that change to the schema soon.  As far as returning a single `DPPageviews` vs. an array w...
[19:58:36] <wikibugs>	 (03PS1) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[19:59:40] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) >>! In T343855#9226451, @Htriedman wrote: > @Eevans Understood! I'll make that change to the schema soon. >  > As far as re...
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T2000).
[20:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for postal32 - https://phabricator.wikimedia.org/T348197 (10Postal32)
[20:00:45] <urbanecm>	 i'll steal the window
[20:00:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:01:49] <wikibugs>	 (03PS1) 10Urbanecm: Fix phan for GrowthExperiments [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963349 (https://phabricator.wikimedia.org/T347571)
[20:02:03] <wikibugs>	 (03PS2) 10Urbanecm: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760)
[20:02:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fix phan for GrowthExperiments [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963349 (https://phabricator.wikimedia.org/T347571) (owner: 10Urbanecm)
[20:02:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:03:09] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:03:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:03:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: upgrade init scripts for Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963378 (owner: 10Andrew Bogott)
[20:04:29] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:04:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:14:48] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for postal32 - https://phabricator.wikimedia.org/T348197 (10Aklapper) 05Open→03Stalled Hi @Postal32, thanks for taking the time to report this and welcome to Wikimedia Phabricator! Please provide reasons why you'd like to to access Netbox and how you plan...
[20:21:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:22:16] <wikibugs>	 (03CR) 10Urbanecm: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:22:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:23:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:23:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:23:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963349 (https://phabricator.wikimedia.org/T347571) (owner: 10Urbanecm)
[20:29:11] <wikibugs>	 (03Merged) 10jenkins-bot: Fix phan for GrowthExperiments [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963349 (https://phabricator.wikimedia.org/T347571) (owner: 10Urbanecm)
[20:29:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:30:12] <wikibugs>	 (03CR) 10Ebernhardson: "I certainly think the wider goal is still worth pursing, but indeed this is also blocking our staging deployment of the cirrus service so " [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[20:31:27] <wikibugs>	 (03PS9) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901)
[20:31:29] <wikibugs>	 (03PS15) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075)
[20:31:31] <wikibugs>	 (03PS1) 10Ebernhardson: rdf-streaming-updater: dse: Rename release to wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/963409
[20:32:18] <wikibugs>	 (03CR) 10Ebernhardson: "This also ensures that when we use .Release.Name in a path it has a more decriptive path." [deployment-charts] - 10https://gerrit.wikimedia.org/r/963409 (owner: 10Ebernhardson)
[20:32:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[20:45:19] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bullseye
[20:45:29] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963347 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:45:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bullseye
[20:45:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bullseye
[20:45:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye
[20:46:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye
[20:46:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye
[20:46:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye
[20:46:11] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] SpecialManageMentors: Skip OOUI initialization when transcluding [extensions/GrowthExperiments] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963348 (https://phabricator.wikimedia.org/T346760) (owner: 10Urbanecm)
[20:46:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bullseye
[20:46:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye
[20:46:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye
[20:46:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye
[20:46:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bullseye
[20:46:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye
[20:46:55] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:963347|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963348|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963349|Fix phan for GrowthExperiments (T347571)]]
[20:47:00] <stashbot>	 T346760: Pages transcluding Special:ManageMentors are sometimes being rendered in the default skin - https://phabricator.wikimedia.org/T346760
[20:47:00] <stashbot>	 T347571: GrowthExperiments fails CI: mwext-php74-phan-docker - https://phabricator.wikimedia.org/T347571
[20:48:20] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:963347|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963348|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963349|Fix phan for GrowthExperiments (T347571)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:48:39] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[20:53:03] <icinga-wm>	 RECOVERY - Check systemd state on releases2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:45] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:963347|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963348|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963349|Fix phan for GrowthExperiments (T347571)]] (duration: 07m 49s)
[20:54:56] <stashbot>	 T346760: Pages transcluding Special:ManageMentors are sometimes being rendered in the default skin - https://phabricator.wikimedia.org/T346760
[20:54:56] <stashbot>	 T347571: GrowthExperiments fails CI: mwext-php74-phan-docker - https://phabricator.wikimedia.org/T347571
[20:56:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (POST events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:57:04] <urbanecm>	 scap's quick again. yay! :)
[20:57:05] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Some comments about the structure, see inline, of ping me if you want more context." [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[20:58:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2054.codfw.wmnet with OS bullseye
[20:58:15] <wikibugs>	 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2054.codfw.wmnet with OS bullseye
[20:59:03] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1062.eqiad.wmnet with OS bullseye
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T2100)
[21:01:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (POST events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:02:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bullseye
[21:02:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:02:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bullseye
[21:04:05] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:10:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:10:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:10:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:11:36] <wikibugs>	 (03PS3) 10JHathaway: postgresql: fix ordering on a new install [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842)
[21:13:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:14:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:15:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (POST events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:16:16] <wikibugs>	 (03CR) 10JHathaway: postgresql: fix ordering on a new install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[21:20:19] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage
[21:23:34] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage
[21:26:34] <wikibugs>	 (03PS1) 10Subramanya Sastry: parsoid-rt-client: Further reduce worker pool to 16 clients [puppet] - 10https://gerrit.wikimedia.org/r/963413 (https://phabricator.wikimedia.org/T345220)
[21:30:35] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:31:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:31:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:19] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM, thx for the fix" [software/spicerack] - 10https://gerrit.wikimedia.org/r/963343 (owner: 10Jbond)
[21:33:55] <wikibugs>	 (03PS1) 10Subramanya Sastry: Revert "Deprecate TOC mutation in OutputPageParserOutput hook" [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963351 (https://phabricator.wikimedia.org/T348134)
[21:34:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage
[21:36:24] <wikibugs>	 (03Merged) 10jenkins-bot: test_init: correctly mock spicerack.Dns [software/spicerack] - 10https://gerrit.wikimedia.org/r/963343 (owner: 10Jbond)
[21:36:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:37:15] <brennen>	 jouncebot nowandnext
[21:37:16] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231004T2100)
[21:37:16] <jouncebot>	 In 8 hour(s) and 22 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0600)
[21:37:16] <jouncebot>	 In 8 hour(s) and 22 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0600)
[21:38:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage
[21:38:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[21:39:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963351 (https://phabricator.wikimedia.org/T348134) (owner: 10Subramanya Sastry)
[21:39:29] <brennen>	 ^ cc: jeena
[21:40:16] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[21:40:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1062.eqiad.wmnet with OS bullseye
[21:40:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bullseye completed: - cloud...
[21:42:27] <jeena>	 brennen: +1 I think there is another blocker still before we can roll forward
[21:43:14] <brennen>	 yeah.
[21:43:15] <jeena>	 oh, looks like it has a fix as well https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/963405/
[21:43:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:43:35] <brennen>	 hmm, is that one a straightforward revert or should we wait for review?
[21:44:04] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:44:07] <jeena>	 I'm not sure, they said revert but the patch doesn't so I'd prefer to wait
[21:44:23] <subbu>	 brennen, afk for baout 10 mins. but will be here agian then if you need anything from me before i bail for the evening.
[21:44:31] <subbu>	 looks like zuul needs that time anyway.
[21:44:32] <jeena>	 there's no reviewer added though
[21:44:35] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bullseye
[21:44:37] <wikibugs>	 (03CR) 10Bking: [C: 03+1] rdf-streaming-updater: dse: Rename release to wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/963409 (owner: 10Ebernhardson)
[21:44:44] <brennen>	 subbu: thanks - i'm guessing there's not much to test with this one?
[21:44:51] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.3.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963414
[21:44:59] <wikibugs>	 (03CR) 10Bking: [C: 03+1] Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[21:45:02] <subbu>	 dont think so? it is just going to stop emitting the deprecations.
[21:45:08] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.3.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/963414 (owner: 10Volans)
[21:45:23] <brennen>	 cool.
[21:46:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye
[21:46:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye
[21:46:53] <brennen>	 jeena: i asked - https://phabricator.wikimedia.org/T348181#9226698
[21:47:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:49:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1132 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:49:01] <wikibugs>	 (03PS3) 10JHathaway: puppetdb: avoid creating database users via dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842)
[21:49:38] <wikibugs>	 (03CR) 10JHathaway: puppetdb: avoid creating database users via dbconfig (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[21:50:55] <wikibugs>	 (03PS1) 10Volans: Upstream release v7.3.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963415
[21:51:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v7.3.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/963415 (owner: 10Volans)
[21:51:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:52:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:52:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:52:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Deprecate TOC mutation in OutputPageParserOutput hook" [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963351 (https://phabricator.wikimedia.org/T348134) (owner: 10Subramanya Sastry)
[21:53:14] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:963351|Revert "Deprecate TOC mutation in OutputPageParserOutput hook" (T348134)]]
[21:53:18] <stashbot>	 T348134: PHP Deprecated: Use of OutputPageParserOutput hook to mutate TOC was deprecated in MediaWiki 1.41 - https://phabricator.wikimedia.org/T348134
[21:53:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[21:54:39] <logmsgbot>	 !log brennen@deploy2002 brennen and ssastry: Backport for [[gerrit:963351|Revert "Deprecate TOC mutation in OutputPageParserOutput hook" (T348134)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:54:42] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[21:54:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1066.eqiad.wmnet with OS bullseye
[21:54:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bullseye completed: - cloud...
[21:55:53] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Use function-orchestrator image with better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/962650 (https://phabricator.wikimedia.org/T346264)
[21:55:55] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388)
[21:55:57] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388)
[21:55:59] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Drop legacy main (all languages) evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388)
[21:56:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:56:25] <logmsgbot>	 !log brennen@deploy2002 brennen and ssastry: Continuing with sync
[21:56:58] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:57:06] <wikibugs>	 (03PS3) 10JHathaway: puppetdb: add ability to configure db_ro_host [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842)
[21:58:10] <wikibugs>	 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10Wikimedia-Fundraising, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10Pols12)
[21:58:43] <volans>	 !log uploaded spicerack_7.3.1 to apt.wikimedia.org bullseye-wikimedia
[21:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:01] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bullseye
[21:59:17] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: dse: Rename release to wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/963409 (owner: 10Ebernhardson)
[21:59:25] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[22:00:13] <subbu>	 brennen, how is it looking?
[22:00:17] <wikibugs>	 (03Merged) 10jenkins-bot: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[22:00:35] <brennen>	 just restarting php now, we'll know shortly
[22:02:18] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.235:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.235 port 9042 https://phabricator.wikimedia.org/T93886
[22:02:28] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:963351|Revert "Deprecate TOC mutation in OutputPageParserOutput hook" (T348134)]] (duration: 09m 13s)
[22:02:32] <urandom>	 !log starting Cassandra rebuild, restbase1030-b — T346803
[22:02:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:02:40] <stashbot>	 T348134: PHP Deprecated: Use of OutputPageParserOutput hook to mutate TOC was deprecated in MediaWiki 1.41 - https://phabricator.wikimedia.org/T348134
[22:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:53] <stashbot>	 T346803: Unable to bootstrap restbase1030-{a,b,c} - https://phabricator.wikimedia.org/T346803
[22:03:27] <brennen>	 subbu: last deprecation notice at 21:57 UTC, if i don't see any in the next 10 min or so i'll assume that's fixed.
[22:03:49] <subbu>	 ok .. i would be surprised if you saw any new ones .. 
[22:04:09] <subbu>	 i am going to sign off now .. but will check in again in a couple hours.
[22:04:22] <subbu>	 (will look at the phab task).
[22:05:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bullseye
[22:05:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye
[22:06:15] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye
[22:06:18] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye
[22:06:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro...
[22:06:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro...
[22:06:30] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye
[22:06:30] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) @Eevans In that case, I'll change the data model to drop it! Will update this thread when it's done.
[22:06:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro...
[22:07:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:09:21] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) (owner: 10Andrea Denisse)
[22:11:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye
[22:11:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye
[22:13:18] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2054.codfw.wmnet with OS bullseye
[22:13:23] <wikibugs>	 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2054.codfw.wmnet with OS bullseye executed with errors: - kubernetes2054 (**FAIL**)   - Removed from...
[22:18:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye
[22:18:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye
[22:21:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage
[22:22:44] <wikibugs>	 10SRE, 10All-and-every-Wiktionary, 10Language-Team, 10Product-Analytics, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10greg)
[22:23:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye
[22:24:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye
[22:24:39] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage
[22:27:54] <wikibugs>	 (03PS16) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075)
[22:28:27] <wikibugs>	 (03PS17) 10Ebernhardson: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075)
[22:39:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[22:40:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[22:40:51] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1065.eqiad.wmnet with OS bullseye
[22:40:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bullseye completed: - cloud...
[23:00:22] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:32:02] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye
[23:32:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro...
[23:38:58] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye
[23:39:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro...
[23:43:48] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:44:11] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye
[23:44:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro...