[00:00:10] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10320717 (10Jclark-ctr)
[00:05:18] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:05:22] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:05:33] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:05:34] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10320722 (10colewhite) All dashboards in the [[ https://grafana-rw.wikimedia.org/dashboards/f/NHnAVr54k/rel...
[00:05:50] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:06:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10320725 (10Jclark-ctr)
[00:08:19] <wikibugs>	 (03PS1) 10Bvibber: Correction to virtual-globaljsonlinks mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090988 (https://phabricator.wikimedia.org/T374746)
[00:10:48] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090988 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber)
[00:10:52] <wikibugs>	 (03CR) 10Eevans: [C:03+2] corto: configure for production phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1090981 (https://phabricator.wikimedia.org/T356790) (owner: 10Eevans)
[00:12:39] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10320743 (10Jclark-ctr) @ABran-WMF these have been racked/ cabled/ configured  Per the racking instructions that where in the Racking Proposal :  and ju...
[00:12:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10320727 (10Jclark-ctr) @bking  these have been racked/ cabled/ configured and just need puppet updated for os install
[00:13:28] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:13:31] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:13:40] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[00:24:46] <icinga-wm>	 PROBLEM - Dell PowerEdge RAID Controller on an-worker1169 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[00:24:47] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-worker1169 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T379856 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[00:24:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1169 - https://phabricator.wikimedia.org/T379856 (10ops-monitoring-bot) 03NEW
[00:31:45] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10320769 (10Jclark-ctr) @Marostegui  Dell is requesting  SOS report and TSR report from this server and another.     I can pull TSR reports but while logging int...
[00:35:18] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10320773 (10Jclark-ctr) {F57699009}  {F57699010}
[00:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090990
[00:38:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090990 (owner: 10TrainBranchBot)
[00:38:49] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: failure to create google doc should not be fatal - https://phabricator.wikimedia.org/T379858 (10Eevans) 03NEW
[01:04:58] <wikibugs>	 (03PS1) 10Bvibber: Avoid use of globaljsonlinks* tables on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090993 (https://phabricator.wikimedia.org/T374746)
[01:05:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090993 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber)
[01:06:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5027.eqsin.wmnet, cp5026.eqsin.wmnet, cp5030.eqsin.wmnet, cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5025.eqsin.wmnet, cp5029.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5027.eqsin.wmnet, cp5026.eqsin.wmnet, cp5030.eqsin.wmnet, cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5025.eqsin.wmnet are marked down but p
[01:06:34] <icinga-wm>	 tps://wikitech.wikimedia.org/wiki/PyBal
[01:06:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5026.eqsin.wmnet, cp5030.eqsin.wmnet, cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5025.eqsin.wmnet, cp5029.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5027.eqsin.wmnet, cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5028.eqsin.wmnet, cp5026.eqsin.wmnet, cp5025.eqsin.wmnet, cp5030.eqsin.wmnet, cp5029.eqsin.wmnet a
[01:06:34] <icinga-wm>	 d down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:06:58] <jinxer-wm>	 FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:07:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[01:08:41] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090996
[01:08:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090996 (owner: 10TrainBranchBot)
[01:09:08] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:11:58] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:15:34] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5029 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[01:16:17] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090990 (owner: 10TrainBranchBot)
[01:18:34] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:18:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:18:57] <sukhe>	 oh boy
[01:19:00] <sukhe>	 !incidents
[01:19:01] <sirenbot>	 5440 (ACKED)  [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw)
[01:19:01] <sirenbot>	 5445 (UNACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[01:19:04] <sukhe>	 !ack 5445
[01:19:05] <sirenbot>	 5445 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[01:19:08] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:19:53] <jinxer-wm>	 FIRING: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected
[01:20:12] <sukhe>	 !incidents
[01:20:13] <sirenbot>	 5440 (ACKED)  [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw)
[01:20:13] <sirenbot>	 5445 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[01:20:13] <sirenbot>	 5446 (UNACKED)  DDoSDetected sre (netflow5002:9100 eqsin)
[01:20:15] <sukhe>	 !ack 5446
[01:20:16] <sirenbot>	 5446 (ACKED)  DDoSDetected sre (netflow5002:9100 eqsin)
[01:21:58] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:22:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[01:24:53] <jinxer-wm>	 RESOLVED: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected
[01:27:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[01:32:07] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[01:35:36] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp5029 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[01:35:45] <sukhe>	 ok great
[01:35:51] <sukhe>	 nothing outstanding for cleanup
[01:45:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090996 (owner: 10TrainBranchBot)
[01:55:22] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[02:01:34] <wikibugs>	 (03PS1) 10Reedy: CommonSettings.php: Properly set / to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834)
[02:02:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CommonSettings.php: Properly set / to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) (owner: 10Reedy)
[02:04:43] <wikibugs>	 (03PS2) 10Reedy: CommonSettings.php: Properly set / to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834)
[02:05:09] <wikibugs>	 (03PS3) 10Reedy: CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834)
[02:23:27] <wikibugs>	 (03PS4) 10Reedy: CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834)
[02:32:30] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:37:30] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[03:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:34:30] <wikibugs>	 (03PS1) 10KartikMistry: CX3 Build 0.2.0+20241113 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091007 (https://phabricator.wikimedia.org/T368718)
[03:35:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091007 (https://phabricator.wikimedia.org/T368718) (owner: 10KartikMistry)
[03:42:30] <icinga-wm>	 PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[03:47:32] <icinga-wm>	 RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 123.37 ms
[03:53:54] <wikibugs>	 (03PS1) 10JHathaway: WIP: don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009
[03:56:27] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye
[03:56:35] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10321140 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye
[04:09:26] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage
[04:11:56] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage
[04:24:12] <wikibugs>	 (03PS2) 10JHathaway: WIP: EFI don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009
[04:25:37] <wikibugs>	 (03CR) 10JHathaway: "With this patch I am no longer able to reproduce the double d-i issue. I am fairly confident it is the cause of our woes, as it explains w" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway)
[04:34:23] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye
[04:34:29] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10321172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye comple...
[05:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T0700).
[07:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:06:24] <XioNoX>	 !log delete office interco IP/prefixes/vlan in ulsfo - T379778
[07:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:27] <stashbot>	 T379778: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778
[07:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:17:38] <wikibugs>	 (03PS1) 10Ayounsi: Remove office interco include [dns] - 10https://gerrit.wikimedia.org/r/1091130 (https://phabricator.wikimedia.org/T379778)
[07:27:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1091130 (https://phabricator.wikimedia.org/T379778) (owner: 10Ayounsi)
[07:27:58] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Remove office interco include [dns] - 10https://gerrit.wikimedia.org/r/1091130 (https://phabricator.wikimedia.org/T379778) (owner: 10Ayounsi)
[07:30:35] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[07:32:50] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Netbox: Update runbook, add dashboard and physicalhosts report. [alerts] - 10https://gerrit.wikimedia.org/r/1090875 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[07:34:17] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove office link dns records - ayounsi@cumin1002"
[07:34:28] <wikibugs>	 (03Merged) 10jenkins-bot: Netbox: Update runbook, add dashboard and physicalhosts report. [alerts] - 10https://gerrit.wikimedia.org/r/1090875 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[07:34:35] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove office link dns records - ayounsi@cumin1002"
[07:34:35] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:34:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet
[07:34:56] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10321323 (10ops-monitoring-bot) Draining ganeti2017.codfw.wmnet of running VMs
[07:36:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10procurement, 13Patch-For-Review: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10321324 (10ayounsi)
[07:41:40] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 145, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:41:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet
[07:42:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet
[07:42:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10321346 (10ops-monitoring-bot) Draining ganeti2017.codfw.wmnet of running VMs
[07:45:42] <wikibugs>	 (03PS1) 10Ayounsi: Remove test BGP session to e8 sonic switch [homer/public] - 10https://gerrit.wikimedia.org/r/1091169
[07:47:52] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Remove test BGP session to e8 sonic switch [homer/public] - 10https://gerrit.wikimedia.org/r/1091169 (owner: 10Ayounsi)
[07:48:26] <wikibugs>	 (03Merged) 10jenkins-bot: Remove test BGP session to e8 sonic switch [homer/public] - 10https://gerrit.wikimedia.org/r/1091169 (owner: 10Ayounsi)
[07:51:57] <wikibugs>	 (03PS1) 10Arnaudb: bashrc: add alias + dbctl alias [puppet] - 10https://gerrit.wikimedia.org/r/1091171
[07:51:58] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] bashrc: add alias + dbctl alias [puppet] - 10https://gerrit.wikimedia.org/r/1091171 (owner: 10Arnaudb)
[07:54:50] <icinga-wm>	 RECOVERY - BGP status on ssw1-e1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:58:38] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T0800).
[08:00:05] <jouncebot>	 kart_, DreamRimmer, and bvibber: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:21] <bvibber>	 o/
[08:00:32] <kart_>	 \0
[08:00:45] <DreamRimmer>	 o/
[08:00:47] <kart_>	 I'll start with my patch. KCVelaga around?
[08:00:52] <KCVelaga>	 Yes
[08:01:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) (owner: 10KCVelaga)
[08:02:24] <wikibugs>	 (03Merged) 10jenkins-bot: Update stream registration and config for MinT for Readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) (owner: 10KCVelaga)
[08:03:34] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1084704|Update stream registration and config for MinT for Readers (T378565)]]
[08:03:37] <stashbot>	 T378565: MinT for Readers instrumentation: update stream configuration and registration for new schema fragment - https://phabricator.wikimedia.org/T378565
[08:07:02] <wikibugs>	 (03PS2) 10KartikMistry: Update recommendation api to 2024-11-11-200548-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089964 (https://phabricator.wikimedia.org/T379037)
[08:08:33] <logmsgbot>	 !log kartik@deploy2002 kcvelaga, kartik: Backport for [[gerrit:1084704|Update stream registration and config for MinT for Readers (T378565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:09:15] <kart_>	 KCVelaga: Can you test the patch on mwdebug?
[08:09:59] <KCVelaga>	 Let me try
[08:10:53] <wikibugs>	 (03PS3) 10KartikMistry: Update recommendation api to 2024-11-13-183159-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089964 (https://phabricator.wikimedia.org/T379592)
[08:12:17] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890 (10MoritzMuehlenhoff) 03NEW
[08:13:59] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10321458 (10MoritzMuehlenhoff) p:05Triage→03Medium
[08:18:40] <icinga-wm>	 RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 79, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:18:56] <wikibugs>	 (03CR) 10Elukey: "I tried to recall why I've set up the code in the first place, and this is what I found on IRC:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway)
[08:19:35] <kart_>	 We are taking some time to test, bvibber - you're next once I'm done with config patch. Backport patch from me is postponed.
[08:19:38] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect, ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:19:50] <bvibber>	 tx
[08:20:36] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 67, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:20:40] <wikibugs>	 (03CR) 10Muehlenhoff: "My personal take is this: We don't use SGX and have no plans to do so (and who knows if Intel doesn't even abandon it in total at some poi" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans)
[08:21:42] <icinga-wm>	 RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 464, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:23:29] <wikibugs>	 (03PS3) 10TChin: flink-app: Add default checkpointing config for Flink 1.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176)
[08:23:44] <logmsgbot>	 !log kartik@deploy2002 kcvelaga, kartik: Continuing with sync
[08:24:15] <KCVelaga>	 Okay
[08:24:55] <kart_>	 bvibber: Sorry. Just in time, our dev is back who can test :D I'll go ahead and +2 the patch as it will take a while to merge..
[08:25:06] <bvibber>	 so mine affect job queue stuff so i won't be able to test them on the debug server :)
[08:25:09] <bvibber>	 \o/
[08:25:16] <bvibber>	 tx
[08:25:53] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] CX3 Build 0.2.0+20241113 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091007 (https://phabricator.wikimedia.org/T368718) (owner: 10KartikMistry)
[08:26:35] <wikibugs>	 (03CR) 10Vgutierrez: haproxykafka: working on TLS client authentication to kafka (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[08:27:24] <wikibugs>	 (03PS1) 10Brouberol: airflow-search: define k8s namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091175 (https://phabricator.wikimedia.org/T378441)
[08:27:25] <wikibugs>	 (03PS1) 10Brouberol: airflow-search: register tenant namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091176 (https://phabricator.wikimedia.org/T378441)
[08:27:27] <wikibugs>	 (03PS1) 10Brouberol: airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441)
[08:27:57] <kart_>	 Did I forgot DreamRimmer? You're next! :)
[08:28:14] <DreamRimmer>	 thanks
[08:28:21] <wikibugs>	 (03PS1) 10Brouberol: airflow-research: change user identitiy files owner to analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/1091178 (https://phabricator.wikimedia.org/T378442)
[08:28:22] <wikibugs>	 (03PS1) 10Brouberol: airflow-search: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1091179 (https://phabricator.wikimedia.org/T378441)
[08:28:24] <wikibugs>	 (03PS1) 10Brouberol: airflow-search: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441)
[08:28:25] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084704|Update stream registration and config for MinT for Readers (T378565)]] (duration: 24m 50s)
[08:28:25] <wikibugs>	 (03PS1) 10Brouberol: airflow-search: define ATS mapping and cache config [puppet] - 10https://gerrit.wikimedia.org/r/1091181 (https://phabricator.wikimedia.org/T378441)
[08:28:28] <stashbot>	 T378565: MinT for Readers instrumentation: update stream configuration and registration for new schema fragment - https://phabricator.wikimedia.org/T378565
[08:30:32] <kart_>	 KCVelaga: done. DreamRimmer go ahead!
[08:31:26] <wikibugs>	 (03CR) 10Elukey: "I found only references about how to do it (if available) via manual BIOS config (like https://www.supermicro.com/support/faqs/faq.cfm?faq" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans)
[08:32:41] <KCVelaga>	 kart_ the stream registration is showing up fine on my end as well now. Thank you.
[08:32:55] <kart_>	 Nice!
[08:33:17] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 140407
[08:33:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 140407
[08:33:56] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 9299
[08:34:54] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9299
[08:35:14] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 141082
[08:35:15] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 141082
[08:35:19] <wikibugs>	 (03CR) 10Muehlenhoff: "I've reworked the Envoy firewall setup in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1090798, this patch will need to be adapted" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[08:35:34] <DreamRimmer>	 kart_: deploying mine?
[08:37:19] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 26744
[08:37:40] <kart_>	 oh, I thought you're doing it :)
[08:38:11] <DreamRimmer>	 I don't have deployment access
[08:38:21] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 26744
[08:39:02] <kart_>	 ouch. Let me take a look at patch.
[08:41:14] <kart_>	 DreamRimmer: deploying..
[08:41:21] <DreamRimmer>	 tx
[08:41:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090937 (https://phabricator.wikimedia.org/T379635) (owner: 10Dreamrimmer)
[08:42:06] <wikibugs>	 (03Merged) 10jenkins-bot: Allow Wikidata bureaucrats to remove admin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090937 (https://phabricator.wikimedia.org/T379635) (owner: 10Dreamrimmer)
[08:42:36] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1090937|Allow Wikidata bureaucrats to remove admin rights (T379635)]]
[08:42:40] <stashbot>	 T379635: Allow Wikidata bureaucrats to remove admin rights - https://phabricator.wikimedia.org/T379635
[08:43:16] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10321525 (10elukey) @jhathaway something interesting that I found on Redfish related to BIOS boot options:  ms-be2088  ` BootModeSelect UEFI BootOption_1...
[08:44:37] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20241113 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091007 (https://phabricator.wikimedia.org/T368718) (owner: 10KartikMistry)
[08:45:03] <wikibugs>	 (03PS2) 10Brouberol: airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441)
[08:46:22] <wikibugs>	 (03PS1) 10Ayounsi: Replace fasw-c-eqiad with new fasw2 [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381)
[08:47:25] <logmsgbot>	 !log kartik@deploy2002 dreamrimmer, kartik: Backport for [[gerrit:1090937|Allow Wikidata bureaucrats to remove admin rights (T379635)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:47:36] <wikibugs>	 (03PS2) 10Ayounsi: Remove old fasw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381)
[08:48:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1087474 (owner: 10Slyngshede)
[08:48:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "I'm guessing you intended to check 9.2.6 and not 9.2.5 (same output though)" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh)
[08:48:40] <kart_>	 DreamRimmer: Patch is available to test using mwdebug servers. Will you able to test it?
[08:49:13] <DreamRimmer>	 looks good to me. https://www.wikidata.org/w/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=general%7Cusergroups
[08:49:18] <DreamRimmer>	 go for it
[08:49:36] <kart_>	 cool!
[08:49:42] <logmsgbot>	 !log kartik@deploy2002 dreamrimmer, kartik: Continuing with sync
[08:52:56] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:53:24] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms
[08:53:45] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-research: change user identitiy files owner to analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/1091178 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol)
[08:53:57] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on fasw-c-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.30 ayounsi https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091182 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[08:54:25] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090937|Allow Wikidata bureaucrats to remove admin rights (T379635)]] (duration: 11m 49s)
[08:54:29] <stashbot>	 T379635: Allow Wikidata bureaucrats to remove admin rights - https://phabricator.wikimedia.org/T379635
[08:54:58] <kart_>	 DreamRimmer: done. 
[08:55:11] <vgutierrez>	 !log import haproxy 2.8.12 to thirtdparty/haproxy28 component for bullseye-wikimedia (apt.wm.o) - T379891
[08:55:16] <kart_>	 Going ahead with my backport patch.. We are running out of time :/
[08:55:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:19] <stashbot>	 T379891: Upgrade haproxy to 2.8.12 on cp hosts - https://phabricator.wikimedia.org/T379891
[08:55:44] <bvibber>	 if we hvae to reschedule mine that's ok
[08:55:49] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10321573 (10elukey) Tried to manually set the continuous flag on sretest2001, rebooted but I didn't see the boot options changing like ms-be2088. So at th...
[08:55:51] <DreamRimmer>	 kart_: Thanks :)
[08:56:05] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1091007|CX3 Build 0.2.0+20241113 (T368718 T374567)]]
[08:56:09] <stashbot>	 T368718: Community-defined Translation Collections: Single selection mode UI - https://phabricator.wikimedia.org/T368718
[08:56:09] <stashbot>	 T374567: SX: Set aria-label to icon-only Codex buttons - https://phabricator.wikimedia.org/T374567
[08:56:54] <kart_>	 bvibber: looks like train window is next, so might need to check with brennen and jnuche (who are doing train deployment..)
[08:57:02] <bvibber>	 ok
[09:00:05] <jouncebot>	 brennen and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T0900).
[09:00:09] <logmsgbot>	 !log kartik@deploy2002 kartik: Backport for [[gerrit:1091007|CX3 Build 0.2.0+20241113 (T368718 T374567)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:00:59] <wikibugs>	 (03PS1) 10JMeybohm: k8s.reimage-stacked-control-plane: Wait for 3m after depooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1091185 (https://phabricator.wikimedia.org/T362408)
[09:01:01] <jnuche>	 kart_, bvibber: hi there, train deployments are happening in US time this week, so you can use go ahead with more backports if you want to
[09:01:10] <bvibber>	 \o/
[09:03:41] <kart_>	 cool.
[09:03:53] <kart_>	 bvibber: I'm testing my patch, give me few minutes.
[09:03:59] <bvibber>	 thx!
[09:04:05] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: disable translation [puppet] - 10https://gerrit.wikimedia.org/r/1091187
[09:04:34] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp
[09:04:42] <wikibugs>	 (03CR) 10Stevemunene: airflow-search: define user kubeconfigs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091179 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:05:42] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091185 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[09:05:57] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-search: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:06:19] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-search: define ATS mapping and cache config [puppet] - 10https://gerrit.wikimedia.org/r/1091181 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:06:39] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-search: define k8s namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091175 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:07:24] <wikibugs>	 (03PS1) 10Muehlenhoff: lintian: Fix selection of vendor profile [puppet] - 10https://gerrit.wikimedia.org/r/1091188
[09:08:17] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-search: register tenant namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091176 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:08:30] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp
[09:09:43] <wikibugs>	 (03CR) 10Muehlenhoff: "FYI; The error "bad-distribution-in-changes-file bullseye-wikimedia" will go away when  https://gerrit.wikimedia.org/r/c/operations/puppet" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh)
[09:10:44] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] k8s.reimage-stacked-control-plane: Wait for 3m after depooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1091185 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[09:12:38] <kart_>	 still testing.. 
[09:12:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "I'm not going to miss "Vorlagen für Dienste" or "Routen-Ziele"..." [puppet] - 10https://gerrit.wikimedia.org/r/1091187 (owner: 10Ayounsi)
[09:12:49] <wikibugs>	 (03CR) 10Volans: [C:03+2] "Thanks for the fix!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090927 (owner: 10Kamila Součková)
[09:13:08] <jayme>	 kart_: , bvibber: would you be so kind to ping me when you're done. AIUI the train window is not used so I could start maintenance work early right after you
[09:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:13:31] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:14:22] <kart_>	 jayme: sure.
[09:14:27] <jayme>	 thanks
[09:16:18] <wikibugs>	 (03Merged) 10jenkins-bot: k8s.reimage-stacked-control-plane: Wait for 3m after depooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1091185 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[09:17:53] <volans>	 !log installed spicerack v8.16.0 on cumin2002
[09:17:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:59] <wikibugs>	 (03CR) 10TChin: flink-app: Add default checkpointing config for Flink 1.20 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[09:21:06] <logmsgbot>	 !log kartik@deploy2002 kartik: Continuing with sync
[09:21:20] <wikibugs>	 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10321649 (10ABran-WMF) >>! In T376596#10205946, @Volans wrote: > Spicerack has support for prometheus, why not getti...
[09:21:51] <kart_>	 bvibber: You'll deploy your patches, right? :)
[09:21:59] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Netbox: disable translation [puppet] - 10https://gerrit.wikimedia.org/r/1091187 (owner: 10Ayounsi)
[09:22:19] <bvibber>	 kart_: in theory i can but i haven't done a deploy by hand in some time :)
[09:23:26] <bvibber>	 best if someone more familiar pushes button
[09:23:48] <bvibber>	 if no time then i'll reschedule
[09:23:49] <kart_>	 OK! :)
[09:23:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] lintian: Fix selection of vendor profile [puppet] - 10https://gerrit.wikimedia.org/r/1091188 (owner: 10Muehlenhoff)
[09:23:52] <bvibber>	 thx :)
[09:25:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-research: change user identitiy files owner to analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/1091178 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol)
[09:25:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-search: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1091179 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:25:45] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091007|CX3 Build 0.2.0+20241113 (T368718 T374567)]] (duration: 29m 40s)
[09:25:50] <stashbot>	 T368718: Community-defined Translation Collections: Single selection mode UI - https://phabricator.wikimedia.org/T368718
[09:25:50] <stashbot>	 T374567: SX: Set aria-label to icon-only Codex buttons - https://phabricator.wikimedia.org/T374567
[09:26:04] <kart_>	 bvibber: ok. deploying first patch..
[09:26:08] <bvibber>	 whee
[09:26:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090988 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber)
[09:26:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] "Oh dang sorry, I misread merged too fast. I'll update this in a subsequent patch" [puppet] - 10https://gerrit.wikimedia.org/r/1091179 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:27:16] <wikibugs>	 (03Merged) 10jenkins-bot: Correction to virtual-globaljsonlinks mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090988 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber)
[09:27:44] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1090988|Correction to virtual-globaljsonlinks mapping (T374746)]]
[09:27:47] <stashbot>	 T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746
[09:27:50] <bvibber>	 \o/
[09:28:52] <wikibugs>	 (03Merged) 10jenkins-bot: doc: fix introduction code bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090927 (owner: 10Kamila Součková)
[09:30:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "Then yes, I guess we can live with the kernel warning :-)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans)
[09:31:27] <logmsgbot>	 !log kartik@deploy2002 bvibber, kartik: Backport for [[gerrit:1090988|Correction to virtual-globaljsonlinks mapping (T374746)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:31:46] <bvibber>	 it's job queue stuff so all i can test is that it doesn't explode ;)
[09:31:49] <kart_>	 bvibber: possible to test this patch? ^
[09:31:58] <kart_>	 ah :)
[09:32:34] <bvibber>	 should be ready to roll, no explody :)
[09:32:39] <kart_>	 cool
[09:32:42] <logmsgbot>	 !log kartik@deploy2002 bvibber, kartik: Continuing with sync
[09:33:04] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-search: define k8s namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091175 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:34:25] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:35:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-search: register tenant namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091176 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:35:09] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:36:29] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:37:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:37:48] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090988|Correction to virtual-globaljsonlinks mapping (T374746)]] (duration: 10m 03s)
[09:37:51] <stashbot>	 T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746
[09:38:01] <bvibber>	 whee
[09:38:03] <kart_>	 bvibber: second patch now..
[09:38:05] <bvibber>	 thx!
[09:38:37] <kart_>	 beta only? should be fast!
[09:38:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090993 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber)
[09:39:30] <bvibber>	 success! first patch is a-ok and functional <3
[09:39:36] <wikibugs>	 (03CR) 10Fabfur: "thanks for the review and suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[09:39:36] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid use of globaljsonlinks* tables on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090993 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber)
[09:39:42] <bvibber>	 whee
[09:40:16] <wikibugs>	 (03PS2) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776)
[09:42:12] <kart_>	 bvibber: all done.
[09:42:17] <bvibber>	 thx!
[09:42:33] <kart_>	 `09:40:05 Skipping sync since all commits were beta/labs-only changes. Operation completed.`
[09:42:39] <bvibber>	 super :D
[09:42:51] <kart_>	 jayme: we're done with deployment
[09:42:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 94.88% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[09:43:02] <jayme>	 kart_: cool, thanks
[09:43:11] <kart_>	 !log Done: UTC morning backport window
[09:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:40] <wikibugs>	 (03PS3) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776)
[09:47:05] <wikibugs>	 (03PS1) 10Stevemunene: Add airflow oidc clients for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/1091193 (https://phabricator.wikimedia.org/T378440)
[09:47:49] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-search: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:47:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 94.88% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[09:48:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-search: define ATS mapping and cache config [puppet] - 10https://gerrit.wikimedia.org/r/1091181 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:49:08] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[09:49:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Move Puppet CA monitoring out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798)
[09:50:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add airflow oidc clients for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/1091193 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene)
[09:52:16] <wikibugs>	 (03CR) 10Stevemunene: [V:03+2 C:03+2] Add airflow oidc clients for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/1091193 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene)
[09:53:53] <wikibugs>	 (03PS1) 10Urbanecm: [GrowthExperiments] Add virtual domain config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939)
[09:54:36] <wikibugs>	 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10321785 (10ABran-WMF) >>>! In T376596#10205946, @Volans wrote: >> why not getting the metrics directly from there i...
[09:55:05] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[09:56:52] <wikibugs>	 (03CR) 10Vgutierrez: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[10:03:02] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): (no justification provided)
[10:03:22] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): (no justification provided) (duration: 00m 21s)
[10:06:11] <wikibugs>	 (03CR) 10Fabfur: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[10:06:23] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): (no justification provided)
[10:06:30] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "> it is best if we disable it in the provisioning to have a reliable, deterministic state" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans)
[10:06:35] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster wikikube-codfw: containerd migration
[10:07:09] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): (no justification provided) (duration: 00m 47s)
[10:11:07] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2001.codfw.wmnet with OS bookworm
[10:14:02] <icinga-wm>	 PROBLEM - BGP status on lsw1-b7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:15:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet
[10:16:48] <moritzm>	 !log remove ganeti2017 from active ganeti nodes T376594
[10:16:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:51] <stashbot>	 T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594
[10:19:20] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[10:19:38] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[10:21:42] <wikibugs>	 (03PS1) 10Stevemunene: airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440)
[10:21:44] <wikibugs>	 (03PS1) 10Stevemunene: airflow-analytics-product: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091200 (https://phabricator.wikimedia.org/T378440)
[10:22:03] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti2017:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:22:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Remove labswiki from HDFS imported dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1090832 (https://phabricator.wikimedia.org/T217792) (owner: 10Btullis)
[10:24:08] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Great! Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090900 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol)
[10:25:00] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe)
[10:28:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti2017 [puppet] - 10https://gerrit.wikimedia.org/r/1091201
[10:30:14] <wikibugs>	 (03PS1) 10JMeybohm: k8s.reimage-stacked-control-plane: Ask for the management password early [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408)
[10:32:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] lintian: Fix selection of vendor profile [puppet] - 10https://gerrit.wikimedia.org/r/1091188 (owner: 10Muehlenhoff)
[10:34:23] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:36:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s.reimage-stacked-control-plane: Ask for the management password early [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[10:38:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti role from ganeti2017 [puppet] - 10https://gerrit.wikimedia.org/r/1091201 (owner: 10Muehlenhoff)
[10:38:36] <wikibugs>	 (03PS2) 10JMeybohm: k8s.reimage-stacked-control-plane: Ask for the management password early [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408)
[10:41:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: leverage liveness and readiness probes for the gms and consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090900 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol)
[10:42:10] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2001.codfw.wmnet with reason: host reimage
[10:44:11] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[10:45:08] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2001.codfw.wmnet with reason: host reimage
[10:47:38] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[10:49:08] <jinxer-wm>	 RESOLVED: ProbeDown: Service ganeti2017:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1100)
[11:01:15] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] haproxykafka: working on TLS client authentication to kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[11:06:08] <icinga-wm>	 RECOVERY - BGP status on lsw1-b7-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:06:18] <wikibugs>	 (03CR) 10Physikerwelt: "I like the idea. I was wondering if there is a check for the validity of the project name according to the ldap requirements, see e.g." [puppet] - 10https://gerrit.wikimedia.org/r/1090854 (https://phabricator.wikimedia.org/T379030) (owner: 10Arturo Borrero Gonzalez)
[11:06:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1089638 (owner: 10Slyngshede)
[11:07:51] <wikibugs>	 (03CR) 10Vgutierrez: apt/varnish: Add/Pin varnish-staging component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[11:08:11] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2001.codfw.wmnet with OS bookworm
[11:08:17] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster wikikube-codfw: containerd migration
[11:09:27] <wikibugs>	 (03PS4) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776)
[11:09:48] <wikibugs>	 (03CR) 10Fabfur: haproxykafka: working on TLS client authentication to kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[11:14:02] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[11:17:23] <moritzm>	 !log installing openssl security updates
[11:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:03] <wikibugs>	 (03CR) 10Elukey: Move Puppet CA monitoring out of the puppetmaster module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[11:27:25] <wikibugs>	 (03PS1) 10Volans: mysql_legacy: fix pymysql queries [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207
[11:29:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Add cumin alias for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1091208
[11:30:36] <wikibugs>	 (03PS3) 10Sergio Gimeno: GrowthExperiments: set experiment config only in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681)
[11:30:50] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I am not 100% sure what is the difference in set_replication_parameters (practically) but I trust that you tested it :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans)
[11:30:54] <wikibugs>	 (03CR) 10Sergio Gimeno: GrowthExperiments: set experiment config only in pilot wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno)
[11:32:49] <wikibugs>	 (03CR) 10Volans: "purely typos, pymysql uses python % string formatting underneath, so it was just a bad syntax and yes I've tested it :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans)
[11:52:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add cumin alias for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1091208 (owner: 10Muehlenhoff)
[11:57:18] <moritzm>	 !log restarting postfix on inbound/outbound servers to pick up openssl updates
[11:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir
[12:00:39] <wikibugs>	 (03PS1) 10Ayounsi: LibreNMS report: various fixes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1091212 (https://phabricator.wikimedia.org/T379907)
[12:04:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10322139 (10cmooney) >>! In T371435#10318507, @RobH wrote: > I'd hand this over to either John or Valerie as ops-eqiad for them to remove any devices...
[12:04:22] <wikibugs>	 (03PS2) 10Muehlenhoff: Move Puppet CA monitoring out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798)
[12:05:25] <wikibugs>	 (03CR) 10Muehlenhoff: Move Puppet CA monitoring out of the puppetmaster module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[12:08:51] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[12:10:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir
[12:12:45] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno)
[12:17:35] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster wikikube-codfw: containerd migration
[12:18:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw
[12:19:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw
[12:22:13] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2003.codfw.wmnet with OS bookworm
[12:23:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[12:23:20] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm (as discussed offline after successful test-cookbook run)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[12:25:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet
[12:25:08] <icinga-wm>	 PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:26:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable nftables for ganeti7002 [puppet] - 10https://gerrit.wikimedia.org/r/1091224
[12:28:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[12:29:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7002.magru.wmnet
[12:29:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Enable nftables for ganeti7002 [puppet] - 10https://gerrit.wikimedia.org/r/1091224 (owner: 10Muehlenhoff)
[12:32:17] <wikibugs>	 (03PS1) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022)
[12:35:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet
[12:38:36] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2003.codfw.wmnet with reason: host reimage
[12:38:39] <Dreamy_Jazz>	 jouncebot: nowandnext
[12:38:39] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 21 minute(s)
[12:38:39] <jouncebot>	 In 0 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1300)
[12:38:58] <Dreamy_Jazz>	 Any objections to me deploying a config change?
[12:40:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090511 (https://phabricator.wikimedia.org/T379583) (owner: 10Dreamy Jazz)
[12:41:22] <wikibugs>	 (03Merged) 10jenkins-bot: Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090511 (https://phabricator.wikimedia.org/T379583) (owner: 10Dreamy Jazz)
[12:41:52] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1090511|Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList (T379583)]]
[12:41:56] <stashbot>	 T379583: Find and exclude special pages where temporary account IP reveal is not necessary - https://phabricator.wikimedia.org/T379583
[12:42:06] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2003.codfw.wmnet with reason: host reimage
[12:43:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet
[12:45:47] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1090511|Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList (T379583)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:46:22] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[12:48:01] <wikibugs>	 10ops-eqsin, 06SRE: Inbound interface errors - asw1-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T376837#10322250 (10RobH) 05Open→03Declined
[12:49:08] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:49:09] <moritzm>	 !log failover ganeti master of magru02 to ganeti7002
[12:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:56] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti7004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[12:51:00] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090511|Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList (T379583)]] (duration: 09m 08s)
[12:51:16] <stashbot>	 T379583: Find and exclude special pages where temporary account IP reveal is not necessary - https://phabricator.wikimedia.org/T379583
[12:51:28] <wikibugs>	 (03PS1) 10KartikMistry: CX3 Build 0.2.0+20241114 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227
[12:51:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet
[12:52:03] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:52:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 (owner: 10KartikMistry)
[12:52:33] <moritzm>	 !log installing apache2 security updates
[12:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:59] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove old fr-tech switch stack from rancid backups [puppet] - 10https://gerrit.wikimedia.org/r/1091228 (https://phabricator.wikimedia.org/T377381)
[12:53:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet
[12:53:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad
[12:54:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad
[12:57:20] <icinga-wm>	 PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:58:19] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mysql_legacy: fix pymysql queries [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans)
[12:58:22] <icinga-wm>	 PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:58:22] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin
[12:58:23] <icinga-wm>	 status
[12:58:24] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin
[12:58:24] <icinga-wm>	 status
[12:59:20] <icinga-wm>	 RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 44, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1300)
[13:00:43] <jinxer-wm>	 FIRING: ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:01:24] <icinga-wm>	 PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:01:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:03:18] <icinga-wm>	 RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:04:33] <jinxer-wm>	 FIRING: [5x] KubernetesCalicoDown: kubernetes2052.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:04:56] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2003.codfw.wmnet with OS bookworm
[13:05:01] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster wikikube-codfw: containerd migration
[13:05:20] <icinga-wm>	 PROBLEM - BGP status on lsw1-d4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:05:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:05:24] <icinga-wm>	 PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:05:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti7004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1091230
[13:05:43] <jinxer-wm>	 RESOLVED: ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:06:20] <icinga-wm>	 RECOVERY - BGP status on lsw1-d4-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:06:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno)
[13:07:28] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 286, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:07:31] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 370, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:07:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti7004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1091230 (owner: 10Muehlenhoff)
[13:08:21] <wikibugs>	 (03PS1) 10Sergio Gimeno: HomepageHooks: run metrics increment in deferred update [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682)
[13:08:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:08:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno)
[13:09:12] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:09:33] <jinxer-wm>	 RESOLVED: [7x] KubernetesCalicoDown: kubernetes2052.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:17:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Deploy fix for search button height [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1091232
[13:17:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Deploy fix for search button height [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1091232 (owner: 10Giuseppe Lavagetto)
[13:18:13] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix search button height - oblivian@cumin1002"
[13:18:15] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix search button height - oblivian@cumin1002
[13:18:51] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix search button height - oblivian@cumin1002
[13:18:52] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix search button height - oblivian@cumin1002"
[13:21:08] <logmsgbot>	 !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@c5ab766]: T379546
[13:21:13] <stashbot>	 T379546: Update the product-analytics DAGs to use miniforge instead of condaforge - https://phabricator.wikimedia.org/T379546
[13:21:48] <logmsgbot>	 !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@c5ab766]: T379546 (duration: 00m 54s)
[13:26:44] <wikibugs>	 (03CR) 10Volans: [C:03+2] mysql_legacy: fix pymysql queries [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans)
[13:29:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10procurement, 13Patch-For-Review: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10322386 (10RobH)
[13:30:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet
[13:34:59] <kart_>	 jouncebot: !next
[13:35:20] <kart_>	 forgot the command :D
[13:35:41] <kart_>	 jouncebot: nowandnext
[13:35:41] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1300)
[13:35:41] <jouncebot>	 In 0 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1400)
[13:36:12] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@2220747]: Stage Refine parallelization improvment [airflow-dags@2220747d]
[13:36:21] <kart_>	 Doing early +2 for my upcoming backport. CI taking some 30 minutes..
[13:36:28] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@2220747]: Stage Refine parallelization improvment [airflow-dags@2220747d] (duration: 00m 15s)
[13:37:23] <wikibugs>	 (03Merged) 10jenkins-bot: mysql_legacy: fix pymysql queries [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans)
[13:38:10] <wikibugs>	 (03PS3) 10Volans: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855
[13:38:10] <wikibugs>	 (03PS3) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856
[13:38:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet
[13:38:29] <wikibugs>	 (03PS5) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776)
[13:38:55] <wikibugs>	 (03CR) 10Fabfur: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[13:42:35] <wikibugs>	 (03PS11) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332)
[13:44:11] <Lucas_WMDE>	 kart_: I don’t see any early +2 yet…
[13:44:49] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] CX3 Build 0.2.0+20241114 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 (owner: 10KartikMistry)
[13:44:53] <Lucas_WMDE>	 yay
[13:45:01] <kart_>	 Lucas_WMDE: sorry :D
[13:45:06] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[13:45:42] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Account Managers: Allow account managers to be assigned by LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087474 (owner: 10Slyngshede)
[13:47:54] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.16.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091235
[13:48:10] <sergi0>	 kart_: should I also do that with my change or that would interfere with yours?
[13:48:11] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[13:48:21] <wikibugs>	 (03Merged) 10jenkins-bot: Account Managers: Allow account managers to be assigned by LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087474 (owner: 10Slyngshede)
[13:49:32] <kart_>	 sergi0: Please wait till I start the deployment.. I'll ping.
[13:49:33] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@2220747]: Stage Refine parallelization improvment [airflow-dags@2220747d]
[13:50:05] <sergi0>	 kart_: ack
[13:50:25] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] [GrowthExperiments] Add virtual domain config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm)
[13:50:41] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@2220747]: Stage Refine parallelization improvment [airflow-dags@2220747d] (duration: 01m 08s)
[13:51:53] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[13:54:35] <wikibugs>	 (03PS1) 10Slyngshede: P:idp experimental webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236)
[13:57:45] <wikibugs>	 (03PS12) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332)
[13:59:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1400)
[14:00:05] <jouncebot>	 kart_ and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:18] <sergi0>	 o/
[14:00:42] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[14:02:26] <Lucas_WMDE>	 o/
[14:02:30] <Lucas_WMDE>	 I think I can deploy!
[14:02:57] <kart_>	 Lucas_WMDE: I'll be deploying my patch first :)
[14:03:05] <Lucas_WMDE>	 I was about to ask if you wanted to self-service :)
[14:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:04:29] * urbanecm waves
[14:04:37] <urbanecm>	 sergi0: let me know if you need any assistance with your patches
[14:05:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10322543 (10MoritzMuehlenhoff)
[14:05:27] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough
[14:05:28] <sergi0>	 urbanecm: yes, I'd appreciate, ty
[14:05:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10322555 (10MoritzMuehlenhoff)
[14:05:56] <Lucas_WMDE>	 kart_: can you already start your scap backport?
[14:06:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 (owner: 10KartikMistry)
[14:06:11] <Lucas_WMDE>	 thanks :)
[14:06:17] <Lucas_WMDE>	 then it’s more visible that you’re first in line :P
[14:06:43] <wikibugs>	 (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.16.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091235 (owner: 10Volans)
[14:07:18] <wikibugs>	 (03PS2) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022)
[14:08:25] <kart_>	 Lucas_WMDE: yeah, started.. still waiting for CI..
[14:09:35] <Lucas_WMDE>	 yeah
[14:10:24] <wikibugs>	 (03PS3) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022)
[14:11:28] <wikibugs>	 06SRE, 10Observability-Alerting, 06Traffic, 13Patch-For-Review: PuppetFailure alert is not being fired for host(s) where agent has failed - https://phabricator.wikimedia.org/T379807#10322574 (10ssingh) Thanks for the investigation and fix @colewhite!   >>! In T379807#10319559, @colewhite wrote: > The issue...
[14:13:09] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20241114 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 (owner: 10KartikMistry)
[14:13:39] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1091227|CX3 Build 0.2.0+20241114]]
[14:13:40] <wikibugs>	 (03CR) 10Ssingh: "@vgutierrez@wikimedia.org: yes indeed and it seems like I did run on 9.2.6 but I erroneously pasted the one from my Ctrl+R. Thanks for che" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh)
[14:13:54] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Release 9.2.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh)
[14:15:07] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) (owner: 10Gmodena)
[14:16:10] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4522/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) (owner: 10Slyngshede)
[14:16:46] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.16.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091235 (owner: 10Volans)
[14:17:21] <logmsgbot>	 !log kartik@deploy2002 kartik: Backport for [[gerrit:1091227|CX3 Build 0.2.0+20241114]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:17:58] <wikibugs>	 (03PS2) 10Slyngshede: P:idp experimental webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236)
[14:18:19] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough
[14:18:47] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4523/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) (owner: 10Slyngshede)
[14:21:37] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "I did a brief check for other ^# deploy-sites and it seems like this is the only one with the erroneous entry. Thanks for the fix!" [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) (owner: 10Cwhite)
[14:22:15] <logmsgbot>	 !log kartik@deploy2002 kartik: Continuing with sync
[14:25:19] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and A:magru and A:dnsbox
[14:26:27] <wikibugs>	 (03PS1) 10Volans: Upstream release v8.16.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1091242
[14:27:03] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091227|CX3 Build 0.2.0+20241114]] (duration: 13m 23s)
[14:28:40] <sergi0>	 urbanecm: shall we proceed?
[14:29:01] <urbanecm>	 sergi0: if kart_ is done, why not!
[14:29:18] <kart_>	 yes
[14:29:26] <kart_>	 Please go ahead.
[14:30:19] <urbanecm>	 sergi0: wanna do the deployment?
[14:30:57] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and A:magru and A:dnsbox
[14:31:39] * sergi0 trying to log in deploy server to answer that
[14:32:55] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] HomepageHooks: run metrics increment in deferred update [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno)
[14:32:58] <urbanecm>	 started CI first
[14:33:00] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and not A:magru and A:dnsbox
[14:33:29] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) (owner: 10Gmodena)
[14:34:46] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) (owner: 10Gmodena)
[14:36:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno)
[14:37:03] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: set experiment config only in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno)
[14:37:29] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1090830|GrowthExperiments: set experiment config only in pilot wikis (T379681)]]
[14:37:33] <stashbot>	 T379681: community-updates-module variant is assigned outside of Growth pilot wikis - https://phabricator.wikimedia.org/T379681
[14:38:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] spark: Avoid Ferm-specific syntax (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1087488 (owner: 10Muehlenhoff)
[14:40:03] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v8.16.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1091242 (owner: 10Volans)
[14:41:24] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1090830|GrowthExperiments: set experiment config only in pilot wikis (T379681)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:45:50] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[14:48:51] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) (owner: 10Reedy)
[14:49:54] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v8.16.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1091242 (owner: 10Volans)
[14:50:31] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090830|GrowthExperiments: set experiment config only in pilot wikis (T379681)]] (duration: 13m 02s)
[14:50:35] <stashbot>	 T379681: community-updates-module variant is assigned outside of Growth pilot wikis - https://phabricator.wikimedia.org/T379681
[14:52:17] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Remove old fasw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381) (owner: 10Ayounsi)
[14:52:36] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Remove old fr-tech switch stack from rancid backups [puppet] - 10https://gerrit.wikimedia.org/r/1091228 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney)
[14:52:49] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Remove old fasw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381) (owner: 10Ayounsi)
[14:53:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove old fasw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381) (owner: 10Ayounsi)
[14:53:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno)
[14:53:46] <volans>	 !log uploaded spicerack_8.16.1 to apt.wikimedia.org bullseye-wikimedia
[14:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:51] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ipv6_primary_ra for all LVS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1091243 (https://phabricator.wikimedia.org/T358260)
[14:54:53] <wikibugs>	 (03Merged) 10jenkins-bot: HomepageHooks: run metrics increment in deferred update [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno)
[14:55:25] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1091231|HomepageHooks: run metrics increment in deferred update (T379682)]]
[14:55:29] <stashbot>	 T379682: Growth KPI Grafana dashboard claims control is not assigned to any users at enwiki - https://phabricator.wikimedia.org/T379682
[14:56:52] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1091243/4524/" [puppet] - 10https://gerrit.wikimedia.org/r/1091243 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh)
[14:58:41] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1091245
[14:59:21] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1091231|HomepageHooks: run metrics increment in deferred update (T379682)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:02:03] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[15:02:11] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:06:41] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091231|HomepageHooks: run metrics increment in deferred update (T379682)]] (duration: 11m 15s)
[15:06:57] <urbanecm>	 hi Amir1, re https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1091197... will that work even in beta (where there is no x1 AFAIK)?
[15:07:03] <stashbot>	 T379682: Growth KPI Grafana dashboard claims control is not assigned to any users at enwiki - https://phabricator.wikimedia.org/T379682
[15:07:05] <urbanecm>	 or do i need to negate that in CS-labs.php?
[15:07:06] <sergi0>	 !log UTC afternoon deploys done
[15:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:35] <Amir1>	 urbanecm: I think (not sure), beta has x1 too?
[15:07:45] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:07:45] <Amir1>	 but it's wikishared db maybe
[15:07:56] <Amir1>	 if not, then yes, negate it in -labs :D
[15:08:10] <Amir1>	 or make it conditional in CS.php
[15:08:31] <urbanecm>	 Amir1: ahh, it defines `extension1` in the config, but it points it to the same server...
[15:08:50] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-search: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[15:08:57] <urbanecm>	 thanks!
[15:13:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[15:13:26] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091245 (owner: 10Muehlenhoff)
[15:15:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[15:15:56] <wikibugs>	 (03PS3) 10JHathaway: EFI don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009
[15:16:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[15:16:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1091181 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol)
[15:16:42] <wikibugs>	 (03CR) 10JHathaway: "per our discussion on IRC, added some more context to the patch, noting the reason for the original addition." [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway)
[15:17:51] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "mmv.js: Store comingFromHashChange as a class property" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091248 (https://phabricator.wikimedia.org/T379835)
[15:18:41] <Amir1>	 jouncebot: nowandnext
[15:18:42] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 41 minute(s)
[15:18:42] <jouncebot>	 In 0 hour(s) and 41 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1600)
[15:19:45] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert "mmv.js: Store comingFromHashChange as a class property" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091248 (https://phabricator.wikimedia.org/T379835) (owner: 10Ladsgroup)
[15:22:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091248 (https://phabricator.wikimedia.org/T379835) (owner: 10Ladsgroup)
[15:22:34] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mmv.js: Store comingFromHashChange as a class property" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091248 (https://phabricator.wikimedia.org/T379835) (owner: 10Ladsgroup)
[15:23:04] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1091248|Revert "mmv.js: Store comingFromHashChange as a class property" (T379835)]]
[15:23:08] <stashbot>	 T379835: Closing an image in MultimediaViewer does not remove the URL fragment - https://phabricator.wikimedia.org/T379835
[15:24:02] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and not A:magru and A:dnsbox
[15:24:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322813 (10Jhancock.wm) I could do this today. or we can wait until next week. assuming no one wants to do a maintenance o...
[15:24:31] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:24:32] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:25:05] <wikibugs>	 (03PS4) 10JHathaway: EFI don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009
[15:25:24] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway)
[15:25:38] <wikibugs>	 (03PS1) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249
[15:25:40] <wikibugs>	 (03CR) 10Brouberol: "Because the service is fully behind the kubernetes ingress, we _don't have to_ register it under LVS. We can though, but this is not what " [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[15:26:15] <wikibugs>	 06SRE, 10Bitu, 06Infrastructure-Foundations: Allow to provide links for Bitu permissions - https://phabricator.wikimedia.org/T379926 (10MoritzMuehlenhoff) 03NEW
[15:26:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (owner: 10Andrew Bogott)
[15:27:28] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1091248|Revert "mmv.js: Store comingFromHashChange as a class property" (T379835)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:27:40] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[15:28:14] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927 (10fnegri) 03NEW
[15:28:44] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2002.codfw.wmnet
[15:28:46] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2002.codfw.wmnet
[15:28:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322881 (10ops-monitoring-bot) depool host wikikube-ctrl2002.codfw.wmnet by jayme@cumin2002 with reason: None
[15:28:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322882 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host wiki...
[15:29:05] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10322866 (10fnegri) 05Open→03Resolved a:03fnegri The issue is resolved, I created this task to track it in case it happens...
[15:29:13] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: T379719
[15:29:19] <stashbot>	 T379719: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719
[15:29:29] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: T379719
[15:29:43] <wikibugs>	 (03CR) 10Herron: [C:03+1] "let em in!" [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) (owner: 10Cwhite)
[15:30:05] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox
[15:30:14] <wikibugs>	 (03PS2) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927)
[15:31:41] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "+1 but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) (owner: 10Clément Goubert)
[15:32:39] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] EFI don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway)
[15:33:30] <sukhe>	 !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.6-1wm1_amd64.changes: T379797
[15:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:46] <stashbot>	 T379797: Package and deploy ATS 9.2.6 - https://phabricator.wikimedia.org/T379797
[15:34:45] <wikibugs>	 (03CR) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) (owner: 10Clément Goubert)
[15:35:14] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091248|Revert "mmv.js: Store comingFromHashChange as a class property" (T379835)]] (duration: 12m 10s)
[15:35:18] <stashbot>	 T379835: Closing an image in MultimediaViewer does not remove the URL fragment - https://phabricator.wikimedia.org/T379835
[15:35:30] <icinga-wm>	 PROBLEM - BGP status on lsw1-c7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:36:05] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: junos upgrade, T364092]
[15:36:10] <stashbot>	 T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092
[15:36:21] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: junos upgrade, T364092]
[15:36:25] <wikibugs>	 (03CR) 10Jbond: [C:04-1] "i don;t think this will fix the underlining issue, see comments.  ill take a look at the task" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[15:37:07] <wikibugs>	 (03PS4) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022)
[15:37:16] <volans>	 !log installed spicerack v8.16.1 to cumin hosts
[15:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:17] <wikibugs>	 (03CR) 10Volans: "This can now be tested with test-cookbook as spicerack has been released and deployed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans)
[15:38:24] <wikibugs>	 (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans)
[15:38:47] <wikibugs>	 (03PS5) 10Reedy: CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834)
[15:38:51] <wikibugs>	 (03CR) 10Reedy: [C:03+2] CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) (owner: 10Reedy)
[15:39:22] <wikibugs>	 (03CR) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) (owner: 10Clément Goubert)
[15:39:34] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) (owner: 10Reedy)
[15:39:49] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1020.eqiad.wmnet with OS bullseye
[15:40:16] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1016.eqiad.wmnet with OS bullseye
[15:42:24] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4043*,cp4051*} and A:cp for 9.2.6-1wm1
[15:43:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.cf
[15:43:16] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[15:44:30] <icinga-wm>	 RECOVERY - BGP status on lsw1-c7-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:45:18] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2002.codfw.wmnet
[15:45:21] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2002.codfw.wmnet
[15:45:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322923 (10ops-monitoring-bot) pool host wikikube-ctrl2002.codfw.wmnet by jayme@cumin2002 with reason: None
[15:45:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322926 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 pool for host wikiku...
[15:45:39] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl2002.codfw.wmnet
[15:45:40] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl2002.codfw.wmnet
[15:46:31] <wikibugs>	 (03PS3) 10Arnaudb: dbtools: command line helper to evaluate a host, or a group of hosts [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715)
[15:46:31] <wikibugs>	 (03CR) 10Arnaudb: "this script has been tested and used here: https://phabricator.wikimedia.org/T378715#10322914" [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) (owner: 10Arnaudb)
[15:47:25] <logmsgbot>	 !log sukhe@cumin1002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=97) Rolling upgrade/restart of Apache Traffic Server on P{cp4043*,cp4051*} and A:cp for 9.2.6-1wm1
[15:47:40] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:47:46] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4043.ulsfo.wmnet
[15:47:50] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:47:58] <sukhe>	 ^ depooled, looking
[15:47:59] <logmsgbot>	 !log reedy@deploy2002 Synchronized wmf-config/CommonSettings.php: T379834 (duration: 08m 02s)
[15:48:03] <stashbot>	 T379834: PHP Deprecated: Automatic conversion of false to array is deprecated - https://phabricator.wikimedia.org/T379834
[15:48:04] <sukhe>	 upgrade didn't go smoothly :)
[15:48:53] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) (owner: 10Clément Goubert)
[15:49:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322929 (10JMeybohm) 05Open→03Resolved a:03JMeybohm @Jhancock.wm swapped the cable into port 1, I've changed BIO...
[15:49:49] <moritzm>	 !log installing nss security updates
[15:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:38] <wikibugs>	 (03PS4) 10Arnaudb: dbtools: command line helper to evaluate a host, or a group of hosts [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715)
[15:55:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,cr1-eqiad.mgmt with reason: router upgrade
[15:55:10] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,cr1-eqiad.mgmt with reason: router upgrade
[15:55:30] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppetserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1091245 (owner: 10Muehlenhoff)
[15:55:54] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp4043 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:56:22] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp4043.ulsfo.wmnet with reason: depooled, debugging
[15:56:22] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] [GrowthExperiments] Add virtual domain config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm)
[15:56:35] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp4043.ulsfo.wmnet with reason: depooled, debugging
[15:57:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,re0.cr1-eqiad.mgmt with reason: router upgrade
[15:57:16] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,re0.cr1-eqiad.mgmt with reason: router upgrade
[16:00:05] <jouncebot>	 brennen and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1600).
[16:00:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2139.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:01:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ml-lab Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1091259
[16:01:42] <papaul>	 !log ongoing maintenance on cr1-eqiad
[16:01:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:01] <wikibugs>	 (03PS5) 10Arnaudb: dbtools: command line helper to evaluate a host, or a group of hosts [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715)
[16:02:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:03:57] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 151575
[16:04:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:04:41] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 151575
[16:07:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:10:55] <jinxer-wm>	 FIRING: [4x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs1017 with peer 208.80.154.196 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable
[16:11:02] <sukhe>	 hmm 
[16:11:05] <akosiaris>	 ?
[16:11:05] <vgutierrez>	 uh'
[16:11:09] <akosiaris>	 !incidents
[16:11:10] <sirenbot>	 5449 (UNACKED)  [4x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad)
[16:11:10] <sukhe>	 eqiad is depooled
[16:11:10] <sirenbot>	 5447 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[16:11:10] <sirenbot>	 5446 (RESOLVED)  DDoSDetected sre (netflow5002:9100 eqsin)
[16:11:10] <Amir1>	 ?
[16:11:10] <sirenbot>	 5445 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[16:11:10] <sirenbot>	 5440 (RESOLVED)  [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw)
[16:11:17] <sukhe>	 !ack 5449
[16:11:18] <sirenbot>	 5449 (ACKED)  [4x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad)
[16:11:20] <Amir1>	 ah okay
[16:11:23] <vgutierrez>	 ah ok
[16:11:26] <sukhe>	 so this alert works, nice :)
[16:11:39] <sukhe>	 downtiming
[16:11:43] <akosiaris>	 cr1?
[16:11:44] <Amir1>	 !ack 5449
[16:11:45] <sirenbot>	 5449 (ACKED)  [4x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad)
[16:11:54] <akosiaris>	 ah, ok.
[16:12:27] <sukhe>	 silenced for all eqiad
[16:13:35] <wikibugs>	 (03CR) 10Jbond: [C:04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott)
[16:15:27] <wikibugs>	 (03PS4) 10Volans: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855
[16:15:28] <wikibugs>	 (03PS4) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856
[16:15:28] <wikibugs>	 (03PS1) 10Volans: mysql: make fetch_one_row return always a dict [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091278
[16:16:07] <wikibugs>	 (03CR) 10Volans: "mypy failure in CI will be fixed by  I2d5bc3e26c537acc14e282d9ad23c271c2dba5cd but doesn't change the behaviour of the cookbook so it can " [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans)
[16:18:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1305.eqiad.wmnet with OS bullseye
[16:19:13] <wikibugs>	 (03CR) 10Jbond: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1064114 (owner: 10Cwhite)
[16:19:26] <wikibugs>	 (03CR) 10Jbond: [C:03+1] openssh: Remove code to disable NIST key exchange [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff)
[16:23:41] <wikibugs>	 (03CR) 10Jbond: "adding simon as they seem to have picked up the next CR in the chain" [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond)
[16:29:26] <wikibugs>	 (03CR) 10Jbond: "good idea, the `systemd::sysuser` has an `$additional_groups` param which should DTRT.  Will need to be updated in `profile::puppetserver:" [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond)
[16:31:14] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[16:31:31] <wikibugs>	 (03CR) 10Jbond: "This can probably be removed from the chain and either abandond or considered seperatly" [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond)
[16:31:49] <icinga-wm>	 PROBLEM - Host db1190 #page is DOWN: PING CRITICAL - Packet loss = 100%
[16:31:52] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[16:31:59] <sukhe>	 !incidents
[16:31:59] <sirenbot>	 5449 (ACKED)  [4x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad)
[16:31:59] <sirenbot>	 5451 (UNACKED)  Host db1190 (paged) - PING  - Packet loss = 100%
[16:32:00] <sirenbot>	 5447 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[16:32:00] <sirenbot>	 5446 (RESOLVED)  DDoSDetected sre (netflow5002:9100 eqsin)
[16:32:00] <sirenbot>	 5445 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[16:32:00] <sirenbot>	 5440 (RESOLVED)  [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw)
[16:32:05] <sukhe>	 !ack 5451
[16:32:05] <sirenbot>	 5451 (ACKED)  Host db1190 (paged) - PING  - Packet loss = 100%
[16:32:06] <icinga-wm>	 PROBLEM - Host ms-fe1012 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:17] <Amir1>	 I can take a look
[16:32:21] <sukhe>	 turning out to be a nice day
[16:32:22] <akosiaris>	 thanks
[16:32:23] <sukhe>	 Amir1: thanks <3
[16:32:26] <icinga-wm>	 PROBLEM - Host dbproxy1026 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:28] <icinga-wm>	 PROBLEM - Host kubernetes1059 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:28] <icinga-wm>	 PROBLEM - Host ml-cache1001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:29] <cdanis>	 yeah
[16:32:30] <icinga-wm>	 PROBLEM - Host cephosd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:30] <icinga-wm>	 PROBLEM - Host dse-k8s-worker1005 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:31] <akosiaris>	 ok, what is going on?
[16:32:32] <icinga-wm>	 PROBLEM - Host dumpsdata1006 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:34] <icinga-wm>	 PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:34] <cdanis>	 akosiaris: this is the network
[16:32:36] <akosiaris>	 ah
[16:32:37] <sukhe>	 yeah
[16:32:39] <cdanis>	 --> #-sre
[16:32:40] <icinga-wm>	 PROBLEM - Host lvs1013 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:42] <icinga-wm>	 PROBLEM - Host elastic1090 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:42] <icinga-wm>	 PROBLEM - Host elastic1104 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:43] <cdanis>	 it's gonna be flooded in here
[16:32:49] <akosiaris>	 thanks, I got worried for a sec, just got out of meeting
[16:32:52] <icinga-wm>	 PROBLEM - Host elastic1089 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:52] <icinga-wm>	 PROBLEM - Host logstash1036 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:54] <icinga-wm>	 PROBLEM - Router interfaces on pfw1-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:33:02] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:33:06] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:33:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:33:11] <sukhe>	 what should have been downtimed here I wonder
[16:33:12] <sukhe>	 or not
[16:33:12] <icinga-wm>	 PROBLEM - Host kafka-jumbo1010 is DOWN: PING CRITICAL - Packet loss = 100%
[16:33:16] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:33:16] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:33:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1190 sad', diff saved to https://phabricator.wikimedia.org/P71044 and previous config saved to /var/cache/conftool/dbconfig/20241114-163317-ladsgroup.json
[16:33:28] <icinga-wm>	 PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:33:31] <icinga-wm>	 PROBLEM - BGP status on pfw1-eqiad is CRITICAL: BGP CRITICAL - AS64701/IPv4: Idle - frack-codfw, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:33:38] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1190.eqiad.wmnet with reason: Sad
[16:33:44] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:33:50] <icinga-wm>	 PROBLEM - Host ssw1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[16:33:50] <icinga-wm>	 PROBLEM - Host ssw1-e1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:33:52] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1190.eqiad.wmnet with reason: Sad
[16:34:06] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:34:08] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:34:16] <icinga-wm>	 PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:34:20] <icinga-wm>	 RECOVERY - Host cephosd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[16:34:20] <icinga-wm>	 RECOVERY - Host elastic1089 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[16:34:21] <icinga-wm>	 RECOVERY - Host db1190 #page is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[16:34:22] <icinga-wm>	 RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[16:34:34] <icinga-wm>	 RECOVERY - Host dbproxy1026 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[16:34:34] <icinga-wm>	 RECOVERY - Host elastic1104 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[16:34:38] <icinga-wm>	 RECOVERY - Host dumpsdata1006 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[16:34:38] <icinga-wm>	 RECOVERY - Host elastic1090 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[16:34:38] <icinga-wm>	 RECOVERY - Host logstash1036 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[16:34:38] <icinga-wm>	 RECOVERY - Host kafka-jumbo1010 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[16:34:40] <icinga-wm>	 RECOVERY - Host ml-cache1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[16:34:42] <icinga-wm>	 RECOVERY - Host ms-fe1012 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[16:34:44] <icinga-wm>	 RECOVERY - Host dse-k8s-worker1005 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[16:34:46] <icinga-wm>	 RECOVERY - Host kubernetes1059 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[16:35:10] <icinga-wm>	 RECOVERY - Host lvs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms
[16:35:12] <wikibugs>	 (03PS1) 10Klausman: ml-staging/experimental: bump max container/pod size to 75G/80G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091289
[16:36:07] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter
[16:36:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:36:12] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0)
[16:36:16] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:36:30] <icinga-wm>	 RECOVERY - BGP status on pfw1-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:36:35] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1305.eqiad.wmnet with reason: host reimage
[16:36:42] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[16:36:46] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:36:54] <icinga-wm>	 RECOVERY - Router interfaces on pfw1-eqiad is OK: OK: host 208.80.154.219, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:36:57] <wikibugs>	 (03CR) 10Klausman: [C:03+1] Add ml-lab Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1091259 (owner: 10Muehlenhoff)
[16:37:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:37:03] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:37:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:37:16] <icinga-wm>	 RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:37:59] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter
[16:38:02] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0)
[16:38:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-staging/experimental: bump max container/pod size to 75G/80G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091289 (owner: 10Klausman)
[16:38:52] <icinga-wm>	 RECOVERY - Host ssw1-e1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms
[16:38:52] <icinga-wm>	 RECOVERY - Host ssw1-e1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.12 ms
[16:39:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:39:16] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:40:01] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1305.eqiad.wmnet with reason: host reimage
[16:45:28] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[16:45:48] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[16:48:11] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) (owner: 10Cwhite)
[16:51:45] <logmsgbot>	 !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@7c4873e]: decouple article-level image suggestions from section-level ones
[16:52:17] <logmsgbot>	 !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@7c4873e]: decouple article-level image suggestions from section-level ones (duration: 00m 53s)
[16:57:24] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: Network maintenance - None
[16:59:17] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1305.eqiad.wmnet with OS bullseye
[17:00:05] <jouncebot>	 jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:02:06] <wikibugs>	 (03CR) 10Jbond: [C:04-1] resolvconf: don't update resolv.conf with 0 nameservers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[17:07:07] <wikibugs>	 (03CR) 10Bking: "The gitlab trusted runners will need to POST to this service...I was thinking that we needed an ingress config for that, but if that's not" [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[17:09:40] <icinga-wm>	 RECOVERY - BGP status on ssw1-e1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:10:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1306.eqiad.wmnet with OS bullseye
[17:13:11] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:13:11] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:13:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2139.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:14:08] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp4043 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[17:15:06] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942 (10Ladsgroup) 03NEW
[17:15:49] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=4043.ulsfo.wmnet
[17:18:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1307.eqiad.wmnet with OS bullseye
[17:18:53] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: Network maintenance - None
[17:18:55] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1190 gradually with 4 steps - Maint over
[17:18:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1308.eqiad.wmnet with OS bullseye
[17:21:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1309.eqiad.wmnet with OS bullseye
[17:24:32] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[17:24:45] <wikibugs>	 (03CR) 10Bking: "Per IRC conversation with @cdanis@wikimedia.org, it does seem that this patch is necessary." [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[17:24:51] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[17:25:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1310.eqiad.wmnet with OS bullseye
[17:25:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1311.eqiad.wmnet with OS bullseye
[17:26:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1312.eqiad.wmnet with OS bullseye
[17:27:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2139.codfw.wmnet with OS bookworm
[17:27:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10323460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2139.codfw.wmnet with O...
[17:27:48] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10323409 (10Ladsgroup) Noting that we are starting to slowly drop all thumbnails in swift as a one-off clean up which would make the change in size of thu...
[17:29:10] <wikibugs>	 (03PS3) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927)
[17:29:14] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1306.eqiad.wmnet with reason: host reimage
[17:29:34] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: bump to 0.3.150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T374919)
[17:29:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[17:30:13] <wikibugs>	 (03PS2) 10DCausse: rdf-streaming-updater: bump to 0.3.150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T374919)
[17:30:33] <wikibugs>	 (03CR) 10DCausse: [C:04-1] "needs Ife016662f5fde835c21457ef457b567d9be61d2a to be fully deployed everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[17:31:18] <wikibugs>	 (03PS4) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927)
[17:31:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[17:32:56] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1306.eqiad.wmnet with reason: host reimage
[17:33:00] <wikibugs>	 (03PS5) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927)
[17:35:08] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[17:37:00] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1307.eqiad.wmnet with reason: host reimage
[17:37:01] <icinga-wm>	 PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[17:37:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1308.eqiad.wmnet with reason: host reimage
[17:39:38] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1307.eqiad.wmnet with reason: host reimage
[17:39:48] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage
[17:42:38] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage
[17:43:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1310.eqiad.wmnet with reason: host reimage
[17:44:21] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1311.eqiad.wmnet with reason: host reimage
[17:45:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2139.codfw.wmnet with reason: host reimage
[17:45:27] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1312.eqiad.wmnet with reason: host reimage
[17:46:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1308.eqiad.wmnet with reason: host reimage
[17:47:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10323538 (10Papaul) 05Open→03Resolved This is done, re0 is now the master. Closing this task ` re0.cr1-eqiad> show chassis routing-engine  Routing Engine statu...
[17:48:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10323542 (10Papaul)
[17:48:13] <wikibugs>	 (03CR) 10Bking: wdqs: remove 5 codfw hosts from production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper)
[17:48:47] <wikibugs>	 (03CR) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[17:49:42] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1311.eqiad.wmnet with reason: host reimage
[17:50:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10323546 (10Volans) Did you go through https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands ?
[17:52:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1306.eqiad.wmnet with OS bullseye
[17:53:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2139.codfw.wmnet with reason: host reimage
[17:57:05] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1310.eqiad.wmnet with reason: host reimage
[17:59:19] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1307.eqiad.wmnet with OS bullseye
[18:00:05] <jouncebot>	 bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1800)
[18:00:43] <bd808>	 nothing for me to deploy today.
[18:01:02] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1312.eqiad.wmnet with reason: host reimage
[18:01:31] <wikibugs>	 (03PS3) 10Bking: wdqs: remove 3 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper)
[18:02:16] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1309.eqiad.wmnet with OS bullseye
[18:03:21] <wikibugs>	 (03PS1) 10Scott French: sre.discovery.datacenter: fix eligible actions in _get_all_services [cookbooks] - 10https://gerrit.wikimedia.org/r/1091314 (https://phabricator.wikimedia.org/T335364)
[18:04:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1190 gradually with 4 steps - Maint over
[18:04:28] <wikibugs>	 (03PS1) 10Scott French: sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1078736 (https://phabricator.wikimedia.org/T372604)
[18:05:56] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1308.eqiad.wmnet with OS bullseye
[18:06:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1078736 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[18:07:58] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[18:08:02] <wikibugs>	 (03PS2) 10Bking: dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659)
[18:08:14] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[18:08:20] <wikibugs>	 (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1078736 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[18:09:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1311.eqiad.wmnet with OS bullseye
[18:11:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[18:13:04] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4043 is OK: HTTP OK: HTTP/1.1 200 OK - 48046 bytes in 0.827 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[18:13:33] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp4043.ulsfo.wmnet with reason: depooled, debugging
[18:13:36] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp4043.ulsfo.wmnet with reason: depooled, debugging
[18:13:43] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] sre.discovery.datacenter: fix eligible actions in _get_all_services [cookbooks] - 10https://gerrit.wikimedia.org/r/1091314 (https://phabricator.wikimedia.org/T335364) (owner: 10Scott French)
[18:15:13] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1078736 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French)
[18:16:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1310.eqiad.wmnet with OS bullseye
[18:18:59] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter
[18:19:18] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0)
[18:20:37] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1312.eqiad.wmnet with OS bullseye
[18:20:57] <wikibugs>	 (03CR) 10Scott French: [C:03+2] sre.discovery.datacenter: fix eligible actions in _get_all_services [cookbooks] - 10https://gerrit.wikimedia.org/r/1091314 (https://phabricator.wikimedia.org/T335364) (owner: 10Scott French)
[18:22:29] <wikibugs>	 (03PS1) 10Andrew Bogott: prometheus-openstack-exporter: try to re-enable placement metrics [puppet] - 10https://gerrit.wikimedia.org/r/1091319
[18:23:43] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091319 (owner: 10Andrew Bogott)
[18:27:20] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: fix eligible actions in _get_all_services [cookbooks] - 10https://gerrit.wikimedia.org/r/1091314 (https://phabricator.wikimedia.org/T335364) (owner: 10Scott French)
[18:28:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] prometheus-openstack-exporter: try to re-enable placement metrics [puppet] - 10https://gerrit.wikimedia.org/r/1091319 (owner: 10Andrew Bogott)
[18:34:28] <wikibugs>	 (03PS1) 10Jforrester: build: Upgrade mediawiki/mediawiki-codesniffer from v43.0.0 to v45.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091320 (https://phabricator.wikimedia.org/T379955)
[18:47:26] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 216, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:47:32] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:47:34] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:47:38] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:47:38] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:47:44] <icinga-wm>	 PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:47:46] <icinga-wm>	 PROBLEM - BGP status on ssw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:47:48] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:47:52] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:47:52] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:47:54] <icinga-wm>	 PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:47:58] <icinga-wm>	 PROBLEM - BGP status on pfw1-eqiad is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:48:16] <icinga-wm>	 PROBLEM - Router interfaces on pfw1-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:49:21] <sukhe>	 !next
[18:49:35] <sukhe>	 jouncebot: now
[18:49:35] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1800)
[18:49:35] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1800)
[18:50:48] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:50:52] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:50:58] <icinga-wm>	 RECOVERY - BGP status on pfw1-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:51:16] <icinga-wm>	 RECOVERY - Router interfaces on pfw1-eqiad is OK: OK: host 208.80.154.219, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:51:34] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:51:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:51:52] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:51:54] <icinga-wm>	 RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:54:36] <wikibugs>	 (03CR) 10Btullis: "Could you link to that conversation, please?" [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[18:56:42] <wikibugs>	 (03PS15) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529
[18:56:42] <wikibugs>	 (03PS1) 10Ebernhardson: opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325
[18:56:42] <wikibugs>	 (03PS1) 10Ebernhardson: opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326
[18:56:43] <wikibugs>	 (03PS1) 10Ebernhardson: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327
[18:56:44] <wikibugs>	 (03CR) 10Btullis: "Oh right, so it's not actually using LVS is it?" [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking)
[18:59:58] <wikibugs>	 (03PS1) 10Bvibber: Enabling shared globaljsonlinks table in x1 for JsonConfig/Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091328 (https://phabricator.wikimedia.org/T379689)
[19:00:05] <jouncebot>	 brennen and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1900).
[19:00:26] <brennen>	 !log 1.44.0-wmf.3 train status (T375662): no current blockers, but holding for network maintenance.
[19:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:52] <stashbot>	 T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662
[19:01:14] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: binary doesn't include build information - https://phabricator.wikimedia.org/T379958 (10Eevans) 03NEW
[19:03:26] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:03:32] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:03:44] <icinga-wm>	 RECOVERY - BGP status on ssw1-f1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:04:38] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:05:46] <icinga-wm>	 RECOVERY - BFD status on cr1-magru is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:06:35] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: update production deployment for project changes - https://phabricator.wikimedia.org/T379204#10323950 (10Eevans) 05Open→03Resolved
[19:12:14] <wikibugs>	 (03CR) 10Ebernhardson: "I do wonder, there is nothing particularly opensearch specific here. This is really the same thing we used on elastic, but I was opting to" [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (owner: 10Ebernhardson)
[19:12:52] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:12:52] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:12:56] <icinga-wm>	 PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:12:58] <icinga-wm>	 PROBLEM - BGP status on pfw1-eqiad is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:13:00] <sukhe>	 brennen: sorry the delay. we ran into some issues so unexpected that it would take this long
[19:13:13] <sukhe>	 but there is definitely value in waiting since eqiad is depooled for edge traffic and services
[19:13:18] <icinga-wm>	 PROBLEM - Router interfaces on pfw1-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:13:26] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 216, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:13:32] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:13:34] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:13:40] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:13:42] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:13:44] <icinga-wm>	 PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:13:50] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:14:12] <wikibugs>	 (03PS1) 10Ssingh: trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330
[19:14:24] <swfrench-wmf>	 !log running sre.discovery.datacenter status all to test deployed fix
[19:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:30] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter
[19:14:44] <icinga-wm>	 PROBLEM - BGP status on ssw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:14:50] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0)
[19:15:32] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4525/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh)
[19:15:40] <brennen>	 sukhe: no worries, we have a long window here on purpose.
[19:16:32] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:16:40] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:16:44] <icinga-wm>	 RECOVERY - BGP status on ssw1-f1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:16:45] <icinga-wm>	 RECOVERY - BFD status on cr1-magru is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:16:50] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:16:52] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:16:56] <icinga-wm>	 RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:16:58] <icinga-wm>	 RECOVERY - BGP status on pfw1-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:17:12] <wikibugs>	 (03PS2) 10Ssingh: trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330
[19:17:18] <icinga-wm>	 RECOVERY - Router interfaces on pfw1-eqiad is OK: OK: host 208.80.154.219, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:17:26] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:17:34] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:17:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:17:52] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:18:25] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4526/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh)
[19:18:34] <wikibugs>	 (03Abandoned) 10BCornwall: apt/varnish: Add/Pin varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[19:19:25] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox
[19:20:16] <James_F>	 !log Running `mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --zType Z8 --report --verbose` for T375972, T367005, T373038, T358737
[19:20:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:32] <stashbot>	 T375972: in the object selector, functions that return a Typed list are not available when a Typed list is expected or required  - https://phabricator.wikimedia.org/T375972
[19:20:32] <stashbot>	 T367005: Map function should be correctly type-hinted that it returns a Typed list of Z1s - https://phabricator.wikimedia.org/T367005
[19:20:33] <stashbot>	 T373038: fetchZidsOfType only returns objects that have at least one label - https://phabricator.wikimedia.org/T373038
[19:20:34] <stashbot>	 T358737: Object selector cannot select unlabeled object by ZID - https://phabricator.wikimedia.org/T358737
[19:21:29] <wikibugs>	 (03Restored) 10BCornwall: apt/varnish: Add/Pin varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[19:21:38] <wikibugs>	 (03PS3) 10BCornwall: apt/varnish: Add varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737)
[19:22:34] <wikibugs>	 (03CR) 10BCornwall: apt/varnish: Add varnish-staging component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[19:25:40] <wikibugs>	 (03PS2) 10Ebernhardson: opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325
[19:25:40] <wikibugs>	 (03PS2) 10Ebernhardson: opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326
[19:25:40] <wikibugs>	 (03PS2) 10Ebernhardson: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327
[19:25:41] <wikibugs>	 (03PS16) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529
[19:26:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (owner: 10Ebernhardson)
[19:31:51] <wikibugs>	 (03PS3) 10Ebernhardson: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327
[19:31:51] <wikibugs>	 (03PS17) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529
[19:31:55] <jinxer-wm>	 RESOLVED: [4x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs1017 with peer 208.80.154.197 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable
[19:32:04] <sukhe>	 nice
[19:32:08] <cdanis>	 😌
[19:32:28] <sukhe>	 we were so split on making this paging. but no regrets
[19:32:53] <cdanis>	 what does that alert check?
[19:33:09] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10324063 (10Jclark-ctr) @ABran-WMF  Dell is requesting SOS report and TSR report from this server and another. can you assist?
[19:33:13] <cdanis>	 it also needs a runbook entry or at least a mention in the wikitech page it links ;)
[19:33:45] <sukhe>	 will just link to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/88526c0114c520878c9c6801ce1ba431b1d3bddf but yes, good idea, will add
[19:33:47] <cdanis>	 ah neat
[19:33:48] <sukhe>	 pybal_bgp_session_established != 1 and ignoring (local_asn, peer) pybal_bgp_enabled == 1
[19:37:25] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: junos upgrade done, T364092]
[19:37:28] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: junos upgrade done, T364092]
[19:37:29] <stashbot>	 T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092
[19:39:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10324084 (10Papaul)
[19:45:51] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091337 (https://phabricator.wikimedia.org/T375662)
[19:45:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091337 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot)
[19:46:37] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091337 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot)
[19:51:28] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: remove 3 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150)
[19:54:10] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs: remove 3 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150)
[19:54:10] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329)
[19:54:10] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330)
[19:54:30] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[19:55:30] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: remove 3 codfw hosts from production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper)
[19:55:47] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.3  refs T375662
[19:55:51] <stashbot>	 T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662
[19:59:36] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333)
[20:01:40] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: Network maintenance complete - None
[20:17:24] <wikibugs>	 (03PS1) 10Bvibber: Update charts-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091341 (https://phabricator.wikimedia.org/T375235)
[20:18:04] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Update charts-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091341 (https://phabricator.wikimedia.org/T375235) (owner: 10Bvibber)
[20:18:42] <bvibber>	 going to do a deploy of chart-renderer slight update :D
[20:20:12] <bvibber>	 hm, i don't have +2 in that repo :D
[20:21:00] <cdanis>	 ah weird
[20:21:01] <cdanis>	 I'll +2 it 
[20:21:05] <bvibber>	 tx
[20:21:09] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Update charts-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091341 (https://phabricator.wikimedia.org/T375235) (owner: 10Bvibber)
[20:21:46] <cdanis>	 `wmf-deployment` and `mediawiki-services` ldap groups have Submit there
[20:22:12] <wikibugs>	 (03Merged) 10jenkins-bot: Update charts-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091341 (https://phabricator.wikimedia.org/T375235) (owner: 10Bvibber)
[20:23:01] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: Network maintenance complete - None
[20:23:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[20:23:32] <logmsgbot>	 !log bvibber@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply
[20:23:35] <logmsgbot>	 !log bvibber@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply
[20:24:07] <logmsgbot>	 !log bvibber@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[20:24:10] <logmsgbot>	 !log bvibber@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[20:24:14] <logmsgbot>	 !log bvibber@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply
[20:24:16] <logmsgbot>	 !log bvibber@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply
[20:24:26] <bvibber>	 well let's try er out
[20:26:38] <bvibber>	 still renders charts at least :D
[20:26:44] <cdanis>	 uh
[20:26:48] <cdanis>	 I think something didn't work, one moment
[20:27:37] <bvibber>	 ok
[20:28:55] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter
[20:29:14] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0)
[20:31:49] <wikibugs>	 (03PS1) 10CDanis: chart-renderer: use 'app' instead of old 'main_app' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091348 (https://phabricator.wikimedia.org/T375235)
[20:32:18] <cdanis>	 so, I didn't actually look at the CI output from your patch at the time, bvibber, but if I had, I would have noticed it had zero effect 😅
[20:32:25] <wikibugs>	 (03PS2) 10CDanis: chart-renderer: use 'app' instead of old 'main_app' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091348 (https://phabricator.wikimedia.org/T375235)
[20:33:28] <bvibber>	 aha
[20:33:36] <cdanis>	 like, for instance, the diffs shown on that patch at https://integration.wikimedia.org/ci/job/helm-lint/21551/console
[20:33:38] <cdanis>	 😅
[20:33:52] <bvibber>	 lolol
[20:33:54] <wikibugs>	 (03CR) 10CDanis: [C:03+2] chart-renderer: use 'app' instead of old 'main_app' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091348 (https://phabricator.wikimedia.org/T375235) (owner: 10CDanis)
[20:34:24] <cdanis>	 I don't know how the whole world wound up with "we'll templatize yaml" as being the way to drive k8s, but here we are
[20:34:45] <cdanis>	 bvibber: okay try your deploy again, and it should give you some diffs to look at in helmfile this time too :D
[20:34:56] <bvibber>	 :)
[20:34:56] <bvibber>	 ok
[20:34:57] <wikibugs>	 (03Merged) 10jenkins-bot: chart-renderer: use 'app' instead of old 'main_app' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091348 (https://phabricator.wikimedia.org/T375235) (owner: 10CDanis)
[20:35:11] <logmsgbot>	 !log bvibber@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply
[20:35:18] <bvibber>	 ahh that looks better
[20:35:24] <cdanis>	 great
[20:35:53] <logmsgbot>	 !log bvibber@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply
[20:36:44] <logmsgbot>	 !log bvibber@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[20:37:20] <logmsgbot>	 !log bvibber@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[20:37:34] <logmsgbot>	 !log bvibber@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply
[20:38:05] <logmsgbot>	 !log bvibber@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply
[20:40:01] <bvibber>	 https://test.wikipedia.org/wiki/Charts we have titles rendering :D
[20:40:06] <bvibber>	 cdanis: i think it worked :D
[20:40:29] <bvibber>	 thanks for walking me through the confusing bits :D <3
[20:43:19] <cdanis>	 no worries!
[20:43:22] <wikibugs>	 (03PS1) 10Herron: aux_k8s: enable new eqiad workers [puppet] - 10https://gerrit.wikimedia.org/r/1091349 (https://phabricator.wikimedia.org/T378989)
[20:47:05] <wikibugs>	 (03CR) 10Herron: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1091349 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron)
[20:47:31] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:47:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2139.codfw.wmnet with OS bookworm
[20:47:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10324342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2139.codfw.wmnet with OS bo...
[20:50:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery)
[20:51:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10324347 (10Jhancock.wm)
[20:53:29] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10324366 (10RobH)
[20:55:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10324348 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert This one's complete. took me a minute to get that last one to behave.
[20:56:25] <wikibugs>	 (03PS3) 10Herron: role::aux_k8s::worker: add role to 2 new eqiad workers [puppet] - 10https://gerrit.wikimedia.org/r/1088610 (https://phabricator.wikimedia.org/T378989)
[20:56:25] <wikibugs>	 (03CR) 10Herron: [V:03+1] "following along with https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes mostly" [puppet] - 10https://gerrit.wikimedia.org/r/1088610 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron)
[20:58:00] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10324353 (10RobH)
[21:00:06] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T2100). Please do the needful.
[21:00:06] <jouncebot>	 Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:20] <Pppery>	 here
[21:01:02] <cjming>	 Pppery: i can deploy - unless you are able and want to self-deploy?
[21:01:13] <Pppery>	 no, i'm a volunteer with no access to anything
[21:01:23] <cjming>	 gotcha - here we go then - 1 sec
[21:01:46] <Pppery>	 You're not the first person to think I have more technical abilities than I do
[21:01:57] <brennen>	 thanks cjming, just realized what time it was.
[21:01:58] <wikibugs>	 (03PS6) 10Pppery: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923)
[21:02:34] <cjming>	 np!
[21:02:55] <TheresNoTime>	 Pppery: can always fix that! :D
[21:03:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery)
[21:03:42] <wikibugs>	 (03Merged) 10jenkins-bot: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery)
[21:04:01] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1082853|Redirect to wikis using subpages rather than namespaces too (T376923)]]
[21:04:07] <stashbot>	 T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923
[21:05:59] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] apt/varnish: Add varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[21:06:55] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10324415 (10Jhancock.wm)
[21:07:52] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10324416 (10Jhancock.wm)
[21:07:59] <logmsgbot>	 !log cjming@deploy2002 cjming, pppery: Backport for [[gerrit:1082853|Redirect to wikis using subpages rather than namespaces too (T376923)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:08:03] <Pppery>	 testig
[21:08:03] <cjming>	 Pppery: on mwdebug if testable
[21:08:08] <Pppery>	 testing now
[21:09:04] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10324422 (10Jhancock.wm) a:03Jhancock.wm
[21:09:10] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10324423 (10Jhancock.wm) a:05ABran-WMF→03Jhancock.wm
[21:10:45] <Pppery>	 Looks good
[21:12:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10324429 (10Jhancock.wm) a:03Jhancock.wm
[21:12:57] <cjming>	 cool - syncing
[21:13:00] <logmsgbot>	 !log cjming@deploy2002 cjming, pppery: Continuing with sync
[21:13:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:17:46] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082853|Redirect to wikis using subpages rather than namespaces too (T376923)]] (duration: 13m 44s)
[21:18:00] <stashbot>	 T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923
[21:18:00] <cjming>	 Pppery: should be live!
[21:18:17] <Pppery>	 Thanks
[21:18:21] <cjming>	 yw!
[21:19:06] <cjming>	 i gotta run - so i'll err on closing the window for now
[21:20:50] <cjming>	 !log end of UTC late backport window
[21:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:41] <wikibugs>	 (03CR) 10Pppery: Add 'rup' as alias for 'roa-rup' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix)
[21:23:01] <wikibugs>	 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#10324468 (10Pppery)
[21:24:21] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] apt/varnish: Add varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[21:25:34] <icinga-wm>	 PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100%
[21:26:01] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@2220747]: Stage Refine test fix
[21:26:13] <wikibugs>	 (03CR) 10Fomafix: Add 'rup' as alias for 'roa-rup' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix)
[21:26:17] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@2220747]: Stage Refine test fix (duration: 00m 16s)
[21:30:02] <icinga-wm>	 RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms
[21:47:34] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@7a66849]: Stage Refine: fix Airflow skip
[21:47:49] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@7a66849]: Stage Refine: fix Airflow skip (duration: 00m 14s)
[21:48:07] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@7a66849]: Stage Refine: fix Airflow skip
[21:49:06] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@7a66849]: Stage Refine: fix Airflow skip (duration: 00m 59s)
[22:03:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[22:09:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[22:13:44] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[22:14:22] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp4043 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[22:17:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:26:00] <wikibugs>	 (03PS1) 10Aleksandar Mastilovic: Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389
[22:26:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 (owner: 10Aleksandar Mastilovic)
[22:28:51] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs: remove 3 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper)
[22:30:49] <ryankemper>	 !log T376150 Depooled `wdqs20[18-20]` in preparation of merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1088185
[22:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:02] <stashbot>	 T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150
[22:31:56] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:32:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:36:47] <wikibugs>	 (03PS5) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:36:58] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:37:12] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp4043.ulsfo.wmnet with reason: ATS upgrade 9.2.6
[22:37:28] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp4043.ulsfo.wmnet with reason: ATS upgrade 9.2.6
[22:38:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:42:58] <wikibugs>	 (03PS6) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:42:58] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:43:06] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:49:04] <wikibugs>	 (03PS7) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:49:13] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack placement: make read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1091392
[22:50:12] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091392 (owner: 10Andrew Bogott)
[22:50:59] <wikibugs>	 (03PS8) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:52:36] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:52:54] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack placement: make read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1091392
[22:52:57] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091392 (owner: 10Andrew Bogott)
[22:56:09] <wikibugs>	 (03PS9) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329)
[22:56:09] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330)
[22:56:09] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333)
[22:56:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack placement: make read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1091392 (owner: 10Andrew Bogott)
[22:58:13] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[22:59:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[22:59:37] <wikibugs>	 (03PS10) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329)
[22:59:37] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330)
[22:59:37] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333)
[23:00:15] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper)
[23:31:55] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10324792 (10Ladsgroup) I'm deleting all thumbnails on every container except commons right now. Only on codfw and in alphabetical order and in serial. Right now, it's on enwikibooks (...
[23:44:22] <wikibugs>	 (03PS1) 10Scott French: debug.json: add support for mwdebug-next [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076848 (https://phabricator.wikimedia.org/T372605)
[23:48:57] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:53:21] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[23:53:57] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:58:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh