[00:00:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10320717 (10Jclark-ctr) [00:05:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1041.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:05:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:05:25] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:05:34] 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10320722 (10colewhite) All dashboards in the [[ https://grafana-rw.wikimedia.org/dashboards/f/NHnAVr54k/rel... [00:05:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:06:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10320725 (10Jclark-ctr) [00:08:19] (03PS1) 10Bvibber: Correction to virtual-globaljsonlinks mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090988 (https://phabricator.wikimedia.org/T374746) [00:10:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090988 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [00:10:52] (03CR) 10Eevans: [C:03+2] corto: configure for production phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1090981 (https://phabricator.wikimedia.org/T356790) (owner: 10Eevans) [00:12:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10320743 (10Jclark-ctr) @ABran-WMF these have been racked/ cabled/ configured Per the racking instructions that where in the Racking Proposal : and ju... [00:12:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10320727 (10Jclark-ctr) @bking these have been racked/ cabled/ configured and just need puppet updated for os install [00:13:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:13:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1046.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:13:40] (03CR) 10BCornwall: [C:03+1] dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [00:24:46] PROBLEM - Dell PowerEdge RAID Controller on an-worker1169 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [00:24:47] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-worker1169 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T379856 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [00:24:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1169 - https://phabricator.wikimedia.org/T379856 (10ops-monitoring-bot) 03NEW [00:31:45] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10320769 (10Jclark-ctr) @Marostegui Dell is requesting SOS report and TSR report from this server and another. I can pull TSR reports but while logging int... [00:35:18] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10320773 (10Jclark-ctr) {F57699009} {F57699010} [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090990 [00:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090990 (owner: 10TrainBranchBot) [00:38:49] 06SRE-OnFire, 10Incident Tooling: corto: failure to create google doc should not be fatal - https://phabricator.wikimedia.org/T379858 (10Eevans) 03NEW [01:04:58] (03PS1) 10Bvibber: Avoid use of globaljsonlinks* tables on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090993 (https://phabricator.wikimedia.org/T374746) [01:05:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090993 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [01:06:34] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5027.eqsin.wmnet, cp5026.eqsin.wmnet, cp5030.eqsin.wmnet, cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5025.eqsin.wmnet, cp5029.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5027.eqsin.wmnet, cp5026.eqsin.wmnet, cp5030.eqsin.wmnet, cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5025.eqsin.wmnet are marked down but p [01:06:34] tps://wikitech.wikimedia.org/wiki/PyBal [01:06:34] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5026.eqsin.wmnet, cp5030.eqsin.wmnet, cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5025.eqsin.wmnet, cp5029.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5027.eqsin.wmnet, cp5031.eqsin.wmnet, cp5032.eqsin.wmnet, cp5028.eqsin.wmnet, cp5026.eqsin.wmnet, cp5025.eqsin.wmnet, cp5030.eqsin.wmnet, cp5029.eqsin.wmnet a [01:06:34] d down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:06:58] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:07:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [01:08:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090996 [01:08:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090996 (owner: 10TrainBranchBot) [01:09:08] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:11:58] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:34] PROBLEM - Webrequests Varnishkafka log producer on cp5029 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [01:16:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1090990 (owner: 10TrainBranchBot) [01:18:34] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:18:36] RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:18:57] oh boy [01:19:00] !incidents [01:19:01] 5440 (ACKED) [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw) [01:19:01] 5445 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [01:19:04] !ack 5445 [01:19:05] 5445 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [01:19:08] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:19:53] FIRING: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [01:20:12] !incidents [01:20:13] 5440 (ACKED) [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw) [01:20:13] 5445 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [01:20:13] 5446 (UNACKED) DDoSDetected sre (netflow5002:9100 eqsin) [01:20:15] !ack 5446 [01:20:16] 5446 (ACKED) DDoSDetected sre (netflow5002:9100 eqsin) [01:21:58] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:22:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [01:24:53] RESOLVED: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [01:27:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [01:32:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [01:35:36] RECOVERY - Webrequests Varnishkafka log producer on cp5029 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [01:35:45] ok great [01:35:51] nothing outstanding for cleanup [01:45:16] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1090996 (owner: 10TrainBranchBot) [01:55:22] (03CR) 10Ottomata: [C:03+1] dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [02:01:34] (03PS1) 10Reedy: CommonSettings.php: Properly set / to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) [02:02:18] (03CR) 10CI reject: [V:04-1] CommonSettings.php: Properly set / to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) (owner: 10Reedy) [02:04:43] (03PS2) 10Reedy: CommonSettings.php: Properly set / to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) [02:05:09] (03PS3) 10Reedy: CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) [02:23:27] (03PS4) 10Reedy: CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) [02:32:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:37:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:05:25] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:30] (03PS1) 10KartikMistry: CX3 Build 0.2.0+20241113 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091007 (https://phabricator.wikimedia.org/T368718) [03:35:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091007 (https://phabricator.wikimedia.org/T368718) (owner: 10KartikMistry) [03:42:30] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:47:32] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 123.37 ms [03:53:54] (03PS1) 10JHathaway: WIP: don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 [03:56:27] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [03:56:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10321140 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [04:09:26] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [04:11:56] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [04:24:12] (03PS2) 10JHathaway: WIP: EFI don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 [04:25:37] (03CR) 10JHathaway: "With this patch I am no longer able to reproduce the double d-i issue. I am fairly confident it is the cause of our woes, as it explains w" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway) [04:34:23] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [04:34:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10321172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye comple... [05:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T0700) [07:00:05] marostegui, Amir1, and arnaudb: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T0700). [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:06:24] !log delete office interco IP/prefixes/vlan in ulsfo - T379778 [07:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:27] T379778: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778 [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:17:38] (03PS1) 10Ayounsi: Remove office interco include [dns] - 10https://gerrit.wikimedia.org/r/1091130 (https://phabricator.wikimedia.org/T379778) [07:27:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1091130 (https://phabricator.wikimedia.org/T379778) (owner: 10Ayounsi) [07:27:58] (03CR) 10Ayounsi: [C:03+2] Remove office interco include [dns] - 10https://gerrit.wikimedia.org/r/1091130 (https://phabricator.wikimedia.org/T379778) (owner: 10Ayounsi) [07:30:35] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [07:32:50] (03CR) 10Slyngshede: [C:03+2] Netbox: Update runbook, add dashboard and physicalhosts report. [alerts] - 10https://gerrit.wikimedia.org/r/1090875 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:34:17] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove office link dns records - ayounsi@cumin1002" [07:34:28] (03Merged) 10jenkins-bot: Netbox: Update runbook, add dashboard and physicalhosts report. [alerts] - 10https://gerrit.wikimedia.org/r/1090875 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:34:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove office link dns records - ayounsi@cumin1002" [07:34:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:34:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet [07:34:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10321323 (10ops-monitoring-bot) Draining ganeti2017.codfw.wmnet of running VMs [07:36:23] 06SRE, 06Infrastructure-Foundations, 10netops, 10procurement, 13Patch-For-Review: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10321324 (10ayounsi) [07:41:40] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 145, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:41:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet [07:42:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet [07:42:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10321346 (10ops-monitoring-bot) Draining ganeti2017.codfw.wmnet of running VMs [07:45:42] (03PS1) 10Ayounsi: Remove test BGP session to e8 sonic switch [homer/public] - 10https://gerrit.wikimedia.org/r/1091169 [07:47:52] (03CR) 10Ayounsi: [C:03+2] Remove test BGP session to e8 sonic switch [homer/public] - 10https://gerrit.wikimedia.org/r/1091169 (owner: 10Ayounsi) [07:48:26] (03Merged) 10jenkins-bot: Remove test BGP session to e8 sonic switch [homer/public] - 10https://gerrit.wikimedia.org/r/1091169 (owner: 10Ayounsi) [07:51:57] (03PS1) 10Arnaudb: bashrc: add alias + dbctl alias [puppet] - 10https://gerrit.wikimedia.org/r/1091171 [07:51:58] (03CR) 10Arnaudb: [C:03+2] bashrc: add alias + dbctl alias [puppet] - 10https://gerrit.wikimedia.org/r/1091171 (owner: 10Arnaudb) [07:54:50] RECOVERY - BGP status on ssw1-e1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:58:38] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T0800). [08:00:05] kart_, DreamRimmer, and bvibber: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:21] o/ [08:00:32] \0 [08:00:45] o/ [08:00:47] I'll start with my patch. KCVelaga around? [08:00:52] Yes [08:01:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) (owner: 10KCVelaga) [08:02:24] (03Merged) 10jenkins-bot: Update stream registration and config for MinT for Readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) (owner: 10KCVelaga) [08:03:34] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1084704|Update stream registration and config for MinT for Readers (T378565)]] [08:03:37] T378565: MinT for Readers instrumentation: update stream configuration and registration for new schema fragment - https://phabricator.wikimedia.org/T378565 [08:07:02] (03PS2) 10KartikMistry: Update recommendation api to 2024-11-11-200548-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089964 (https://phabricator.wikimedia.org/T379037) [08:08:33] !log kartik@deploy2002 kcvelaga, kartik: Backport for [[gerrit:1084704|Update stream registration and config for MinT for Readers (T378565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:09:15] KCVelaga: Can you test the patch on mwdebug? [08:09:59] Let me try [08:10:53] (03PS3) 10KartikMistry: Update recommendation api to 2024-11-13-183159-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089964 (https://phabricator.wikimedia.org/T379592) [08:12:17] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890 (10MoritzMuehlenhoff) 03NEW [08:13:59] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10321458 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:18:40] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 79, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:18:56] (03CR) 10Elukey: "I tried to recall why I've set up the code in the first place, and this is what I found on IRC:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway) [08:19:35] We are taking some time to test, bvibber - you're next once I'm done with config patch. Backport patch from me is postponed. [08:19:38] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect, ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:19:50] tx [08:20:36] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 67, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:40] (03CR) 10Muehlenhoff: "My personal take is this: We don't use SGX and have no plans to do so (and who knows if Intel doesn't even abandon it in total at some poi" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [08:21:42] RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 464, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:23:29] (03PS3) 10TChin: flink-app: Add default checkpointing config for Flink 1.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176) [08:23:44] !log kartik@deploy2002 kcvelaga, kartik: Continuing with sync [08:24:15] Okay [08:24:55] bvibber: Sorry. Just in time, our dev is back who can test :D I'll go ahead and +2 the patch as it will take a while to merge.. [08:25:06] so mine affect job queue stuff so i won't be able to test them on the debug server :) [08:25:09] \o/ [08:25:16] tx [08:25:53] (03CR) 10KartikMistry: [C:03+2] CX3 Build 0.2.0+20241113 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091007 (https://phabricator.wikimedia.org/T368718) (owner: 10KartikMistry) [08:26:35] (03CR) 10Vgutierrez: haproxykafka: working on TLS client authentication to kafka (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [08:27:24] (03PS1) 10Brouberol: airflow-search: define k8s namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091175 (https://phabricator.wikimedia.org/T378441) [08:27:25] (03PS1) 10Brouberol: airflow-search: register tenant namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091176 (https://phabricator.wikimedia.org/T378441) [08:27:27] (03PS1) 10Brouberol: airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441) [08:27:57] Did I forgot DreamRimmer? You're next! :) [08:28:14] thanks [08:28:21] (03PS1) 10Brouberol: airflow-research: change user identitiy files owner to analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/1091178 (https://phabricator.wikimedia.org/T378442) [08:28:22] (03PS1) 10Brouberol: airflow-search: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1091179 (https://phabricator.wikimedia.org/T378441) [08:28:24] (03PS1) 10Brouberol: airflow-search: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) [08:28:25] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084704|Update stream registration and config for MinT for Readers (T378565)]] (duration: 24m 50s) [08:28:25] (03PS1) 10Brouberol: airflow-search: define ATS mapping and cache config [puppet] - 10https://gerrit.wikimedia.org/r/1091181 (https://phabricator.wikimedia.org/T378441) [08:28:28] T378565: MinT for Readers instrumentation: update stream configuration and registration for new schema fragment - https://phabricator.wikimedia.org/T378565 [08:30:32] KCVelaga: done. DreamRimmer go ahead! [08:31:26] (03CR) 10Elukey: "I found only references about how to do it (if available) via manual BIOS config (like https://www.supermicro.com/support/faqs/faq.cfm?faq" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [08:32:41] kart_ the stream registration is showing up fine on my end as well now. Thank you. [08:32:55] Nice! [08:33:17] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 140407 [08:33:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 140407 [08:33:56] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 9299 [08:34:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9299 [08:35:14] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 141082 [08:35:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 141082 [08:35:19] (03CR) 10Muehlenhoff: "I've reworked the Envoy firewall setup in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1090798, this patch will need to be adapted" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [08:35:34] kart_: deploying mine? [08:37:19] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 26744 [08:37:40] oh, I thought you're doing it :) [08:38:11] I don't have deployment access [08:38:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 26744 [08:39:02] ouch. Let me take a look at patch. [08:41:14] DreamRimmer: deploying.. [08:41:21] tx [08:41:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090937 (https://phabricator.wikimedia.org/T379635) (owner: 10Dreamrimmer) [08:42:06] (03Merged) 10jenkins-bot: Allow Wikidata bureaucrats to remove admin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090937 (https://phabricator.wikimedia.org/T379635) (owner: 10Dreamrimmer) [08:42:36] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1090937|Allow Wikidata bureaucrats to remove admin rights (T379635)]] [08:42:40] T379635: Allow Wikidata bureaucrats to remove admin rights - https://phabricator.wikimedia.org/T379635 [08:43:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10321525 (10elukey) @jhathaway something interesting that I found on Redfish related to BIOS boot options: ms-be2088 ` BootModeSelect UEFI BootOption_1... [08:44:37] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20241113 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091007 (https://phabricator.wikimedia.org/T368718) (owner: 10KartikMistry) [08:45:03] (03PS2) 10Brouberol: airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441) [08:46:22] (03PS1) 10Ayounsi: Replace fasw-c-eqiad with new fasw2 [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381) [08:47:25] !log kartik@deploy2002 dreamrimmer, kartik: Backport for [[gerrit:1090937|Allow Wikidata bureaucrats to remove admin rights (T379635)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:47:36] (03PS2) 10Ayounsi: Remove old fasw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381) [08:48:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1087474 (owner: 10Slyngshede) [08:48:32] (03CR) 10Vgutierrez: [C:03+1] "I'm guessing you intended to check 9.2.6 and not 9.2.5 (same output though)" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh) [08:48:40] DreamRimmer: Patch is available to test using mwdebug servers. Will you able to test it? [08:49:13] looks good to me. https://www.wikidata.org/w/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=general%7Cusergroups [08:49:18] go for it [08:49:36] cool! [08:49:42] !log kartik@deploy2002 dreamrimmer, kartik: Continuing with sync [08:52:56] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:53:24] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [08:53:45] (03CR) 10Stevemunene: [C:03+1] airflow-research: change user identitiy files owner to analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/1091178 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [08:53:57] ACKNOWLEDGEMENT - Juniper alarms on fasw-c-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.30 ayounsi https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091182 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:54:25] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090937|Allow Wikidata bureaucrats to remove admin rights (T379635)]] (duration: 11m 49s) [08:54:29] T379635: Allow Wikidata bureaucrats to remove admin rights - https://phabricator.wikimedia.org/T379635 [08:54:58] DreamRimmer: done. [08:55:11] !log import haproxy 2.8.12 to thirtdparty/haproxy28 component for bullseye-wikimedia (apt.wm.o) - T379891 [08:55:16] Going ahead with my backport patch.. We are running out of time :/ [08:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:19] T379891: Upgrade haproxy to 2.8.12 on cp hosts - https://phabricator.wikimedia.org/T379891 [08:55:44] if we hvae to reschedule mine that's ok [08:55:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10321573 (10elukey) Tried to manually set the continuous flag on sretest2001, rebooted but I didn't see the boot options changing like ms-be2088. So at th... [08:55:51] kart_: Thanks :) [08:56:05] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1091007|CX3 Build 0.2.0+20241113 (T368718 T374567)]] [08:56:09] T368718: Community-defined Translation Collections: Single selection mode UI - https://phabricator.wikimedia.org/T368718 [08:56:09] T374567: SX: Set aria-label to icon-only Codex buttons - https://phabricator.wikimedia.org/T374567 [08:56:54] bvibber: looks like train window is next, so might need to check with brennen and jnuche (who are doing train deployment..) [08:57:02] ok [09:00:05] brennen and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T0900). [09:00:09] !log kartik@deploy2002 kartik: Backport for [[gerrit:1091007|CX3 Build 0.2.0+20241113 (T368718 T374567)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:00:59] (03PS1) 10JMeybohm: k8s.reimage-stacked-control-plane: Wait for 3m after depooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1091185 (https://phabricator.wikimedia.org/T362408) [09:01:01] kart_, bvibber: hi there, train deployments are happening in US time this week, so you can use go ahead with more backports if you want to [09:01:10] \o/ [09:03:41] cool. [09:03:53] bvibber: I'm testing my patch, give me few minutes. [09:03:59] thx! [09:04:05] (03PS1) 10Ayounsi: Netbox: disable translation [puppet] - 10https://gerrit.wikimedia.org/r/1091187 [09:04:34] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp [09:04:42] (03CR) 10Stevemunene: airflow-search: define user kubeconfigs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091179 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:05:42] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091185 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:05:57] (03CR) 10Stevemunene: [C:03+1] airflow-search: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:06:19] (03CR) 10Stevemunene: [C:03+1] airflow-search: define ATS mapping and cache config [puppet] - 10https://gerrit.wikimedia.org/r/1091181 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:06:39] (03CR) 10Stevemunene: [C:03+1] airflow-search: define k8s namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091175 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:07:24] (03PS1) 10Muehlenhoff: lintian: Fix selection of vendor profile [puppet] - 10https://gerrit.wikimedia.org/r/1091188 [09:08:17] (03CR) 10Stevemunene: [C:03+1] airflow-search: register tenant namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091176 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:08:30] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp [09:09:43] (03CR) 10Muehlenhoff: "FYI; The error "bad-distribution-in-changes-file bullseye-wikimedia" will go away when https://gerrit.wikimedia.org/r/c/operations/puppet" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh) [09:10:44] (03CR) 10JMeybohm: [C:03+2] k8s.reimage-stacked-control-plane: Wait for 3m after depooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1091185 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:12:38] still testing.. [09:12:49] (03CR) 10Muehlenhoff: [C:03+1] "I'm not going to miss "Vorlagen für Dienste" or "Routen-Ziele"..." [puppet] - 10https://gerrit.wikimedia.org/r/1091187 (owner: 10Ayounsi) [09:12:49] (03CR) 10Volans: [C:03+2] "Thanks for the fix!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090927 (owner: 10Kamila Součková) [09:13:08] kart_: , bvibber: would you be so kind to ping me when you're done. AIUI the train window is not used so I could start maintenance work early right after you [09:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:31] (03CR) 10Stevemunene: [C:03+1] airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:14:22] jayme: sure. [09:14:27] thanks [09:16:18] (03Merged) 10jenkins-bot: k8s.reimage-stacked-control-plane: Wait for 3m after depooling [cookbooks] - 10https://gerrit.wikimedia.org/r/1091185 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:17:53] !log installed spicerack v8.16.0 on cumin2002 [09:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:59] (03CR) 10TChin: flink-app: Add default checkpointing config for Flink 1.20 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090430 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [09:21:06] !log kartik@deploy2002 kartik: Continuing with sync [09:21:20] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10321649 (10ABran-WMF) >>! In T376596#10205946, @Volans wrote: > Spicerack has support for prometheus, why not getti... [09:21:51] bvibber: You'll deploy your patches, right? :) [09:21:59] (03CR) 10Ayounsi: [C:03+2] Netbox: disable translation [puppet] - 10https://gerrit.wikimedia.org/r/1091187 (owner: 10Ayounsi) [09:22:19] kart_: in theory i can but i haven't done a deploy by hand in some time :) [09:23:26] best if someone more familiar pushes button [09:23:48] if no time then i'll reschedule [09:23:49] OK! :) [09:23:49] (03CR) 10Vgutierrez: [C:03+1] lintian: Fix selection of vendor profile [puppet] - 10https://gerrit.wikimedia.org/r/1091188 (owner: 10Muehlenhoff) [09:23:52] thx :) [09:25:34] (03CR) 10Brouberol: [C:03+2] airflow-research: change user identitiy files owner to analytics-deploy [puppet] - 10https://gerrit.wikimedia.org/r/1091178 (https://phabricator.wikimedia.org/T378442) (owner: 10Brouberol) [09:25:38] (03CR) 10Brouberol: [C:03+2] airflow-search: define user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1091179 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:25:45] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091007|CX3 Build 0.2.0+20241113 (T368718 T374567)]] (duration: 29m 40s) [09:25:50] T368718: Community-defined Translation Collections: Single selection mode UI - https://phabricator.wikimedia.org/T368718 [09:25:50] T374567: SX: Set aria-label to icon-only Codex buttons - https://phabricator.wikimedia.org/T374567 [09:26:04] bvibber: ok. deploying first patch.. [09:26:08] whee [09:26:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090988 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [09:26:57] (03CR) 10Brouberol: [C:03+2] "Oh dang sorry, I misread merged too fast. I'll update this in a subsequent patch" [puppet] - 10https://gerrit.wikimedia.org/r/1091179 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:27:16] (03Merged) 10jenkins-bot: Correction to virtual-globaljsonlinks mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090988 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [09:27:44] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1090988|Correction to virtual-globaljsonlinks mapping (T374746)]] [09:27:47] T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746 [09:27:50] \o/ [09:28:52] (03Merged) 10jenkins-bot: doc: fix introduction code bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/1090927 (owner: 10Kamila Součková) [09:30:45] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "Then yes, I guess we can live with the kernel warning :-)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [09:31:27] !log kartik@deploy2002 bvibber, kartik: Backport for [[gerrit:1090988|Correction to virtual-globaljsonlinks mapping (T374746)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:31:46] it's job queue stuff so all i can test is that it doesn't explode ;) [09:31:49] bvibber: possible to test this patch? ^ [09:31:58] ah :) [09:32:34] should be ready to roll, no explody :) [09:32:39] cool [09:32:42] !log kartik@deploy2002 bvibber, kartik: Continuing with sync [09:33:04] (03CR) 10Brouberol: [C:03+2] airflow-search: define k8s namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091175 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:34:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:35:03] (03CR) 10Brouberol: [C:03+2] airflow-search: register tenant namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091176 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:35:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:36:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:37:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:37:48] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090988|Correction to virtual-globaljsonlinks mapping (T374746)]] (duration: 10m 03s) [09:37:51] T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746 [09:38:01] whee [09:38:03] bvibber: second patch now.. [09:38:05] thx! [09:38:37] beta only? should be fast! [09:38:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090993 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [09:39:30] success! first patch is a-ok and functional <3 [09:39:36] (03CR) 10Fabfur: "thanks for the review and suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [09:39:36] (03Merged) 10jenkins-bot: Avoid use of globaljsonlinks* tables on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090993 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [09:39:42] whee [09:40:16] (03PS2) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) [09:42:12] bvibber: all done. [09:42:17] thx! [09:42:33] `09:40:05 Skipping sync since all commits were beta/labs-only changes. Operation completed.` [09:42:39] super :D [09:42:51] jayme: we're done with deployment [09:42:55] FIRING: MaxConntrack: Max conntrack at 94.88% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:43:02] kart_: cool, thanks [09:43:11] !log Done: UTC morning backport window [09:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:40] (03PS3) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) [09:47:05] (03PS1) 10Stevemunene: Add airflow oidc clients for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/1091193 (https://phabricator.wikimedia.org/T378440) [09:47:49] (03CR) 10Btullis: [C:03+1] airflow-search: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:47:55] RESOLVED: MaxConntrack: Max conntrack at 94.88% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [09:48:07] (03CR) 10Btullis: [C:03+1] airflow-search: define ATS mapping and cache config [puppet] - 10https://gerrit.wikimedia.org/r/1091181 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:49:08] (03CR) 10Btullis: [C:03+1] airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [09:49:38] (03PS1) 10Muehlenhoff: Move Puppet CA monitoring out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) [09:50:38] (03CR) 10Brouberol: [C:03+1] Add airflow oidc clients for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/1091193 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:52:16] (03CR) 10Stevemunene: [V:03+2 C:03+2] Add airflow oidc clients for pcc [labs/private] - 10https://gerrit.wikimedia.org/r/1091193 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:53:53] (03PS1) 10Urbanecm: [GrowthExperiments] Add virtual domain config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) [09:54:36] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10321785 (10ABran-WMF) >>>! In T376596#10205946, @Volans wrote: >> why not getting the metrics directly from there i... [09:55:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [09:56:52] (03CR) 10Vgutierrez: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [10:03:02] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): (no justification provided) [10:03:22] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): (no justification provided) (duration: 00m 21s) [10:06:11] (03CR) 10Fabfur: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [10:06:23] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): (no justification provided) [10:06:30] (03CR) 10FNegri: [C:03+1] "> it is best if we disable it in the provisioning to have a reliable, deterministic state" [cookbooks] - 10https://gerrit.wikimedia.org/r/1089664 (https://phabricator.wikimedia.org/T379351) (owner: 10Volans) [10:06:35] !log jayme@cumin2002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster wikikube-codfw: containerd migration [10:07:09] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): (no justification provided) (duration: 00m 47s) [10:11:07] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2001.codfw.wmnet with OS bookworm [10:14:02] PROBLEM - BGP status on lsw1-b7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:15:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet [10:16:48] !log remove ganeti2017 from active ganeti nodes T376594 [10:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:51] T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594 [10:19:20] PROBLEM - ganeti-noded running on ganeti2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:19:38] PROBLEM - ganeti-confd running on ganeti2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:21:42] (03PS1) 10Stevemunene: airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) [10:21:44] (03PS1) 10Stevemunene: airflow-analytics-product: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091200 (https://phabricator.wikimedia.org/T378440) [10:22:03] FIRING: ProbeDown: Service ganeti2017:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:22:18] (03CR) 10Btullis: [C:03+2] Revert "Remove labswiki from HDFS imported dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1090832 (https://phabricator.wikimedia.org/T217792) (owner: 10Btullis) [10:24:08] (03CR) 10Btullis: [C:03+1] "Great! Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090900 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol) [10:25:00] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [10:28:50] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti2017 [puppet] - 10https://gerrit.wikimedia.org/r/1091201 [10:30:14] (03PS1) 10JMeybohm: k8s.reimage-stacked-control-plane: Ask for the management password early [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408) [10:32:28] (03CR) 10Muehlenhoff: [C:03+2] lintian: Fix selection of vendor profile [puppet] - 10https://gerrit.wikimedia.org/r/1091188 (owner: 10Muehlenhoff) [10:34:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:36:50] (03CR) 10CI reject: [V:04-1] k8s.reimage-stacked-control-plane: Ask for the management password early [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:38:36] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti role from ganeti2017 [puppet] - 10https://gerrit.wikimedia.org/r/1091201 (owner: 10Muehlenhoff) [10:38:36] (03PS2) 10JMeybohm: k8s.reimage-stacked-control-plane: Ask for the management password early [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408) [10:41:35] (03CR) 10Brouberol: [C:03+2] datahub: leverage liveness and readiness probes for the gms and consumers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1090900 (https://phabricator.wikimedia.org/T379711) (owner: 10Brouberol) [10:42:10] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2001.codfw.wmnet with reason: host reimage [10:44:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [10:45:08] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2001.codfw.wmnet with reason: host reimage [10:47:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [10:49:08] RESOLVED: ProbeDown: Service ganeti2017:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1100) [11:01:15] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: working on TLS client authentication to kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [11:06:08] RECOVERY - BGP status on lsw1-b7-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:06:18] (03CR) 10Physikerwelt: "I like the idea. I was wondering if there is a check for the validity of the project name according to the ldap requirements, see e.g." [puppet] - 10https://gerrit.wikimedia.org/r/1090854 (https://phabricator.wikimedia.org/T379030) (owner: 10Arturo Borrero Gonzalez) [11:06:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1089638 (owner: 10Slyngshede) [11:07:51] (03CR) 10Vgutierrez: apt/varnish: Add/Pin varnish-staging component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [11:08:11] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2001.codfw.wmnet with OS bookworm [11:08:17] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster wikikube-codfw: containerd migration [11:09:27] (03PS4) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) [11:09:48] (03CR) 10Fabfur: haproxykafka: working on TLS client authentication to kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [11:14:02] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [11:17:23] !log installing openssl security updates [11:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:03] (03CR) 10Elukey: Move Puppet CA monitoring out of the puppetmaster module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:27:25] (03PS1) 10Volans: mysql_legacy: fix pymysql queries [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 [11:29:46] (03PS1) 10Muehlenhoff: Add cumin alias for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1091208 [11:30:36] (03PS3) 10Sergio Gimeno: GrowthExperiments: set experiment config only in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) [11:30:50] (03CR) 10Elukey: [C:03+1] "I am not 100% sure what is the difference in set_replication_parameters (practically) but I trust that you tested it :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans) [11:30:54] (03CR) 10Sergio Gimeno: GrowthExperiments: set experiment config only in pilot wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [11:32:49] (03CR) 10Volans: "purely typos, pymysql uses python % string formatting underneath, so it was just a bad syntax and yes I've tested it :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans) [11:52:25] (03CR) 10Muehlenhoff: [C:03+2] Add cumin alias for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1091208 (owner: 10Muehlenhoff) [11:57:18] !log restarting postfix on inbound/outbound servers to pick up openssl updates [11:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:24] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir [12:00:39] (03PS1) 10Ayounsi: LibreNMS report: various fixes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1091212 (https://phabricator.wikimedia.org/T379907) [12:04:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10322139 (10cmooney) >>! In T371435#10318507, @RobH wrote: > I'd hand this over to either John or Valerie as ops-eqiad for them to remove any devices... [12:04:22] (03PS2) 10Muehlenhoff: Move Puppet CA monitoring out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) [12:05:25] (03CR) 10Muehlenhoff: Move Puppet CA monitoring out of the puppetmaster module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:08:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:10:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir [12:12:45] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [12:17:35] !log jayme@cumin2002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster wikikube-codfw: containerd migration [12:18:37] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [12:19:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [12:22:13] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2003.codfw.wmnet with OS bookworm [12:23:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:23:20] (03CR) 10Jelto: [C:03+1] "lgtm (as discussed offline after successful test-cookbook run)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091202 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:25:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet [12:25:08] PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:26:10] (03PS1) 10Muehlenhoff: Enable nftables for ganeti7002 [puppet] - 10https://gerrit.wikimedia.org/r/1091224 [12:28:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:29:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7002.magru.wmnet [12:29:38] (03CR) 10Muehlenhoff: [C:03+2] Enable nftables for ganeti7002 [puppet] - 10https://gerrit.wikimedia.org/r/1091224 (owner: 10Muehlenhoff) [12:32:17] (03PS1) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) [12:35:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [12:38:36] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2003.codfw.wmnet with reason: host reimage [12:38:39] jouncebot: nowandnext [12:38:39] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:39] In 0 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1300) [12:38:58] Any objections to me deploying a config change? [12:40:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090511 (https://phabricator.wikimedia.org/T379583) (owner: 10Dreamy Jazz) [12:41:22] (03Merged) 10jenkins-bot: Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090511 (https://phabricator.wikimedia.org/T379583) (owner: 10Dreamy Jazz) [12:41:52] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1090511|Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList (T379583)]] [12:41:56] T379583: Find and exclude special pages where temporary account IP reveal is not necessary - https://phabricator.wikimedia.org/T379583 [12:42:06] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2003.codfw.wmnet with reason: host reimage [12:43:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [12:45:47] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1090511|Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList (T379583)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:46:22] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [12:48:01] 10ops-eqsin, 06SRE: Inbound interface errors - asw1-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T376837#10322250 (10RobH) 05Open→03Declined [12:49:08] FIRING: [2x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:09] !log failover ganeti master of magru02 to ganeti7002 [12:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:56] PROBLEM - ganeti-wconfd running on ganeti7004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:51:00] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090511|Hide IP reveal tools on Special:AbuseLog and Special:GlobalBlockList (T379583)]] (duration: 09m 08s) [12:51:16] T379583: Find and exclude special pages where temporary account IP reveal is not necessary - https://phabricator.wikimedia.org/T379583 [12:51:28] (03PS1) 10KartikMistry: CX3 Build 0.2.0+20241114 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 [12:51:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet [12:52:03] RESOLVED: [2x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:52:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 (owner: 10KartikMistry) [12:52:33] !log installing apache2 security updates [12:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:59] (03PS1) 10Cathal Mooney: Remove old fr-tech switch stack from rancid backups [puppet] - 10https://gerrit.wikimedia.org/r/1091228 (https://phabricator.wikimedia.org/T377381) [12:53:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet [12:53:43] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [12:54:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [12:57:20] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:58:19] (03CR) 10Arnaudb: [C:03+1] mysql_legacy: fix pymysql queries [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans) [12:58:22] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:58:22] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [12:58:23] status [12:58:24] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [12:58:24] status [12:59:20] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 44, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1300) [13:00:43] FIRING: ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:24] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:01:24] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:03:18] RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:03:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:33] FIRING: [5x] KubernetesCalicoDown: kubernetes2052.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:04:56] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2003.codfw.wmnet with OS bookworm [13:05:01] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster wikikube-codfw: containerd migration [13:05:20] PROBLEM - BGP status on lsw1-d4-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:05:24] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:05:24] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:05:30] (03PS1) 10Muehlenhoff: Switch ganeti7004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1091230 [13:05:43] RESOLVED: ProbeDown: Service miscweb2003:30443 has failed probes (http_bienvenida_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:30443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:06:20] RECOVERY - BGP status on lsw1-d4-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:06:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [13:07:28] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 286, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:07:31] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 370, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:07:43] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti7004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1091230 (owner: 10Muehlenhoff) [13:08:21] (03PS1) 10Sergio Gimeno: HomepageHooks: run metrics increment in deferred update [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) [13:08:24] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:08:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno) [13:09:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:09:33] RESOLVED: [7x] KubernetesCalicoDown: kubernetes2052.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:11] (03PS1) 10Giuseppe Lavagetto: Deploy fix for search button height [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1091232 [13:17:25] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Deploy fix for search button height [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1091232 (owner: 10Giuseppe Lavagetto) [13:18:13] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix search button height - oblivian@cumin1002" [13:18:15] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix search button height - oblivian@cumin1002 [13:18:51] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix search button height - oblivian@cumin1002 [13:18:52] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix search button height - oblivian@cumin1002" [13:21:08] !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@c5ab766]: T379546 [13:21:13] T379546: Update the product-analytics DAGs to use miniforge instead of condaforge - https://phabricator.wikimedia.org/T379546 [13:21:48] !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@c5ab766]: T379546 (duration: 00m 54s) [13:26:44] (03CR) 10Volans: [C:03+2] mysql_legacy: fix pymysql queries [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans) [13:29:20] 06SRE, 06Infrastructure-Foundations, 10netops, 10procurement, 13Patch-For-Review: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10322386 (10RobH) [13:30:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet [13:34:59] jouncebot: !next [13:35:20] forgot the command :D [13:35:41] jouncebot: nowandnext [13:35:41] For the next 0 hour(s) and 24 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1300) [13:35:41] In 0 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1400) [13:36:12] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@2220747]: Stage Refine parallelization improvment [airflow-dags@2220747d] [13:36:21] Doing early +2 for my upcoming backport. CI taking some 30 minutes.. [13:36:28] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@2220747]: Stage Refine parallelization improvment [airflow-dags@2220747d] (duration: 00m 15s) [13:37:23] (03Merged) 10jenkins-bot: mysql_legacy: fix pymysql queries [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091207 (owner: 10Volans) [13:38:10] (03PS3) 10Volans: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 [13:38:10] (03PS3) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 [13:38:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet [13:38:29] (03PS5) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) [13:38:55] (03CR) 10Fabfur: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [13:42:35] (03PS11) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [13:44:11] kart_: I don’t see any early +2 yet… [13:44:49] (03CR) 10KartikMistry: [C:03+2] CX3 Build 0.2.0+20241114 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 (owner: 10KartikMistry) [13:44:53] yay [13:45:01] Lucas_WMDE: sorry :D [13:45:06] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [13:45:42] (03CR) 10Slyngshede: [C:03+2] Account Managers: Allow account managers to be assigned by LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087474 (owner: 10Slyngshede) [13:47:54] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.16.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091235 [13:48:10] kart_: should I also do that with my change or that would interfere with yours? [13:48:11] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [13:48:21] (03Merged) 10jenkins-bot: Account Managers: Allow account managers to be assigned by LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1087474 (owner: 10Slyngshede) [13:49:32] sergi0: Please wait till I start the deployment.. I'll ping. [13:49:33] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@2220747]: Stage Refine parallelization improvment [airflow-dags@2220747d] [13:50:05] kart_: ack [13:50:25] (03CR) 10Ladsgroup: [C:03+1] [GrowthExperiments] Add virtual domain config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [13:50:41] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@2220747]: Stage Refine parallelization improvment [airflow-dags@2220747d] (duration: 01m 08s) [13:51:53] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [13:54:35] (03PS1) 10Slyngshede: P:idp experimental webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) [13:57:45] (03PS12) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [13:59:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1400) [14:00:05] kart_ and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:18] o/ [14:00:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:02:26] o/ [14:02:30] I think I can deploy! [14:02:57] Lucas_WMDE: I'll be deploying my patch first :) [14:03:05] I was about to ask if you wanted to self-service :) [14:03:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:29] * urbanecm waves [14:04:37] sergi0: let me know if you need any assistance with your patches [14:05:09] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10322543 (10MoritzMuehlenhoff) [14:05:27] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [14:05:28] urbanecm: yes, I'd appreciate, ty [14:05:39] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10322555 (10MoritzMuehlenhoff) [14:05:56] kart_: can you already start your scap backport? [14:06:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 (owner: 10KartikMistry) [14:06:11] thanks :) [14:06:17] then it’s more visible that you’re first in line :P [14:06:43] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.16.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091235 (owner: 10Volans) [14:07:18] (03PS2) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) [14:08:25] Lucas_WMDE: yeah, started.. still waiting for CI.. [14:09:35] yeah [14:10:24] (03PS3) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) [14:11:28] 06SRE, 10Observability-Alerting, 06Traffic, 13Patch-For-Review: PuppetFailure alert is not being fired for host(s) where agent has failed - https://phabricator.wikimedia.org/T379807#10322574 (10ssingh) Thanks for the investigation and fix @colewhite! >>! In T379807#10319559, @colewhite wrote: > The issue... [14:13:09] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20241114 [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091227 (owner: 10KartikMistry) [14:13:39] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1091227|CX3 Build 0.2.0+20241114]] [14:13:40] (03CR) 10Ssingh: "@vgutierrez@wikimedia.org: yes indeed and it seems like I did run on 9.2.6 but I erroneously pasted the one from my Ctrl+R. Thanks for che" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh) [14:13:54] (03CR) 10Ssingh: [C:03+2] Release 9.2.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1090933 (https://phabricator.wikimedia.org/T379797) (owner: 10Ssingh) [14:15:07] (03CR) 10Xcollazo: [C:03+1] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) (owner: 10Gmodena) [14:16:10] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4522/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) (owner: 10Slyngshede) [14:16:46] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.16.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091235 (owner: 10Volans) [14:17:21] !log kartik@deploy2002 kartik: Backport for [[gerrit:1091227|CX3 Build 0.2.0+20241114]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:58] (03PS2) 10Slyngshede: P:idp experimental webauthn [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) [14:18:19] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [14:18:47] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4523/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091237 (https://phabricator.wikimedia.org/T311236) (owner: 10Slyngshede) [14:21:37] (03CR) 10Ssingh: [C:03+1] "I did a brief check for other ^# deploy-sites and it seems like this is the only one with the erroneous entry. Thanks for the fix!" [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) (owner: 10Cwhite) [14:22:15] !log kartik@deploy2002 kartik: Continuing with sync [14:25:19] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and A:magru and A:dnsbox [14:26:27] (03PS1) 10Volans: Upstream release v8.16.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1091242 [14:27:03] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091227|CX3 Build 0.2.0+20241114]] (duration: 13m 23s) [14:28:40] urbanecm: shall we proceed? [14:29:01] sergi0: if kart_ is done, why not! [14:29:18] yes [14:29:26] Please go ahead. [14:30:19] sergi0: wanna do the deployment? [14:30:57] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and A:magru and A:dnsbox [14:31:39] * sergi0 trying to log in deploy server to answer that [14:32:55] (03CR) 10Urbanecm: [C:03+2] HomepageHooks: run metrics increment in deferred update [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno) [14:32:58] started CI first [14:33:00] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and not A:magru and A:dnsbox [14:33:29] (03CR) 10Gmodena: [C:03+2] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) (owner: 10Gmodena) [14:34:46] (03Merged) 10jenkins-bot: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) (owner: 10Gmodena) [14:36:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [14:37:03] (03Merged) 10jenkins-bot: GrowthExperiments: set experiment config only in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090830 (https://phabricator.wikimedia.org/T379681) (owner: 10Sergio Gimeno) [14:37:29] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1090830|GrowthExperiments: set experiment config only in pilot wikis (T379681)]] [14:37:33] T379681: community-updates-module variant is assigned outside of Growth pilot wikis - https://phabricator.wikimedia.org/T379681 [14:38:35] (03CR) 10Muehlenhoff: [C:03+2] spark: Avoid Ferm-specific syntax (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/1087488 (owner: 10Muehlenhoff) [14:40:03] (03CR) 10Volans: [C:03+2] Upstream release v8.16.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1091242 (owner: 10Volans) [14:41:24] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1090830|GrowthExperiments: set experiment config only in pilot wikis (T379681)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:45:50] !log sgimeno@deploy2002 sgimeno: Continuing with sync [14:48:51] (03CR) 10Jforrester: [C:03+1] CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) (owner: 10Reedy) [14:49:54] (03Merged) 10jenkins-bot: Upstream release v8.16.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1091242 (owner: 10Volans) [14:50:31] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1090830|GrowthExperiments: set experiment config only in pilot wikis (T379681)]] (duration: 13m 02s) [14:50:35] T379681: community-updates-module variant is assigned outside of Growth pilot wikis - https://phabricator.wikimedia.org/T379681 [14:52:17] (03CR) 10Cathal Mooney: [C:03+1] Remove old fasw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381) (owner: 10Ayounsi) [14:52:36] (03Abandoned) 10Cathal Mooney: Remove old fr-tech switch stack from rancid backups [puppet] - 10https://gerrit.wikimedia.org/r/1091228 (https://phabricator.wikimedia.org/T377381) (owner: 10Cathal Mooney) [14:52:49] (03CR) 10Ayounsi: [C:03+2] Remove old fasw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381) (owner: 10Ayounsi) [14:53:18] (03CR) 10Cathal Mooney: [C:03+2] Remove old fasw-c-eqiad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1091182 (https://phabricator.wikimedia.org/T377381) (owner: 10Ayounsi) [14:53:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno) [14:53:46] !log uploaded spicerack_8.16.1 to apt.wikimedia.org bullseye-wikimedia [14:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:51] (03PS1) 10Ssingh: hiera: set do_ipv6_primary_ra for all LVS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1091243 (https://phabricator.wikimedia.org/T358260) [14:54:53] (03Merged) 10jenkins-bot: HomepageHooks: run metrics increment in deferred update [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091231 (https://phabricator.wikimedia.org/T379682) (owner: 10Sergio Gimeno) [14:55:25] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1091231|HomepageHooks: run metrics increment in deferred update (T379682)]] [14:55:29] T379682: Growth KPI Grafana dashboard claims control is not assigned to any users at enwiki - https://phabricator.wikimedia.org/T379682 [14:56:52] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1091243/4524/" [puppet] - 10https://gerrit.wikimedia.org/r/1091243 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [14:58:41] (03PS1) 10Muehlenhoff: puppetserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1091245 [14:59:21] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1091231|HomepageHooks: run metrics increment in deferred update (T379682)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:02:03] !log sgimeno@deploy2002 sgimeno: Continuing with sync [15:02:11] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:06:41] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091231|HomepageHooks: run metrics increment in deferred update (T379682)]] (duration: 11m 15s) [15:06:57] hi Amir1, re https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1091197... will that work even in beta (where there is no x1 AFAIK)? [15:07:03] T379682: Growth KPI Grafana dashboard claims control is not assigned to any users at enwiki - https://phabricator.wikimedia.org/T379682 [15:07:05] or do i need to negate that in CS-labs.php? [15:07:06] !log UTC afternoon deploys done [15:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:35] urbanecm: I think (not sure), beta has x1 too? [15:07:45] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:07:45] but it's wikishared db maybe [15:07:56] if not, then yes, negate it in -labs :D [15:08:10] or make it conditional in CS.php [15:08:31] Amir1: ahh, it defines `extension1` in the config, but it points it to the same server... [15:08:50] (03CR) 10Brouberol: [C:03+2] airflow-search: define OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1091180 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [15:08:57] thanks! [15:13:02] (03CR) 10Brouberol: [C:03+2] airflow-search: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091177 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [15:13:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091245 (owner: 10Muehlenhoff) [15:15:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [15:15:56] (03PS3) 10JHathaway: EFI don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 [15:16:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [15:16:41] (03CR) 10Brouberol: [C:03+2] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1091181 (https://phabricator.wikimedia.org/T378441) (owner: 10Brouberol) [15:16:42] (03CR) 10JHathaway: "per our discussion on IRC, added some more context to the patch, noting the reason for the original addition." [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway) [15:17:51] (03PS1) 10Ladsgroup: Revert "mmv.js: Store comingFromHashChange as a class property" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091248 (https://phabricator.wikimedia.org/T379835) [15:18:41] jouncebot: nowandnext [15:18:42] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [15:18:42] In 0 hour(s) and 41 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1600) [15:19:45] (03CR) 10Ladsgroup: [C:03+2] Revert "mmv.js: Store comingFromHashChange as a class property" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091248 (https://phabricator.wikimedia.org/T379835) (owner: 10Ladsgroup) [15:22:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091248 (https://phabricator.wikimedia.org/T379835) (owner: 10Ladsgroup) [15:22:34] (03Merged) 10jenkins-bot: Revert "mmv.js: Store comingFromHashChange as a class property" [extensions/MultimediaViewer] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091248 (https://phabricator.wikimedia.org/T379835) (owner: 10Ladsgroup) [15:23:04] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1091248|Revert "mmv.js: Store comingFromHashChange as a class property" (T379835)]] [15:23:08] T379835: Closing an image in MultimediaViewer does not remove the URL fragment - https://phabricator.wikimedia.org/T379835 [15:24:02] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and not A:magru and A:dnsbox [15:24:12] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322813 (10Jhancock.wm) I could do this today. or we can wait until next week. assuming no one wants to do a maintenance o... [15:24:31] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:24:32] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:25:05] (03PS4) 10JHathaway: EFI don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 [15:25:24] (03CR) 10Elukey: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway) [15:25:38] (03PS1) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 [15:25:40] (03CR) 10Brouberol: "Because the service is fully behind the kubernetes ingress, we _don't have to_ register it under LVS. We can though, but this is not what " [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [15:26:15] 06SRE, 10Bitu, 06Infrastructure-Foundations: Allow to provide links for Bitu permissions - https://phabricator.wikimedia.org/T379926 (10MoritzMuehlenhoff) 03NEW [15:26:16] (03CR) 10CI reject: [V:04-1] resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (owner: 10Andrew Bogott) [15:27:28] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1091248|Revert "mmv.js: Store comingFromHashChange as a class property" (T379835)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:27:40] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:28:14] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927 (10fnegri) 03NEW [15:28:44] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-ctrl2002.codfw.wmnet [15:28:46] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-ctrl2002.codfw.wmnet [15:28:49] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322881 (10ops-monitoring-bot) depool host wikikube-ctrl2002.codfw.wmnet by jayme@cumin2002 with reason: None [15:28:52] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322882 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host wiki... [15:29:05] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10322866 (10fnegri) 05Open→03Resolved a:03fnegri The issue is resolved, I created this task to track it in case it happens... [15:29:13] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: T379719 [15:29:19] T379719: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719 [15:29:29] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: T379719 [15:29:43] (03CR) 10Herron: [C:03+1] "let em in!" [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) (owner: 10Cwhite) [15:30:05] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox [15:30:14] (03PS2) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) [15:31:41] (03CR) 10Kamila Součková: [C:03+1] "+1 but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) (owner: 10Clément Goubert) [15:32:39] (03CR) 10JHathaway: [C:03+2] EFI don't remove override twice [cookbooks] - 10https://gerrit.wikimedia.org/r/1091009 (owner: 10JHathaway) [15:33:30] !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.6-1wm1_amd64.changes: T379797 [15:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:46] T379797: Package and deploy ATS 9.2.6 - https://phabricator.wikimedia.org/T379797 [15:34:45] (03CR) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) (owner: 10Clément Goubert) [15:35:14] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091248|Revert "mmv.js: Store comingFromHashChange as a class property" (T379835)]] (duration: 12m 10s) [15:35:18] T379835: Closing an image in MultimediaViewer does not remove the URL fragment - https://phabricator.wikimedia.org/T379835 [15:35:30] PROBLEM - BGP status on lsw1-c7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:05] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: junos upgrade, T364092] [15:36:10] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [15:36:21] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: junos upgrade, T364092] [15:36:25] (03CR) 10Jbond: [C:04-1] "i don;t think this will fix the underlining issue, see comments. ill take a look at the task" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [15:37:07] (03PS4) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) [15:37:16] !log installed spicerack v8.16.1 to cumin hosts [15:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:17] (03CR) 10Volans: "This can now be tested with test-cookbook as spicerack has been released and deployed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [15:38:24] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [15:38:47] (03PS5) 10Reedy: CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) [15:38:51] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) (owner: 10Reedy) [15:39:22] (03CR) 10Clément Goubert: wikikube: Add wikikube-worker13[05-12] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) (owner: 10Clément Goubert) [15:39:34] (03Merged) 10jenkins-bot: CommonSettings.php: Properly set $wgCSPReportOnlyHeader/$wgCSPHeader to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090998 (https://phabricator.wikimedia.org/T379834) (owner: 10Reedy) [15:39:49] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1020.eqiad.wmnet with OS bullseye [15:40:16] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1016.eqiad.wmnet with OS bullseye [15:42:24] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4043*,cp4051*} and A:cp for 9.2.6-1wm1 [15:43:16] !log pt1979@cumin2002 START - Cookbook sre.network.cf [15:43:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [15:44:30] RECOVERY - BGP status on lsw1-c7-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:45:18] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2002.codfw.wmnet [15:45:21] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2002.codfw.wmnet [15:45:24] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322923 (10ops-monitoring-bot) pool host wikikube-ctrl2002.codfw.wmnet by jayme@cumin2002 with reason: None [15:45:28] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322926 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 pool for host wikiku... [15:45:39] !log jayme@cumin2002 START - Cookbook sre.hosts.remove-downtime for wikikube-ctrl2002.codfw.wmnet [15:45:40] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-ctrl2002.codfw.wmnet [15:46:31] (03PS3) 10Arnaudb: dbtools: command line helper to evaluate a host, or a group of hosts [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) [15:46:31] (03CR) 10Arnaudb: "this script has been tested and used here: https://phabricator.wikimedia.org/T378715#10322914" [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) (owner: 10Arnaudb) [15:47:25] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=97) Rolling upgrade/restart of Apache Traffic Server on P{cp4043*,cp4051*} and A:cp for 9.2.6-1wm1 [15:47:40] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:47:46] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4043.ulsfo.wmnet [15:47:50] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:47:58] ^ depooled, looking [15:47:59] !log reedy@deploy2002 Synchronized wmf-config/CommonSettings.php: T379834 (duration: 08m 02s) [15:48:03] T379834: PHP Deprecated: Automatic conversion of false to array is deprecated - https://phabricator.wikimedia.org/T379834 [15:48:04] upgrade didn't go smoothly :) [15:48:53] (03CR) 10Clément Goubert: [C:03+2] wikikube: Add wikikube-worker13[05-12] [puppet] - 10https://gerrit.wikimedia.org/r/1091225 (https://phabricator.wikimedia.org/T377022) (owner: 10Clément Goubert) [15:49:21] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379719#10322929 (10JMeybohm) 05Open→03Resolved a:03JMeybohm @Jhancock.wm swapped the cable into port 1, I've changed BIO... [15:49:49] !log installing nss security updates [15:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:38] (03PS4) 10Arnaudb: dbtools: command line helper to evaluate a host, or a group of hosts [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) [15:55:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,cr1-eqiad.mgmt with reason: router upgrade [15:55:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,cr1-eqiad.mgmt with reason: router upgrade [15:55:30] (03CR) 10JHathaway: [C:03+1] puppetserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1091245 (owner: 10Muehlenhoff) [15:55:54] PROBLEM - Ensure traffic_server is running for instance backend on cp4043 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:56:22] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp4043.ulsfo.wmnet with reason: depooled, debugging [15:56:22] (03CR) 10Michael Große: [C:03+1] [GrowthExperiments] Add virtual domain config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [15:56:35] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp4043.ulsfo.wmnet with reason: depooled, debugging [15:57:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,re0.cr1-eqiad.mgmt with reason: router upgrade [15:57:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr1-eqiad,cr1-eqiad IPV6,re0.cr1-eqiad.mgmt with reason: router upgrade [16:00:05] brennen and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1600). [16:00:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2139.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:01:28] (03PS1) 10Muehlenhoff: Add ml-lab Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1091259 [16:01:42] !log ongoing maintenance on cr1-eqiad [16:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:01] (03PS5) 10Arnaudb: dbtools: command line helper to evaluate a host, or a group of hosts [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) [16:02:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:03:57] !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 151575 [16:04:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:04:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 151575 [16:07:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:10:55] FIRING: [4x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs1017 with peer 208.80.154.196 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [16:11:02] hmm [16:11:05] ? [16:11:05] uh' [16:11:09] !incidents [16:11:10] 5449 (UNACKED) [4x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [16:11:10] eqiad is depooled [16:11:10] 5447 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [16:11:10] 5446 (RESOLVED) DDoSDetected sre (netflow5002:9100 eqsin) [16:11:10] ? [16:11:10] 5445 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [16:11:10] 5440 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw) [16:11:17] !ack 5449 [16:11:18] 5449 (ACKED) [4x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [16:11:20] ah okay [16:11:23] ah ok [16:11:26] so this alert works, nice :) [16:11:39] downtiming [16:11:43] cr1? [16:11:44] !ack 5449 [16:11:45] 5449 (ACKED) [4x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [16:11:54] ah, ok. [16:12:27] silenced for all eqiad [16:13:35] (03CR) 10Jbond: [C:04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [16:15:27] (03PS4) 10Volans: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 [16:15:28] (03PS4) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 [16:15:28] (03PS1) 10Volans: mysql: make fetch_one_row return always a dict [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091278 [16:16:07] (03CR) 10Volans: "mypy failure in CI will be fixed by I2d5bc3e26c537acc14e282d9ad23c271c2dba5cd but doesn't change the behaviour of the cookbook so it can " [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [16:18:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1305.eqiad.wmnet with OS bullseye [16:19:13] (03CR) 10Jbond: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1064114 (owner: 10Cwhite) [16:19:26] (03CR) 10Jbond: [C:03+1] openssh: Remove code to disable NIST key exchange [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff) [16:23:41] (03CR) 10Jbond: "adding simon as they seem to have picked up the next CR in the chain" [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond) [16:29:26] (03CR) 10Jbond: "good idea, the `systemd::sysuser` has an `$additional_groups` param which should DTRT. Will need to be updated in `profile::puppetserver:" [puppet] - 10https://gerrit.wikimedia.org/r/978017 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [16:31:14] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:31:31] (03CR) 10Jbond: "This can probably be removed from the chain and either abandond or considered seperatly" [puppet] - 10https://gerrit.wikimedia.org/r/978049 (owner: 10Jbond) [16:31:49] PROBLEM - Host db1190 #page is DOWN: PING CRITICAL - Packet loss = 100% [16:31:52] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:31:59] !incidents [16:31:59] 5449 (ACKED) [4x] PyBalBGPUnstable lvs sre (pybal 64600 208.80.154.196 eqiad) [16:31:59] 5451 (UNACKED) Host db1190 (paged) - PING - Packet loss = 100% [16:32:00] 5447 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [16:32:00] 5446 (RESOLVED) DDoSDetected sre (netflow5002:9100 eqsin) [16:32:00] 5445 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [16:32:00] 5440 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw) [16:32:05] !ack 5451 [16:32:05] 5451 (ACKED) Host db1190 (paged) - PING - Packet loss = 100% [16:32:06] PROBLEM - Host ms-fe1012 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:17] I can take a look [16:32:21] turning out to be a nice day [16:32:22] thanks [16:32:23] Amir1: thanks <3 [16:32:26] PROBLEM - Host dbproxy1026 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:28] PROBLEM - Host kubernetes1059 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:28] PROBLEM - Host ml-cache1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:29] yeah [16:32:30] PROBLEM - Host cephosd1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:30] PROBLEM - Host dse-k8s-worker1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:31] ok, what is going on? [16:32:32] PROBLEM - Host dumpsdata1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:34] PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:34] akosiaris: this is the network [16:32:36] ah [16:32:37] yeah [16:32:39] --> #-sre [16:32:40] PROBLEM - Host lvs1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:42] PROBLEM - Host elastic1090 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:42] PROBLEM - Host elastic1104 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:43] it's gonna be flooded in here [16:32:49] thanks, I got worried for a sec, just got out of meeting [16:32:52] PROBLEM - Host elastic1089 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:52] PROBLEM - Host logstash1036 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:54] PROBLEM - Router interfaces on pfw1-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:33:02] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:33:06] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:33:10] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:33:11] what should have been downtimed here I wonder [16:33:12] or not [16:33:12] PROBLEM - Host kafka-jumbo1010 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:16] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:33:16] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:33:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1190 sad', diff saved to https://phabricator.wikimedia.org/P71044 and previous config saved to /var/cache/conftool/dbconfig/20241114-163317-ladsgroup.json [16:33:28] PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:31] PROBLEM - BGP status on pfw1-eqiad is CRITICAL: BGP CRITICAL - AS64701/IPv4: Idle - frack-codfw, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1190.eqiad.wmnet with reason: Sad [16:33:44] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:50] PROBLEM - Host ssw1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:33:50] PROBLEM - Host ssw1-e1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1190.eqiad.wmnet with reason: Sad [16:34:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:34:08] FIRING: [2x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:34:16] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:34:20] RECOVERY - Host cephosd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [16:34:20] RECOVERY - Host elastic1089 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:34:21] RECOVERY - Host db1190 #page is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [16:34:22] RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:34:34] RECOVERY - Host dbproxy1026 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [16:34:34] RECOVERY - Host elastic1104 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:34:38] RECOVERY - Host dumpsdata1006 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:34:38] RECOVERY - Host elastic1090 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:34:38] RECOVERY - Host logstash1036 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:34:38] RECOVERY - Host kafka-jumbo1010 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:34:40] RECOVERY - Host ml-cache1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:34:42] RECOVERY - Host ms-fe1012 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:34:44] RECOVERY - Host dse-k8s-worker1005 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [16:34:46] RECOVERY - Host kubernetes1059 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:35:10] RECOVERY - Host lvs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [16:35:12] (03PS1) 10Klausman: ml-staging/experimental: bump max container/pod size to 75G/80G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091289 [16:36:07] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter [16:36:10] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:36:12] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [16:36:16] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:36:30] RECOVERY - BGP status on pfw1-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:36:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1305.eqiad.wmnet with reason: host reimage [16:36:42] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [16:36:46] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:36:54] RECOVERY - Router interfaces on pfw1-eqiad is OK: OK: host 208.80.154.219, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:36:57] (03CR) 10Klausman: [C:03+1] Add ml-lab Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1091259 (owner: 10Muehlenhoff) [16:37:02] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:37:03] RESOLVED: [2x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:37:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:37:16] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:37:59] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter [16:38:02] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [16:38:44] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-staging/experimental: bump max container/pod size to 75G/80G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091289 (owner: 10Klausman) [16:38:52] RECOVERY - Host ssw1-e1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [16:38:52] RECOVERY - Host ssw1-e1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.12 ms [16:39:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:39:16] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:40:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1305.eqiad.wmnet with reason: host reimage [16:45:28] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [16:45:48] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [16:48:11] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1090976 (https://phabricator.wikimedia.org/T379807) (owner: 10Cwhite) [16:51:45] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@7c4873e]: decouple article-level image suggestions from section-level ones [16:52:17] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@7c4873e]: decouple article-level image suggestions from section-level ones (duration: 00m 53s) [16:57:24] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: Network maintenance - None [16:59:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1305.eqiad.wmnet with OS bullseye [17:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:02:06] (03CR) 10Jbond: [C:04-1] resolvconf: don't update resolv.conf with 0 nameservers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [17:07:07] (03CR) 10Bking: "The gitlab trusted runners will need to POST to this service...I was thinking that we needed an ingress config for that, but if that's not" [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [17:09:40] RECOVERY - BGP status on ssw1-e1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1306.eqiad.wmnet with OS bullseye [17:13:11] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:13:11] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2139.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:14:08] RECOVERY - Ensure traffic_server is running for instance backend on cp4043 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:15:06] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942 (10Ladsgroup) 03NEW [17:15:49] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=4043.ulsfo.wmnet [17:18:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1307.eqiad.wmnet with OS bullseye [17:18:53] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: Network maintenance - None [17:18:55] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1190 gradually with 4 steps - Maint over [17:18:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1308.eqiad.wmnet with OS bullseye [17:21:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1309.eqiad.wmnet with OS bullseye [17:24:32] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [17:24:45] (03CR) 10Bking: "Per IRC conversation with @cdanis@wikimedia.org, it does seem that this patch is necessary." [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [17:24:51] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [17:25:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1310.eqiad.wmnet with OS bullseye [17:25:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1311.eqiad.wmnet with OS bullseye [17:26:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1312.eqiad.wmnet with OS bullseye [17:27:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2139.codfw.wmnet with OS bookworm [17:27:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10323460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2139.codfw.wmnet with O... [17:27:48] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10323409 (10Ladsgroup) Noting that we are starting to slowly drop all thumbnails in swift as a one-off clean up which would make the change in size of thu... [17:29:10] (03PS3) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) [17:29:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1306.eqiad.wmnet with reason: host reimage [17:29:34] (03PS1) 10DCausse: rdf-streaming-updater: bump to 0.3.150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T374919) [17:29:48] (03CR) 10CI reject: [V:04-1] resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [17:30:13] (03PS2) 10DCausse: rdf-streaming-updater: bump to 0.3.150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T374919) [17:30:33] (03CR) 10DCausse: [C:04-1] "needs Ife016662f5fde835c21457ef457b567d9be61d2a to be fully deployed everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [17:31:18] (03PS4) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) [17:31:55] (03CR) 10CI reject: [V:04-1] resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [17:32:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1306.eqiad.wmnet with reason: host reimage [17:33:00] (03PS5) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) [17:35:08] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [17:37:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1307.eqiad.wmnet with reason: host reimage [17:37:01] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:37:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1308.eqiad.wmnet with reason: host reimage [17:39:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1307.eqiad.wmnet with reason: host reimage [17:39:48] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage [17:42:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage [17:43:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1310.eqiad.wmnet with reason: host reimage [17:44:21] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1311.eqiad.wmnet with reason: host reimage [17:45:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2139.codfw.wmnet with reason: host reimage [17:45:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1312.eqiad.wmnet with reason: host reimage [17:46:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1308.eqiad.wmnet with reason: host reimage [17:47:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10323538 (10Papaul) 05Open→03Resolved This is done, re0 is now the master. Closing this task ` re0.cr1-eqiad> show chassis routing-engine Routing Engine statu... [17:48:06] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10323542 (10Papaul) [17:48:13] (03CR) 10Bking: wdqs: remove 5 codfw hosts from production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [17:48:47] (03CR) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott) [17:49:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1311.eqiad.wmnet with reason: host reimage [17:50:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10323546 (10Volans) Did you go through https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands ? [17:52:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1306.eqiad.wmnet with OS bullseye [17:53:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2139.codfw.wmnet with reason: host reimage [17:57:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1310.eqiad.wmnet with reason: host reimage [17:59:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1307.eqiad.wmnet with OS bullseye [18:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1800) [18:00:43] nothing for me to deploy today. [18:01:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1312.eqiad.wmnet with reason: host reimage [18:01:31] (03PS3) 10Bking: wdqs: remove 3 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [18:02:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1309.eqiad.wmnet with OS bullseye [18:03:21] (03PS1) 10Scott French: sre.discovery.datacenter: fix eligible actions in _get_all_services [cookbooks] - 10https://gerrit.wikimedia.org/r/1091314 (https://phabricator.wikimedia.org/T335364) [18:04:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1190 gradually with 4 steps - Maint over [18:04:28] (03PS1) 10Scott French: sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1078736 (https://phabricator.wikimedia.org/T372604) [18:05:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1308.eqiad.wmnet with OS bullseye [18:06:12] (03CR) 10Giuseppe Lavagetto: [C:03+1] sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1078736 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:07:58] (03CR) 10Bking: [C:03+2] dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [18:08:02] (03PS2) 10Bking: dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) [18:08:14] (03CR) 10Bking: [V:03+2 C:03+2] dse-k8s-services: add CNAME for blunderbuss (nee hdfs-synchronizer) [dns] - 10https://gerrit.wikimedia.org/r/1090972 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [18:08:20] (03CR) 10Scott French: [C:03+2] sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1078736 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:09:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1311.eqiad.wmnet with OS bullseye [18:11:18] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:13:04] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4043 is OK: HTTP OK: HTTP/1.1 200 OK - 48046 bytes in 0.827 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:13:33] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp4043.ulsfo.wmnet with reason: depooled, debugging [18:13:36] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp4043.ulsfo.wmnet with reason: depooled, debugging [18:13:43] (03CR) 10Clément Goubert: [C:03+1] sre.discovery.datacenter: fix eligible actions in _get_all_services [cookbooks] - 10https://gerrit.wikimedia.org/r/1091314 (https://phabricator.wikimedia.org/T335364) (owner: 10Scott French) [18:15:13] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1078736 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:16:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1310.eqiad.wmnet with OS bullseye [18:18:59] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter [18:19:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [18:20:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1312.eqiad.wmnet with OS bullseye [18:20:57] (03CR) 10Scott French: [C:03+2] sre.discovery.datacenter: fix eligible actions in _get_all_services [cookbooks] - 10https://gerrit.wikimedia.org/r/1091314 (https://phabricator.wikimedia.org/T335364) (owner: 10Scott French) [18:22:29] (03PS1) 10Andrew Bogott: prometheus-openstack-exporter: try to re-enable placement metrics [puppet] - 10https://gerrit.wikimedia.org/r/1091319 [18:23:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091319 (owner: 10Andrew Bogott) [18:27:20] (03Merged) 10jenkins-bot: sre.discovery.datacenter: fix eligible actions in _get_all_services [cookbooks] - 10https://gerrit.wikimedia.org/r/1091314 (https://phabricator.wikimedia.org/T335364) (owner: 10Scott French) [18:28:24] (03CR) 10Andrew Bogott: [C:03+2] prometheus-openstack-exporter: try to re-enable placement metrics [puppet] - 10https://gerrit.wikimedia.org/r/1091319 (owner: 10Andrew Bogott) [18:34:28] (03PS1) 10Jforrester: build: Upgrade mediawiki/mediawiki-codesniffer from v43.0.0 to v45.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091320 (https://phabricator.wikimedia.org/T379955) [18:47:26] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 216, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:47:32] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:47:34] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:47:38] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:47:38] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:47:44] PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:47:46] PROBLEM - BGP status on ssw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:47:48] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:47:52] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:47:52] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:47:54] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:47:58] PROBLEM - BGP status on pfw1-eqiad is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:48:16] PROBLEM - Router interfaces on pfw1-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:49:21] !next [18:49:35] jouncebot: now [18:49:35] For the next 0 hour(s) and 10 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1800) [18:49:35] For the next 0 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1800) [18:50:48] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:50:52] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:50:58] RECOVERY - BGP status on pfw1-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:51:16] RECOVERY - Router interfaces on pfw1-eqiad is OK: OK: host 208.80.154.219, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:51:34] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:51:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:51:52] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:51:54] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:54:36] (03CR) 10Btullis: "Could you link to that conversation, please?" [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [18:56:42] (03PS15) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [18:56:42] (03PS1) 10Ebernhardson: opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325 [18:56:42] (03PS1) 10Ebernhardson: opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 [18:56:43] (03PS1) 10Ebernhardson: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 [18:56:44] (03CR) 10Btullis: "Oh right, so it's not actually using LVS is it?" [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [18:59:58] (03PS1) 10Bvibber: Enabling shared globaljsonlinks table in x1 for JsonConfig/Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091328 (https://phabricator.wikimedia.org/T379689) [19:00:05] brennen and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T1900). [19:00:26] !log 1.44.0-wmf.3 train status (T375662): no current blockers, but holding for network maintenance. [19:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:52] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [19:01:14] 06SRE-OnFire, 10Incident Tooling: corto: binary doesn't include build information - https://phabricator.wikimedia.org/T379958 (10Eevans) 03NEW [19:03:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:03:32] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:03:44] RECOVERY - BGP status on ssw1-f1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:04:38] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:46] RECOVERY - BFD status on cr1-magru is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:06:35] 06SRE-OnFire, 10Incident Tooling: corto: update production deployment for project changes - https://phabricator.wikimedia.org/T379204#10323950 (10Eevans) 05Open→03Resolved [19:12:14] (03CR) 10Ebernhardson: "I do wonder, there is nothing particularly opensearch specific here. This is really the same thing we used on elastic, but I was opting to" [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (owner: 10Ebernhardson) [19:12:52] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:12:52] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:12:56] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:12:58] PROBLEM - BGP status on pfw1-eqiad is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:13:00] brennen: sorry the delay. we ran into some issues so unexpected that it would take this long [19:13:13] but there is definitely value in waiting since eqiad is depooled for edge traffic and services [19:13:18] PROBLEM - Router interfaces on pfw1-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:26] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 216, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:32] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:34] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:13:40] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:13:42] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:13:44] PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:13:50] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:14:12] (03PS1) 10Ssingh: trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330 [19:14:24] !log running sre.discovery.datacenter status all to test deployed fix [19:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:30] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter [19:14:44] PROBLEM - BGP status on ssw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:14:50] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [19:15:32] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4525/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [19:15:40] sukhe: no worries, we have a long window here on purpose. [19:16:32] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:16:40] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:16:44] RECOVERY - BGP status on ssw1-f1-eqiad.mgmt is OK: BGP OK - up: 15, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:16:45] RECOVERY - BFD status on cr1-magru is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:16:50] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:16:52] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:16:56] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:16:58] RECOVERY - BGP status on pfw1-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:17:12] (03PS2) 10Ssingh: trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330 [19:17:18] RECOVERY - Router interfaces on pfw1-eqiad is OK: OK: host 208.80.154.219, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:17:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:17:34] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:17:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:17:52] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:18:25] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4526/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [19:18:34] (03Abandoned) 10BCornwall: apt/varnish: Add/Pin varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:19:25] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox [19:20:16] !log Running `mwscript-k8s -f -- extensions/WikiLambda/maintenance/updateSecondaryTables.php --wiki=wikifunctionswiki --zType Z8 --report --verbose` for T375972, T367005, T373038, T358737 [19:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:32] T375972: in the object selector, functions that return a Typed list are not available when a Typed list is expected or required - https://phabricator.wikimedia.org/T375972 [19:20:32] T367005: Map function should be correctly type-hinted that it returns a Typed list of Z1s - https://phabricator.wikimedia.org/T367005 [19:20:33] T373038: fetchZidsOfType only returns objects that have at least one label - https://phabricator.wikimedia.org/T373038 [19:20:34] T358737: Object selector cannot select unlabeled object by ZID - https://phabricator.wikimedia.org/T358737 [19:21:29] (03Restored) 10BCornwall: apt/varnish: Add/Pin varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:21:38] (03PS3) 10BCornwall: apt/varnish: Add varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) [19:22:34] (03CR) 10BCornwall: apt/varnish: Add varnish-staging component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:25:40] (03PS2) 10Ebernhardson: opensearch: Introduce resource for keystore values [puppet] - 10https://gerrit.wikimedia.org/r/1091325 [19:25:40] (03PS2) 10Ebernhardson: opensearch: Add resource to define cross-cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/1091326 [19:25:40] (03PS2) 10Ebernhardson: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 [19:25:41] (03PS16) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [19:26:37] (03CR) 10CI reject: [V:04-1] opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 (owner: 10Ebernhardson) [19:31:51] (03PS3) 10Ebernhardson: opensearch: Add resource to log busy threads [puppet] - 10https://gerrit.wikimedia.org/r/1091327 [19:31:51] (03PS17) 10Ebernhardson: [WIP] Transition relforge to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1090529 [19:31:55] RESOLVED: [4x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs1017 with peer 208.80.154.197 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [19:32:04] nice [19:32:08] 😌 [19:32:28] we were so split on making this paging. but no regrets [19:32:53] what does that alert check? [19:33:09] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10324063 (10Jclark-ctr) @ABran-WMF Dell is requesting SOS report and TSR report from this server and another. can you assist? [19:33:13] it also needs a runbook entry or at least a mention in the wikitech page it links ;) [19:33:45] will just link to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/88526c0114c520878c9c6801ce1ba431b1d3bddf but yes, good idea, will add [19:33:47] ah neat [19:33:48] pybal_bgp_session_established != 1 and ignoring (local_asn, peer) pybal_bgp_enabled == 1 [19:37:25] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: junos upgrade done, T364092] [19:37:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: junos upgrade done, T364092] [19:37:29] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [19:39:39] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10324084 (10Papaul) [19:45:51] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091337 (https://phabricator.wikimedia.org/T375662) [19:45:52] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091337 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [19:46:37] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091337 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [19:51:28] (03PS4) 10Ryan Kemper: wdqs: remove 3 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) [19:54:10] (03PS5) 10Ryan Kemper: wdqs: remove 3 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) [19:54:10] (03PS4) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [19:54:10] (03PS2) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [19:54:30] (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [19:55:30] (03CR) 10Ryan Kemper: wdqs: remove 3 codfw hosts from production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [19:55:47] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.3 refs T375662 [19:55:51] T375662: 1.44.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T375662 [19:59:36] (03PS1) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [20:01:40] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: Network maintenance complete - None [20:17:24] (03PS1) 10Bvibber: Update charts-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091341 (https://phabricator.wikimedia.org/T375235) [20:18:04] (03CR) 10CDanis: [C:03+1] Update charts-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091341 (https://phabricator.wikimedia.org/T375235) (owner: 10Bvibber) [20:18:42] going to do a deploy of chart-renderer slight update :D [20:20:12] hm, i don't have +2 in that repo :D [20:21:00] ah weird [20:21:01] I'll +2 it [20:21:05] tx [20:21:09] (03CR) 10CDanis: [C:03+2] Update charts-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091341 (https://phabricator.wikimedia.org/T375235) (owner: 10Bvibber) [20:21:46] `wmf-deployment` and `mediawiki-services` ldap groups have Submit there [20:22:12] (03Merged) 10jenkins-bot: Update charts-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091341 (https://phabricator.wikimedia.org/T375235) (owner: 10Bvibber) [20:23:01] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: Network maintenance complete - None [20:23:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:23:32] !log bvibber@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [20:23:35] !log bvibber@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [20:24:07] !log bvibber@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [20:24:10] !log bvibber@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [20:24:14] !log bvibber@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [20:24:16] !log bvibber@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [20:24:26] well let's try er out [20:26:38] still renders charts at least :D [20:26:44] uh [20:26:48] I think something didn't work, one moment [20:27:37] ok [20:28:55] !log swfrench@cumin2002 START - Cookbook sre.discovery.datacenter [20:29:14] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) [20:31:49] (03PS1) 10CDanis: chart-renderer: use 'app' instead of old 'main_app' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091348 (https://phabricator.wikimedia.org/T375235) [20:32:18] so, I didn't actually look at the CI output from your patch at the time, bvibber, but if I had, I would have noticed it had zero effect 😅 [20:32:25] (03PS2) 10CDanis: chart-renderer: use 'app' instead of old 'main_app' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091348 (https://phabricator.wikimedia.org/T375235) [20:33:28] aha [20:33:36] like, for instance, the diffs shown on that patch at https://integration.wikimedia.org/ci/job/helm-lint/21551/console [20:33:38] 😅 [20:33:52] lolol [20:33:54] (03CR) 10CDanis: [C:03+2] chart-renderer: use 'app' instead of old 'main_app' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091348 (https://phabricator.wikimedia.org/T375235) (owner: 10CDanis) [20:34:24] I don't know how the whole world wound up with "we'll templatize yaml" as being the way to drive k8s, but here we are [20:34:45] bvibber: okay try your deploy again, and it should give you some diffs to look at in helmfile this time too :D [20:34:56] :) [20:34:56] ok [20:34:57] (03Merged) 10jenkins-bot: chart-renderer: use 'app' instead of old 'main_app' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091348 (https://phabricator.wikimedia.org/T375235) (owner: 10CDanis) [20:35:11] !log bvibber@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [20:35:18] ahh that looks better [20:35:24] great [20:35:53] !log bvibber@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [20:36:44] !log bvibber@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [20:37:20] !log bvibber@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [20:37:34] !log bvibber@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [20:38:05] !log bvibber@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [20:40:01] https://test.wikipedia.org/wiki/Charts we have titles rendering :D [20:40:06] cdanis: i think it worked :D [20:40:29] thanks for walking me through the confusing bits :D <3 [20:43:19] no worries! [20:43:22] (03PS1) 10Herron: aux_k8s: enable new eqiad workers [puppet] - 10https://gerrit.wikimedia.org/r/1091349 (https://phabricator.wikimedia.org/T378989) [20:47:05] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1091349 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [20:47:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:47:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2139.codfw.wmnet with OS bookworm [20:47:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10324342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2139.codfw.wmnet with OS bo... [20:50:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [20:51:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10324347 (10Jhancock.wm) [20:53:29] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10324366 (10RobH) [20:55:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10324348 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert This one's complete. took me a minute to get that last one to behave. [20:56:25] (03PS3) 10Herron: role::aux_k8s::worker: add role to 2 new eqiad workers [puppet] - 10https://gerrit.wikimedia.org/r/1088610 (https://phabricator.wikimedia.org/T378989) [20:56:25] (03CR) 10Herron: [V:03+1] "following along with https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes mostly" [puppet] - 10https://gerrit.wikimedia.org/r/1088610 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [20:58:00] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10324353 (10RobH) [21:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241114T2100). Please do the needful. [21:00:06] Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:20] here [21:01:02] Pppery: i can deploy - unless you are able and want to self-deploy? [21:01:13] no, i'm a volunteer with no access to anything [21:01:23] gotcha - here we go then - 1 sec [21:01:46] You're not the first person to think I have more technical abilities than I do [21:01:57] thanks cjming, just realized what time it was. [21:01:58] (03PS6) 10Pppery: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) [21:02:34] np! [21:02:55] Pppery: can always fix that! :D [21:03:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [21:03:42] (03Merged) 10jenkins-bot: Redirect to wikis using subpages rather than namespaces too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082853 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [21:04:01] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1082853|Redirect to wikis using subpages rather than namespaces too (T376923)]] [21:04:07] T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923 [21:05:59] (03CR) 10Ssingh: [C:03+1] apt/varnish: Add varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:06:55] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10324415 (10Jhancock.wm) [21:07:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10324416 (10Jhancock.wm) [21:07:59] !log cjming@deploy2002 cjming, pppery: Backport for [[gerrit:1082853|Redirect to wikis using subpages rather than namespaces too (T376923)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:03] testig [21:08:03] Pppery: on mwdebug if testable [21:08:08] testing now [21:09:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10324422 (10Jhancock.wm) a:03Jhancock.wm [21:09:10] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10324423 (10Jhancock.wm) a:05ABran-WMF→03Jhancock.wm [21:10:45] Looks good [21:12:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10324429 (10Jhancock.wm) a:03Jhancock.wm [21:12:57] cool - syncing [21:13:00] !log cjming@deploy2002 cjming, pppery: Continuing with sync [21:13:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:46] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082853|Redirect to wikis using subpages rather than namespaces too (T376923)]] (duration: 13m 44s) [21:18:00] T376923: Setup missing.php layer redirects for wikipedia hosting the other projects too - https://phabricator.wikimedia.org/T376923 [21:18:00] Pppery: should be live! [21:18:17] Thanks [21:18:21] yw! [21:19:06] i gotta run - so i'll err on closing the window for now [21:20:50] !log end of UTC late backport window [21:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:41] (03CR) 10Pppery: Add 'rup' as alias for 'roa-rup' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix) [21:23:01] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#10324468 (10Pppery) [21:24:21] (03CR) 10BCornwall: [C:03+2] apt/varnish: Add varnish-staging component [puppet] - 10https://gerrit.wikimedia.org/r/1090572 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:25:34] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [21:26:01] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@2220747]: Stage Refine test fix [21:26:13] (03CR) 10Fomafix: Add 'rup' as alias for 'roa-rup' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix) [21:26:17] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@2220747]: Stage Refine test fix (duration: 00m 16s) [21:30:02] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [21:47:34] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@7a66849]: Stage Refine: fix Airflow skip [21:47:49] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@7a66849]: Stage Refine: fix Airflow skip (duration: 00m 14s) [21:48:07] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@7a66849]: Stage Refine: fix Airflow skip [21:49:06] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@7a66849]: Stage Refine: fix Airflow skip (duration: 00m 59s) [22:03:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:09:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:13:44] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:14:22] PROBLEM - Ensure traffic_server is running for instance backend on cp4043 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:17:25] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:00] (03PS1) 10Aleksandar Mastilovic: Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 [22:26:52] (03CR) 10CI reject: [V:04-1] Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 (owner: 10Aleksandar Mastilovic) [22:28:51] (03CR) 10Bking: [C:03+2] wdqs: remove 3 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [22:30:49] !log T376150 Depooled `wdqs20[18-20]` in preparation of merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1088185 [22:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:02] T376150: Prepare 5 codfw hosts to serve wdqs-internal from main graph - https://phabricator.wikimedia.org/T376150 [22:31:56] (03CR) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:32:25] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:47] (03PS5) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:36:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:37:12] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp4043.ulsfo.wmnet with reason: ATS upgrade 9.2.6 [22:37:28] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp4043.ulsfo.wmnet with reason: ATS upgrade 9.2.6 [22:38:58] PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:42:58] (03PS6) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:42:58] RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:43:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:49:04] (03PS7) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:49:13] (03PS1) 10Andrew Bogott: Openstack placement: make read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1091392 [22:50:12] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091392 (owner: 10Andrew Bogott) [22:50:59] (03PS8) 10Bking: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:52:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:52:54] (03PS2) 10Andrew Bogott: Openstack placement: make read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1091392 [22:52:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091392 (owner: 10Andrew Bogott) [22:56:09] (03PS9) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [22:56:09] (03PS3) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [22:56:09] (03PS2) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [22:56:46] (03CR) 10Andrew Bogott: [C:03+2] Openstack placement: make read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1091392 (owner: 10Andrew Bogott) [22:58:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:59:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:59:37] (03PS10) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [22:59:37] (03PS4) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [22:59:37] (03PS3) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [23:00:15] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [23:31:55] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10324792 (10Ladsgroup) I'm deleting all thumbnails on every container except commons right now. Only on codfw and in alphabetical order and in serial. Right now, it's on enwikibooks (... [23:44:22] (03PS1) 10Scott French: debug.json: add support for mwdebug-next [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076848 (https://phabricator.wikimedia.org/T372605) [23:48:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:53:21] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:53:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint1002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:58:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh