[00:00:38] RESOLVED: [15x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:04:14] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056268 (owner: 10TrainBranchBot) [00:07:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10009135 (10Jclark-ctr) [00:08:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10009137 (10Jclark-ctr) a:03Jclark-ctr [00:09:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:12:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [00:12:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10009155 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm [00:14:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc2017.codfw.wmnet with OS bookworm [01:15:06] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10009184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2017.codfw.wmnet with OS bookworm executed with errors: - pc2017... [01:15:38] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:20] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:20] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:01] (03PS5) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) [02:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:25] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1055886/3388/" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [02:14:21] (03CR) 10Dzahn: [V:03+1] "if you click through to the full change catalog in the compiler results it shows how the config file is present in 1003 and absent on 1004" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [02:39:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:20] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:20] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:29:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:36:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:12:43] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host deploy1003.eqiad.wmnet with OS bullseye [05:14:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:38] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T367856)', diff saved to https://phabricator.wikimedia.org/P66906 and previous config saved to /var/cache/conftool/dbconfig/20240724-053128-marostegui.json [05:31:33] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:45:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:45:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:46:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P66907 and previous config saved to /var/cache/conftool/dbconfig/20240724-054635-marostegui.json [05:59:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T0600) [06:01:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P66908 and previous config saved to /var/cache/conftool/dbconfig/20240724-060142-marostegui.json [06:04:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:16:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T367856)', diff saved to https://phabricator.wikimedia.org/P66909 and previous config saved to /var/cache/conftool/dbconfig/20240724-061650-marostegui.json [06:16:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [06:17:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [06:17:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T367856)', diff saved to https://phabricator.wikimedia.org/P66910 and previous config saved to /var/cache/conftool/dbconfig/20240724-061712-marostegui.json [06:49:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:42] !log deploy CR1056198 Rename LVS-service-IPs prefix-list [06:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:05:38] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:59] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [08:24:43] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056268 (owner: 10TrainBranchBot) [08:26:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056445 [08:33:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056445 (owner: 10TrainBranchBot) [08:34:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:46] (03CR) 10Filippo Giunchedi: "Alert LGTM, though please reach out to dcops for heads up and see if they are ready to take the tasks (e.g. in terms of procedures, debugg" [alerts] - 10https://gerrit.wikimedia.org/r/1054649 (https://phabricator.wikimedia.org/T368088) (owner: 10Herron) [08:38:10] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056445 (owner: 10TrainBranchBot) [08:39:06] (03CR) 10Filippo Giunchedi: "I've glanced at the patch and LGTM, thus virtual +1, I'm not voting as I'm not following puppet(server) development/maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [08:39:39] (03CR) 10Filippo Giunchedi: "I've glanced at the patch and LGTM, thus virtual +1, I'm not voting as I'm not following puppet(server) development/maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [08:41:09] (03PS4) 10Hashar: puppetmaster: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056201 (https://phabricator.wikimedia.org/T338277) [08:43:22] (03PS14) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [08:44:33] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [08:45:38] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056447 [08:46:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056447 (owner: 10TrainBranchBot) [08:49:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056142 (https://phabricator.wikimedia.org/T370765) (owner: 10Anzx) [08:52:44] (03PS15) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [08:53:58] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:00:54] (03PS2) 10Stevemunene: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) [09:01:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:52] (03PS16) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [09:04:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:49] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:05:38] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:10] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on kubernetes1055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056447 (owner: 10TrainBranchBot) [09:16:25] (03PS17) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [09:19:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:55] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:38] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:40] (03CR) 10Jelto: "thanks! looks mostly good, two nits and one comment about a service port parameter" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:38:10] (03PS1) 10Filippo Giunchedi: admin: add tappof to ops [puppet] - 10https://gerrit.wikimedia.org/r/1056452 [09:39:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:32] (03CR) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [09:47:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10009576 (10cmooney) [09:50:54] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846 (10cmooney) 03NEW p:05Triage→03Medium [09:51:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10009596 (10cmooney) [09:51:02] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10009595 (10cmooney) [09:52:55] jouncebot: now [09:52:56] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [09:53:01] jouncebot: next [09:53:01] In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1000) [09:53:28] (03PS3) 10Effie Mouzeli: mediawiki: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055871 [09:55:35] (03PS4) 10Effie Mouzeli: mediawiki: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055871 [09:55:36] (03PS1) 10Effie Mouzeli: DNM: testing minor change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056459 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1000) [10:03:33] jouncebot: nowandnext [10:03:33] For the next 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1000) [10:03:34] In 0 hour(s) and 56 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1100) [10:03:40] effie: ^ potentially useful shortcut btw :) [10:06:16] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852 (10cmooney) 03NEW p:05Triage→03Medium [10:06:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10009700 (10cmooney) [10:06:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10009701 (10cmooney) [10:07:15] (03PS3) 10Elukey: WIP dhcp: add dhcp_filename and dhcp_options [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 [10:09:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:19] (03CR) 10CI reject: [V:04-1] WIP dhcp: add dhcp_filename and dhcp_options [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 (owner: 10Elukey) [10:14:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 21 days, 0:00:00 on 16 hosts with reason: Legacy appserver spindown [10:15:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 21 days, 0:00:00 on 16 hosts with reason: Legacy appserver spindown [10:15:50] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10009738 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=52c5c269-d4e9-4489-a397-00874b75eb1c) set by cgoubert@cumin1002 for 21 days, 0:0... [10:16:41] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host deploy1003.eqiad.wmnet with OS bullseye [10:17:39] Lucas_WMDE: I know:) I just like chatting with the bot, that is all [10:17:47] okay :) [10:19:18] (03PS1) 10Abijeet Patro: TranslatablePage: Split translatable page id cache into multiple shards [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056460 (https://phabricator.wikimedia.org/T366455) [10:25:38] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:30] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [10:29:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:01] jouncebot: nowandnext [10:31:01] For the next 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1000) [10:31:01] In 0 hour(s) and 28 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1100) [10:33:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [10:34:19] (03PS2) 10Filippo Giunchedi: admin: add tappof to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/1056452 [10:35:02] (03PS1) 10Dreamy Jazz: Remove now unused $wgGlobalBlockingDatabase definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056462 (https://phabricator.wikimedia.org/T370856) [10:35:17] (03CR) 10Elukey: "Jesse: one thing that was suggested by Filippo is to check the LDAP groups, since in theory we have sre-admins and we'd need to create ops" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [10:36:21] I'd like to do a deploy if that's okay? [10:36:54] If not, I can wait till the window. [10:37:06] Dreamy_Jazz: let me do quick one [10:37:15] Sure. [10:37:42] (03PS1) 10AOkoth: install: adjust vrts partition configs [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) [10:40:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10009886 (10cmooney) The LVS moves are a pre-requisite before we start moving other hosts, so I am going to start pepping the... [10:48:49] (03CR) 10Elukey: [C:04-1] "Precautionary -1, just want to have time to go through this before we go ahead." [puppet] - 10https://gerrit.wikimedia.org/r/1056220 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [10:49:25] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055871 (owner: 10Effie Mouzeli) [10:50:59] (03PS1) 10Giuseppe Lavagetto: profile::haproxy: move tls_terminator.pp to profile module [puppet] - 10https://gerrit.wikimedia.org/r/1056466 [10:51:28] (03Merged) 10jenkins-bot: mediawiki: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055871 (owner: 10Effie Mouzeli) [10:51:38] (03PS1) 10Clément Goubert: site.pp: Put legacy api and appservers insetup [puppet] - 10https://gerrit.wikimedia.org/r/1056467 (https://phabricator.wikimedia.org/T367949) [10:52:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862 (10cmooney) 03NEW p:05Triage→03Medium [10:52:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10009979 (10cmooney) [10:52:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10009980 (10cmooney) [10:53:54] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:54:01] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:54:20] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:54:34] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:54:47] (03CR) 10Elukey: [C:04-1] "I am not 100% convinced that this is a good way to go, for some reasons:" [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [10:55:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:57:28] !log jiji@deploy1002 Started scap sync-world: Noop, bumping mediawiki chart version [10:58:33] (03Abandoned) 10Effie Mouzeli: DNM: testing minor change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056459 (owner: 10Effie Mouzeli) [10:59:05] (03Abandoned) 10Effie Mouzeli: mw-mcouter: use bookworm images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051740 (https://phabricator.wikimedia.org/T368366) (owner: 10Effie Mouzeli) [10:59:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:01] !log jiji@deploy1002 Finished scap: Noop, bumping mediawiki chart version (duration: 02m 32s) [11:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1100). nyaa~ [11:00:25] RESOLVED: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:38] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:43] Dreamy_Jazz: I am done [11:00:50] Thanks! [11:01:08] I was running out of battery, I had to switch places, sorry for the delay [11:01:29] No problem. [11:01:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056462 (https://phabricator.wikimedia.org/T370856) (owner: 10Dreamy Jazz) [11:02:06] Going to deploy now. [11:02:38] (03Merged) 10jenkins-bot: Remove now unused $wgGlobalBlockingDatabase definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056462 (https://phabricator.wikimedia.org/T370856) (owner: 10Dreamy Jazz) [11:02:58] (03CR) 10Volans: "The approach looks good to me. Very few minor aesthetical nits. I skipped the tests for now." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 (owner: 10Elukey) [11:03:37] !log dreamyjazz@deploy1002 Started scap sync-world: Backport for [[gerrit:1056462|Remove now unused $wgGlobalBlockingDatabase definition (T370856)]] [11:03:41] T370856: Remove now un-used $wgGlobalBlockingDatabase definition in operations/mediawiki-config - https://phabricator.wikimedia.org/T370856 [11:06:03] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1056462|Remove now unused $wgGlobalBlockingDatabase definition (T370856)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:06:18] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [11:08:01] (03CR) 10EoghanGaffney: [C:03+1] install: adjust vrts partition configs [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [11:11:04] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1056462|Remove now unused $wgGlobalBlockingDatabase definition (T370856)]] (duration: 07m 27s) [11:11:13] T370856: Remove now un-used $wgGlobalBlockingDatabase definition in operations/mediawiki-config - https://phabricator.wikimedia.org/T370856 [11:11:42] Finished my deploys [11:15:38] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [11:19:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:18] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10010089 (10Volans) Without too much previous experience from past migrations, I think we could tackle it per DB section (aka shard), moving all easily... [11:24:20] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:33] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10010096 (10Ladsgroup) This should have the map: https://fault-tolerance.toolforge.org/map?cluster=s1 [11:30:21] (03CR) 10Hnowlan: [C:03+1] "nice nice nice" [puppet] - 10https://gerrit.wikimedia.org/r/1056467 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [11:31:56] (03PS82) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [11:32:18] (03PS83) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [11:32:21] (03CR) 10CI reject: [V:04-1] prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [11:32:42] (03CR) 10CI reject: [V:04-1] prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [11:33:52] (03PS84) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [11:34:10] (03CR) 10Clément Goubert: [C:03+2] site.pp: Put legacy api and appservers insetup [puppet] - 10https://gerrit.wikimedia.org/r/1056467 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [11:38:28] (03PS1) 10Cathal Mooney: lvs2012: move row C & D vlans to primary uplink and and new ones [puppet] - 10https://gerrit.wikimedia.org/r/1056478 (https://phabricator.wikimedia.org/T370862) [11:42:06] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10010124 (10cmooney) [11:43:38] (03PS2) 10Cathal Mooney: lvs2012: move row C & D vlans to primary uplink and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/1056478 (https://phabricator.wikimedia.org/T370862) [11:44:47] (03PS1) 10Clément Goubert: sre.mediawiki.restart-appservers: Remove legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1056470 (https://phabricator.wikimedia.org/T367949) [11:45:01] (03PS2) 10Clément Goubert: sre.mediawiki.route-traffic: Use switchdc defined services [cookbooks] - 10https://gerrit.wikimedia.org/r/1056471 (https://phabricator.wikimedia.org/T367949) [11:45:12] (03PS3) 10Cathal Mooney: lvs2012: move row C & D vlans to primary uplink and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/1056478 (https://phabricator.wikimedia.org/T370862) [11:45:23] (03PS2) 10Clément Goubert: sre.switchdc.mediawiki: No-op formatting change [cookbooks] - 10https://gerrit.wikimedia.org/r/1056472 (https://phabricator.wikimedia.org/T367949) [11:45:36] (03PS2) 10Clément Goubert: sre.switchdc.mediawiki: Remove legacy services [cookbooks] - 10https://gerrit.wikimedia.org/r/1056473 (https://phabricator.wikimedia.org/T367949) [11:49:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:50:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:49] (03PS1) 10Clément Goubert: Don't force puppet 7 on legacy appservers [puppet] - 10https://gerrit.wikimedia.org/r/1056481 (https://phabricator.wikimedia.org/T367949) [11:57:53] (03CR) 10Clément Goubert: [C:03+2] Don't force puppet 7 on legacy appservers [puppet] - 10https://gerrit.wikimedia.org/r/1056481 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [11:59:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056460 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [12:04:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10010171 (10cmooney) [12:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:44] (03PS1) 10Sergio Gimeno: dewiki: enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056484 (https://phabricator.wikimedia.org/T370261) [12:14:20] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:28] (03PS1) 10Brouberol: growthbook: replace ferretdb by mongo itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056485 (https://phabricator.wikimedia.org/T365839) [12:17:12] (03CR) 10CI reject: [V:04-1] growthbook: replace ferretdb by mongo itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056485 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [12:18:28] (03PS2) 10Brouberol: growthbook: replace ferretdb by mongo itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056485 (https://phabricator.wikimedia.org/T365839) [12:23:35] (03CR) 10Filippo Giunchedi: "Joanna: sre-admins group lists you as approval, would you mind taking a look? thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1056452 (owner: 10Filippo Giunchedi) [12:23:37] (03CR) 10David Caro: [C:03+2] envvars backend: update endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055146 (https://phabricator.wikimedia.org/T365014) (owner: 10Slavina Stefanova) [12:24:39] (03PS1) 10David Caro: replica_cnf: fix bad endpoint path [puppet] - 10https://gerrit.wikimedia.org/r/1056489 [12:27:14] (03CR) 10CI reject: [V:04-1] replica_cnf: fix bad endpoint path [puppet] - 10https://gerrit.wikimedia.org/r/1056489 (owner: 10David Caro) [12:30:26] (03PS2) 10David Caro: replica_cnf: fix bad endpoint path [puppet] - 10https://gerrit.wikimedia.org/r/1056489 [12:31:27] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@24f95a8]: (no justification provided) [12:31:58] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@24f95a8]: (no justification provided) (duration: 00m 30s) [12:39:20] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy1003.eqiad.wmnet with OS bullseye [12:40:38] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:51] (03PS1) 10Slavina Stefanova: replica-cnf-api: fix envvars endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1056494 (https://phabricator.wikimedia.org/T365014) [12:41:17] (03CR) 10Slavina Stefanova: envvars backend: update endpoints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055146 (https://phabricator.wikimedia.org/T365014) (owner: 10Slavina Stefanova) [12:43:21] (03CR) 10CI reject: [V:04-1] replica-cnf-api: fix envvars endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1056494 (https://phabricator.wikimedia.org/T365014) (owner: 10Slavina Stefanova) [12:46:17] (03PS3) 10D3r1ck01: [wmf-config] Remove trailing slash in SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056495 [12:47:41] (03CR) 10Jelto: "comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [12:48:03] (03PS2) 10Slavina Stefanova: replica-cnf-api: fix envvars endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1056494 (https://phabricator.wikimedia.org/T365014) [12:48:54] (03PS2) 10Sergio Gimeno: frwiktionary, dewiki: enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056484 (https://phabricator.wikimedia.org/T370261) [12:52:39] (03CR) 10David Caro: [C:03+2] replica_cnf: fix bad endpoint path [puppet] - 10https://gerrit.wikimedia.org/r/1056489 (owner: 10David Caro) [12:54:20] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:54] (03PS1) 10Kamila Součková: shellbox-video: set mesh.idle_timeout to 1d [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056498 (https://phabricator.wikimedia.org/T356241) [12:58:27] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10010285 (10Volans) @Ladsgroup that's looks very useful, I didn't know about it, is it mentioned anywhere? I can't find in wikitech. Having glimpsed at... [12:59:20] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:28] (03PS10) 10Jelto: firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) [12:59:53] (03CR) 10CI reject: [V:04-1] firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1300). [13:00:05] anzx and abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:04:17] o/ [13:04:23] I can probably deploy in a few minutes [13:04:53] (03CR) 10Ssingh: [C:03+1] "Looks good, compared against Netbox as well!" [puppet] - 10https://gerrit.wikimedia.org/r/1056478 (https://phabricator.wikimedia.org/T370862) (owner: 10Cathal Mooney) [13:05:07] hello [13:06:53] (03PS11) 10Jelto: firewall/gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) [13:07:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056484 (https://phabricator.wikimedia.org/T370261) (owner: 10Sergio Gimeno) [13:07:59] (03PS3) 10Brouberol: growthbook: replace ferretdb by mongo itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056485 (https://phabricator.wikimedia.org/T365839) [13:10:19] hello. Is anyone available for the deployment window? I have a patch for backport: 1056460 TranslatablePage: Split translatable page id cache into multiple shards - task T366455 [13:10:29] 1056460: TranslatablePage: Split translatable page id cache into multiple shards | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1056460 [13:10:29] yes, I’m here [13:10:37] currently looking at the change by anzx [13:10:38] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:43] and trying to figure out if I feel confident enough to deploy it ^^ [13:10:48] Lucas_WMDE, OK :-) [13:11:02] * Lucas_WMDE tries to find the knwiki change [13:11:16] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3392/co" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [13:11:42] apparently that was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/350439, about a month after I joined WMDE heh [13:12:06] Lucas_WMDE: T133137 [13:12:06] T133137: Local upload on Kannada Wikipedia - https://phabricator.wikimedia.org/T133137 [13:12:59] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10010334 (10ssingh) https://grafana.wikimedia.org/goto/8urj7LXIR?orgId=1 {F56642739} The hypothesis that reducing logging should help the CPU... [13:13:32] ok, policy is indeed identical [13:13:40] I guess that’s fine then [13:13:59] (03PS4) 10Anzx: knwikisource: Enable local uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056142 (https://phabricator.wikimedia.org/T370765) [13:14:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056142 (https://phabricator.wikimedia.org/T370765) (owner: 10Anzx) [13:14:11] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1017.eqiad.wmnet with OS bookworm [13:14:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10010355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm executed with errors: - pc1017 (*... [13:14:20] FIRING: [3x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:49] (03Merged) 10jenkins-bot: knwikisource: Enable local uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056142 (https://phabricator.wikimedia.org/T370765) (owner: 10Anzx) [13:15:16] (03PS1) 10Ssingh: Release 0.9.8-1+wmf12u2 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1056500 (https://phabricator.wikimedia.org/T370068) [13:15:18] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1056142|knwikisource: Enable local uploads (T370765)]] [13:15:22] T370765: Enable Local uploads on Kannada Wikisource - https://phabricator.wikimedia.org/T370765 [13:15:53] (03PS1) 10Ssingh: Revert "hiera: dns6001: reduce anycast_hc logging level and backups" [puppet] - 10https://gerrit.wikimedia.org/r/1056501 [13:16:35] (03CR) 10Ssingh: "For awareness mostly, @ayounsi@wikimedia.org" [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1056500 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [13:16:44] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit for backport" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056460 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:17:19] abijeet: should we also backport the change to wmf.14 or did you leave it out intentionally? [13:17:24] wmf.15 is only on group0 at the moment [13:17:59] Lucas_WMDE, I think it should be fine to backport to wmf.15 only. [13:18:03] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1056142|knwikisource: Enable local uploads (T370765)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:18:16] Lucas_WMDE: checking [13:18:31] abijeet: okay [13:19:01] Lucas_WMDE: looks good [13:19:42] abijeet: the train is going to roll when we are not online, we may want to consider backporting to wmf.14 [13:19:51] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, anzx: Continuing with sync [13:20:06] oh too late [13:20:19] OK, let me schedule that too. [13:20:30] you can still schedule a wmf.14 backport if you want [13:20:36] jouncebot: next [13:20:36] In 0 hour(s) and 39 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1400) [13:20:43] ok, we should avoid running into that window [13:21:01] but if you upload it now there might still be enough time for CI to go through [13:21:02] (03PS1) 10Abijeet Patro: TranslatablePage: Split translatable page id cache into multiple shards [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056502 (https://phabricator.wikimedia.org/T366455) [13:21:18] (03PS2) 10Giuseppe Lavagetto: profile::haproxy: move tls_terminator.pp to profile module [puppet] - 10https://gerrit.wikimedia.org/r/1056466 [13:21:24] Lucas_WMDE: I reckon abi is working on it as we speak [13:21:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:40] yeah, I can see it in wikibugs above :) [13:21:54] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of backport" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056502 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:23:36] Lucas_WMDE, thanks, done: 1056502: TranslatablePage: Split translatable page id cache into multiple shards | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1056502 [13:23:41] (03CR) 10Fabfur: [C:03+1] "LGTM!" [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1056500 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [13:24:03] (03CR) 10Ssingh: [C:03+2] Release 0.9.8-1+wmf12u2 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1056500 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [13:24:04] also added in https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1300 [13:24:20] FIRING: [5x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:41] I’m wondering if we want to deploy both backports together or separately [13:24:59] I’m a bit worried about the potential effect of basically starting over from a cold cache in this feature [13:25:16] (since I don’t see a fallback to the old key in the code – but maybe I’m missing something) [13:25:33] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1056142|knwikisource: Enable local uploads (T370765)]] (duration: 10m 14s) [13:25:36] so if we have enough time, I feel like it’s safer to deploy wmf.15 first, and see if there are any issues from that, before doing wmf.14 [13:25:37] T370765: Enable Local uploads on Kannada Wikisource - https://phabricator.wikimedia.org/T370765 [13:25:43] MetaWiki is the wiki with the largest users of translatable pages. [13:25:45] but first they need to make it through CI anyway ^^ [13:25:54] (03PS3) 10Giuseppe Lavagetto: profile::haproxy: move tls_terminator.pp to profile module [puppet] - 10https://gerrit.wikimedia.org/r/1056466 [13:26:03] ok, and metawiki is on wmf.14 it seems [13:26:44] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3394/co" [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto) [13:26:54] Lucas_WMDE: I will trust your judgement here [13:27:30] o_O I just realized what the code does [13:27:43] it’s *one* huge array of all page IDs of translatable pages [13:27:44] wow [13:27:56] (03PS1) 10Ssingh: durum: change logging for anycast-hc to INFO [puppet] - 10https://gerrit.wikimedia.org/r/1056503 [13:28:03] yup, to quickly verify if a given page is a translatable page or not. [13:28:10] okay, so the cache would only be cold for one (or rather, 3) accesses [13:28:20] plus however many simultaneous requests try to look it up at the same time [13:28:33] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3395/co" [puppet] - 10https://gerrit.wikimedia.org/r/1056503 (owner: 10Ssingh) [13:28:34] but I’m hoping that WANObjectCache already has some logic to make sure they don’t all try to recalculate the value in parallel [13:28:37] but just wait for each other [13:28:53] !log reprepro -C main include bookworm-wikimedia anycast-healthchecker_0.9.8-1+wmf12u2_amd64.changes: T370068 [13:28:53] ok but then the fact that it doesn’t fall back to an old value is probably fine [13:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:57] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [13:29:05] I’ll still sync them separately, I think we should have enough time [13:29:08] but I’m less worried [13:29:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056460 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:29:46] (03PS3) 10Slavina Stefanova: replica-cnf-api: fix envvars endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1056494 (https://phabricator.wikimedia.org/T365014) [13:30:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055228 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [13:30:47] (03CR) 10Ssingh: [V:03+1 C:03+2] durum: change logging for anycast-hc to INFO [puppet] - 10https://gerrit.wikimedia.org/r/1056503 (owner: 10Ssingh) [13:31:19] https://wikitech.wikimedia.org/wiki/Memcached_for_MediaWiki says that WANObjectCache is responsible for preventing cache stampedes, so I think my hope above is reasonably well-founded ^^ [13:32:12] (03Abandoned) 10Slavina Stefanova: replica-cnf-api: fix envvars endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1056494 (https://phabricator.wikimedia.org/T365014) (owner: 10Slavina Stefanova) [13:32:56] Lucas_WMDE: I have taken off the server hosting that key to reimage it, and nothing bad happened, that I know of [13:33:10] okay, that also sounds promising ^^ [13:33:32] Lucas_WMDE, your reasoning makes sense to me. I wouldn't expect new caches being introduced to overload memcache [13:33:53] when a memcached server goes offline, requests go to another pool which is by default cold [13:34:20] FIRING: [4x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [13:35:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10010510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm [13:36:12] I also deployed the patch on translatewiki.net an hour back to verify that there are no unintended side-effects. We use a single memcache instance there. Everything looks OK for now. [13:36:22] nice, thanks for testing there! [13:36:26] Lucas_WMDE: regardless, if this patch is to cause issues, it means there is something we need to addressed (so would be a good find), given that memcached data are volatile [13:36:33] (03CR) 10Alexandros Kosiaris: [C:04-1] Add MPIC service port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [13:36:35] abijeet: thank you [13:36:40] (03CR) 10Hnowlan: [C:03+1] "I am sceptical if the issues we're seeing are on the shellbox-video side, but this is worth trying" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056498 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [13:37:47] !log silence OtelCollectorRefusedSpans in codfw for 7d - T370043 [13:37:50] bleh, the wmf.14 backport is already failing in zuul [13:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:51] T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter - https://phabricator.wikimedia.org/T370043 [13:38:04] npmjs ECONNRESET -.- [13:38:06] !nowandnext [13:38:19] jouncebot: nowandnext [13:38:19] For the next 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1300) [13:38:19] In 0 hour(s) and 21 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1400) [13:38:22] * kamila_ is very good at this [13:39:20] FIRING: [4x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:27] I’ll restart the gate-and-submit once the wmf.14 backport finishes [13:40:33] but it might not complete before the end of the window :/ [13:41:06] abijeet, effie: if you’re around for a bit longer, we could probably deploy it after James_F (or $otherWikifunctionsPerson) is done with that window [13:41:11] Lucas_WMDE: lest check if the wikifunctions folks have things to backport [13:41:20] We do. [13:41:29] But we can delay if you need it. [13:41:32] (03PS6) 10Alexandros Kosiaris: Add MPIC service listener proxy [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [13:42:04] I don’t have a strong preference about delaying the wikifunctions window vs. waiting until it’s done, I’m here either way [13:42:06] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1056470 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:42:21] James_F: if you are ok with delaying, that would be great [13:42:26] Go for it. [13:42:31] (03CR) 10WMDE-Fisch: [C:03+1] Add wikibase client interaction stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055228 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [13:42:33] alright, thank you! [13:42:34] it is getting late for abi too [13:42:37] James_F: cheers! [13:42:37] James_F, thanks [13:42:47] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1056471 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:43:03] (03CR) 10Alexandros Kosiaris: [C:03+1] "I 've upload a patchset to test the new gerrit functionality that happens when one clicks on "apply edit". Anyway, with applying claime's " [puppet] - 10https://gerrit.wikimedia.org/r/1056062 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [13:43:35] (still need to wait for the wmf.15 backport to finish before the wmf.14 one can even be started again, unfortunately) [13:43:48] (03PS2) 10Ayounsi: Netbox report timers: run as sre_bot user [puppet] - 10https://gerrit.wikimedia.org/r/1055957 [13:43:48] (03PS1) 10Ayounsi: Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) [13:44:05] (IIUC zuul is keeping the wmf.14 one in the gate-and-submit-wmf queue, behind the wmf.15 one, because if the wmf.15 one fails it’ll resubmit the wmf.14 one independently) [13:44:20] FIRING: [4x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:43] (or we force-merge the wmf.14 backport… but I don’t think that’s warranted yet) [13:45:02] Ideally not, but if you're confident it's a problem with CI rather than code… [13:45:11] CI exists to give confidence, not just spin wheels. [13:45:14] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1056472 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:45:35] it was an ECONNRESET while downloading npm dependencies https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php74/20086/console [13:45:36] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1056473 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [13:45:49] Eh. I'd just force-merge, if you're deploying immediately. [13:45:56] Presumably you'll be testing it. :-) [13:46:05] * Lucas_WMDE turns head and looks at abijeet ;) [13:46:12] (03Merged) 10jenkins-bot: TranslatablePage: Split translatable page id cache into multiple shards [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056460 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:46:13] (03CR) 10CI reject: [V:04-1] TranslatablePage: Split translatable page id cache into multiple shards [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056502 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:46:44] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1056460|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] [13:47:30] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "try again; but I might also just force-merge this – the error was a random ECONNRESET from npm (nothing we haven’t seen before), and there" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056502 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:47:48] we can let the wmf.15 deployment run through first [13:48:09] Lucas_WMDE, lets deploy wmf.15 and then we can think at force merging wmf.14 [13:48:14] * Lucas_WMDE nods [13:48:27] scap for wmf.15 is running now [13:48:32] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [13:49:05] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, abi: Backport for [[gerrit:1056460|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:49:13] abijeet: please test on a group0 wiki :) [13:49:20] FIRING: [4x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:05] (03PS1) 10Hnowlan: services_proxy: add idle_timeout parameter, enable in shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) [13:50:06] Lucas_WMDE, testing on mediawikiwiki [13:50:15] (03CR) 10Filippo Giunchedi: [C:04-1] "Tested in Pontoon and it works! See inline for docs comment, other than that LGTM and we're good to merge" [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [13:50:17] (03CR) 10Dzahn: "This would be right for a virtual machine but now that it's a physical host i think we want something different, as Jelto said already." [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [13:51:03] (03CR) 10Clare Ming: Add MPIC service port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [13:51:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [13:52:05] Lucas_WMDE, looks good [13:52:13] effie, any spikes on Memcache? [13:52:15] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, abi: Continuing with sync [13:52:20] abijeet: thanks, syncing [13:52:36] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10010610 (10elukey) Finally a commit from puppetserver1001's private repo was propagated correctly to all nodes. What I check... [13:53:02] (03PS1) 10Elukey: Move the dump_cloud_ip_ranges to puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) [13:53:33] (03CR) 10Ayounsi: "Confirmed that this parameter exists in netbox 3 and 4." [puppet] - 10https://gerrit.wikimedia.org/r/1055957 (owner: 10Ayounsi) [13:53:54] abijeet: looking [13:54:33] so far so good from my end [13:54:59] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10010616 (10Jhancock.wm) [13:57:05] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1056460|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] (duration: 10m 21s) [13:57:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056502 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:57:31] * Lucas_WMDE looks at zuul [13:57:38] ETA 13 mins… let’s just force-merge then [13:57:43] the main test build already finished successfully [13:57:50] Lucas_WMDE, I think it should be fine. [13:57:55] Ack. [13:58:21] (and because the main test build finished after the gate-and-submit build started, I didn’t even have to V+2 myself, heh) [13:58:26] scap running now [13:58:33] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1056502|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] [13:59:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2017.codfw.wmnet with OS bookworm [13:59:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10010642 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2017.codfw.wmnet with OS bookworm [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1400) [14:00:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:54] !log lucaswerkmeister-wmde@deploy1002 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1056502|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:01:56] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 7 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [14:01:57] abijeet: do you want to test again on group1/2? [14:02:16] yup [14:03:07] (03CR) 10Kamila Součková: [C:03+2] shellbox-video: set mesh.idle_timeout to 1d [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056498 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [14:04:03] (03Merged) 10jenkins-bot: shellbox-video: set mesh.idle_timeout to 1d [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056498 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [14:04:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:01] Lucas_WMDE, looks good [14:05:06] !log lucaswerkmeister-wmde@deploy1002 abi, lucaswerkmeister-wmde: Continuing with sync [14:05:08] \o/ [14:05:13] Yay. [14:06:20] (03PS2) 10Elukey: Move the dump_cloud_ip_ranges etcd upload to puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) [14:08:43] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-07-17-145014 to 2024-07-19-164024 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056266 (https://phabricator.wikimedia.org/T57876) (owner: 10Jforrester) [14:09:36] * Lucas_WMDE notices no non-k8s canaries in scap output anymore [14:09:38] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-07-17-145014 to 2024-07-19-164024 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056266 (https://phabricator.wikimedia.org/T57876) (owner: 10Jforrester) [14:09:54] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1056502|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] (duration: 11m 21s) [14:10:01] !log UTC afternoon backport+config window done [14:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:09] James_F: all yours, sorry for running into your window again :/ [14:10:24] traffic just spiked on a couple of memcached servers, but either way we hould wait [14:10:27] Lucas_WMDE: It happens! Thanks for fixing production. [14:10:27] should* [14:11:07] * Lucas_WMDE looks at the grafana dashboard like the “I have no idea what I’m doing” dog [14:11:08] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [14:11:24] Lucas_WMDE: you are just the messenger haha [14:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:53] bit early to say but looks like it might be tapering off at least [14:11:56] so I’m not panicking yet ^^ [14:12:20] effie: Hey, is it you that has an un-gitted change to shellbox-video in prod deploy-charts? We can't deploy with it there. [14:12:47] I am not I am afraid [14:12:58] hnowlan ^ [14:13:11] that's me, apologies! [14:13:12] fixing [14:13:15] Thanks. [14:13:30] fixed [14:13:49] (03CR) 10Alexandros Kosiaris: [C:03+1] services_proxy: add idle_timeout parameter, enable in shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:14:30] abijeet: there is a chance things are worse than before [14:14:33] abijeet, effie: feels a bit suspicious that (AFAICT) two memcached hosts are seeing higher traffic, and one roughly twice as much as the other… are two of the three shards getting hashed to the same server? [14:14:36] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs6003.drmrs.wmnet [14:15:16] effie, ohh [14:15:17] Lucas_WMDE: that could be an explanation, but traffic is even higher than back when it was a single key, so something is not right [14:15:30] :/ [14:15:30] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10010718 (10elukey) I didn't know about the `dump_cloud_ip_ranges` use cases, so https://gerrit.wikimedia.org/r/1056508 will b... [14:15:37] !log ecarg@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:16:12] we may have to revert after james is done [14:16:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:25] Oh dear. [14:16:32] You can probably deploy in parallel to us. [14:16:36] !log ecarg@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:16:50] I’m ready to revert if needed [14:16:52] effie, could the reason for the spike be that all the wikis will are filling the caches at the same time? was that the case previously too? [14:18:12] !log ecarg@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:18:39] !log disable puppet on O:durum [14:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:51] abijeet: can you please elaborate a bit? [14:19:08] !log ecarg@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:19:27] !log ecarg@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:20:19] !log ecarg@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:20:36] (03CR) 10Kamila Součková: [C:03+1] services_proxy: add idle_timeout parameter, enable in shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:20:55] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6003.drmrs.wmnet [14:21:01] (03CR) 10Volans: "Have you already tested if the script works manually running it?" [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [14:21:22] effie, translate extension is deployed on quite a few wikis and since the cache key changed all of them will be filling the cache in at the same time. Could the spike be because of that? [14:21:22] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T370672#10010743 (10Jhancock.wm) 05Open→03Resolved [14:21:35] (03CR) 10Elukey: [V:03+1] "The script already runs on all puppetservers, but without the option to write to etcd (so it does run correctly)." [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [14:21:58] abijeet: if multiple wikis do that, then yes that would make sense [14:22:13] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet [14:22:54] !log kamila@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [14:22:55] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-07-17-145805 to 2024-07-23-225548 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056267 [14:23:41] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2024-07-17-145805 to 2024-07-23-225548 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056267 (owner: 10Jforrester) [14:23:42] abijeet: lets give it al little time, but I dont recall such a spike when I took the server hosting this key down [14:24:11] !log upgrade O:durum to anycast-hc 0.9.8-1+wmf12u2 [14:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:22] (03PS1) 10Southparkfan: LabsServices: convert more services to svc records [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056513 (https://phabricator.wikimedia.org/T361383) [14:24:23] !log kamila@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [14:24:34] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [14:24:42] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-07-17-145805 to 2024-07-23-225548 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056267 (owner: 10Jforrester) [14:25:00] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [14:25:09] !log kamila@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [14:25:47] effie, ok. I'm around [14:26:04] !log kamila@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [14:26:10] !log ecarg@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:26:56] !log ecarg@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:27:04] !log ecarg@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:27:06] abijeet: I would guess that by now things should have been stable, but I just see steadily high traffic [14:27:29] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org [reason: upgrading anycast-hc: T370068] [14:27:33] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [14:28:15] (03CR) 10Ssingh: [C:03+2] Revert "hiera: dns6001: reduce anycast_hc logging level and backups" [puppet] - 10https://gerrit.wikimedia.org/r/1056501 (owner: 10Ssingh) [14:28:52] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5006.eqsin.wmnet [14:29:04] abijeet: but also the number of calls to even only the #2 shard of the key, is very high [14:29:38] 22172.33 req/sec to 1 key shard [14:29:59] !log ecarg@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:30:11] I’m guessing this must be due to the removal of 'checkKeys', rather than the splitting in 3 on its own… I’m trying to figure out what checkKeys does now [14:30:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2017.codfw.wmnet with reason: host reimage [14:30:35] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4010.ulsfo.wmnet [14:30:57] as far as I understand, checkKeys marks the value as stale but does not purge it immediately. So you might temporarily see stale values for a long time. [14:31:00] !log ecarg@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:31:05] for a short time** [14:31:06] oh, I didn’t see that the patch adds ->delete() [14:31:09] I missed that part before [14:31:22] (03CR) 10AOkoth: install: adjust vrts partition configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [14:31:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org [reason: finished upgrading anycast-hc: T370068] [14:31:47] given that this key is cached for 2 hours, I don't see why its continuously high [14:32:41] I guess we are making 3 times more requests than we would normally but those should ideally be split across 3 servers and the size of each one should be 1/3rd. [14:32:52] !log ecarg@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:33:07] (03CR) 10Scott French: [C:03+1] "Thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1056473 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:33:07] !log ecarg@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:33:14] but bytes written and TX traffic have gone up by way more than 3× [14:33:21] (03PS2) 10Clément Goubert: Add MPIC service port [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:33:25] abijeet: right now we have 2 shards in 1 server (I dont unserstand why), and the 3rd one on another [14:33:45] (03CR) 10Scott French: [C:03+1] sre.switchdc.mediawiki: No-op formatting change [cookbooks] - 10https://gerrit.wikimedia.org/r/1056472 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:33:46] (03PS3) 10Clément Goubert: Add MPIC service port [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:33:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2017.codfw.wmnet with reason: host reimage [14:33:54] but it feels like we are making way too many calls [14:34:02] effie, the spike in the graphs support that fact. [14:34:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:24] (03CR) 10Clément Goubert: "Applied the edits myself because gerrit is confusing." [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:34:54] ok abijeet Lucas_WMDE I suggest we revert this patch [14:35:22] (03CR) 10Scott French: [C:03+1] sre.mediawiki.restart-appservers: Remove legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1056470 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:35:24] okay [14:35:25] (03CR) 10Southparkfan: "For the record: this prefix list contains all /24 or /27 ranges belonging to {private1,public1}-lvs-{codfw,drmrs,esams,eqiad,magru,ulsfo} " [homer/public] - 10https://gerrit.wikimedia.org/r/1056198 (https://phabricator.wikimedia.org/T370156) (owner: 10Cathal Mooney) [14:35:38] !log ecarg@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:35:57] (03CR) 10Southparkfan: "(+ eqsin, of course)" [homer/public] - 10https://gerrit.wikimedia.org/r/1056198 (https://phabricator.wikimedia.org/T370156) (owner: 10Cathal Mooney) [14:36:05] (03CR) 10JHathaway: [C:03+2] fix Puppet::FileServing::Content for puppet 7 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1056258 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [14:36:37] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4010.ulsfo.wmnet [14:36:42] Lucas_WMDE: abijeet sorry, I forgot to share the graph btw https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-30m&to=now&viewPanel=58, my bad [14:37:24] James_F: okay to deploy a MW revert? (can’t find ecarg’s IRC name) [14:37:29] Lucas_WMDE: Go for it. [14:37:42] (03CR) 10Scott French: [C:03+1] sre.mediawiki.route-traffic: Use switchdc defined services [cookbooks] - 10https://gerrit.wikimedia.org/r/1056471 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:38:07] (03PS1) 10TrainBranchBot: Revert "TranslatablePage: Split translatable page id cache into multiple shards" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056515 [14:38:07] (03CR) 10TrainBranchBot: "lucaswerkmeister-wmde@deploy1002 created a revert of this change as I8739b0dd82d16cd85baaa1da637e70de56ad7909" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056502 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [14:38:11] (03PS1) 10TrainBranchBot: Revert "TranslatablePage: Split translatable page id cache into multiple shards" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056516 [14:38:12] (03CR) 10TrainBranchBot: "lucaswerkmeister-wmde@deploy1002 created a revert of this change as I0c8c0b7bb5ad2f46e56ca0dc84e50c99081a95e9" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056460 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [14:38:27] * Lucas_WMDE started scap backport --revert [14:38:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056515 (owner: 10TrainBranchBot) [14:38:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056516 (owner: 10TrainBranchBot) [14:38:59] oh good, that’s going to wait for CI too… [14:39:14] Only if you want it to. [14:39:19] I think that’s another force-merge-justified situation [14:39:19] :-D [14:39:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:25] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:38] (03CR) 10Andrew Bogott: [C:03+2] LabsServices: convert more services to svc records [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056513 (https://phabricator.wikimedia.org/T361383) (owner: 10Southparkfan) [14:39:40] I’ll force-merge in a moment if nobody objects [14:39:47] (which the other day I found out is called “lazy consensus” :P) [14:39:56] I am lazy so sure [14:40:14] (03CR) 10Lucas Werkmeister (WMDE): [V:03+2 C:03+2] "Force-merging, relatively urgent production revert and there’s no reason why CI should fail." [extensions/Translate] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056515 (owner: 10TrainBranchBot) [14:40:17] (03CR) 10Volans: [C:03+1] "TIL, I thought we were running it just in one place to avoid polling all the data N times from the same sources." [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [14:40:26] (03CR) 10Lucas Werkmeister (WMDE): [V:03+2 C:03+2] "Force-merging, relatively urgent production revert and there’s no reason why CI should fail." [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056516 (owner: 10TrainBranchBot) [14:40:50] oh good, unexpected commit pulled in the meantime [14:41:08] andrewbogott: can I continue with my scap? [14:41:26] (I’m not sure if that would deploy “LabsServices: convert more services to svc records”… I think yes?) [14:41:51] Lucas_WMDE: yes. Sorry about the collision [14:42:03] ok, doing [14:42:04] thanks [14:42:17] (I guess LabsServices.php is probably beta-only anyway) [14:42:22] Yes. [14:42:24] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1056515|Revert "TranslatablePage: Split translatable page id cache into multiple shards"]], [[gerrit:1056516|Revert "TranslatablePage: Split translatable page id cache into multiple shards"]] [14:42:57] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10010868 (10Volans) Let's just make sure that `requestctl` works fine on the puppetmasters, it's installed but the `pyparsing`... [14:44:08] Lucas_WMDE: abijeet it seems that it went even more badly than we thought https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&var-kClass=pagetranslation&from=now-1h&to=now [14:44:24] wowee [14:44:35] 100k to 4mil [14:44:46] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, trainbranchbot: Backport for [[gerrit:1056515|Revert "TranslatablePage: Split translatable page id cache into multiple shards"]], [[gerrit:1056516|Revert "TranslatablePage: Split translatable page id cache into multiple shards"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:44:55] abijeet: want to test anything on mwdebug? [14:45:03] the main impact will only be visible when we roll out evreywhere anyway [14:45:24] effie: but some latency went down at least! [14:45:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891 (10cmooney) 03NEW p:05Triage→03Medium [14:45:46] Lucas_WMDE, doing a quick check [14:45:56] Lucas_WMDE: haha [14:46:28] uploaded the master branch revert at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1056520 [14:46:47] and I’ll probably +2 that soon, so we don’t risk accidentally rolling this out with the next train [14:46:47] doing a lot of low-lat requests best recipe to attain SLO :takes_notes: [14:46:49] (03PS1) 10Cathal Mooney: lvs2011: move row C & D vlans to primary uplink and add new ones [puppet] - 10https://gerrit.wikimedia.org/r/1056521 (https://phabricator.wikimedia.org/T370891) [14:47:06] Lucas_WMDE, looks ok [14:47:11] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, trainbranchbot: Continuing with sync [14:47:16] ok, thanks for checking [14:49:08] (03CR) 10Clément Goubert: Add MPIC service port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:49:42] (03PS1) 10JHathaway: bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1056522 [14:50:36] (03PS2) 10Hnowlan: services_proxy: add idle_timeout parameter, enable in shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) [14:50:37] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10010933 (10cmooney) [14:50:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10010941 (10cmooney) [14:51:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10010940 (10cmooney) [14:52:02] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1056515|Revert "TranslatablePage: Split translatable page id cache into multiple shards"]], [[gerrit:1056516|Revert "TranslatablePage: Split translatable page id cache into multiple shards"]] (duration: 09m 37s) [14:52:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:52:30] lines are going down already [14:53:45] AFAICT everything’s back to pre-backport levels [14:54:02] (including the latency, which has gone back up, heh) [14:54:08] Lucas_WMDE, thanks [14:54:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:34] abijeet: Lucas_WMDE thank you for your time [14:54:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:54:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host pc2017.codfw.wmnet with OS bookworm [14:54:52] Lucas_WMDE: tx for going back and reverting [14:54:59] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10010962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2017.codfw.wmnet with OS bookworm completed: - pc2017 (**FAIL**)... [14:55:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10010963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2017.codfw.wmnet with OS bookworm executed with errors: - pc2017... [14:55:40] np [14:56:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10010948 (10cmooney) a:03cmooney [14:56:19] (03CR) 10JHathaway: [C:03+2] bump version [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/1056522 (owner: 10JHathaway) [14:56:41] (03PS3) 10Hnowlan: services_proxy: add idle_timeout parameter, enable in shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) [14:57:49] (03CR) 10Alexandros Kosiaris: Add MPIC service port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056163 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:57:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10010968 (10cmooney) [14:58:04] (03CR) 10Dzahn: install: adjust vrts partition configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [14:58:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10010969 (10cmooney) [14:58:42] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10010971 (10cmooney) [14:59:19] (03CR) 10CI reject: [V:04-1] services_proxy: add idle_timeout parameter, enable in shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [14:59:20] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:41] (03PS1) 10JHathaway: pcc: bump version for puppet7 workers [puppet] - 10https://gerrit.wikimedia.org/r/1056523 (https://phabricator.wikimedia.org/T367547) [15:00:12] abijeet, effie: I’m definitely glad we backported to wmf.14 as well – it seems like the effect wasn’t really visible on wmf.15, so without the wmf.14 backport this would’ve instead been confusing and harder to track down when metawiki was promoted to wmf.15 in a few hours… [15:00:39] (03CR) 10JHathaway: [C:03+2] pcc: bump version for puppet7 workers [puppet] - 10https://gerrit.wikimedia.org/r/1056523 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [15:01:09] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3010.esams.wmnet [15:01:43] (03CR) 10Volans: [C:04-1] "Adding Luca to gather more feedback and hopefully consensus on one direction." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [15:01:50] (03PS2) 10AOkoth: install: adjust vrts partition configs [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) [15:02:14] (03CR) 10AOkoth: install: adjust vrts partition configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [15:02:50] Lucas_WMDE, yup, thanks effie for surfacing that. [15:03:02] (03PS4) 10Hnowlan: services_proxy: add idle_timeout parameter, enable in shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) [15:03:39] (03PS4) 10Elukey: dhcp: add dhcp_filename and dhcp_options for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 (https://phabricator.wikimedia.org/T363576) [15:04:22] (03CR) 10Elukey: "All fixed! Also extended the new config to ConfMac, should be ready for another review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:04:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2017.codfw.wmnet with OS bookworm [15:04:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10011004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2017.codfw.wmnet with OS bookworm [15:05:13] master branch revert +2ed as well now [15:05:39] * Lucas_WMDE done deploying, hopefully [15:07:06] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3402/co" [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [15:07:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2017.codfw.wmnet with reason: host reimage [15:07:54] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3010.esams.wmnet [15:09:59] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [15:10:22] (03CR) 10Elukey: [C:03+1] "Checked the uid with LDAP, the change looks good! After https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054894 we should get more pe" [puppet] - 10https://gerrit.wikimedia.org/r/1056452 (owner: 10Filippo Giunchedi) [15:10:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2017.codfw.wmnet with reason: host reimage [15:12:29] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3403/console" [puppet] - 10https://gerrit.wikimedia.org/r/1056030 (owner: 10Scott French) [15:12:43] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the addition." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:13:12] (03CR) 10Elukey: Netbox report timers: run as sre_bot user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055957 (owner: 10Ayounsi) [15:14:56] (03CR) 10Elukey: [C:03+1] Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:15:27] (03CR) 10Elukey: [C:03+2] dhcp: add dhcp_filename and dhcp_options for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:18:02] (03CR) 10Elukey: "From https://phabricator.wikimedia.org/T289779#7347372 it seems that it must happen manually, no idea how though :)" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [15:19:24] (03PS2) 10Ayounsi: Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) [15:19:39] (03CR) 10Ssingh: [V:03+1] "The reload here might not be required as suggested by Brandon. gdnsd does this:" [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:20:06] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (NOOP 3 DIFF 1 CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compil" [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [15:20:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2017.codfw.wmnet with OS bookworm [15:20:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10011062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2017.codfw.wmnet with OS bookworm completed: - pc2017 (**PASS**)... [15:21:16] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10011077 (10Jhancock.wm) [15:21:57] (03Merged) 10jenkins-bot: dhcp: add dhcp_filename and dhcp_options for DHCPConfMac and DHCPConfOpt82 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056176 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:22:37] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10011078 (10Jhancock.wm) 05Open→03Resolved @Marostegui all yours! [15:24:03] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [15:24:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:01] (03CR) 10Hnowlan: [V:03+1 C:03+2] services_proxy: add idle_timeout parameter, enable in shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1056506 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [15:27:48] (03PS3) 10Ssingh: dnsbox: set anycast-hc num_backups to one [puppet] - 10https://gerrit.wikimedia.org/r/1051381 [15:28:28] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3405/co" [puppet] - 10https://gerrit.wikimedia.org/r/1051381 (owner: 10Ssingh) [15:29:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet [15:33:07] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc2017 - https://phabricator.wikimedia.org/T369658#10011129 (10Ladsgroup) Thank you so much! [15:38:45] (03PS1) 10Elukey: sre.hosts.reimage: add workaround for PXE boot issue on some NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1056534 (https://phabricator.wikimedia.org/T363576) [15:39:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Move lvs2014 uplink to lsw1-d2-codfw, connect to private1-d2-codfw and trunk all vlans on primary. - https://phabricator.wikimedia.org/T370897 (10cmooney) 03NEW p:05Triage→03Medium [15:40:00] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Move lvs2014 uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10011160 (10cmooney) [15:40:11] (03CR) 10Elukey: "Likely CI will report a -1, we need to release a new version of spicerack first." [cookbooks] - 10https://gerrit.wikimedia.org/r/1056534 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:40:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10011161 (10cmooney) [15:40:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#10011162 (10elukey) [15:42:08] (03CR) 10JHathaway: [C:03+1] puppetmaster: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056201 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [15:42:32] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: add workaround for PXE boot issue on some NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1056534 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [15:42:36] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:43:21] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:45:41] (03PS1) 10Ilias Sarantopoulos: ml-services: update hf image for gemma2 and cmd args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056538 (https://phabricator.wikimedia.org/T370670) [15:51:32] (03CR) 10AikoChou: [C:03+1] ml-services: update hf image for gemma2 and cmd args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056538 (https://phabricator.wikimedia.org/T370670) (owner: 10Ilias Sarantopoulos) [15:59:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:38] (03PS4) 10Giuseppe Lavagetto: puppet-lint.rc: add specification of our indentation [puppet] - 10https://gerrit.wikimedia.org/r/1056131 [16:10:38] (03PS4) 10Giuseppe Lavagetto: profile::haproxy: move tls_terminator.pp to profile module [puppet] - 10https://gerrit.wikimedia.org/r/1056466 [16:11:23] (03PS1) 10Ssingh: wikidough: set log level to INFO for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1056544 [16:11:42] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3406/co" [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto) [16:12:03] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3407/co" [puppet] - 10https://gerrit.wikimedia.org/r/1056544 (owner: 10Ssingh) [16:12:48] (03CR) 10Ssingh: [V:03+1 C:03+2] wikidough: set log level to INFO for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1056544 (owner: 10Ssingh) [16:15:18] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904 (10mszabo) 03NEW [16:15:27] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 10MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10011339 (10WDoranWMF) [16:15:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:54] (03CR) 10Btullis: [C:03+1] growthbook: replace ferretdb by mongo itself (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056485 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [16:21:57] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update hf image for gemma2 and cmd args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056538 (https://phabricator.wikimedia.org/T370670) (owner: 10Ilias Sarantopoulos) [16:22:52] (03Merged) 10jenkins-bot: ml-services: update hf image for gemma2 and cmd args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056538 (https://phabricator.wikimedia.org/T370670) (owner: 10Ilias Sarantopoulos) [16:23:21] (03PS5) 10Giuseppe Lavagetto: profile::haproxy: move tls_terminator.pp to profile module [puppet] - 10https://gerrit.wikimedia.org/r/1056466 [16:24:37] (03CR) 10CDanis: "Overall I love this, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [16:26:08] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3408/console" [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto) [16:30:21] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC now reports no differences, so this is in fact a noop refactor that saves us about 150 LOC." [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto) [16:33:15] !log sudo cumin -b1 -s120 'O:wikidough' 'systemctl restart anycast-healthchecker.service' [16:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:45] (03CR) 10Elukey: "Thanks a lot for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [16:38:19] (03PS1) 10Cathal Mooney: lvs2014: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056550 (https://phabricator.wikimedia.org/T370897) [16:38:46] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:42:37] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10011540 (10andrea.denisse) 05Open→03Stalled [16:42:38] (03CR) 10CDanis: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [16:42:48] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10011544 (10andrea.denisse) 05Open→03Stalled [16:43:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frack servers to codfw - jhancock@cumin2002" [16:43:45] (03PS2) 10Cathal Mooney: lvs2014: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056550 (https://phabricator.wikimedia.org/T370897) [16:44:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frack servers to codfw - jhancock@cumin2002" [16:44:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:49:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10011631 (10cmooney) [16:49:06] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10011630 (10cmooney) [16:49:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:36] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10011653 (10cmooney) [16:50:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:50:47] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:53:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frack servers to codfw - jhancock@cumin2002" [16:54:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frack servers to codfw - jhancock@cumin2002" [16:54:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:25] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1700) [17:02:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frack servers to codfw - jhancock@cumin2002" [17:03:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frack servers to codfw - jhancock@cumin2002" [17:03:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:05:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:25] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10011831 (10RobH) @andrea.denisse: Can you please update the puppet repo site.pp and preseed.yaml for these hosts and once done update this task and unassign from yourself. Once the servers ar... [17:07:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10011817 (10Jhancock.wm) a:05Jhancock.wm→03Papaul These are ready for you. frqueue2003 ETH0 <-> FASW-C8A eth-0/0/32 ETH1 <-> FASW-C8B eth-1/0/32... [17:07:45] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10011834 (10andrea.denisse) @RobH Yes, I'm working on it. [17:08:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10011845 (10cmooney) [17:08:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10011849 (10RobH) a:03colewhite @colewhite, Please note that while this racking task is filed, we still need one more update from you or your team before these systems arri... [17:09:01] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10011851 (10RobH) a:03BTullis @btullis, Please note that while this racking task is filed, we still need one more update from you or your team before these systems arrive o... [17:09:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:10:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10011857 (10RobH) a:03MatthewVernon @MatthewVernon, Please note that while this racking task is filed, we still need one more update from you or your t... [17:10:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10011859 (10RobH) a:03fgiunchedi @fgiunchedi, Please note that while this racking task is filed, we still need one more update from you or your team before these systems arr... [17:12:31] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10011888 (10RobH) a:03Dwisehaupt @Dwisehaupt, The fundraising hosts follow a different workflow than normal hosts, in that we typically don't reimage them and hand them off to your te... [17:12:43] (03PS1) 10Hnowlan: mesh.configuration: set idle_timeout to timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) [17:13:38] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:14:25] (03PS1) 10Andrea Denisse: alert: Add node definitions for the alert1002 and alert2002 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056561 (https://phabricator.wikimedia.org/T370111) [17:15:49] (03CR) 10Andrea Denisse: "The preseed.yml file already has the right partition recipe for all hosts matching `alert*`." [puppet] - 10https://gerrit.wikimedia.org/r/1056561 (https://phabricator.wikimedia.org/T370111) (owner: 10Andrea Denisse) [17:17:11] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10011920 (10RobH) a:03Dwisehaupt @Dwisehaupt, The fundraising hosts follow a different workflow than normal hosts, in that we typically don't reimage them and hand them off to your tea... [17:17:27] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10011925 (10RobH) a:03Dwisehaupt @Dwisehaupt, The fundraising hosts follow a different workflow than normal hosts, in that we typically don't reimage them and hand them off to your tea... [17:18:21] (03PS2) 10Hnowlan: mesh.configuration: set idle_timeout to timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) [17:18:21] (03PS1) 10Hnowlan: shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) [17:19:11] (03CR) 10CI reject: [V:04-1] mesh.configuration: set idle_timeout to timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [17:19:16] (03CR) 10CI reject: [V:04-1] shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [17:20:04] (03PS1) 10Cathal Mooney: lvs2013: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056563 (https://phabricator.wikimedia.org/T370927) [17:20:43] (03PS3) 10Cathal Mooney: lvs2014: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056550 (https://phabricator.wikimedia.org/T370897) [17:21:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frack servers to codfw - jhancock@cumin2002" [17:22:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding frack servers to codfw - jhancock@cumin2002" [17:22:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:25:02] (03PS1) 10BCornwall: ncmonitor: Enable patches, email; Set monthly [puppet] - 10https://gerrit.wikimedia.org/r/1056567 [17:26:11] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10011966 (10cmooney) [17:27:12] (03CR) 10David Caro: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [17:29:45] (03PS3) 10AOkoth: install: adjust vrts partition configs [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) [17:31:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10011969 (10cmooney) [17:31:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10011975 (10cmooney) [17:35:49] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10012024 (10Jhancock.wm) a:05Jhancock.wm→03Papaul @Papaul these are ready for your part. fransw2001 ETH0 <-> FASW-C8A eth-0/0/29 ETH1 <-> FASW-C8B e... [17:39:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:02] (03CR) 10Dzahn: [C:03+1] install: adjust vrts partition configs [puppet] - 10https://gerrit.wikimedia.org/r/1056463 (https://phabricator.wikimedia.org/T369674) (owner: 10AOkoth) [17:49:00] (03CR) 10Dzahn: [C:03+2] firewall/gitlab: add option to throttle and drop traffic using nftables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:54:48] (03CR) 10Dzahn: [C:03+2] "noop on gitlab1004 and gitlab2002 confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:55:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:05] dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T1800). nyaa~ [18:02:02] (03CR) 10Dzahn: [C:03+2] "on gitlab1003 : /etc/nftables/99_throttling_puppet.nft was removed and /etc/nftables/099_throttling-chain_puppet.nft was created" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [18:02:25] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056572 (https://phabricator.wikimedia.org/T366960) [18:02:27] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056572 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [18:03:05] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056572 (https://phabricator.wikimedia.org/T366960) (owner: 10TrainBranchBot) [18:10:33] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.15 refs T366960 [18:10:38] T366960: 1.43.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T366960 [18:10:56] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10012178 (10Dzahn) gitlab1003 and gitlab1004 are unchanged. gitlab2002 now has the new fil... [18:15:10] (03CR) 10Scott French: mediawiki: fetch active deployment host (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [18:16:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:29] (03PS1) 10Dzahn: gerrit: add nft throttling on replica but don't enable yet [puppet] - 10https://gerrit.wikimedia.org/r/1056574 (https://phabricator.wikimedia.org/T365259) [18:19:33] (03PS1) 10CDanis: make mypy happy [software/statograph] - 10https://gerrit.wikimedia.org/r/1056575 [18:21:34] (03CR) 10Dzahn: "uh.. Gerrit has unexpected compilation failures BEFORE this change: https://puppet-compiler.wmflabs.org/output/1056574/3409/gerrit1003.wik" [puppet] - 10https://gerrit.wikimedia.org/r/1056574 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [18:29:22] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:35] (03PS1) 10Ebernhardson: Check the output of RevisionStore::getRevisionById [extensions/CirrusSearch] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056578 (https://phabricator.wikimedia.org/T370770) [18:33:01] (03PS1) 10Ebernhardson: Check the output of RevisionStore::getRevisionById [extensions/CirrusSearch] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056580 (https://phabricator.wikimedia.org/T370770) [18:33:05] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T367547#10012271" [puppet] - 10https://gerrit.wikimedia.org/r/1056574 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [18:34:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:41] (03PS1) 10Dzahn: gitlab: set nft throttling policy to drop on replica [puppet] - 10https://gerrit.wikimedia.org/r/1056581 (https://phabricator.wikimedia.org/T366882) [18:37:50] (03CR) 10Volans: [C:03+1] "LGTM, just recheck it once the new spicerack release is out to make CI happy" [cookbooks] - 10https://gerrit.wikimedia.org/r/1056534 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey) [18:44:14] (03PS1) 10Zabe: Initial configuration for cswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056584 (https://phabricator.wikimedia.org/T370905) [18:45:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:27] (03CR) 10Volans: [C:03+1] "LGTM" [software/statograph] - 10https://gerrit.wikimedia.org/r/1056575 (owner: 10CDanis) [18:52:04] (03CR) 10CDanis: [C:03+2] make mypy happy [software/statograph] - 10https://gerrit.wikimedia.org/r/1056575 (owner: 10CDanis) [18:54:24] (03Merged) 10jenkins-bot: make mypy happy [software/statograph] - 10https://gerrit.wikimedia.org/r/1056575 (owner: 10CDanis) [18:54:42] (03PS3) 10Volans: Filter out NaN data from Prometheus [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) [18:55:26] (03PS4) 10Andrea Denisse: burrow: Create the /var/run/burrow dir with systemd-tmpfiles [puppet] - 10https://gerrit.wikimedia.org/r/1056579 (https://phabricator.wikimedia.org/T366573) [18:55:26] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1056579/3413/" [puppet] - 10https://gerrit.wikimedia.org/r/1056579 (https://phabricator.wikimedia.org/T366573) (owner: 10Andrea Denisse) [18:59:21] (03PS1) 10Dzahn: site: simplify regex for doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056586 [18:59:37] (03CR) 10Herron: [C:03+1] alert: Add node definitions for the alert1002 and alert2002 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056561 (https://phabricator.wikimedia.org/T370111) (owner: 10Andrea Denisse) [19:00:31] (03CR) 10CDanis: [C:03+2] Filter out NaN data from Prometheus [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) (owner: 10Volans) [19:01:46] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1055488/3414/" [puppet] - 10https://gerrit.wikimedia.org/r/1055488 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:02:44] (03Merged) 10jenkins-bot: Filter out NaN data from Prometheus [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) (owner: 10Volans) [19:04:46] (03CR) 10Scott French: "Thanks again for the reviews here, Reuven. I just realized this was still active, despite being superseded by Ic48417e5acb0a64cd6af1c66a2b" [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:05:23] (03Abandoned) 10Scott French: mediawiki-cache-warmup: prepare for bare-metal turndown [puppet] - 10https://gerrit.wikimedia.org/r/1053819 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [19:06:43] (03PS1) 10Jdrewniak: Create dark mode launch banner for Vector 2022 [skins/Vector] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056587 (https://phabricator.wikimedia.org/T370303) [19:08:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [skins/Vector] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056587 (https://phabricator.wikimedia.org/T370303) (owner: 10Jdrewniak) [19:09:39] (03PS7) 10Scott French: deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) [19:10:12] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:19:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:22:37] (03CR) 10JHathaway: cloud-vps puppetservers: remove use of the 'gitpuppet' user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [19:25:33] (03CR) 10JHathaway: [C:03+1] Move the dump_cloud_ip_ranges etcd upload to puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [19:27:26] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:27:27] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:29:48] (03CR) 10Gergő Tisza: [C:03+1] [wmf-config] Remove trailing slash in SSO domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056495 (owner: 10D3r1ck01) [19:30:07] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:30:09] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:31:23] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:31:40] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:35:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:39:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:49] (03PS66) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:42:51] (03CR) 10Andrea Denisse: [C:03+2] alert: Add node definitions for the alert1002 and alert2002 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056561 (https://phabricator.wikimedia.org/T370111) (owner: 10Andrea Denisse) [19:46:39] 10ops-codfw, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10012590 (10andrea.denisse) a:05andrea.denisse→03None [19:46:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10012587 (10andrea.denisse) [19:59:02] (03PS1) 10JHathaway: pcc-db1002: use latest pcc [puppet] - 10https://gerrit.wikimedia.org/r/1056597 (https://phabricator.wikimedia.org/T367547) [19:59:44] (03CR) 10JHathaway: [C:03+2] pcc-db1002: use latest pcc [puppet] - 10https://gerrit.wikimedia.org/r/1056597 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T2000). [20:00:05] sergi0, ebernhardson, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] hello [20:00:24] o/ [20:01:22] \o [20:03:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T370556#10012656 (10VRiley-WMF) We have located a spare drive and ready to replace it at any time. [20:04:59] i can deploy [20:06:27] (03CR) 10Zabe: [C:03+2] Create dark mode launch banner for Vector 2022 [skins/Vector] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056587 (https://phabricator.wikimedia.org/T370303) (owner: 10Jdrewniak) [20:06:29] (03CR) 10Zabe: [C:03+2] frwiktionary, dewiki: enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056484 (https://phabricator.wikimedia.org/T370261) (owner: 10Sergio Gimeno) [20:06:46] ebernhardson: you can self-deploy afterwards? [20:06:59] zabe: sure i can do that [20:07:10] (03Merged) 10jenkins-bot: frwiktionary, dewiki: enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056484 (https://phabricator.wikimedia.org/T370261) (owner: 10Sergio Gimeno) [20:07:28] (03CR) 10Zabe: [C:03+2] Check the output of RevisionStore::getRevisionById [extensions/CirrusSearch] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056578 (https://phabricator.wikimedia.org/T370770) (owner: 10Ebernhardson) [20:07:30] (03CR) 10Zabe: [C:03+2] Check the output of RevisionStore::getRevisionById [extensions/CirrusSearch] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056580 (https://phabricator.wikimedia.org/T370770) (owner: 10Ebernhardson) [20:08:52] !log zabe@deploy1002 Started scap sync-world: Backport for [[gerrit:1056484|frwiktionary, dewiki: enable CommunityConfiguration (T370261 T369711)]] [20:09:00] T370261: Release CommunityConfiguration extension to dewiki - https://phabricator.wikimedia.org/T370261 [20:09:00] T369711: Release CommunityConfiguration to French Wiktionary - https://phabricator.wikimedia.org/T369711 [20:09:18] * MichaelG_WMF is here now, after creating a change for T370941 [20:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:31] still hit +2 on your patches so that CI gets running [20:10:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10012684 (10andrea.denisse) a:05andrea.denisse→03None [20:11:24] !log zabe@deploy1002 zabe, sgimeno: Backport for [[gerrit:1056484|frwiktionary, dewiki: enable CommunityConfiguration (T370261 T369711)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:49] sergi0: is your patch testable? [20:12:13] yes, testing now [20:13:26] Looks fine in both wikis [20:13:42] !log zabe@deploy1002 zabe, sgimeno: Continuing with sync [20:14:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011#10012706 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [20:18:35] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1056484|frwiktionary, dewiki: enable CommunityConfiguration (T370261 T369711)]] (duration: 09m 43s) [20:18:41] T370261: Release CommunityConfiguration extension to dewiki - https://phabricator.wikimedia.org/T370261 [20:18:41] T369711: Release CommunityConfiguration to French Wiktionary - https://phabricator.wikimedia.org/T369711 [20:19:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet - https://phabricator.wikimedia.org/T369011#10012714 (10VRiley-WMF) I have shut down the server and completed a flea power drain. Booted this server back u... [20:23:32] !log sgimeno@mwmaint1002:~$ mwscript extensions/GrowthExperiments/maintenance/migrateCommunityConfig.php --wiki=dewiki --force [20:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:36] !log mwscript extensions/GrowthExperiments/maintenance/migrateCommunityConfig.php --wiki=frwiktionary #T369711 [20:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:40] T369711: Release CommunityConfiguration to French Wiktionary - https://phabricator.wikimedia.org/T369711 [20:25:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:13] (03CR) 10Ebernhardson: [C:03+2] Check the output of RevisionStore::getRevisionById [extensions/CirrusSearch] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056580 (https://phabricator.wikimedia.org/T370770) (owner: 10Ebernhardson) [20:27:17] (03CR) 10Ebernhardson: [C:03+2] Check the output of RevisionStore::getRevisionById [extensions/CirrusSearch] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056578 (https://phabricator.wikimedia.org/T370770) (owner: 10Ebernhardson) [20:28:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056587 (https://phabricator.wikimedia.org/T370303) (owner: 10Jdrewniak) [20:29:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:35:38] (03Merged) 10jenkins-bot: Create dark mode launch banner for Vector 2022 [skins/Vector] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056587 (https://phabricator.wikimedia.org/T370303) (owner: 10Jdrewniak) [20:36:10] !log zabe@deploy1002 Started scap sync-world: Backport for [[gerrit:1056587|Create dark mode launch banner for Vector 2022 (T370303)]] [20:36:15] T370303: Vector 2022 Dark Mode launch notifications - https://phabricator.wikimedia.org/T370303 [20:38:05] oh it adds i18n stuff [20:38:13] that explains why its so slow [20:38:40] Yeah sorry about that... [20:38:49] (03Merged) 10jenkins-bot: Check the output of RevisionStore::getRevisionById [extensions/CirrusSearch] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056578 (https://phabricator.wikimedia.org/T370770) (owner: 10Ebernhardson) [20:41:54] zabe: are you around for a while longer and could do another spontaneous deploy? [20:42:52] We have an issue with a lot of logspam in GrowthExperiments. We have a fix, but knowing the time CI takes to merge tasks in that repo, it will take us well out of this window. [20:43:13] If not, then that is also not the end of the world, I'll just schedule it for tomorrow morning [20:45:08] Difficult, I need to go for around 30 min in about 40 min [20:49:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10012824 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm executed with errors: - pc1017 (*... [20:49:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [20:49:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10012826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm [20:49:31] zabe: ok, then we will find another way. [20:49:49] Thank you for doing today's deploys! 🙏 [20:54:11] yw [20:56:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:34] (03PS1) 10Dzahn: wikistats: mount a cinder volume to store backups externally [puppet] - 10https://gerrit.wikimedia.org/r/1056601 [20:59:55] maybe 5 min until testable lol [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240724T2100) [21:01:34] (03Merged) 10jenkins-bot: Check the output of RevisionStore::getRevisionById [extensions/CirrusSearch] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1056580 (https://phabricator.wikimedia.org/T370770) (owner: 10Ebernhardson) [21:02:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10012861 (10Dwisehaupt) [21:02:50] (03PS2) 10Dzahn: wikistats: mount a cinder volume to store backups externally [puppet] - 10https://gerrit.wikimedia.org/r/1056601 [21:03:37] (03CR) 10Dzahn: [C:03+2] wikistats: mount a cinder volume to store backups externally [puppet] - 10https://gerrit.wikimedia.org/r/1056601 (owner: 10Dzahn) [21:04:21] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10012862 (10Dwisehaupt) a:05Dwisehaupt→03None @RobH Correct. That is our assumption. Once the host is racked, cabled, and all the basic provisioning of the hardware is comple... [21:05:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10012865 (10Dwisehaupt) [21:05:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10012866 (10Dwisehaupt) [21:06:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10012867 (10Dwisehaupt) a:05Dwisehaupt→03None @RobH Correct. That is our assumption. Once the host is racked, cabled, and all the basic provisioning of the hardware is compl... [21:07:09] !log zabe@deploy1002 jdrewniak, zabe: Backport for [[gerrit:1056587|Create dark mode launch banner for Vector 2022 (T370303)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:13] T370303: Vector 2022 Dark Mode launch notifications - https://phabricator.wikimedia.org/T370303 [21:07:14] jan_drewniak: you can finally test now:) [21:07:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10012870 (10Dwisehaupt) a:05Dwisehaupt→03None @RobH Correct. That is our assumption. Once the host is racked, cabled, and all the basic provisioning of the hardware is comple... [21:07:58] zabe: ok I'm on it [21:08:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T370556#10012887 (10VRiley-WMF) Replaced the SSD in Disk 1 in Backplane 1 [21:10:10] zabe: ok good to sync [21:10:55] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1056581/3416/gitlab1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1056581 (https://phabricator.wikimedia.org/T366882) (owner: 10Dzahn) [21:11:04] (03CR) 10Dzahn: [V:03+1] gitlab: set nft throttling policy to drop on replica [puppet] - 10https://gerrit.wikimedia.org/r/1056581 (https://phabricator.wikimedia.org/T366882) (owner: 10Dzahn) [21:11:16] !log zabe@deploy1002 jdrewniak, zabe: Continuing with sync [21:11:19] cool [21:11:28] 06SRE, 10DNS: DKIM Key to Public DNS - https://phabricator.wikimedia.org/T370961#10012888 (10Peachey88) [21:13:02] 06SRE, 10DNS: DKIM Key to Public DNS (Dayforce) - https://phabricator.wikimedia.org/T370961#10012889 (10Peachey88) [21:13:17] (03CR) 10Dzahn: "@Arnold maybe we can do this together and chat a little bit about it while deploying" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:14:54] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1055495/3417/vrts1001.eqiad.wmnet/change.vrts1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:15:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:55] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1056587|Create dark mode launch banner for Vector 2022 (T370303)]] (duration: 41m 44s) [21:17:56] jan_drewniak: patch is live [21:18:00] T370303: Vector 2022 Dark Mode launch notifications - https://phabricator.wikimedia.org/T370303 [21:18:07] ebernhardson: over to you, sorry that the i18n stuff took so long [21:18:24] zabe: no worries, thanks! [21:18:41] (03PS3) 10Dzahn: vrts: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) [21:19:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:49] (03CR) 10Dzahn: [C:04-1] "unexpected new problem https://puppet-compiler.wmflabs.org/output/1055495/3418/vrts1001.eqiad.wmnet/change.vrts1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:20:00] zabe: yes, thank you! [21:20:43] !log ebernhardson@deploy1002 Started scap sync-world: Backport for [[gerrit:1056580|Check the output of RevisionStore::getRevisionById (T370770)]] [21:20:47] T370770: Error: Call to a member function audienceCan() on null - https://phabricator.wikimedia.org/T370770 [21:21:50] (03PS18) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [21:26:30] !log ebernhardson@deploy1002 ebernhardson: Backport for [[gerrit:1056580|Check the output of RevisionStore::getRevisionById (T370770)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:39] T370770: Error: Call to a member function audienceCan() on null - https://phabricator.wikimedia.org/T370770 [21:27:49] (03PS3) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [21:28:09] !log ebernhardson@deploy1002 ebernhardson: Continuing with sync [21:28:40] (03PS1) 10Dzahn: aphlict: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1056603 (https://phabricator.wikimedia.org/T370677) [21:29:43] (03PS4) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [21:29:44] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [21:30:22] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1056603/3420/aphlict1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1056603 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:30:48] (03PS5) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [21:31:08] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [21:32:40] (03CR) 10Dzahn: [V:04-1 C:04-1] "https://puppet-compiler.wmflabs.org/output/1055493/3421/phab1004.eqiad.wmnet/change.phab1004.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:32:50] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:1056580|Check the output of RevisionStore::getRevisionById (T370770)]] (duration: 12m 07s) [21:32:55] T370770: Error: Call to a member function audienceCan() on null - https://phabricator.wikimedia.org/T370770 [21:33:08] that closes out the backport window [21:35:15] !log ryankemper@cumin2002 START - Cookbook sre.apifeatureusage.roll-restart-reboot-logstash rolling reboot on A:apifeatureusage [21:38:27] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on wdqs[1012-1013].eqiad.wmnet with reason: T366555 security [21:38:44] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on wdqs[1012-1013].eqiad.wmnet with reason: T366555 security [21:38:46] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1012.eqiad.wmnet [21:38:53] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1013.eqiad.wmnet [21:40:20] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1055999/3422/" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [21:40:45] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on wdqs[2007,2009-2012].codfw.wmnet with reason: T366555 security [21:41:06] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on wdqs[2007,2009-2012].codfw.wmnet with reason: T366555 security [21:42:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1013.eqiad.wmnet [21:42:19] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2007.codfw.wmnet [21:42:24] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2009.codfw.wmnet [21:42:28] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2010.codfw.wmnet [21:42:30] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2011.codfw.wmnet [21:42:35] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2012.codfw.wmnet [21:44:18] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.apifeatureusage.roll-restart-reboot-logstash (exit_code=0) rolling reboot on A:apifeatureusage [21:45:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2009.codfw.wmnet [21:45:33] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2012.codfw.wmnet [21:45:34] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1012.eqiad.wmnet [21:45:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2011.codfw.wmnet [21:45:45] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2010.codfw.wmnet [21:46:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2007.codfw.wmnet [21:47:07] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on wdqs[1014-1015].eqiad.wmnet with reason: T366555 security [21:47:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on wdqs[1014-1015].eqiad.wmnet with reason: T366555 security [21:47:41] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1015.eqiad.wmnet [21:47:46] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1014.eqiad.wmnet [21:49:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:46] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1015.eqiad.wmnet [21:50:46] (03PS1) 10Dzahn: wikistats: drop min_gb parameter from cinder volume mount [puppet] - 10https://gerrit.wikimedia.org/r/1056605 [21:51:04] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1014.eqiad.wmnet [21:52:05] (03PS6) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [21:52:14] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [21:54:12] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on wdqs[1018-1021].eqiad.wmnet with reason: T366555 security [21:54:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:31] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on wdqs[1018-1021].eqiad.wmnet with reason: T366555 security [21:55:11] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1018.eqiad.wmnet [21:55:16] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1019.eqiad.wmnet [21:55:24] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1020.eqiad.wmnet [21:55:37] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1021.eqiad.wmnet [21:55:48] (03PS1) 10Dzahn: cinderutils: add --allow-unattended-format when preparing volumes [puppet] - 10https://gerrit.wikimedia.org/r/1056606 [21:56:12] (03CR) 10Dzahn: [C:03+2] "fails. ---> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056606" [puppet] - 10https://gerrit.wikimedia.org/r/1056601 (owner: 10Dzahn) [21:56:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:57:00] ^^ checking [21:58:01] (03PS2) 10Dzahn: cinderutils: add --allow-unattended-format when preparing volumes [puppet] - 10https://gerrit.wikimedia.org/r/1056606 [21:58:49] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1018.eqiad.wmnet [21:58:54] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1019.eqiad.wmnet [21:59:02] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1020.eqiad.wmnet [21:59:14] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1021.eqiad.wmnet [22:01:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:04:20] FIRING: JobUnavailable: Reduced availability for job jmx_query_service_streaming_updater in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:05:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:20] RESOLVED: JobUnavailable: Reduced availability for job jmx_query_service_streaming_updater in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:11:43] RESOLVED: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:39:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:52:58] jouncebot: nowandnext [22:52:58] No deployments scheduled for the next 7 hour(s) and 7 minute(s) [22:52:59] In 7 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T0600) [22:52:59] In 7 hour(s) and 7 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240725T0600) [22:53:31] (03CR) 10Zabe: [C:03+2] Initial configuration for cswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056584 (https://phabricator.wikimedia.org/T370905) (owner: 10Zabe) [22:54:11] (03Merged) 10jenkins-bot: Initial configuration for cswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056584 (https://phabricator.wikimedia.org/T370905) (owner: 10Zabe) [22:55:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:24] !log Create Wikivoyage Czech # T370905 [22:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:29] T370905: Create Wikivoyage Czech - https://phabricator.wikimedia.org/T370905 [22:59:40] !log zabe@deploy1002 Started scap sync-world: T370905 [23:01:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [23:02:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [23:02:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T367856)', diff saved to https://phabricator.wikimedia.org/P66919 and previous config saved to /var/cache/conftool/dbconfig/20240724-230209-marostegui.json [23:02:14] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:08:54] !log zabe@deploy1002 Finished scap: T370905 (duration: 09m 14s) [23:08:58] T370905: Create Wikivoyage Czech - https://phabricator.wikimedia.org/T370905 [23:09:30] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=cswikivoyage --cluster=all 2>&1 | tee /tmp/cswikivoyage.UpdateSearchIndexConfig.log # T370905 [23:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:00] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056618 [23:11:01] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056618 (owner: 10Zabe) [23:11:46] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056618 (owner: 10Zabe) [23:11:52] !log zabe@deploy1002 Started scap sync-world: update interwiki cache [23:14:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:38] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:20:18] !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 08m 25s) [23:29:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:55] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10013287 (10aude) For building this as a node service, is it still recommended to use service-template-node? I noticed that it has some security issue... [23:34:20] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056622 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056622 (owner: 10TrainBranchBot) [23:55:10] (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056625 (https://phabricator.wikimedia.org/T344471) [23:57:41] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056625 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [23:58:45] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056625 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [23:59:35] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [23:59:57] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [23:59:58] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply